Kulmala - Scalable MPEG-4 Encoder o

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Hindawi Publishing Corporation

EURASIP Journal on Embedded Systems


Volume 2006, Article ID 38494, Pages 1–15
DOI 10.1155/ES/2006/38494

Scalable MPEG-4 Encoder on FPGA Multiprocessor SOC

Ari Kulmala, Olli Lehtoranta, Timo D. Hämäläinen, and Marko Hännikäinen

Department of Information Technology, Institute of Digital and Computer Systems, Tampere University of Technology,
P.O. Box 553, Korkeakoulunkatu 1, 33101 Tampere, Finland
Received 15 December 2005; Revised 16 May 2006; Accepted 27 June 2006
High computational requirements combined with rapidly evolving video coding algorithms and standards are a great challenge
for contemporary encoder implementations. Rapid specification changes prefer full programmability and configurability both for
software and hardware. This paper presents a novel scalable MPEG-4 video encoder on an FPGA-based multiprocessor system-
on-chip (MPSOC). The MPSOC architecture is truly scalable and is based on a vendor-independent intellectual property (IP)
block interconnection network. The scalability in video encoding is achieved by spatial parallelization where images are divided
to horizontal slices. A case design is presented with up to four synthesized processors on an Altera Stratix 1S40 device. A truly
portable ANSI-C implementation that supports an arbitrary number of processors gives 11 QCIF frames/s at 50 MHz without
processor specific optimizations. The parallelization efficiency is 97% for two processors and 93% with three. The FPGA utilization
is 70%, requiring 28 797 logic elements. The implementation effort is significantly lower compared to traditional multiprocessor
implementations.

Copyright © 2006 Ari Kulmala et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION trade-off between performance and area without creating an


extra burden in system design by scaling the number of iden-
Video is becoming an essential part of embedded multime-
tical processors. Furthermore, the architecture is designed to
dia terminals. There are, however, many contradicting con-
be easily reusable for any kind of application.
straints in video codec implementations. One challenge is
the rapid evolution of compression standards with several A data parallel MPEG-4 simple profile (SP) software en-
different algorithms. This requires programmability that is coder is implemented on the MPSOC to demonstrate the
easy to achieve with processor-based platforms. However, effectiveness and scalability of the presented solution. The
achieving the best power, energy, and silicon area efficiency video images are divided into independent horizontal slices
requires custom hardware implementations. On the other which are mapped to different processors for encoding. A
hand, hardware (HW) design is more demanding than soft- master-slave configuration is used where the master proces-
ware (SW) development, and modifications are very expen- sor is responsible for overall control and the slaves perform
sive and time consuming. For example, nonrecurring engi- the video encoding.
neering (NRE) costs, especially photo mask fabrication costs, In this paper, we use an Altera Stratix FPGA as the target
increase rapidly with each technology generation making fre- platform [3], Altera Nios processors, and our heterogeneous
quent HW upgrades less favorable. Software only implemen- IP block interconnection v.2 (HIBI) [4] as the communica-
tation solves the flexibility and upgradeability problem but tion network. No MPEG-4 specific HW accelerators, for ex-
is not an optimal solution from a performance versus silicon ample, for motion estimation, are currently used, but HIBI
area point of view. provides a very convenient plug-and-play method to add in-
The work in this paper solves the HW video codec de- tellectual property (IP) blocks independent of the vendor.
sign flexibility and upgradeability problem with a fully pro- Topics of interest in this paper include practical im-
grammable, scalable MPSOC approach [1, 2]. The key idea plementation issues, such as utilized FPGA resources and
is to use synthesizable soft-core processors and a synthesiz- achieved performance, design cycle improvement, scalabil-
able system-on-chip (SOC) interconnection network, which ity, and encoder specific issues like memory optimization due
allows prototyping and implementation on any FPGA plat- to scarce on-chip memories. The implementation works in
form, or in an ASIC technology with a rapid design cycle. In practice with an FPGA board attached to a PC that sends
addition, our implementation framework enables a seamless source video streams and receives compressed data.
2 EURASIP Journal on Embedded Systems

This paper is organized as follows. Related work is re- quiring 400 kgates for HW accelerators while providing 30
viewed in Section 2. The MPSOC architecture is described QCIF frames/s at 12 MHz. Reference [16] presents FPGA im-
in Section 3. The video encoding software and our encoder plementations of hardware accelerators for an H.264 video
parallelization approach are presented in Section 4. Section 5 codec. In [17], an H.264 coder is designed and verified with
explains the integration of the HW architecture and software. an FPGA emulator platform. An interface between a host PC
In Section 6 the results are presented. Finally, Section 7 sum- and an FPGA-based MPEG-4 encoder is built in [18] en-
marizes the paper and discusses future work. abling fast prototyping and debugging.

2. RELATED WORK 2.2. FPGA multiprocessor architectures

In this section we consider the related work in two categories, Although multiprocessor systems have been researched for
parallel video encoding and FPGA-based MPSOC architec- a while, most of the work has concentrated on ASIC imple-
tures. mentations. FPGAs have only recently grown large enough
to hold such implementations, which is one reason for a low
number of reported FPGA-based MPSOCs. However, two
2.1. Parallel encoder implementations
trends of research can be identified.
Due to high computational complexity of video encoding First, FPGA multiprocessor systems are used to develop
[5], several parallel solutions have been developed in con- parallel applications. In these works, the main emphasis
trast to traditional sequential program flow [6]. There are at is usually on the application and its parallelization. The
least four general parallelization methods used: functional, hardware architectures are briefly summarized. Typical im-
temporal, data, and video-object parallelism. For functional plementations rely on vendor-dependant solutions, because
parallelism [7] different functions, such as DCT and motion they are usually easy to use. The hardware restrictions to scal-
estimation, are connected in a functional pipeline to be exe- ability or flexibility and the ease of adding or removing com-
cuted in parallel by different processing units. However, scal- ponents are often not addressed.
ing a functional parallel application requires a lot of work In [19], Martina et al. have developed a shared mem-
(high scaling effort). When each processor executes a specific ory FPGA multiprocessor system of digital signal processor
function, adding or removing processors requires a whole (DSP) cores of their own design that run basic signal pro-
system redesign to balance the computational load in the cessing algorithms. The implemented bus-interconnection is
pipeline. For temporal parallelism (i.e., parallel in time) a full not described. There are no synthesis results for the whole
frame is assigned to every CPU. The scalability of this style is system, but the system runs at 89 MHz on a Xilinx XCV1000
high. However, as the number of parallel encoders increase, it FPGA. Wang and Ziavras presented a system of six soft-
introduces a significant latency in encoding, since one frame core Nios processors [20]. Processors are interconnected
is buffered for each encoding CPU. Works in this category with a multimaster Avalon bus [21]. No figures of the area
include [8–10]. In data parallelism, the image is divided into required for the whole system are presented. However,
slices that are assigned to different CPUs. The slices are en- the maximum clock frequency is 40 MHz using an Altera
coded in parallel frame-by-frame. This approach is used in EP20K200EFC484-2x FPGA board.
[11–13]. For video-object parallelism, which is specific to Second, a hardware-oriented point of view for future
MPEG-4, arbitrary sized shapes referred to as video-objects multiprocessor requirements is presented. Reconfigurability
in the image are assigned to different CPUs. The objects can is often emphasized [22]. Also, IP-block-based systems are
be considerably unequal in size, which may lead to unbal- stressed and a need for a scalable, standard interface inter-
anced execution time between different CPUs if care is not connection network is anticipated. Kalte et al. [22] have pre-
taken. Such work is presented, for example, in [14]. sented a multilayer advanced microcontroller bus architec-
We are mainly interested in real-time encoding. Func- ture (AMBA) interconnection architecture used in an FPGA.
tional, data, and video-object parallelism are all eligible for In AMBA, access to slaves is multiplexed between masters
real-time, low-latency video encoding, because they do not and different masters can use different peripherals simulta-
require frames to be buffered. We chose data parallelism be- neously. A conceptual view of a system is depicted although
cause functional parallelism has a high scaling effort and not implemented. The interconnection architecture is syn-
video-object parallelism is strictly MPEG-4 specific. Scala- thesized separately, as is the processor. No application is de-
bility is the most feasible criterion used to compare different scribed.
architectures and parallel implementations, because reported This work combines both of the above categories, since
results typically vary in accuracy. The scalability of different an application and a working prototype is implemented on
parallelization methods is compared in Section 6. the proposed architecture. The architecture itself is con-
Contemporary FPGA designs tend to use single encoder structed of IP blocks and a general purpose interconnec-
cores with HW accelerators arranged in a functional pipeline. tion architecture with support for an OCP-IP interface [23],
Our implementation is one of the first utilizing multiple par- which is a standard interconnection interface. A standard-
allel encoders on an FPGA in a data parallel configuration. ized IP block interface along with high scalability ensures the
In [15], an FPGA-based H.263 encoder is demonstrated re- future use of the architecture.
Ari Kulmala et al. 3

Table 1: Summary of HIBI characteristics.


Ethernet Ext. SRAM (1 MB)-program memory
Property HIBI implementation
Avalon ext. bridge Hierarchical bus with wrappers
 Topology
and bridges between bus segments
CPU0 CPU1 CPU2 CPUn 1
N2H2 N2H2 N2H2  N2H2 Interface FIFO, OCP
HIBI HIBI HIBI HIBI Clocking Multiple clock domains
wrapper wrapper wrapper wrapper
HIBI HIBI Arbitration Distributed, pipelined

bridge
Stratix HIBI wrapper Arbitration Priority, round-robin, time division
EP1S40 SDRAM Clock Clock algorithm multiple access (TDMA), or combination
FPGA controller domain 1 domain 2
FIFO sizes, data width, addresses,
Ext. SDRAM (16 MB)-picture memory Synthesis-time
initial configuration, number of
Development board configurable
parameters configuration pages and their type
(RAM or ROM), included properties
Figure 1: High-level view of the architecture on an FPGA develop-
Run-time
ment board. All arbitration parameters and algorithm,
configurable
cycle counters, power mode
parameters
Quality-of- TDMA, send limit + priority/round-robin,
3. MPSOC ARCHITECTURE service (QoS) multiple priorities for data, fast reconfiguration
In this section we present our novel architecture for multi- Bus resolution OR-network
processor systems. The key elements of the architecture are Multiple addresses per IP, multiplexed
processor programmability and scalability and flexibility that Addressing
in the bus with data, allows multicast
are obtained with the HIBI on-chip network [4]. A high-level
view of the architecture is depicted in Figure 1. It contains
a parameterizable number of soft-core processors, an HIBI
network, direct memory access (DMA) blocks (N2H2), and transfers between segments. Every wrapper is assigned an ad-
two external memories located on the development board. dress space, which can vary depending on the number of
In this case, external SRAM memory is used for instructions wrappers and the need for different addresses per wrapper.
for all processors. HIBI is used to access the external SDRAM When data is sent via HIBI, only the destination address is
memory, which is used for shared data, and to let processors delivered. In order to distinguish between different sources,
to send messages directly to each other. each source must use a separate destination address within
the recipients address space.
3.1. Heterogeneous IP block interconnection v.2 There are several different HIBI wrapper interfaces for IP
blocks. It is left to the designer to decide which interface is
HIBI is a hierarchical interconnection for SOCs, which was the most appropriate, but there may be a mix of different
originally developed for ASICs. The objective of HIBI is to kinds of interfaces. For example, FIFO and OCP interfaces
provide a topology-independent, scalable, yet high-perform- can be utilized in the same architecture. Also, HIBI supports
ance on-chip network. It is highly configurable and supports multiple priorities for data. Optionally, there can be differ-
multiple clock domains. To provide maximum efficiency, ent FIFOs for every priority. In that case, the highest priority
there are no empty cycles during bus transfers under a high FIFOs are always treated first. Thus, they can interrupt lower
load. Split transactions are used in read operations. Bus ca- priority data transfers to get service immediately.
pacity is fully exercised by starting new transfers right after
the preceding one has finished. The HIBI network can be 3.2. Soft-core processors
reconfigured at run-time, using a configuration RAM. This
should not be confused with FPGA reconfiguration. Currently, we have used soft-core processors Nios [24] and
The basic building block of the HIBI network is an HIBI Nios II [25, 26] in our architecture. The master processor is
wrapper. As HIBI utilizes distributed arbitration, each HIBI Nios with a configuration shown in Figure 2. The peripherals
wrapper is responsible for seizing the bus at right time. The are timers, LEDs button parallel I/Os (PIOs), and an UART.
main characteristics of HIBI are presented in Table 1 [1]. In Nios can be used as a 16-bit or 32-bit processor, which af-
Figure 1, three processors and an SDRAM controller form fects the data bus width. Nios always uses 16-bit instruction
one segment, Clock domain 1, and the rest of the processors words, which restricts the immediate values to five bits. With
form another segment, Clock domain 2. There is no mas- prefix-instructions this is increased to 16 bits. A large num-
ter/slave configuration in the interconnection and, thus, ev- ber of prefix instructions reduces code efficiency.
ery IP block can freely access any other IP block. An HIBI Nios II is a 32-bit CPU with 32-bit instructions. Nios II
bridge is used to connect the segments together to allow has more limited configurability. We use the fastest version,
4 EURASIP Journal on Embedded Systems

Ethernet SRAM 1 MB 3.3. Nios-to-HIBI v.2 DMA

Master processor
Nios processors do not have a native support for the HIBI
Avalon on-chip network. Therefore, a DMA block, Nios-to-HIBI v.2
256 B 2 KB interfaces
Timer 2 vector table boot ROM Ext. bridge (N2H2), was implemented to attach the processors to HIBI.
Slave 1 DMA minimizes CPU intervention in transfers. This allows
Slave 2
Slave 3 the CPU to execute the application while DMA transfers data
8 KB 32 bit NIOS 64 KB
LED PIO
instr. cache CPU data RAM
on the background.
N2H2 includes three Avalon interfaces. The slave inter-
face is used by a CPU to control and query N2H2. These con-
Button PIO UART Timer 1 N2H2 HIBI figuration and status registers include the state of the DMA
block, DMA transfer instructions, and DMA receiving in-
structions. Two master interfaces are used separately for re-
Figure 2: Master processor configuration. ceiving and transmitting. In order to increase the reusability,
these interfaces have been isolated from the other as much as
possible. Thus, with minimal modifications, the same block
Slave processor
can be applied to different processors.
The transmitter side is fairly straightforward. First, the
256 B 2 KB Avalon To ext.
vector table boot ROM interface SRAM CPU writes the memory address (e.g., a pointer), the amount
of data, priority, and destination address to the configuration
register. Following this, the transmitter sends the address to
8 KB NIOS II/f 64 KB the HIBI. The transmitter then reads the data from memory
LED PIO instr. cache CPU data RAM
and instantly transfers the data.
Receiving is more complicated. Data sent through HIBI
UART Timer 1 N2H2 HIBI may get fragmented. To circumvent this, we have imple-
mented multiple receiving channels that wait for a given
amount of data before interrupting the CPU. Each channel
Figure 3: Slave processor configuration. can be configured to receive data from any source and save
it to memory locations defined by the CPU. There can be
several data transfers going on simultaneously, so N2H2 uses
Nios II/f (fast) that provides 1.16 DMIPS/MHz and an area the HIBI address to distinguish between them. For example,
consumption of around 2000 LEs. Nios II is used for video if two sources are sending data simultaneously, two channels
encoding slaves with the configuration depicted in Figure 3. are used. When the expected number of data has arrived on
Unlike Nios, Nios II uses tightly-coupled memory between the a channel, the CPU is notified by an interrupt or via a poll
processor and data RAM. Data does use the Avalon [21] bus register.
that significantly reduces the memory access time. Figure 4 depicts a data transfer over HIBI. CPU1 sends
Nios natively use the Avalon bus to connect to memories four words to CPU2. On cycle 0, CPU1 gives a start com-
and peripherals. Avalon bus masters can command slaves, mand to the N2H2 DMA. IRQ is acknowledged in clock cy-
but slaves can only get the attention of a master by raising cle 1 and the transfer is started immediately. The address is
an interrupt. A typical master is a processor. A drawback is sent first and then the data. Clock cycles 3–8 are consumed
that masters cannot communicate directly with each other. by the HIBI wrapper and arbitration latency. During this de-
Both Nios processors have separate instruction and data lay, another transmission can be proceeding in HIBI, so the
buses. In Figures 2 and 3 only the data bus is drawn for latency is hidden. When the access to the HIBI is gained, the
simplicity. The instruction bus connects to the boot ROM, address and the data are sent forward in clock cycles 9–13.
instruction memory (ext. SRAM), and vector table. With The data propagates through the receiving unit and buffers
the Avalon bus, there are 20 bus master address lines, 32 until at clock cycle 15 N2H2 sees the address and determines
data lines, and waitrequest and read signals. Data bus mas- the right channel. Clock cycles 16–19 are used to store the
ter has 20 address lines, two 32-bit data buses for read and data in the memory of the CPU2. After all the data expected
write, and signals waitrequest, read, write, irq and irq num- has been received, an IRQ is given to the CPU2 at clock cycle
ber. This makes 92-signal lines in total. There are possibly 21.
other signal lines as well, but even with this practical mini-
mum, there are 146-signal lines in the buses. As a compari- 4. MPEG-4 SOFTWARE IMPLEMENTATION
son, 32-bit HIBI bus consists of only 38-signal lines (32-bit
data, 3-bit command, address valid, lock, and full). In addi- One of the key advantages of data parallel encoding methods
tion, HIBI supports interprocessor communication without is that they enable scalability by using macroblock row, mac-
restrictions. The features of Avalon are not sufficient for data roblock, or block-level image subdivision. Moreover, spatial
intensive multiprocessor communication, motivating the use data parallelization can be performed with vertical, horizon-
of HIBI. tal, rectangular, or arbitrary shaped slices. The problem of
Ari Kulmala et al. 5

Clock cycles
Module 0 1 2 3 4 5 6 7 8 9 10 11121314 15 1617 18 19 20 21
CPU1 I Notify the
Transmit ADDDD receiving CPU2
HIBI network A DDDD
Receive A DDDD
CPU2 DMA1 send Data on HIBI I
Initiate transfer DMA2 receive

I IRQ D Data
A HIBI address processing

Figure 4: An arbitrary four-word data transfer over HIBI between two CPUs.

MV1 not available


MV2

MV3 Recon image for PE1 (local mem.)


Slice for PE1
MV1

5 MB rows

Mem-to-mem transfer
Mem-to-mem transfer
PE1 PE2 PE3
First MB row of PE2

MV1, MV2, MV3 = predictor vectors


Last MB row of PE1
MB whose vector is predicted
VPH

4 MB rows
MBs completed on PE1 MV over slice boundary

MBs completed on PE2 Slice for PE2


MBs completed on PE3 Recon image for PE2 (local mem.)
(a) (b)

Figure 5: (a) Motion vector dependency problem in vertical parallelization and (b) horizontal parallelization for distributed memory ma-
chines.

vertical parallelization, shown on the left side of Figure 5, is serting slice headers such as H.263 group-of-block (GOB) or
that predictive coding is not considered, leading to motion MPEG-4 video packet headers (VPH) in the beginning of a
vector (MV) and DQUANT (denoting changes in quantiza- slice. Clearly, this results in some overhead in the bit stream
tion parameter (QP)) dependency problems [27]. For exam- but prediction dependencies are avoided. In addition, inter-
ple, H.263/MPEG-4 vector prediction is performed by com- processor communication and extra memory is needed to
puting the median of three neighboring vectors referred to implement the overlapping.
as MV1, MV2, and MV3 in Figure 5. Due to data-dependent However, a drawback of [27] is a somewhat coarse gran-
computations and the different shape of slices, computations ularity leading to unbalanced computational loads due to the
do not proceed in a synchronized manner in different slices. unequal size of slices. For this reason, the original approach is
For this reason, a data dependency problem arises in the slice improved by subdividing images using macroblock granular-
boundaries where one of the predictor vectors may not be ity as in Figure 6. Interprocessor communication and over-
available. lapping are further avoided by exploiting a shared memory
Horizontal spatial partitioning, however, is natural to in an MPSOC type of platform. The new method is highly
raster scan macroblock (MB) coding. The right side of Figure scalable since the whole image is assignable to a single proces-
5 depicts our previous implementation on a distributed sor while the largest configuration dedicates a processor for
memory DSP using MB row granularity [27]. The recon- each MB. No interprocessor communication is needed since
structed images are made slightly overlap to allow motion data can be read directly from the global memory buffer.
vectors to point over slice boundaries. The overlapping areas The shared memories, however, are potential bottlenecks,
are also exchanged between processors after local image re- and thus efficient techniques for hiding transfer latencies are
construction. Prediction dependencies are eliminated by in- needed.
6 EURASIP Journal on Embedded Systems

Recon image (shared mem. for PE1 & PE2) The tasks of the master are illustrated in the middle of
Slice for PE1 Figure 7. To encode a frame, the master first waits for the
parameters from the host PC. Next, the PC sends the raw
image (one frame at a time). The master slices the received
VPH image, configures the slaves, and signals them to start the en-
coding. As the slaves complete, each informs the master that
Slice for PE2 it has finished. After all the slaves have completed encoding,
the master finds out the sizes of the bit streams of the slaves,
VPH
merges the bit streams, and sends the merged bit stream (en-
coded image) to the PC.
Slice for PE3 Slave tasks are illustrated on the right of Figure 7. First,
VPH = video packet header position the slave waits for the parameters from the master. Then,
(removes prediction dependencies) the slave downloads a local motion estimation (ME) window
and pixels of the corresponding image macroblock. Then, it
encodes the macroblock. This continues as long as there are
Figure 6: Horizontal data parallelization for shared memory.
macroblocks in the slice left to encode. If the local bit buffer
goes full, the bits are uploaded to the external image mem-
ory. After all the macroblocks have been encoded, the slave
The data parallelization in Figure 6 was implemented uploads the bit buffer to the external memory and begins to
with a master control program and slave MPEG-4 encoders. wait for the next slice.
In addition, a host PC program has been implemented as The video encoder can run in two different modes: first,
a user interface. It should be noted that the master and it can run in real-time, so one frame at a time is transferred
the slave refer to the encoding processors, not, for exam- forth and back. Second, it can run in buffered mode, where
ple, to the Avalon bus master or slave used for Avalon com- the PC downloads a video sequence to the master. It is en-
munication. The flow graphs and the synchronization of coded as a whole and sent back to the PC. The video sequence
SW are depicted in Figure 7 while the implementation and length is parameterizable. Buffered mode mimics a situation
the integration details of the master and the slave proces- where, for example, a video camera is attached to the system
sor are discussed in Section 5. The master processor con- and feeding the encoder.
trols and synchronizes the encoding. The tasks of the host
PC, the master, and the slaves are presented in the follow-
ing. 5. INTEGRATION

We have now presented a highly scalable hardware platform


4.1. Software implementation and a data parallel video encoder. Their integration is pre-
sented in the following. The main properties of the Stratix
The host PC implements a user interface for inputting en-
FPGA chip [28] used for this project are given in Table 2. A
coding parameters to the master. The user interface enables
logic element (LE) contains a four input look-up table and
the selection of a video format (resolution and frame rate),
a flip-flop. A digital signal processing (DSP) block contains
bit rate control mode (constant or variable), quantization
multiply-and-accumulate (MAC) blocks. These blocks can
parameter (QP), as well as the number of slaves used in
also be used as fast embedded multipliers, which are uti-
the encoding. The host PC and the master can communi-
lized by the processors. Phase-locked loops (PLL) are used
cate via a custom UDP/IP-based messaging protocol, which
to generate different clock frequencies. Embedded memory
supports its own flow control, retransmissions, packet struc-
provides on-chip storage.
tures, fragmentation, and assembly. Our messaging protocol
allows real-time modification of the frame rate, QP and bit Apart from the Stratix 1S40 FPGA, the development
rate parameters during the encoding. board offers two UARTs and an Ethernet connection. It has
8 MB of external Flash memory, 1 MB of external SRAM
The tasks of the host PC also include capturing and load-
memory, and 16 MB of external SDRAM memory. Down-
ing a raw video image, sending the raw data to the master
loading and debugging is done via a JTAG connection.
and decoding the output. Received bits are stored to the local
disk for debugging. In addition, the host PC measures statis-
tics such as the average encoding frame rate and bit rate. At 5.1. Initial constraints for the architecture
any time, the host PC can issue a reinitialization command
to stop the encoding, release dynamically allocated SW re- The application requires 64 KB local data memories. As this
sources, for example, memory, and return to the initial state. memory is used for stack and local variables, it needs to be
For example, this feature enables changes in the video reso- fast on-chip memory. Also, some memory is used for in-
lution and the number of slaves without rebooting the plat- struction caches. Thus, the limited amount of on-chip RAM
form. Also, prototyping and testability are improved since in the FPGA bounds the maximum number of processors
several video formats can be successively tested by changing to four. Optionally, external memories could be used, but
the parameters. the development board does not contain any free, suitable
Ari Kulmala et al. 7

Host PC 86 Master Slave

n al
Update master Update slave Sig Wait params. and
parameters parameters slice ready sync. event
MB = 0;
Wait image
Wait image upload cpu = 0; Load 48  48 pixel local ME window
capture
For (all slaves)
slice ready; Load pixels of current MB
Upload image
Host PC primary data loop

to shared mem. Wait slave bits


True

MPEG-4 encode MB

Get bit stream size


Wait bits Is local bit buffer True

Signal
full ?

Upload bits to
Merge bits

image mem.
False
cpu++;
True MB ++;
Decode bits if (cpu < num of slaves)
if (MB < num of MBs)
False
False
Send bits Signal: master
Measure to host PC
statistics slice encoded Upload bits to
image mem.
False

True False
If (reinit) if (reinit) if (reinit)
False True
True Reinit cmd
Reinit cmd
(Sent to all slaves)

Figure 7: Software flow graphs and task synchronization.

Table 2: Stratix 1S40 FPGA properties. The video encoder was configured in such a way that a
Feature Stratix 1S40 contains 64 KB of local data memory is sufficient for each proces-
sor. Small buffers and memories like 2 KB boot program
Logic elements (LEs) 41 250 ROMs were also assigned to the on-chip memory. The ex-
Embedded memory, RAM (bits) 3 423 744 ternal 16 MB SDRAM was allocated as the frame memory.
DSP blocks 14 A custom SDRAM controller was implemented with special
PLLs 12 DMA functions. General block transfer commands are im-
plemented with an option to support application-specific
commands. For example, we have currently implemented
memories as the external SRAM and external SDRAM are a command for an automatic square block retrieval, for
already utilized. Therefore, at maximum one master pro- instance an 8 × 8 block, for the video encoder application.
cessor and three slaves are used. The amount of the LEs In this way we need to commit only a single transfer instead
in the FPGA is more than sufficient to allow for scalabil- of eight separate transfers. However, the SDRAM controller
ity. is fully reusable and can also be used without the application-
Three different memories are used for the video encod- specific features. The control interface also supports sequen-
ing: the on-chip embedded memory, the external SRAM, and tial block writes and reads, increasing the efficiency com-
the external SDRAM. The flash memory is only used to con- paring to single write/read operations. The SDRAM DMA
figure the FPGA upon power-up. The same encoding soft- is configured with the highest priority messages. However, in
ware can be used for all slaves, which provides an oppor- practice, using higher priority does not have a notable effect
tunity to use the same instruction memory for all of them. on performance in this application.
As the application fits to 512 KB, the 1 MB SRAM memory
was divided evenly between the master and the slave proces- 5.2. Configurations of the components
sors. To reduce the shared memory contention, instruction
memory caches were used. Each cache utilizes 8 KB of on- The exact configurations of the processors are shown in Fig-
chip memory. ures 2 and 3. The bus to the external SRAM goes through the
8 EURASIP Journal on Embedded Systems

master processor in practice, as opposed to the architecture Read segment of slave bits
shown in Figure 1. Slaves have no access to the address space
of the master. Master, however, can access the slave portion Master’s local mem.
1.
of the memory. That gives an opportunity to reconfigure the Buffer A
slaves at run-time, changing the program on the fly. This fea- 2. Merge
ture is not utilized in the presented encoder. The data RAM in
all CPUs is dual-port, the alternate port is used by the N2H2. Buffer B
Slave UARTs are multiplexed so that the PC can monitor any 3. Write
of the slaves. Timers are used for benchmarking and profil-
ing the program. The master needs an extra timer to be used
with the Ethernet. Slave’s bit buffer

HIBI is configured as a 32-bit wide single bus and used Global bit buffer SDRAM
with the single-FIFO interface. HIBI uses priority-based ar-
bitration, in which the master processor has the highest pri-
ority. Each of the HIBI wrappers has a five words deep low Figure 8: Master’s bit stream merge task.
priority data FIFO and a three words deep high priority
data FIFO. The HIBI bus single burst transfer length is lim-
ited to 25 words. Each N2H2 has eight Rx channels, sup-
used in the “send bits” phase of the master, except no bit
ports 256 separate addresses for channels and has a maxi-
shifting is needed since the bit stream is ensured to be byte
mum packet size of 65536 32-bit words. HIBI bridges are not
aligned by the implementation.
used, because there is no need for multiple clock domains.
The MPEG-4 encoding SW exploits the MPSOC features as
explained in the following sections. 5.4. Implementation of slaves’ tasks

Our MPEG-4 encoder software is identical to [29] except


5.3. Implementation of master’s tasks
that all DSP specific optimizations have been omitted. A
The tasks of the master are discussed in Section 4 and il- platform-independent portable ANSI-C implementation has
lustrated in the middle of Figure 7. All parameterization is been used for Nios II. A single program multiple data
performed via the external shared SDRAM. However, since (SPMD)-approach has been used, so the slave programs are
N2H2 DMA is used to access the shared data memory, identical. The program execution is, however, greatly depen-
SDRAM, one cannot refer to SDRAM via pointers. For exam- dant on the data, so the execution flows differ from one CPU
ple, the C language statement sdramVariable = pSdramAddr to another. Slave tasks are discussed in Section 4 and illus-
[0] is not possible. Instead, data between external and local trated in the right side of Figure 7.
memories is moved with help of dedicated software library The slaves have to operate under low memory conditions.
calls, for example, sdramRead ( ) and sdramWrite ( ). To circumvent this issue, the ME process is carried out in a
The current 64 KB limitation of the local data memory, 48 × 48 sliding window under the programmers control as
however, presents a more demanding challenge considering illustrated in Figure 9(a). The ME window moves in a raster
that a raw QCIF image takes 37.1 KB. Our solution is to al- scan pattern centered on the current MB position. An ex-
locate large memory buffers, such as the currently encoded ception is an image boundary, where the window is clipped
image, the reconstructed images, and the output bit buffers, inside the image. The ME window loading is optimized by
on the external SDRAM. Two additional 1KB buffers are al- packing four consecutive pixels into a 32-bit word. Further-
located on the on-chip RAM, which are used for processing more, the overlapping of subsequent ME window positions
data from the large buffers a small segment at a time. could be used to minimize accesses to the external memory.
For example, bit stream merging is implemented as a Figure 9(b) shows two approaches for updating the ME
three-step process, illustrated in Figure 8, which is repeated window. First, one can load a whole window from SDRAM
in a loop. As long as there are slave bits remaining, the master every time the MB position changes, for example, with DMA.
first reads a small portion of the slave’s bit stream into the in- In the second approach, only the rightmost column is loaded
put buffer A. Second, the master concatenates and shifts the from SDRAM while the remaining pixels are copied inside
data after the tail of the master’s global bit buffer in buffer B. the on-chip memory. The data transmissions are executed
Third, the master writes the result to the SDRAM and resyn- beforehand in the background in order to minimize the time
chronizes the buffer B with the updated tail of the global bit consumed by waiting for data to be processed.
buffer. The tail contains the bits that did not form a full 32- The drawback of the first approach is high SDRAM band-
bit word. The tail is stored to the SDRAM to keep the buffers width requirement. For example, the ME of a 4CIF video
synchronized, but the master uses the local copy to avoid un- (704 × 576) at 30 frames/s demands 104 MB/s. In compari-
necessary memory traffic. This allows merging of large, ar- son, the second approach requires 34 MB/s but the drawback
bitrary sized bit streams, and realizes a general merging pro- is that 136 million operations per second (MOPS) are con-
cedure for variable length coded (VLC) streams that are not sumed by load/store operations needed for copying. Con-
aligned to byte boundaries. A similar buffering scheme is also sidering that current SDRAM is fast, for example, 133 MHz
Ari Kulmala et al. 9

Choice 1: reload all MBs


Nios II data RAM External SDRAM
(Old recon image 176  144)

SDRAM
(Local copy)
48  48 sliding ME window
MB Pos (x, y) MB Pos (x + 1, y)
32 bits 16
IRQ (data received)

16 Choice 2: minimize SDRAM access


Nios II
32 bits 1 2 3 1 2 3
Ctrl

SDRAM
N2H2 DMA SDRAM DMA
HIBI wrap. HIBI wrap.
HIBI 32 bits
Copy rows (2, 3) to (1, 2)
(Four 8-bit pixels packet to a 32-bit word) MB Pos (x, y) MB Pos (x + 1, y)
(a) Sliding ME window (b) Window update approaches

Figure 9: (a) 48 × 48 sliding window, (b) ME window updating.

32-bit SDRAM yields maximum of 4 bytes ∗ 133 M = Due to the presence of data/instruction caches, IRQ process-
507 MB/s, the first approach is still a viable option due to ing, and the underlying on-chip network causing deviations
DMA [27]. Hence, the second approach is well suited for sys- in the results, three benchmarking runs were made and their
tems with slow SDRAM, while fast SDRAM and DMA can be average is reported as the result. Scalability is evaluated by
used to reduce CPU utilization. In this work, we support the computing the speed-up (S(n)) as
first approach because we have an efficient DMA that can be
used to minimize the CPU load. When ME is carried out, the FPSavg (n)
S(n) = , (2)
rest of the encoding is performed similar to a nonparallel en- FPSavg (1)
coder except that each slave works on a different slice. The
where FPSavg (n) is the average frame rate of a multislave con-
slave bit output uses a local/global buffering scheme compa-
figuration and FPSavg (1) is the average frame rate with a sin-
rable to Figure 8.
gle slave. In addition, parallelization efficiency (E(n)) is com-
puted as
6. RESULTS  
FPSavg (n)
The performance of our system is evaluated by measuring the E(n) =   ∗ 100%. (3)
n ∗ FPSavg (1)
FPGA utilization, the encoding frame rate as a function of the
number of slaves, and the complexities of encoding tasks. All
timing and profiling procedures for measurements are im- 6.1. FPGA utilization
plemented with HW timers running at the CPU clock fre- Table 3 shows the FPGA utilization of the MPSOC HW mod-
quency, 50 MHz. The system was benchmarked in buffered ules. The area consumption is reported in terms of Logic El-
mode, since the main concern is the pure video encoding ements (LE) and mem usage is the utilization of the on-chip
speed with no Ethernet activity. RAM. The statistics have been obtained by synthesizing MP-
The measurements were carried out with two standard SOC with Quartus II 5.0 into a Stratix 1S40. Currently, the
QCIF sequences carphone and news. The encoder was con- maximum frequency (50 MHz) is dictated by the connection
figured to use IPPP. . . frame structure, where only the first to external SRAM. Total memory and area refer to the max-
frame is Intra (I) coded while the rest are motion compen- imum capacity of the FPGA chip. Memory figures are deter-
sated Inter (P) frames. All tests were done in variable bit rate mined from the theoretical maximum number of the avail-
(VBR) mode where different bit rates are realized by chang- able memory bits. However, if we also count the bits that can-
ing QP. Video sequence relative input/output frame rates not be used due to the memory block architecture, the mem-
were fixed to 30 frames/s in all test cases. ory usage rises to 87% of the available memory resources.
During a benchmarking run, 120 frames of the selected Therefore, the memory utilization restricts the system and
test sequence were encoded. The average encoding frame rate not the logic utilization. The FIFO buffers in the system have
(FPSavg ) is computed as been implemented with on-chip RAM. We have also imple-
fcpu mented an FIFO using LE flip-flops as data storage. Thus, we
FPSavg (n) = , (1) can optionally save the on-chip RAM and use the spare LEs
cframe (n)
instead.
where fcpu is the CPU frequency and Cframe (n) denotes aver- LE resources are abundant and have been exploited in the
age encoding cycles per QCIF frame with n encoding slaves. architecture. For example, in the SDRAM controller, there
10 EURASIP Journal on Embedded Systems

Table 3: FPGA utilization statistics.


HW module Module count Total mem [KB] % of total mem Module area [LE] Total area [LE] % of LE
Master Nios I 1 80.8 19.3 2 720 2 720 6.6
Slave Nios II 3 225.1 53.9 2 324 6 972 16.9
N2H2 DMA 4 0 0 1 894 7 576 18.4
HIBI network 1 0.4 0.1 8 506 8 506 20.6
SDRAM DMA 1 0.3 0.1 3 205 3 205 7.8
Utilization 306.6 73.3 28 979 70.2

13 13
12 12
11 11
10 10
Frames (s)

Frames (s)
9 9
8 8
7 7
6 6
5 5
4 4
3 3
1 2 3 1 2 3
Number of encoding slaves Number of encoding slaves
Ideal Ideal
QP = 25 (26 kbps @ 30 fps) QP = 25 (21 kbps @ 30 fps)
QP = 12 (65 kbps @ 30 fps) QP = 12 (52 kbps @ 30 fps)
QP = 4 (301 kbps @ 30 fps) QP = 4 (199 kbps @ 30 fps)

Figure 10: Frame rates for sequence carphone.qcif (176 × 144). Figure 11: Frame rates for sequence news.qcif (176 × 144).

are four read and four write ports, ensuring that no CPU has information yields
to unnecessarily wait. N2H2 has eight channels to provide    
flexibility for the software programmer. There are spare LEs Cimg x, y, n, Cmb = Cslice x, y, n, Cmb + Cmaster (n), (4)
on the FPGA, since only 70% have been utilized. where Cimg is the clock cycles required to encode a frame, x
and y are the width and height of the luma image in pixels,
6.2. Encoding speed and software scalability n is the number of encoding processors, and Cmb is the aver-
age clock cycles required to encode a macroblock. The term
Figures 10 and 11 present average encoding frame rates as a Cmaster denotes the master’s overhead resulting from the se-
function of QP and the number of slaves for the carphone and quentially computed parts. Cslice represents parallelized com-
news QCIF sequences. The bit rates are reported relative to putations and is the number of clock cycles required to en-
the fixed 30 frames/s sequence output rate. The straight lines code the largest slice in the system. Mathematically Cslice is
depict an ideal speed-up, which is obtained by multiplying computed as
the frame rate of a single slave with the number of slaves. The  
frame rates are measured from the master’s main encoding x/16 ∗  y/16
Cslice = ∗ Cmb , (5)
loop. n
As scalability was one of our main objectives, the results
where the rounding for x and y takes care that the image is
indicate very good success. The parallelization efficiency of
always divisible to macroblocks, for example, x and y do not
carphone using two slaves is within 97% of the ideal result. If
need to be divisible by 16. The overall rounding finds the size
we further increase the number of slaves to three, the paral-
of the largest slice in case the number of macroblocks is not
lelization efficiency is 93%. As the current FPGA restricts the
evenly divisible by the number of processors.
number of processors to four (one master and three slaves),
The master’s overhead results from four subfunctions
we estimate the performance of larger configurations in the
which can be combined as
following.
Cmaster (n) = Cconfig (n) + CgetBitStreamSize (n)
(6)
6.2.1. Performance estimation for larger configurations + Cmerge (n) + Coth (n),
The complexity of image encoding task depends on slice en- where Cconfig is due to the configuration of encoding param-
coding times as well as the overhead of the master. This eters for the slaves, CgetBitStreamSize results from reading the
Ari Kulmala et al. 11

Table 4: Measured clock cycles for Master’s subfunctions. Table 5: Parameterization of linear equations for complexity mod-
eling.
f (subfunction) 1 slave CPU 2 slave CPUs 3 slave CPUs
Merge 32876.4150 38009.8750 43160.7267 f (subfunction) a( f ) b( f ) c( f ) (CPU cycles)
Config 2821.6367 5313.8333 7878.8450 Merge 0.1564 0.8436 32876.4150
GetBitStreamSize 1264.2967 2517.6617 3772.6450 Config 0.8961 0.1039 2821.6367
Oth 117016.7367 117465.5550 117982.0483 GetBitStreamSize 0.9920 0.0080 1264.2967
Oth 0.0041 0.9959 117016.7367
3.5
Normalized execution time

3
130
2.5 120
110

Estimated frames (s)


2 100
90
1.5 80
70
1 60
50
0.5 40
30
0 20
1 2 3 10
Number of encoding slaves 0
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34
Bits desc. read Number of encoding slaves
Slave cnfg loop
Bits merge Ideal
Others Estimated FPS

Figure 12: Growth rate of complexity of master’s subtasks. Figure 13: Estimated frame rate for n-processor MPSOC system.

sizes of slave bit streams from SDRAM, Cmerge is the num- Due to the simultaneous access to the shared data mem-
ber of clock cycles due to merging of slave bits streams, and ory at the beginning of each frame encoding, the slave’s start-
others are related to the IRQs of the master and an internal up latency, that is, the time to get enough data to start pro-
state management. It is pointed out that for an optimized cessing, also increases as the number of slaves increase. This
system, all Ethernet related tasks are omitted. The measured time is not included in the estimate. Each slave requires one
average clock cycles for the aforementioned subfunctions are motion estimation window, 2560 bytes (640 words), to start
presented in Table 4 as a function of the number of encoding processing. It can be assumed that this amount can be trans-
processors. In Table 4, f identifies the subfunction. ferred from SDRAM to CPU in around 1000 cycles. Thus,
For the mathematical model, it is necessary to model the since the frame encoding time is millions of clock cycles, the
growth of the master task complexity as a function of n. The impact is quite insignificant.
complexity change is illustrated in Figure 12, which is plotted Finally, the encoding frame rate estimation on the
using the values in Table 4. For each subfunction, a curve was MPEG-4 MPSOC system is computed with
obtained by plotting the clock cycles with n encoding proces-
sors divided by the clock cycles required for one encoding   fcpu
FPSMPSOC x, y, n, Cmb , fcpu =  , (8)
CPU. Cimge x, y, n, Cmb
The results in Figure 12 show a linear increase in com-
plexity for all subfunctions of the master. Therefore, the where fcpu is the clock frequency of 50 MHz. With bench-
complexities of each subfunction as a function of n are marking it was found that Cmb is on the average of 133394.8
approximated with clock cycles per macroblock for carphone if a QP value of 12
is used.
Figure 13 presents the predicted encoding frame rate for
C f ∈{merge, config, getBitStreamSize, oth} (n) the optimized MPSOC as a function of n for the QCIF video
  (7) format. The values are obtained with (8) using the param-
= a( f ) ∗ n + b( f ) ∗ c( f ), eters in Table 5. The system scales nearly linearly when n is
smaller than 12. After 12 encoding processors, the complex-
where a is the slope of the line (the gradient) and b is the ity of the master’s sequential overhead starts to increase faster
intercept on the vertical axis in Figure 12, and c is the number than is the benefit of data parallelization and the frame rate
of clock cycles of a subfunction with one encoding processor. saturates after 24 slaves. The small variation at large n is due
In practice, the clock cycles of one encoding processor are to the unbalanced sizes of slices.
scaled with the linear model to obtain a prediction for an n One solution to smooth out the variations would be to
CPU system. The subfunction specific parameters for (7) are use a finer subdivision granularly, for example, 8 × 8 or
presented in Table 5. 4 × 4 blocks, but this is impractical from an implementation
12 EURASIP Journal on Embedded Systems

12 time for MasterWait + Poll is consumed by waiting for the


11 master to collect and merge the bit stream, deliver a new raw
10 slice, and provide the MPEG-4 coding parameters for next
9 frame.
Frames (s)

8
7
6.4. Hardware scalability
6
5 HIBI offers good scalability. For this architecture, adding or
4 removing processors is simple—it only takes minutes to pa-
3 rameterize and prepare for synthesis. HIBI also offers a con-
1 2 3
venient way to add new IP components, for example new
Number of encoding slaves
processors or hardware accelerators. It is possible to add new
QP = 12, Ideal features to the system without altering the video encoding.
QP = 12, measured A simple example is the addition of a hardware utilization
QP = 12, estimated
monitor. The addition does not require any changes to the
encoding and it is easy to plug into the system by just adding
Figure 14: Ideal, measured, and estimated frame rates for car- one HIBI wrapper to the interconnection. Encouraging pre-
phone.qcif (176 × 144).
liminary results have been obtained from attaching a hard-
ware accelerator (DCT, IDCT, quantizer, and inverse quan-
point of view since a macroblock would span two processors. tizer) to the system (around a couple of FPS increase).
In practice, advanced horizontal parallelization scales better The utilization of the HIBI bus is measured to be only
than the row-wise approach used in [27] due to finer granu- 3%. Therefore, the interconnection is not likely to become
larity. a bottleneck even with larger configurations. From a perfor-
Figure 14 shows that the model applies well to real, mea- mance point of view, the interconnection architecture should
sured performance. It is slightly lower than the measured be as invisible as possible in hiding data transmission time
frame rates with the maximum error being about 4%. The from the CPU.
model is applicable to QCIF video streams and also for larger The speed-up gained by adding processors (e.g., Figure
formats by changing the measured execution times appro- 10) shows that the interconnection architecture performs
priately. It takes into account the number of macroblocks well. Figure 15 shows that only 4% (SDRAM config + comm
and as the number of processors increases, the sizes of slices wait) of the CPU encoding time is spent on data transmis-
may not be equal in terms of number of macroblocks result- sions and waiting for data.
ing in computational unbalance. However, if we use a larger As the HIBI utilization is low, it is expected that the
video size, for example CIF, the number of macroblocks also shared data memory (SDRAM) will not become the bottle-
increases. Therefore, with larger video sizes, we can benefit neck in the future. We have determined that the intercon-
from having more processors than is currently practical for nection is capable of transmitting the required data for a
QCIF. CIF image, which is four times larger at 25 frames/s using
a clock frequency of 28 MHz without forming a significant
6.3. Relative complexity of encoding tasks performance bottleneck. The application can perform data
fetches from memory in parallel with computation. There-
The complexities of different CPU tasks show how comput- fore, memory latencies can be tolerated.
ing power is shared within one processor. For the master pro- Shared instruction memory utilization, via the Avalon
cessor, up to 96% of the time is spent waiting the slaves to bus, is at most 20% of available bandwidth. We have inves-
complete. In buffered mode, frame transmissions over the tigated the effect of shared instruction memory on perfor-
Ethernet are not executed until the whole video sequence is mance and preliminary results indicate that it is currently
encoded and thus, this time is not included in the utilization negligible with respect to total frame encoding time. How-
figures. In this case, the master operates as a state machine. ever, four processors could share one instruction memory
However, the master processor is designed to handle all the port. The absolute maximum for sufficient performance [30]
external tasks that are not directly related to the video en- is ten processors per port.
coding. This can include I/O handling (e.g., Ethernet proto-
col stack), audio coding, and user interfaces, like changing 6.5. Comparison of scalability to related work
quantization parameters at run-time. The greatest require-
ment is fast response time for the user interface processor. In Figure 16, the scalability of different video encoders is pre-
Double buffering could also be used. sented. The optimal situation would be a gain of 100% par-
Figure 15 illustrates how the execution time is divided for allel efficiency, that is, a speed-up of 2.0 for 2 CPUs and 3.0
one slave processor to encode one macroblock. Motion esti- for 3 CPUs. Of the four general categories of parallelization,
mation (ME) is by far the most computationally challenging results from three are presented. No publications were found
task. Other time consuming tasks are interpolation, quanti- that describe a functional parallel encoder that scales with a
zation (Q), inverse quantization (IQ), DCT, and IDCT. The varying number of processors.
Ari Kulmala et al. 13

Comm. wait
0.88% Others Intra LD/ST
3.46% 0.06%
Q + IQ
9.02%
SDRAM config.
3.39%
DCT + IDCT
6.76%
MasterWait + Poll
6.88%
dcPred
2.97%
Interpolation
10.26%
vlcMB
1.6%
Intra/inter ?
3.42%

Block sub/add
3.47%
ME (SAD)
47.23%
Copy 16  16
0.59%

Figure 15: Relative complexities of encoding tasks on one slave CPU.

15 Yung and Leung [13]. One work based on video-object par-


14
13 allelism, which is closely related to data parallelism, is pre-
12 sented in He et al. [14].
11
Overall speed-up

10 The results from related research are frequently presented


9 as speed-up or frame rate curves, so the exact numerical val-
8
7 ues cannot be obtained. We have reproduced the curves in
6 Figure 16 to be as accurate as possible. Results from Nang and
5 Kim, Agi and Jagannathan, He et al., and Yung and Leung are
4
3 plotted. Barbosa et al. have given results for three cases. The
2 result for 16 CPUs is estimated from the sketched figure. Our
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 results are from the model described earlier. Results for Peng
Number of encoding processors and Zhao and Akramullah et al. are plotted from the given
numerical figures.
Nang: temporal parallel MPEG-1
This work: data parallel MPEG-4 (hor) Figure 16 shows that the scalability of our implementa-
Agi: temporal parallel MPEG-1 tion is the highest of all data parallel implementations. That
Peng: data parallel H.264 (hor) implies both good parallelization efficiency achieved with
He: VO parallel MPEG-4
Akramullah: data parallel MPEG-2 (rect)
software and an efficient hardware architecture.
Barbosa: temporal parallel MPEG-1
Yung: data parallel H.261 (S2/ver)
7. CONCLUSIONS
Figure 16: Speed-ups of different video encoder implementations. A highly scalable MPSOC architecture with an MPEG-4 en-
coder has been presented. The parallelization efficiency of the
application, when the number of encoding processors is in-
Nang and Kim [8], Agi and Jagannathan [9], and Barbosa creased from one to two, is 97% and to three, 93%. No real-
et al. [10] use temporal parallelism. Nang et al. have received time video encoder was found that has such a high scalability
great parallelization efficiency. However, as temporal paral- for real-time video encoding. Our benefit is due to both a
lelism needs frames to be buffered (increasing the latency) it well-designed architecture and application. The architecture
is not considered to be low-latency real-time video encoding. efficiently hides the data transmissions from the processors.
In our work we present a data parallel application. Similar The software takes full advantage of the parallelism by ele-
works are Peng and Zhao [11], Akramullah et al. [12], and gantly sharing the encoding load. The software does not need
14 EURASIP Journal on Embedded Systems

any changes when the number of processors is altered, thus [11] Q. Peng and Y. Zhao, “Study on parallel approach in H.26L
the scaling effort is very low. video encoder,” in 4th International Conference on Parallel and
The scalability and flexibility of the MPSOC architec- Distributed Computing, Applications and Technologies (PDCAT
ture is gained by using an HIBI on-chip network and soft- ’03), pp. 834–837, Chengdu, China, August 2003.
core processors in a plug-and-play fashion. The performance [12] S. M. Akramullah, I. Ahmad, and M. L. Liou, “Performance
of software-based MPEG-2 video encoder on parallel and dis-
and area usage can be flexibly compromised by changing the
tributed systems,” IEEE Transactions on Circuits and Systems
number of processors. It takes only minutes to change the for Video Technology, vol. 7, no. 4, pp. 687–695, 1997, Transac-
number of processors and then the new system is ready to tion Briefs.
be synthesized. Since the architecture is implemented in an [13] N. H. C. Yung and K.-K. Leung, “Spatial and temporal data
FPGA, the amount of on-chip memory becomes a limiting parallelization of the H.261 video coding algorithm,” IEEE
factor. Transactions on Circuits and Systems for Video Technology,
In the future, platform flexibility will be demonstrated vol. 11, no. 1, pp. 91–104, 2001.
with the use of different soft-core processors and hardware [14] Y. He, I. Ahmad, and M. L. Liou, “MPEG-4 based interactive
accelerators. The connection to external instruction mem- video using parallel processing,” in International Conference
ory will be replaced with an HIBI interface to achieve bet- on Parallel Processing (ICPP ’98), pp. 329–336, Minneapolis,
ter clock frequencies. Furthermore, scalability will be further Minn, USA, August 1998.
[15] M. J. Garrido, C. Sanz, M. Jiménez, and J. M. Menasses, “An
evaluated with larger FPGA chips and by connecting several
FPGA implementation of a flexible architecture for H.263
ones together to form a larger system. video coding,” IEEE Transactions on Consumer Electronics,
vol. 48, no. 4, pp. 1056–1066, 2002.
REFERENCES [16] R. Kordasiewicz and S. Shirani, “Hardware implementation of
the optimized transform and quantization blocks of H.264,”
[1] E. Salminen, A. Kulmala, and T. D. Hämäläinen, “HIBI-based in Canadian Conference on Electrical and Computer Engineer-
multiprocessor SoC on FPGA,” in IEEE International Sym- ing (CCECE ’04), vol. 2, pp. 943–946, Niagara Falls, Ontario,
posium on Circuits and Systems (ISCAS ’05), pp. 3351–3354, Canada, May 2004.
Kobe, Japan, May 2005. [17] Q. Peng and J. Jing, “H.264 codec system-on-chip design and
[2] O. Lehtoranta, E. Salminen, A. Kulmala, M. Hännikäinen, and verification,” in 5th International Conference on ASIC (ASI-
T. D. Hämäläinen, “A parallel MPEG-4 encoder for FPGA CON ’03), vol. 2, pp. 922–925, Beijing, China, October 2003.
based multiprocessor SoC,” in 15th International Conference [18] Y.-L. Lin, C.-P. Young, Y.-J. Chang, Y.-H. Chung, and A. W.
on Field Programmable Logic and Applications (FPL ’05), pp. Y. Su, “Versatile PC/FPGA based verification/fast prototyping
380–385, Tampere, Finland, August 2005. platform with multimedia applications,” in Proceedings of the
[3] Altera Corporation, Nios Development Board: Reference Man- 21st IEEE Instrumentation and Measurement Technology Con-
ual, Stratix Professional Edition, Ver. 1.1, July 2003. ference (IMTC ’04), vol. 2, pp. 1490–1495, Como, Italy, May
[4] E. Salminen, T. Kangas, J. Riihimäki, V. Lahtinen, K. Ku- 2004.
usilinna, and T. D. Hämäläinen, “HIBI v.2 communication [19] M. Martina, A. Molino, and F. Vacca, “FPGA system-on-chip
network for system-on-chip,” in Computer Systems: Architec- soft IP design: a reconfigurable DSP,” in Proceedings of the 45th
tures, Modeling, and Simulation, A. D. Pimentel and S. Vassil- Midwest Symposium on Circuits and Systems (MWSCAS ’02),
iadis, Eds., vol. 3133 of Lecture Notes in Computer Science, pp. vol. 3, pp. 196–199, Tulsa, Okla, USA, August 2002.
412–422, Springer, Berlin, Germany, July 2004. [20] X. Wang and S. G. Ziavras, “Parallel direct solution of lin-
ear equations of FPGA-based machines,” in IEEE International
[5] P. Kuhn, Algorithms, Complexity Analysis and VLSI Architec-
Parallel & Distributed Processing Symposium (IPDPS ’03), pp.
tures for MPEG-4 Motion Estimation, Kluwer Academic, Dor-
113–120, Nice, France, April 2003.
drecht, The Netherlands, 1999.
[21] Altera Corporation, “Avalon Interface Specification,” Ver. 2.4,
[6] I. Ahmad, Y. He, and M. L. Liou, “Video compression with
January 2004.
parallel processing,” Parallel Computing, vol. 28, no. 7-8, pp.
[22] H. Kalte, D. Langen, E. Vonnahme, A. Brinkmann, and
1039–1078, 2002.
U. Rückert, “Dynamically reconfigurable system-on-pro-
[7] O. Cantineau and J.-D. Legat, “Efficient parallelisation of an grammable-chip,” in 10th Euromicro Workshop on Parallel,
MPEG-2 codec on a TMS320C80 video processor,” in IEEE Distributed and Network-based Processing (PDP ’02), pp. 235–
International Conference on Image Processing (ICIP ’98), vol. 3, 242, Canary Islands, Spain, January 2002.
pp. 977–980, Chicago, Ill, USA, October 1998. [23] OCP-IP Alliance, “Open Core Protocol specification,” Release
[8] J. Nang and J. Kim, “An Effective parallelizing scheme of 2.0, 2003.
MPEG-1 video encoding on ethernet-connected worksta- [24] Altera Corporation, “Nios 3.0 CPU datasheet,” October 2004.
tions,” in Proceedings of the Conference on Advances in Parallel [25] Altera Corporation, “Nios II Processor Reference Handbook,”
and Distributed Computing (APDC ’97), pp. 4–11, Shanghai, May 2005.
China, March 1997. [26] Altera Corporation, Nios II, Site visited 28.11.2005, http://
[9] I. Agi and R. Jagannathan, “A portable fault-tolerant paral- www.altera.com/products/ip/processors/nios2/ni2-index.
lel software MPEG-1 encoder,” Multimedia Tools and Applica- html.
tions, vol. 2, no. 3, pp. 183–197, 1996. [27] O. Lehtoranta, T. D. Hämäläinen, V. Lappalainen, and J. Mu-
[10] D. M. Barbosa, J. P. Kitajima, and W. Weira Jr., “Parallelizing stonen, “Parallel implementation of video encoder on quad
MPEG video encoding using multiprocessors,” in Proceedings DSP system,” Microprocessors and Microsystems, vol. 26, no. 1,
of the XII Brazilian Symposium on Computer Graphics and Im- pp. 1–15, 2002.
age Processing (SIBGRAPI ’99), pp. 215–222, Sao Paulo, Brazil, [28] Altera Corporation, “Stratix Device Handbook,” January
October 1999. 2005.
Ari Kulmala et al. 15

[29] O. Lehtoranta and T. D. Hämäläinen, “Feasibility study of a


real-time operating system for a multichannel MPEG-4 en-
coder,” in Multimedia on Mobile Devices, vol. 5684 of Proceed-
ings of SPIE, pp. 292–299, San Jose, Calif, USA, January 2005.
[30] A. Kulmala, E. Salminen, O. Lehtoranta, T. D. Hämäläinen,
and M. Hännikäinen, “Impact of shared instruction memory
on performance of FPGA-based MP-SoC video encoder,” in
The 9th IEEE workshop on Design and Diagnostics of Electronic
Circuits and Systems (DDECS ’06), pp. 59–64, Prague, Czech
Republic, April 2006.

Ari Kulmala received the M.S. degree in


computer science from the Tampere Uni-
versity of Technology (TUT), Finland, in
August 2005. He is currently pursuing Ph.D.
degree and working as a research scientist in
the DACI Research Group in the Institute
of Digital and Computer Systems at TUT.
His research interests include system-on-
chip architectures, interconnections, and
low power design.

Olli Lehtoranta was born in Vantaa, Fin-


land, on 2nd of June 1976. He received
his M.S. degree “with distinction” in com-
puter science from Tampere University of
Technology, in November 2001. In 1999, he
joined the Institute of Digital and Com-
puter Systems Laboratory where he cur-
rently works towards his Ph.D. degree. He
has authored two journals and nine confer-
ence papers for refereed scientific publica-
tions. His current research interests include parallel processing for
video encoding, error resilient video encoding methods, computa-
tional complexity analysis of video encoding algorithms, and low-
level assembly optimization of video codecs for media instruction
set architectures (ISA) as well as DSP.

Timo D. Hämäläinen received the M.S. de-


gree in 1993 and the Ph.D. degree in 1997,
both from Tampere University of Technol-
ogy (TUT). He acted as a Senior Research
Scientist and Project Manager at TUT dur-
ing 1997–2001. He was nominated to be
Full Professor at TUT, Institute of Digi-
tal and Computer Systems, in 2001. He
heads the DACI Research Group that fo-
cuses on three main lines: wireless local
area networking and wireless sensor networks, high-performance
DSP/HW-based video encoding, and interconnection networks
with design flow tools for heterogeneous SoC platforms.

Marko Hännikäinen received the M.S. de-


gree in 1998 and the Ph.D. degree in 2002,
both from Tampere University of Technol-
ogy (TUT). Currently he acts as a Senior
Research Scientist in the Institute of Dig-
ital and Computer Systems at TUT, and
a Project Manager in the DACI Research
Group. His research interests include wire-
less local and personal area networking,
wireless sensor and ad hoc networks, and
novel web services.

You might also like