Rank Metric Decoder Architectures For Random Linear Network Coding With Error Control

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

296 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO.

2, FEBRUARY 2012

Rank Metric Decoder Architectures for Random


Linear Network Coding With Error Control
Ning Chen, Member, IEEE, Zhiyuan Yan, Senior Member, IEEE, Maximilien Gadouleau, Member, IEEE,
Ying Wang, Member, IEEE, and Bruce W. Suter, Senior Member, IEEE

AbstractWhile random linear network coding is a powerful packets as vectors over some finite field and forms an outgoing
tool for disseminating information in communication networks, packet by linearly combining incoming packets using random
it is highly susceptible to errors caused by various sources. Due
to error propagation, errors greatly deteriorate the throughput coefficients. Due to its random linear operations, RLNC not
of network coding and seriously undermine both reliability only achieves network capacity in a distributed manner, but
and security of data. Hence, error control for network coding also provides robustness to changing network conditions.
is vital. Recently, constant-dimension codes (CDCs), especially
KtterKschischang (KK) codes, have been proposed for error Unfortunately, it is highly susceptible to errors caused by
control in random linear network coding. KK codes can also be various reasons, such as noise, malicious or malfunctioning
constructed from Gabidulin codes, an important class of rank nodes, or insufficient min-cut [3]. Since linearly combining
metric codes. Rank metric decoders have been recently proposed
for both Gabidulin and KK codes, but they have high compu- packets results in error propagation, errors greatly deteriorate
tational complexities. Furthermore, it is not clear whether such the throughput of network coding and seriously undermine both
decoders are feasible and suitable for hardware implementations. reliability and security of data. Thus, error control for random
In this paper, we reduce the complexities of rank metric decoders
and propose novel decoder architectures for both codes. The linear network coding is critical.
synthesis results of our decoder architectures for Gabidulin and Error control schemes proposed for RLNC assume two types
KK codes with limited error-correcting capabilities over small of transmission models. The schemes of the first type (see, for
fields show that our architectures not only are affordable, but also
achieve high throughput. example, [4]) depend on and take advantage of the underlying
network topology or the particular linear network coding op-
Index TermsConstant-dimension codes (CDCs), decoding,
error correction coding, Gabidulin codes, Galois fields, integrated erations performed at various network nodes. The schemes of
circuits, KtterKschischang (KK) codes, network coding, rank the second type [3], [5] assume that the transmitter and receiver
metric codes, subspace codes. have no knowledge of such channel transfer characteristics. The
two transmission models are referred to as coherent and nonco-
herent network coding, respectively.
I. INTRODUCTION It has been recently shown [3] that an error control code for
noncoherent network coding, called a subspace code, is a set of
N ETWORK coding [1] is a promising candidate for a new
unifying design paradigm for communication networks,
due to its advantages in throughput and robustness to network
subspaces (of a vector space), and information is encoded in the
choice of a subspace as a codeword; a set of packets that gen-
failures. Hence, network coding is already used or considered erate the chosen subspace is then transmitted [3]. A subspace
in gossip-based data dissemination, 802.11 wireless ad hoc net- code is called a constant-dimension code (CDC) if its subspaces
working, peer-to-peer networks, and mobile ad hoc networks are of the same dimension. CDCs are of particular interest since
(MANETs). they lead to simplified network protocols due to the fixed dimen-
Random linear network coding (RLNC) [2] is arguably sion. A class of asymptotically optimal CDCs has been proposed
the most important class of network coding. RLNC treats all in [3], and this class is referred to as the KK codes. A decoding
algorithm based on interpolation for bivariate linearized poly-
Manuscript received March 29, 2010; revised September 10, 2010; accepted
nomials is also proposed in [3] for the KK codes. It was shown
October 17, 2010. Date of publication January 06, 2011; date of current version that KK codes correspond to lifting [5] of Gabidulin codes [6],
January 18, 2012. This work was supported in part by Thales Communications, [7], which are a class of optimal rank metric codes. Gabidulin
Inc., a summer extension grant from Air Force Research Lab, and the National codes are also called maximum rank distance (MRD) codes,
Science Foundation under Grant ECCS-0925890. This paper was presented in
part at the IEEE Workshop on Signal Processing Systems, Tampere, Finland, since they achieve the Singleton bound in the rank metric [6],
October2009. as ReedSolomon (RS) codes achieve the Singleton bound of
N. Chen is with SandForce Inc., Saratoga, CA 95070 USA (e-mail: Hamming distance. Due to the connection between Gabidulin
nchen@sandforce.com).
Z. Yan is with the Department of Electrical and Computer Engineering, and KK codes, the decoding of KK codes can be viewed as
Lehigh University, Bethlehem, PA 18015 USA (e-mail: yan@lehigh.edu). generalized decoding of Gabidulin codes, which involves devia-
M. Gadouleau is with the Department of Computer Science, Queen Mary, tions as well as errors and erasures [5]. Gabidulin codes are sig-
University of London, E1 4NS U.K. (e-mail: mgadouleau@eecs.qmul.ac.uk).
Y. Wang is with the New Jersey Research Center, QUALCOMM, Inc.,
nificant in themselves: For coherent network coding, the error
Bridgewater, NJ 08807 USA (e-mail: aywang11@gmail.com). correction capability of error control schemes is succinctly de-
B. W. Suter is with Air Force Research Laboratory, Rome, NY 13441 USA scribed by the rank metric [8]; thus, error control codes for co-
(e-mail: bruce.suter@rl.af.mil).
Color versions of one or more of the figures in this paper are available online
herent network coding are essentially rank metric codes.
at http://ieeexplore.ieee.org. The benefits of network coding above come at the price of ad-
Digital Object Identifier 10.1109/TVLSI.2010.2096239 ditional operations needed at the source nodes for encoding, at
1063-8210/$26.00 2011 IEEE
CHEN et al.: RANK METRIC DECODER ARCHITECTURES FOR RLNC WITH ERROR CONTROL 297

the intermediate nodes for linear combining, and at the destina- The decoding algorithms of both Gabidulin and KK codes
tion node(s) for decoding. In practice, the decoding complexi- involve solving key equations. We adapt the inversionless
ties at destination nodes are much greater than the encoding and BerlekampMassey algorithm (BMA) in [14], [15] to
combining complexities. The decoding complexities of RLNC solving key equations for rank metric codes. Our inver-
are particularly high when large underlying fields are assumed sionless BMA leads to reduced complexities as well as
and when additional mechanisms such as error control are ac- efficient architectures;
counted for. Clearly, the decoding complexities of RLNC are The decoding algorithm of KK codes requires that the input
critical to both software and hardware implementations. Fur- be arranged in a row reduced echelon (RRE) form [16].
thermore, area/power overheads of their VLSI implementations We define a more generalized form called -RRE form,
are important factors in system design. Unfortunately, prior re- and show that it is sufficient if the input is in the -RRE
search efforts have mostly focused on theoretical aspects of net- form. This change not only reduces the complexity of re-
work coding, and complexity reduction and efficient VLSI im- formulating the input, but also enables parallel processing
plementation of network coding decoders have not been suffi- of decoding KK codes based on Cartesian products.
ciently investigated so far. For example, although the decoding Another main contribution of this paper is efficient decoder
complexities of Gabidulin and KK codes were analyzed in [9], architectures for both Gabidulin and KK codes. Aiming to re-
[10] and [3], [5], respectively, they do not reflect the impact of duce the area and to improve the regularity of our decoder archi-
the size of the underlying finite fields. To ensure high probability tectures, we have also reformulated other steps in the decoding
of success for RLNC, a field of size or is desired [11]. algorithm. To evaluate the performance of our decoder architec-
However, these large field sizes will increase decoding complex- tures for Gabidulin and KK codes, we implement our decoder
ities and hence complicate hardware implementations. Finally, architecture for two rate-1/2 Gabidulin codes and their corre-
to the best of our knowledge, hardware architectures for these sponding KK codes. Our KK decoders can be used in network
decoders have not been investigated in the open literature. coding with various packet lengths by Cartesian product [5].
In this paper, we fill this significant gap by investigating com- The synthesis results of our decoders show that our decoder ar-
plexity reduction and efficient hardware implementation for de- chitectures for Gabidulin and KK codes over small fields with
coders in RLNC with error control. This effort is significant to limited error-correcting capabilities not only are affordable, but
the evaluation and design of network coding for several rea- also achieve high throughput. Our decoder architectures and im-
sons. First, our results evaluate the complexities of decoders for plementation results are novel to the best of our knowledge.
RLNC as well as the area, power, and throughput of their hard- The decoders considered in this work are bounded distance
ware implementations, thereby helping to determine the feasi- decoders, and their decoding capability is characterized in [5,
bility and suitability of network coding for various applications. Theorem 11]. The thrust of our work is to reduce complexi-
Second, our research results provide instrumental guidelines to ties and to devise efficient architectures for such decoders, while
the design of network coding from the perspective of complexity maintaining their decoder capability. To this end, our reformu-
as well as hardware implementation. Third, our research results lations of the decoding algorithms do not affect the decoding ca-
lead to efficient decoders and hence reduce the area and power pability of the bounded distance decoders of Gabidulin and KK
overheads of network coding. codes. The error performance of the bounded distance decoders
In this paper, we focus on the generalized Gabidulin decoding has been investigated in our previous works [17][19]. Hence,
algorithm [5] for the KK codes and the modified Berlekamp- despite its significance, a detailed error performance analysis is
Massey decoding algorithm in [12] for Gabidulin codes for two beyond the scope of this paper, and we do not include it due to
reasons. First, compared with the decoding algorithm in [3], the limited space.
generalized Gabidulin decoding [5] has a smaller complexity, The remainder of this paper is organized as follows. After
especially for high-rate KK codes [5]. Second, components in briefly reviewing the background in Section II, we present our
the errors-only Gabidulin decoding algorithm in [12] can be complexity-saving algorithmic reformulations and efficient
easily adapted in the generalized Gabidulin decoding of KK decoder architectures in Sections III and IV, respectively.
codes. Thus, among the decoding algorithms for Gabidulin In Section V, the proposed architectures are implemented in
codes, we focus on the decoding algorithm in [12]. Verilog and synthesized for area/performance evaluation. The
Although we focus on RLNC with error control in this paper, conclusion is given in Section VI.
our results can be easily applied to RLNC without error control.
For RLNC without error control, the decoding complexity is pri- II. PRELIMINARIES
marily due to inverting of the global coding matrix via Gauss
Jordan elimination, which is also considered in this paper. A. Notation
Our main contributions include several algorithmic reformu- Let denote a power of prime and denote a finite field
lations that reduce the computational complexities of decoders of order . We use , , and to denote an
for both Gabidulin and KK codes. Our complexity-saving algo- identity matrix, an -dimensional vector space over , and
rithmic reformulations are given here. the set of all matrices over , respectively. For a set
We first adopt normal basis representations for all finite , denotes the complement subset
field elements and then significantly reduce the com- and denotes the columns of in .
plexity of bit-parallel normal basis multipliers by using In this paper, all vectors and matrices are in bold face.
our common subexpression elimination (CSE) algo- The rank weight of a vector over is defined as the max-
rithm [13]; imal number of its coordinates that are linearly independent over
298 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012

As in RS decoding, we can compute syndromes for Gabidulin


codes as for any received vector
. Then the syndrome polynomial can be
used to solve the key equation [12, Theorem 3]
Fig. 1. Data flow of Gabidulin decoding.
(2)
the base field . Rank metric between two vectors over is
for the error span polynomial , using the BMA. Up to
the rank weight of their difference [20]. For a column vector
error values s can be obtained by finding a basis
, we can expand each of its component into a row
for the root space of using the methods in [24],
vector over the base field . Such a row expansion leads to an
[25]. Then we can find the error locators s corresponding to
matrix over . In this paper, we slightly abuse the no-
s by solving a system of equations
tation so that can represent a vector in or a matrix in
, although the meaning is usually clear given the context.
Given a matrix , its row space, rank, and RRE form are de- (3)
noted by , , and , respectively. For a sub-
space , its dimension is denoted by and
. The rank distance of two vectors and in is where is the number of errors. Gabidulins algorithm [6] in
Algorithm 1 can be used to solve (3). Finally, the error locations
defined as . The subspace distance
s are obtained from s by solving
[3] of their row spaces is defined as
.
A linearized polynomial [21], [22] (or -polynomial) over (4)
is a polynomial of the form , where
. For a linearized polynomial , its -degree is
defined to be the greatest value of for which is nonzero.
Algorithm 1 (Gabidulins Algorithm [6]).
For convenience, let denote . The symbolic product of two
linearized polynomials and , denoted by (that is,
Input: and
), is also a linearized polynomial. The
-reverse of a linearized polynomial is Output:
given by the polynomial , where
1.1 Compute matrices and as
for and is the -degree of . For a set
of field elements, we use to denote its minimal
linearized polynomial, which is the monic linearized polyno-
mial of least degree such that all the elements of are its roots.

B. Gabidulin Codes and Their Decoding


A Gabidulin code [6] is a linear code over , whose otherwise.
parity-check matrix has a form as
1.2 Compute s recursively as
and
, for .
.. .. .. (1)
..
. . . . In total, the decoding complexity of Gabidulin codes is
roughly operations over [9], where
is the code rate, or operations over [10].
where are linearly independent over
Note that all polynomials involved in the decoding process are
. Let denote . Since is an -di-
linearized polynomials.
mensional vector space over , it is necessary that . The
Gabidulin codes are often viewed as the counterpart in rank
minimum rank distance of a Gabidulin code is ,
metric codes of the well-known RS codes. As shown in Table I,
and hence Gabidulin codes are MRD codes.
an analogy between RS and Gabidulin codes can be established
The decoding process of Gabidulin codes includes five major
in many aspects. Such an analogy helps us understand the de-
steps: syndrome computation, key equation solver, finding the
coding of Gabidulin codes and, in some cases, allows us to adapt
root space, finding the error locators by Gabidulins algorithm
innovations proposed for RS codes to Gabidulin codes.
[6], and finding error locations. The data flow of Gabidulin de-
coding is shown in Fig. 1.
C. KK Codes and Their Decoding
Key equation solvers based on a modified BMA [12] or a
modified WelchBerlekamp algorithm (WBA) [23] have been By the lifting operation [5], KK codes can be constructed
proposed. In this paper, we focus on the modified BMA due to from Gabidulin codes. Lifting can also be seen as a generaliza-
its low complexity. tion of the standard approach to random linear network coding
CHEN et al.: RANK METRIC DECODER ARCHITECTURES FOR RLNC WITH ERROR CONTROL 299

Fig. 2. Data flow of KK decoding.

TABLE I
ANALOGY BETWEEN RS AND GABIDULIN CODES Algorithm 2 (General Rank Decoding [5]).

Input: received tuple


Output: error word
2.1 Compute , ,
, , and
, where is the
-reverse of .
2.2 Compute the error span polynomial:
a) Use the modified BMA [12] to solve the key equation
such that
[2], which transmits matrices in the form , where where .
, , and . b) Compute .
In practice, the packet length could be very long. To accom- c) Use Gabidulins algorithm [6] to find that solves
modate long packets based on the KK codes, very large and , .
are needed, which results in prohibitively high complexity due d) Compute followed by
to the huge field size of . A low-complexity approach in [5] .
suggested that instead of using a single long Gabidulin code, a 2.3 Find a basis for the root space of .
Cartesian product of many short Gabidulin codes with the same 2.4 Find the error locations:
distance can be used to construct CDCs for long packets via the a) Solve , using
lifting operation. Gabidulins algorithm [6] to find the error locators
Let the received matrix be , where and .
. Note that we always assume the received matrix b) Compute the error locations s by solving (4).
is full-rank [5]. The row and column rank deficiencies of are c) Compute the error word , where
and , respectively. In the each is the row expansion of .
decoding algorithm of [5], the matrix is first turned into an
RRE form, and then the RRE form of is expanded into
, where denotes III. COMPUTATIONAL COMPLEXITY REDUCTION
the column positions of leading entries in the first rows of In general, RLNC is carried out over , where is any prime
. The tuple is called a reduction of [5]. It power. That is, packets are treated as vectors over . Since
was proved [5] that our investigation of computational complexities is for both soft-
ware and hardware implementations of RLNC, where data are
, where and . Now the stored and transmitted in bits, we focus on RLNC over charac-
decoding problem to minimize the subspace distance becomes teristic-2 fields in our work, i.e., is a power of two. In some
a problem to minimize the rank distance. cases, we further assume , as it leads to further complexity
For a KK code , the generalized rank decoding [5] finds an reductions.
error word . The error word
A. Finite Field Representation
is expanded as a summation of products of column and row vec-
tors [5] such that . Each term is called Finite field elements can be represented by vectors using
either an erasure, if is known, or a deviation, if is known, different types of bases: polynomial basis, normal basis, and
or an error, if neither nor is known. In this general de- dual basis [26]. In rank metric decoders, most polynomials
coding problem, has columns from and has rows involved are linearized polynomials, and hence their evalu-
from . Given a Gabidulin code of minimum distance , the ations and symbolic products require computing their th
corresponding KK code is able to correct errors, erasures, powers. Suppose a field element is represented by a vector
and deviations as long as if . over with respect to a normal basis, computing th powers
Algorithm 2 was proposed [5] for generalized decoding of ( is a positive or negative integer) of the element is simply
the KK codes, and its data flow is shown in Fig. 2. It requires cyclic shifts of the corresponding vector by positions, which
operations in [5]. significantly reduces computational complexities. For example,
300 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012

TABLE II
COMPLEXITIES OF BIT-PARALLEL NORMAL BASIS MULTIPLIERS OVER FINITE FIELDS (FOR THESE TWO FIELDS, ALL THREE
IMPLEMENTATIONS HAVE THE SAME CPD.)

the computational complexity of Algorithm 1 is primarily due , , is defined as the number of terms in computing a
to the following updates in Step 1.1: bit of the product , where and
and is a normal
basis. It has been shown [29] that a parallel MO multiplier over
(5)
needs AND gates and at most XOR
gates. For instance, for the fields and , their s are
which require divisions and computing [ 1]th powers. With minimized to 21 and 85, respectively [26]. Using a common
normal basis representation, [ 1]th powers are obtained by a subexpression elimination algorithm [13], we significantly re-
single cyclic shift. When , they can be computed in an in- duce the number of XOR gates while maintaining the same critical
versionless form , path delays (CPDs) of one AND plus five XOR gates and one AND
, which also avoids finite field plus seven XOR gates as direct implementations, respectively.
divisions or inversions. Thus, using normal basis representation Our results are compared with those in [26] and [29] in Table II,
also reduces the complexity of Gabidulins algorithm. where we also provide the prime polynomial for each field.
In addition to lower complexities of finite field arithmetic op- The reduced gate count for normal basis multiplication is
erations, normal basis representation leads to reduced complex- particularly important for hardware implementations of RLNC.
ities in the decoding of Gabidulin and KK codes for several rea- This improvement is transparent to the complexity of decoders,
sons. First, it was shown that using normal basis can facilitate in terms of finite field operations. When decoders for RLNC
the computation of symbolic product [9]. Second, it was also are realized in hardware, the reduced gate count for normal
suggested [9] that solving (4) can be trivial using normal basis. If basis multiplication will be reflected in reduced area and power
is a normal basis, the matrix , whose rows consumption.
are vector representations of s with respect to the basis s,
C. Inversionless BMA
becomes an identity matrix with additional all-zero columns.
Hence, solving (4) requires no computation. These two com- The modified BMA for rank metric codes [12] is similar to
plexity reductions were also observed in [10]. Third, if a normal the BMA for RS codes except that polynomial multiplications
basis of is used as s and , the parity check matrix are replaced by symbolic products. The modified BMA [12]
in (1) becomes a cyclic matrix. Thus, syndrome computa- requires finite field divisions, which are more complex than
tion becomes part of a cyclic convolution of other arithmetic operations. Following the idea of inversion-
and , for which fast algorithms are available (see, for example, less RS decoder [14], we propose an inversionless variant in
[27]). Using fast cyclic convolution algorithms are favorable Algorithm 3.
when is large.
Algorithm 3. iBMA
B. Normal Basis Arithmetic Operations
We also propose finite field arithmetic operations with re- Input: Syndromes
duced complexities, when normal basis representation is used. Output:
When represented by vectors, the addition and subtraction of
two elements are simply component-wise addition, which is 3.1 Initialize: , , and
straightforward to implement. For characteristic-2 fields , .
inverses can be obtained efficiently by a sequence of squaring 3.2 For ,
and multiplying, since for a) Compute the discrepancy .
[26]. Since the -th powers require no computation, b) If , then go to (e).
the complexity of inversion in turn depends on that of multi- c) Modify the connection polynomial:
plication. Division can be implemented by a concatenation of .
inversion and multiplication: , and hence the d) If , go to (e). Otherwise, ,
complexity of division also depends on that of multiplication in , and . Go to (a).
the end. e) Set and
There are serial and parallel architectures for normal basis fi- .
nite field multipliers. To achieve high throughput in our decoder, 3.3 Set .
we consider only parallel architectures. Most normal basis mul-
tipliers are based on the MasseyOmura (MO) architecture [26], Using a similar approach as in [14], we prove that the output
[28]. The complexity of a serial MO normal basis multiplier over of Algorithm 3 is the same as produced by the modi-
CHEN et al.: RANK METRIC DECODER ARCHITECTURES FOR RLNC WITH ERROR CONTROL 301

fied BMA, except it is scaled by a constant . We first define an -RRE form for received matrices. Given a
However, this scaling is inconsequential since the two polyno- matrix , where and , the ma-
mials have the same root space. trix is in its -RRE form as long as (its leftmost columns)
Using normal basis, the modified BMA in [12] requires at is in its RRE form. Compared with the RRE form, the -RRE
most inversions, multiplications, form is more relaxed as it puts no constraints on the right part.
and additions over [9]. Our inversionless We note that an -RRE form of a matrix is not unique.
version, Algorithm 3, requires at most multipli- We now show that the relaxed constraint does not affect the
cations and additions. Since a normal basis in- decoding. Similar to [5, Proposition 7], we first show that a re-
version is obtained by normal basis multiplications, the duction based on -RRE form of always exists. Given
complexity of normal basis inversion is roughly times that and , where represents the reducing
of normal basis multiplication. Hence, Algorithm 3 reduces the row operations, the product is in its
complexity considerably. Algorithm 3 is also more suitable for -RRE form. We note that and ,
hardware implementation, as shown in Section IV. where the column and row rank deficiency of are given by
and , respectively. We
D. Finding the Root Space have the following result about the reduction based on .
Instead of finding roots of polynomials in RS decoding, we Lemma 1: Let and and be defined as above. There
need to find the root spaces of linearized polynomials in rank exists a tuple and a set
metric decoding. Hence, the Chien search [30] in RS decoding satisfying , , , and
will have a high complexity for two reasons. First, it requires
polynomial evaluations over the whole field, whose complexity so that .
is very high. Second, it cannot find a set of linearly independent
roots. See Appendix A for the proof of Lemma 1. Lemma 1 shows
A probabilistic algorithm to find the root space was proposed that we can find an alternative reduction based on -RRE form
in [25]. For Gabidulin codes, it can be further simplified as sug- of , instead of an RRE form of . The key of our alternative
gested in [9]. However, hardware implementations of proba- reduction of is that the reduction is mostly determined by the
bilistic algorithms require random number generators. Further- first columns of . Also, this alternative reduction does
more, the algorithm in [25] requires symbolic long division, not come as a surprise. As shown in [5, Proposition 8], row opera-
which is also not suitable for hardware implementations. Ac- tions on can produce alternative reductions. Next, we show that
cording to [5], the average complexity of the probabilistic al- decoding based on our alternative reduction is the same as in [5].
gorithm in [25] is operations over , while that of Similar to [5, Theorem 9], we have the following results.
Berlekamps deterministic method [24] is operations in Lemma 2: Let be a reduction of de-
plus operations in . Since their complexity dif- termined by its -RRE form, we have
ference is small, we focus on the deterministic method, which .
is much easier to implement.
Suppose we need to find the root space of a linearized polyno- See Appendix B for the proof. Lemma 2 shows that the
mial , Berlekamps deterministic method first evaluates the subspace decoding problem is equivalent to the generalized
polynomial on a basis of the field such Gabidulin decoding problem with the alternative reduction
that , . Then it expands s in the , which is obtained from an -RRE form of .
base field as columns of an matrix and finds linearly Our alternative reduction leads to two advantages. First, it re-
independent roots such that . Using the representation sults in reduced complexity in preprocessing. Given a matrix
based on , the roots are also the roots of , the preprocessing needed to transform into its -RRE
the given polynomial. Finding is to obtain the linear depen- form is only part of the preprocessing to transform into its
dent combinations of the columns of , which can be done by RRE form. We can show that the maximal number of arith-
Gaussian elimination. metic operations in the former preprocessing is given by
, whereas that of the latter prepro-
E. -RRE Form cessing is . Since
Given a received subspace spanned by a set of received , the relaxed constraint leads to a lower complexity, and
packets, the input of Algorithm 2 is a three-tuple, called a reduc- the reduction depends on and . Second, the re-
tion of the received space represented by its generator matrix duction for -RRE forms is completely determined by the left-
; the three-tuple is obtained based on when it is in its RRE most columns of instead of the whole matrix, which greatly
form [5]. Thus, before the decoding starts, preprocessing is simplifies hardware implementations. This advantage is partic-
performed on the received packets so as to obtain the RRE form ularly important for the decoding of CDCs that are lifted from
of . We show that the processed matrix needs to satisfy only a Cartesian products of Gabidulin codes. Since the row opera-
relaxed constraint, which does not affect the decoding outcome, tions to obtain an -RRE form depend on only, decoding
while leading to two advantages. First, the relaxed constraint can be divided into parallel and smaller de-
results in reduced complexities in the preprocessing step. Second coding problems whose inputs are .
and more importantly, the relaxed constraint enables parallel Thus, for these CDCs, we can decode in a serial manner with
processing of decoding KK codes based on Cartesian products. only one small decoder, or in a partly parallel fashion with more
302 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012

decoders, or even in a fully parallel fashion. This flexibility al- If are linearly dependent, and hence
lows tradeoffs between cost/area/power and throughput. Fur- is ignored. So Algorithm 4 integrates detection of linearly
thermore, since the erasures are determined by and are the dependency at no extra computational cost.
same for all , the computation of and in Algo- Essentially, Algorithm 4 breaks down evaluations of high
rithm 2 can be shared among these parallel decoding problems, -degree polynomials into evaluations of polynomials with
thereby reducing overall complexity. -degree of one. It avoids operations with very high complexity
while maintaining the same total complexity of the algorithm.
F. Finding Minimal Linearized Polynomials
Minimal linearized polynomials can be computed by solving IV. ARCHITECTURE DESIGN
systems of linear equations. Given roots , the Aiming to reduce the storage requirement and total area as
minimal linearized polynomial satisfies well as to improve the regularity of our decoder architectures,
we further reformulate the steps in the decoding algorithms of
both Gabidulin and KK codes. Again, we assume the decoder
.. .. .. .. .. (6) architectures are suitable for RLNC over , where is a power
.. .
. . . . . of two.

A. High-Speed BMA Architecture


Thus, it can be solved by Gaussian elimination over the exten-
sion field . Gabidulins algorithm is not applicable because To increase the throughput, regular BMA architectures with
the rows of the matrix are not the powers of the same element. shorter CPD are necessary. Following the approaches in [15],
The complexity to solve (6) is very high. Instead, we re- we develop two architectures based on Algorithm 3, which are
formulate the method from [21, Ch. 1, Theorem 7]. The main analogous to the riBM and RiBM algorithms in [15].
idea of [21, Ch. 1, Theorem 7] is to recursively construct the In Algorithm 3, the critical path is in step 3.2(a). Note that
minimal linearized polynomial using symbolic products instead is the th coefficient of the discrepancy polynomial
of polynomial multiplications in polynomial interpolation. . By using ,
Given linearly independent roots , we can can be computed as
construct a series of linearized polynomials as:
and for
.
Although the recursive method in [21, Ch. 1, Theorem
7] is for -polynomials, we can adapt it to linearized poly- (7)
nomials readily. A serious drawback of [21, Ch. 1, The-
orem 7] is that the evaluation of has a rapidly which has the same structure as step 3.2(c). Hence, this reformu-
increasing complexity when the degree of gets lation is more conducive to a regular implementation. Given the
higher. To eliminate this drawback, we reformulate the algo- similarities between step 3.2(a) and (7), and can be
rithm so that the evaluation is done in a recursive combined together into one polynomial . Similarly,
way. Our reformulated algorithm is based on the fact that and can be combined into one polynomial . These
. Rep- changes are incorporated in our RiBMA algorithm, shown in
resenting as , we obtain Algorithm 4. Algorithm 5.

Algorithm 4 (Minimal Linearized Polynomials). Algorithm 5. RiBMA

Input: Roots Input: Syndromes


Output: The minimal linearized polynomial Output:
4.1 Set , for and 5.1 Initialize: ,
. , , and .
4.2 For , 5.2 For ,
a) If , and a) Modify the combined polynomial:
for ; Otherwise, ;
b) Set ;
c) If and , set , ,
and for and ;
.
d) Set ,
;
Since powers of require only cyclic shifting, the opera-
tions in Algorithm 4 are simple. Also, Algorithm 4 does not e) Set and
require the roots to be linearly independent. In Algorithm 4, .
for and . 5.3 Set .
CHEN et al.: RANK METRIC DECODER ARCHITECTURES FOR RLNC WITH ERROR CONTROL 303

Following Algorithm 5, we propose a systolic RiBMA archi-


tecture shown in Fig. 3, which consists of identical pro-
cessing elements (BEs), whose circuitry is shown in Fig. 4. The
central control unit BCtrl, the rightmost cell in Fig. 3, updates
, generates the global control signals and , and passes Fig. 3. RiBMA architecture.
along the coefficient . The control signal is set to 1
only if and . In each processing element, there
are two critical paths, both of which consist of one multiplier
and one adder over .

B. Generalized BMA
The key equation of KK decoding is essentially the same as
(2), but has a -degree less than instead of .
Actually, in KK decoding, we do not know the exact value of
before solving the key equation. All we need is to determine
the maximum number of correctable errors given erasures
and deviations, which is given by . Fig. 4. Processing element BE (x is a cyclic shift and requires no hardware
but wiring).
Hence, we adapt our BMA in Section III-C to KK decoding, as
in Algorithm 6. To apply Algorithm 6 to Gabidulin decoding,
we can simply use . Gaussian eliminations on matrices over , while Gabidulins
algorithm operates on matrices over . Here, we focus on
Algorithm 6 (Generalized RiBMA). Gaussian eliminations over and Gabidulins algorithm will
be discussed in Section IV-D.
Input: and For high-throughput implementations, we adapt the pivoting
architecture in [31], which was developed for non-singular ma-
Output: trices over . It always keeps the pivot element on the top-
6.1 Initialize as follows: , left location of the matrix, by cyclically shifting the rows and
columns. Our Gaussian elimination algorithm, shown in Algo-
,
rithm 7, has three key differences from the pivoting architecture
, , and . in [31]. First, Algorithm 7 is applicable to matrices over any
6.2 For , field. Second, and more importantly, Algorithm 7 can be used
a) Modify the combined polynomial: for singular matrices. This feature is necessary since singular
; matrices occur in the reduction for the RRE form and finding
b) Set ; the root space. Third, Algorithm 7 is also flexible about matrix
c) If and , set , , sizes, which are determined by the variable numbers of errors,
and ; erasures, and deviations.
d) Set ,
; Algorithm 7 (Gaussian Elimination for Root Space).
e) Set and
. Input: , whose rows are evaluations of over
the normal basis, and
6.3 Set .
Output: Linearly independent roots of
Compared with Algorithm 5, we replace by . The variable 7.1 Set .
makes it difficult to design regular architectures. By carefully 7.2 For
initializing and , we ensure that the desired a)
output is always at a fixed position of , regard- b) While and
less of . Hence, the only irregular part is the initialization. , , and .
The initialization of Algorithm 6 can be done by shifting in at c) If is not zero, , ,
most cycles. Hence, the RiBMA architecture in Fig. 3 can be and ; Otherwise, .
adapted to the KK decoder and keep the same worst case latency 7.3 The first rows of are all zeros and the first
of cycles. rows of are roots.

C. Gaussian Elimination
The eliminate and shiftup operations are quite similar to those
We need Gaussian elimination to obtain -RRE forms as well in [31, Algorithm 2]. In , for ,
as to find root spaces. Furthermore, Gabidulins algorithm in for
Algorithm 1 is essentially a smart way of Gaussian elimination, , and . Note that a cyclic row
which takes advantage of the properties of the matrix. The re- shift and a cyclic column shift are already embedded in the elim-
duction (to obtain -RRE forms) and finding the root space are inate operation. In the operation, the first row
304 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012

Fig. 5. Regular architecture for Gaussian elimination.

is moved to the th row while the second to the


th rows are moved up. That is, for ,
if , and for
. The operation essen-
tially mimics all row operations in eliminate without the column
shift: for , for
, and . In the shiftleft opera-
tion, all columns are cyclicly shifted to the left. In other words,
for all and , .
By adding a shiftleft operation, Algorithm 7 handles both sin-
gular and nonsingular matrices while [31, Algorithm 2] works Fig. 6. Processing element GE .
for nonsingular matrices only. Since is always full rank, the
roots obtained are guaranteed to be linearly independent. In Algorithm 8, we incorporate the extraction of , , and
We can get the root space using Algorithm 7, and we can also into Gaussian elimination. Our architecture has the same worst
use it in KK decoding to reduce the received vector to an -RRE case latency as Algorithm 7 and requires no extra cycles to ex-
form. However, Algorithm 7 only produces . We extend it to tract out of the -RRE form. Hence, the throughput also re-
Algorithm 8 below so as to obtain simultaneously. mains the same.
Algorithm 7 is implemented by the regular architecture
Algorithm 8 (Gaussian Elimination for -RRE Forms). shown in Fig. 5, which is a 2-D array of processing
elements (GEs). The leftmost columns of processing el-
ements correspond to , and the rightmost columns .
Input: matrix and matrix Algorithm 8 can be implemented with the same architecture
Output: , , , and with GEs. The leftmost columns of processing
elements correspond to , and the rightmost columns . The
8.1 Set , and as empty. elements for are omitted in the figure. The circuitry of the
8.2 For each column processing element GE is shown in Fig. 6. The control signal
a) for row chooses from five inputs based on the operation:
b) While and keeping the value, shiftleft, eliminate (or reduce), and shiftup
, , , . (using the first row or the next row).
c) If is not zero, , ,
D. Gabidulins Algorithm
, .
d) Otherwise, , append the first column of In Algorithm 1, the matrix is first reduced to a triangular form.
to , set the top-right element of to one, and It takes advantage of the property of the matrix so that it requires
add to . no division in the first stage. In the first stage, we need to perform
elimination on only one row. We use a similar pivoting scheme
8.3 Set . The deviations are given by the
like Algorithm 7. When a row is reduced to having only one
first rows of .
nonzero element, a division is used to obtain one coefficient of
8.4 For each column , and . Then it performs a backward elimination after getting each
. coefficient. Hence, we introduce a backward pivoting scheme,
8.5 The received vector is given by . where the pivot element is always at the bottom-right corner.
CHEN et al.: RANK METRIC DECODER ARCHITECTURES FOR RLNC WITH ERROR CONTROL 305

Fig. 8. Processing element AE .

Fig. 7. Our architecture of Gabidulins algorithm.

In Algorithm 1, there are two matrices over , and


. In step 1.2, it requires only s to compute the coefficients.
Fig. 9. Processing element QE .
To compute in (5), it requires only and .
And for in (5), it requires only and . Re-
cursively, only those s where are necessary.
E. Low-Complexity Linearized Interpolation
Actually, given any , entries can be
computed with the entries . With It would seem that three registers are needed to store ,
, we need to store only values to keep s, and s, respectively, in Algorithm 4. However, we
track of . Hence, we reduce the storage of from can implement Algorithm 4 with a single register of size
-bit registers down to . We cannot reduce the storage of . First, we note that s are used to initialize s,
to because we have to use the pivoting scheme for
and only s are used in the updates. Second, after the th
short critical paths.
In our decoder, Algorithm 1 is implemented by the regular iteration of step 4.2, the -degree of is no more
architecture shown in Fig. 7, which includes an array of than and we need only there-
AEs and a one-dimensional array of QEs. The circuitry of after. Thus, we can store the coefficients of and
the processing element and is shown in Figs. 8 and in a register of size . We refer
9. The upper MUX in AE controls the output sending upward to this register as and index it from left to right.
along the diagonal. Its control signal is 1 for the second Note that are stored at the lower end
row and 0 for other rows since we update one row in a cycle of the register, and the coefficients of are stored at
and we keep the pivot on the upper left corner in Step 1.1. The the higher end of the register. At each iteration, the content of
control of the lower MUX in AE is 0 for working on Step 1.1, the register is shifted to the left by one position, so that
and 1 for working on Step 1.2. Similarly the control of the is always stored at .
MUX in QE is 0 for working on Step 1.1, and 1 for working
on Step 1.2. However, in Step 1.1, only part of QEs need up- Algorithm 9 (Reformulated Algorithm for Minimal Linearized
date and others should maintain their values and their control Polynomials).
signals s are set to 2. Initially, and
for . Step 1.1 needs substeps. In the first Input: Roots
substeps, , ,
, and for sub- Output: The minimal linearized polynomial
step . In the last substep, and all s are set to 9.1 Initialization: for , and
2. This substep is to put the updated into the original posi-
.
tion. In Step 1.2, the pivot is in the right lower corner, where
9.2 For ,
we compute s. Step 1.2 also needs substeps, in which
all s and s are set to 1. First is computed by a) If ,
where . Note that the inversion i) For ,
may need clock cycles. In each substep, the matrix is ;
moving down the diagonal so the to be inverted is always ii) For ,
at the bottom right corner. At the same time, the s are also ;
moving down. Basically, in substep , the architecture updates b) Otherwise, for , .
s to for by doing one
backward elimination at each substep. 9.3 .
306 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012

TABLE III
WORST CASE DECODING LATENCY (IN TERMS OF CLOCK CYCLES). GAUSSIAN
ELIMINATION OVER (ROOT SPACE IN GABIDULIN AND KK DECODERS)
HAS THE LONGEST CRITICAL PATH OF ONE MULTIPLIER, ONE ADDER, ONE
TWO-INPUT MUX, AND ONE FIVE-INPUT MUX
Fig. 10. Architecture of linearized polynomial interpolation.

Fig. 11. Processing element ME (x is a cyclic shift, and requires no hard-


ware but wiring). For simplicity, we have omitted the superscripts of  .
linearized polynomials are determined by the number of regis-
We note that the updates involve , which is always
ters, which is to accommodate , , and ,
set to zero (see Fig. 10). When an input is not linearly
whose degrees are , , and , respectively. The syndromes
independent with , . In this case, the can be computed by sets of multiply-and-accumulators in
algorithm simply ignores the input, and the registers are cycles. Note that the computations of , , and
shifted to the left by one position. Hence, whether or not the can be done concurrently. The latency of RiBMA is for
inputs are linearly independent, the minimal iterations. The latency of a symbolic product is de-
linearized polynomial for the inputs will be available after termined by the -degree of . When computing ,
iterations. This flexibility is important for our decoder architec- we are concerned about only the terms of -degree less than
ture, since the number of linearly independent inputs varies. because only those are meaningful for the key equation.
Algorithm 9 is implemented by the systolic architecture For computing , the result of in
shown in Fig. 10, which consists of processing ele- can be reused, so it needs only one symbolic product. In total,
ments (MEs). The circuitry of the processing element assuming , the decoding latencies of our Gabidulin and
is shown in Fig. 11. The cr signal is 1 only when . KK decoders are and
The signal for each cell is 1 only if . Basi- cycles, respectively.
cally, controls whether the update is for or One assumption in our analysis is that the unit that com-
as in Algorithm 4. putes in Figs. 9 and 11 is implemented with pure com-
binational logic, which leads to a long CPD for large s. To
F. Decoding Failure
achieve a short CPD for large s, it is necessary to pipeline
A complete decoder declares decoding failure when no valid the unit that computes . There are two ways to pipeline
codeword is found within the decoding radius of the received it: that requires multiplica-
word. To the best of our knowledge, decoding failures of tions, or that requires multiplications for divi-
Gabidulin and KK codes were not discussed in previous works. sion. To maintain a short CPD, needs to be implemented
Similar to RS decoding algorithms, a rank decoder can return sequentially with one clock cycle for each multiplication. Let
decoding failure when the roots of the error span polynomial and it requires at most
are not unique. That is, the root space of has a clock cycles for getting minimal linearized polynomials ,
dimension smaller than the -degree of . Note that this , and . Similarly, it requires at most
applies to both Gabidulin and KK decoders. For KK decoders, more cycles to perform forward elimination in Gabidulins al-
another condition of decoding failure is when the total number gorithm for the error locator, and the latency of this step will be
of erasures and deviations exceeds the decoding bound . cycles.
In our architectures, we use a block-level pipeline scheme for
G. Latency and Throughput high throughput. Data transfers between modules are buffered
We analyze the worst case decoding latencies of our decoder into multiple stages so the throughput is determined by only the
architectures, in terms of clock cycles, in Table III. longest latency of a single module. For brevity, we present only
As in [31], the latency of Gaussian elimination for the -RRE the data flow of our pipelined Gabidulin decoder in Fig. 12. The
form is at most cycles. Similarly, the latency data in different pipeline stages are for different decoding ses-
of finding the root space is at most . sions. Hence, these five units can work on five different sessions
For Gabidulins algorithm, it needs one cycle per row for for- currently for higher throughput. If some block finishes before
ward elimination and the same for backward elimination. For others, it cannot start another session until all are finished. So the
each coefficient, it takes cycles to perform a division. Hence, throughput of our block-level pipeline decoders is determined
it needs at most and by the block with the longest latency. For Gabidulin decoders,
for and respectively. The latencies of finding the minimal the block of finding root space is the bottleneck that requires
CHEN et al.: RANK METRIC DECODER ARCHITECTURES FOR RLNC WITH ERROR CONTROL 307

TABLE V
PERFORMANCE OF KK DECODERS FOR 512-byte PACKETS

Fig. 12. Data flow of our pipelined Gabidulin decoder.


To provide a reference for comparison, the gate count of our
TABLE IV (8, 4) KK decoder is only 62% to that of the (255, 239) RS
SYNTHESIS RESULTS OF DECODERS FOR GABIDULIN AND KK CODES
decoder over the same field in [33], which is 115,500. So
for Gabidulin and KK codes over small fields, which have lim-
ited error-correcting capabilities, their hardware implementa-
tions are feasible. The area and power of decoder architectures
in Table IV appear affordable except for applications with very
stringent area and power requirements.
B. Implementation Results of Long Codes
Although the area and power shown in Table IV are affordable
and high throughputs are achieved, the Gabidulin and KK codes
used have very limited block lengths 8 and 16. For practical net-
work applications, the packet size may be large [11]. One ap-
proach to increase the block length of a CDC is to lift a Cartesian
product of Gabidulin codes [5]. We also consider the hardware
cycles, the longest latency in the worst case sce- implementations for this case. We assume a packet size of 512
nario. For KK decoders, the bottleneck is the RRE block, which bytes, and use a KK code that is based on Cartesian product of
requires cycles. 511 length-8 Gabidulin codes. As observed in Section III-E, the
-RRE form allows us to either decode this long KK code in
V. IMPLEMENTATION RESULTS AND DISCUSSIONS a serial, partly parallel, or fully parallel fashion. For example,
To evaluate the performance of our decoder architectures, we more decoder modules can be used to decode in parallel for
implement our architectures for Gabidulin and KK codes for higher throughput. We list the gate counts and throughput of the
RLNC over . Note that, although the random linear combi- serial and factor-7 parallel schemes based on the (8, 4) KK de-
nations are carried out over , decoding of Gabidulin and KK coder and those of the serial and factor-5 parallel schemes based
codes are performed over extension fields of . on the (16, 8) KK decoder in Table V.
We restrict , the number of received packets, to save hard- In Table V, we simply use multiple KK decoders for parallel
ware while maintaining the error correction capability. We note implementations. Parallel KK decoders actually share the same
that a large leads to more rows in the architecture in Fig. 5. , , , and . Hence, some hardware can be also shared,
Note that we assume the input matrix is full rank as [5]. When such as the left part of Gaussian elimination for reduction in
, the number of deviations is at least Fig. 6 and the interpolation block for . With the same
and it is uncorrectable. Hence, in our implementation of KK latency, these hardware savings are roughly 7% of one single
decoders, we assume is less than . KK decoder.
A. Implementation Results C. Discussions
We implement our decoder architecture in Verilog for an (8, Our implementation results above show that the hardware im-
4) Gabidulin code over and a (16, 8) one over , which plementations of RLNC over small fields and with limited error
can correct errors of rank up to two and four, respectively. We control are quite feasible, unless there are very stringent area
also implement our decoder architecture for their corresponding and power requirements. However, small field sizes imply lim-
KK codes, which can correct errors, erasures, and devia- ited block length and limited error control. As shown above, the
tions as long as is no more than five or nine, respec- block length of a constant-dimension code can be increased by
tively. Our designs are synthesized using Cadence RTL Com- lifting a Cartesian product of Gabidulin codes. While this easily
piler 9.1 and FreePDK 45 nm standard cell library [32]. The syn- provides arbitrarily long block length, it does not address the
thesis results are given in Table IV. In these tables, the total area limited error control associated with small field sizes. For ex-
includes both cell area and estimated net area, the gate counts ample, a Cartesian product of (8, 4) Gabidulin codes has the
are in equivalent numbers of two-input NAND gates, and the total same error correction capability as the (8, 4) KK decoder, and
power includes both leakage and estimated dynamic power. All their corresponding CDCs also have the same error-correction
estimations are made by the synthesis tool. The throughput is capability. If we want to increase the error correction capabili-
computed as . ties of both Gabidulin and KK codes, longer codes are needed
308 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012

and in turn larger fields are required. A larger field size implies REFERENCES
a higher complexity for finite field arithmetic, and longer codes [1] R. Ahlswede, N. Cai, S.-Y. R. Li, and R. W. Yeung, Network infor-
with greater error correction capability also lead to higher com- mation flow, IEEE Trans. Inf. Theory, vol. 46, no. 4, pp. 12041216,
Jul. 2000.
plexity. It remains to be seen whether the decoder architectures [2] T. Ho, M. Mdard, R. Koetter, D. R. Karger, M. Effros, J. Shi, and B.
continue to be affordable for longer codes over larger fields, and Leong, A random linear network coding approach to multicast, IEEE
this will be the subject of our future work. Trans. Inf. Theory, vol. 52, no. 10, pp. 44134430, Oct. 2006.
[3] R. Ktter and F. R. Kschischang, Coding for errors and erasures in
random network coding, IEEE Trans. Inf. Theory, vol. 54, no. 8, pp.
VI. CONCLUSION 35793591, Aug. 2008.
[4] N. Cai and R. W. Yeung, Network coding and error correction, in
This paper presents novel hardware architectures for Proc. IEEE Inf. Theory Workshop (ITW02), Oct. 2025, 2002, pp.
119122.
Gabidulin and KK decoders. Our work not only reduces the [5] D. Silva, F. R. Kschischang, and R. Ktter, A rank-metric approach
computational complexity for the decoder but also devises to error control in random network coding, IEEE Trans. Inf. Theory,
regular architectures suitable for hardware implementations. vol. 54, no. 9, pp. 39513967, Sep. 2008.
[6] E. M. Gabidulin, Theory of codes with maximum rank distance,
Synthesis results using a standard cell library confirm that our Probl. Inf. Transm., vol. 21, no. 1, pp. 112, Jan.Mar. 1985.
designs achieve high speed and high throughput. [7] R. M. Roth, Maximum-rank array codes and their application to criss-
cross error correction, IEEE Trans. Inf. Theory, vol. 37, no. 2, pp.
APPENDIX A 328336, Mar. 1991.
PROOF OF LEMMA 1 [8] D. Silva and F. R. Kschischang, On metrics for error correction in net-
work coding, IEEE Trans. Inf. Theory, vol. 55, no. 12, pp. 54795490,
Proof: This follows the proof of [5, Proposition 7] closely. Dec. 2009.
[9] M. Gadouleau and Z. Yan, Complexity of decoding Gabidulin codes,
Suppose the RRE of is in Proc. 42nd Ann. Conf. Inf. Sci. Syst. (CISS08), Princeton, NJ, Mar.
1921, 2008, pp. 10811085.
[10] F. R. Kschischang and D. Silva, Fast encoding and decoding of
and an - form of is . Since the RRE Gabidulin codes, in Proc. IEEE Int. Symp. Inf. Theory (ISIT09),
Seoul, Korea, Jun. 28Jul. 3 2009, pp. 28582862.
form of is unique, . Thus, and . In the [11] P. A. Chou, Y. Wu, and K. Jain, Practical network coding, in Proc.
proof of [5, Proposition 7], is chosen based on . Thus, we 41st Ann. Allerton Conf. Commun., Control, and Computing, Monti-
cello, IL, Oct. 2003.
choose . Since is uniquely determined by and [12] G. Richter and S. Plass, Error and erasure decoding of rank-codes with
a modified Berlekamp-Massey algorithm, in Proc. 5th Int. ITG Conf.
is by , we also have . Finally, choosing , Source and Channel Coding (SCC04), Erlangen, Germany, Jan. 2004,
the rest follows the same steps as in the proof of [5, Proposition pp. 249256.
7]. [13] N. Chen and Z. Yan, Cyclotomic FFTs with reduced additive complex-
ities based on a novel common subexpression elimination algorithm,
APPENDIX B IEEE Trans. Signal Process., vol. 57, no. 3, pp. 10101020, Mar. 2009.
PROOF OF LEMMA 2 [14] H. Burton, Inversionless decoding of binary BCH codes, IEEE Trans.
Inf. Theory, vol. IT-17, no. 4, pp. 464466, Jul. 1971.
Proof: This follows a similar approach as in [5, Appendix [15] D. V. Sarwate and N. R. Shanbhag, High-speed architectures for
C]. We havethe following. ReedSolomon decoders, IEEE Trans. Very Large Scale Integr.
(VLSI) Syst., vol. 9, no. 5, pp. 641655, Oct. 2001.
[16] P. Lancaster and M. Tismenetsky, The Theory of Matrices, ser.
Comput. Sci. Appl. Math., 2nd ed. Orlando, FL: Academic, 1985.
[17] M. Gadouleau and Z. Yan, On the decoder error probability of
bounded rank-distance decoders for maximum rank distance codes,
IEEE Trans. Inf. Theory, vol. 54, no. 7, pp. 32023206, Jul. 2008.
[18] M. Gadouleau and Z. Yan, Decoder error probability of bounded
distance decoders for constant-dimension codes, in Proc. IEEE Int.
(8) Symp. Inf. Theory (ISIT09), Seoul, Korea, Jun. 28Jul. 3 2009, pp.
22262230.
[19] M. Gadouleau and Z. Yan, On the decoder error probability of
bounded rank distance decoders for rank metric codes, in Proc. IEEE
Inf. Theory Workshop (ITW09), Sicily, Italy, Oct. 1116, 2009, pp.
485489.
[20] P. Delsarte, Bilinear forms over a finite field, with applications to
coding theory, J. Comb. Theory, Ser. A, vol. 25, pp. 226241, 1978.
(9) [21] O. Ore, On a special class of polynomials, Trans. Amer. Math. Soc.,
vol. 35, no. 3, pp. 559584, 1933.
[22] O. Ore, Contributions to the theory of finite fields, Trans. Amer.
Math. Soc., vol. 36, no. 2, pp. 243274, 1934.
where (8) follows from and (9) follows [23] P. Loidreau, A Welch-Berlekamp like algorithm for decoding
from . Since , the Gabidulin codes, in Proc. 4th Int. Workshop Coding and Cryptog-
raphy (WCC05), Bergen, Norway, Mar. 1418, 2005, vol. 3969,
subspace distance is given by Lecture Notes in Computer Science, pp. 3645.
[24] E. R. Berlekamp, Algebraic Coding Theory. New York: McGraw-
Hill, 1968.
. [25] V. Skachek and R. M. Roth, Probabilistic algorithm for finding roots
of linearized polynomials, Des. Codes Cryptogr., vol. 46, no. 1, pp.
1723, Jan. 2008.
ACKNOWLEDGMENT [26] E. D. Mastrovito, VLSI architectures for computations in Galois
fields, Ph.D. dissertation, Dept. Electr. Eng., Linkping Univ.,
The authors would like to thank Dr. D. Silva and Prof. F. R. Linkping, Sweden, 1991.
[27] M. Wagh and S. Morgera, A new structured design method for convo-
Kschischang for valuable discussions and the reviewers for their lutions over finite fieldsPart I, IEEE Trans. Inf. Theory, vol. IT-29,
constructive comments. no. 4, pp. 583595, Jul. 1983.
CHEN et al.: RANK METRIC DECODER ARCHITECTURES FOR RLNC WITH ERROR CONTROL 309

[28] J. K. Omura and J. L. Massey, Computational method and apparatus Maximilien Gadouleau (S06M10) received the
for finite field arithmetic, U.S. Patent 4 587 627, May 6, 1986. M.S. degree in electrical and computer engineering
[29] A. Reyhani-Masoleh and M. A. Hasan, A new construction of from Esigelec, Saint-Etienne du Rouvray, France, in
MasseyOmura parallel multiplier over GF(2 ) , IEEE Trans. 2004, and the M.S. and Ph.D. degrees in computer
Comput., vol. 51, no. 5, pp. 511520, May 2002. engineering from Lehigh University, Bethlehem, PA,
[30] E. R. Berlekamp, Algebraic Coding Theory, revised ed. Laguna Hills, in 2005 and 2009, respectively.
CA: Aegean Park Pres, 1984. From 2009 to 2010, he was a Postdoctoral
[31] A. Bogdanov, M. C. Mertens, C. Paar, J. Pelzl, and A. Rupp, A par- Researcher with the Universit de Reims Cham-
allel hardware architecture for fast Gaussian elimination over GF(2), pagne-Ardenne, Reims, France. In 2010, he joined
in Proc. 14th Ann. IEEE Symp. Field-Programmable Custom Com- the School of Electronic Engineering and Computer
puting Machines (FCCM06), Napa Valley, CA, Apr. 2426, 2006, pp. Science at Queen Mary, University of London,
237248. London, U.K., as a Postdoctoral Research Assistant. His current research inter-
[32] J. E. Stine, I. Castellanos, M. Wood, J. Henson, F. Love, W. R. Davis, ests are coding theory, network coding, and cryptography, and their relations to
P. D. Franzon, M. Bucher, S. Basavarajaiah, J. Oh, and R. Jenkal, combinatorics and graph theory.
FreePDK: An open-source variation-aware design kit, in Proc. IEEE Dr. Gadouleau is a member of the IEEE Information Theory Society.
Int. Conf. Microelectron. Syst. Education (MSE07), San Diego, CA,
Jun. 34, 2007, pp. 173174.
[33] H. Lee, High-speed VLSI architecture for parallel ReedSolomon de-
coder, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 11, no.
2, pp. 288294, Apr. 2003. Ying Wang (S00M06) received the Ph.D. degree
in electrical and computer engineering from the Uni-
versity of Illinois at Urbana-Champaign, Urbana, IL,
in 2006.
Currently, she is a Staff Engineer with the New
Ning Chen (S06M10) received the B.E. and M.E. Jersey Research Center of QUALCOMM Corporate
degrees from Tsinghua University, Beijing, China, in Research and Development, Bridgewater, NJ. Her
2001 and 2004, respectively, and the Ph.D. degree research interests include wireless communications,
from Lehigh University, in 2010, all in electrical en- statistical signal processing, multimedia security and
gineering. forensics, and detection and estimation theory.
Currently he is with the Enterprise Storage Di-
vision, PMC-Sierra, Allentown, PA. His research
interests are in the VLSI design and implementation
of digital signal processing and communication
systems. Bruce W. Suter (S80M85SM92) received the
B.S. and M.S. degrees in electrical engineering and
Ph.D. degree in computer science from the Univer-
sity of South Florida, Tampa, in 1972 and 1988, re-
spectively.
Zhiyuan Yan (S00M03SM08) received the In 1998, he joined the Technical Staff of the U.S.
B.E. degree in electronic engineering from Tsinghua Air Force Research Laboratory, Rome, NY, where he
University, Beijing, China, in 1995, and the M.S. was the founding Director of the Center for Integrated
and Ph.D. degrees in electrical engineering from the Transmission and Exploitation (CITE). He has held a
University of Illinois, Urbana, in 1999 and 2003, visiting appointments at Harvard University and the
respectively. Massachusetts Institute of Technology. His previous
During the summers of 2000 and 2002, he was positions include academia at the U.S. Air Force Institute of Technology and
a Research Intern with Nokia Research Center, the University of Alabama at Birmingham, together with industrial positions at
Irving, TX. He joined the Electrical and Computer Honeywell Inc. and Litton Industries. He is the author or coauthor of over 100
Engineering Department, Lehigh University, Beth- technical publications and the author of a widely accepted monograph, Multirate
lehem, PA, as an Assistant Professor in August and Wavelet Signal Processing (Academic Press, 1998). His current research
2003. He has authored or coauthored over 70 technical papers in refereed interests are focused on wireless computing networks and their applications to
journals and conference proceedings. He is an associate editor for the Journal signal and image processing.
of Signal Processing Systems. His current research interests are in coding Dr. Suter is a member of Tau Beta Pi and Eta Kappa Nu. He was the re-
theory, cryptography, wireless communications, and VLSI implementations of cipient of the Air Force Research Laboratory (AFRL) Fellow, an AFRL-wide
communication and signal processing systems. award for his accomplishments in the theory, application, and implementation
Dr. Yan is a member of Tau Beta Pi, Sigma Xi, and Phi Kappa Phi. He served of signal processing algorithms, the Arthur S. Flemming Award, a government-
as technical program committee co-chair and general co-chair for ACM Great wide award for his pioneering Hankel transform research, the General Ronald
Lakes Symposium on VLSI in 2007 and 2008, respectively. He has served as an W. Yates Award for Excellence in Technology Transfer for his patented Fourier
associate editor for the IEEE COMMUNICATIONS LETTERS since 2008. He is a transform processor, and the Fred I. Diamond Award for best laboratory re-
member of the IEEE Information Theory, Communications, Signal Processing, search publication. He is a former associate editor of the IEEE TRANSACTIONS
and Computer Societies. ON SIGNAL PROCESSING.

You might also like