Parallel CRC Generator Whitepaper PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

F EATURE

ARTICLE by Evgeni Stavinov

A Practical Parallel CRC


Generation Method
Do you understand the mechanics of the cyclic redundancy check (CRC) well
enough to build a customized parallel CRC circuit described by an arbitrary
CRC generator polynomial? This article covers a practical method of generating
Verilog or VHDL code for the parallel CRC. The result is the fast generation of
a parallel CRC code for an arbitrary polynomial and data width.

M ost electrical and computer


engineers are familiar with the
cyclic redundancy check (CRC). Many
know that it’s used in communication
protocols to detect bit errors, and that it’s
Data

essentially a remainder of the modulo-2


long division operation. Some have had Figure
igure 1—This is a USB CRC5 implementation as LFSR using generator polynomial
G(x) = x5 + x2 + 1.
closer encounters with the CRC and
know that it’s implemented as a linear
feedback shift register (LFSR) using flip-flops and XOR practicing logic design engineer should understand.
gates. They likely used an online tool or an existing
example to generate parallel CRC code for a design. But CRC OVERVIEW
very few engineers understand the mechanics of the Every modern communication protocol uses one or
CRC well enough to build a customized parallel CRC more error-detection algorithms. CRC is by far the most
circuit described by an arbitrary CRC generator polyno- popular. CRC properties are defined by the generator poly-
mial. What about you? nomial length and coefficients. The protocol specification
In this article, I’ll present a practical method for gen- usually defines CRC in hex or polynomial notation. For
erating Verilog or VHDL code for the parallel CRC. This
method allows for the fast generation of a parallel CRC
code for an arbitrary polynomial and data width. I’ll also
briefly describe other interesting methods and provide
M-bit CRC M-bit CRC
January 2010 – Issue 234

more information on the subject. Next state Parallel Output


CRC
So why am I covering parallel CRC? There are several Generator

existing tools that can generate the code, and a lot of M-bit Data
input
examples for popular CRC polynomials. However, it’s
often beneficial to understand the underlying principles
in order to implement a customized circuit or make Figure 2—This is a parallel CRC block. The next state CRC output is
optimizations to an existing one. This is a subject every a function of the current state CRC and the data.

38 CIRCUIT CELLAR® • www.circuitcellar.com


example, CRC5 used in USB protocol is represented as the physical level. A typical USB PHY chip has an 8- or
0x5 in hex notation or as G(x) = x5 + x2 + 1 in the poly- 16-bit data interface to the chip that does protocol pro-
nomial notation: cessing. A circuit that checks or generates CRC has to
Hex notation 0x5 j polynomial notation G  x = x + x + 1
5 2 work at that speed.
Another more esoteric application I’ve encountered has
This CRC is typically implemented in hardware as a lin- to do with calculating 64-bit CRC on data written and
ear feedback shift register (LFSR) with a serial data input read from a 288-bit-wide memory controller (two 64-bit
(see Figure 1). DDR DIMMs with ECC bits). To achieve higher
In many cases the serial LFSR implementation of the throughput, the CRC’s serial LFSR implementation must
CRC is suboptimal for a given design. Because of the be converted into a parallel N-bit-wide circuit, where N
serial data input, it only allows the CRC calculation of is the design datapath width, so that N bits are
one data bit every clock. If a design has an N-bit data- processed in every clock. This is a parallel CRC imple-
path—meaning that every clock CRC module has to calcu- mentation, which is the subject of this article. Figure 2
late CRC on N bits of data—serial CRC will not work. One is a simplified block diagram of the parallel CRC.
example is USB 2.0, which transmits data at 480 MHz on Even though the CRC was invented almost half a cen-
tury ago and has gained
widespread use, it still
Listing 1—This Verilog module implements parallel USB CRC5 with 4-bit data. sparks a lot of interest
//==========================================================================
in the research commu-
// Verilog module that implements parallel USB CRC5 with 4-bit data nity. There is a con-
//========================================================================== stant stream of research
module crc5_parallel( papers and patents that
input [3:0] data_in,
output reg[4:0] crc5,
offer different parallel
input rst, CRC implementation
input clk); with speed and logic
area improvements. I
// LFSR for USB CRC5
function [4:0] crc5_serial;
was searching available
input [4:0] crc; literature and web
input data; resources about parallel
CRC calculation meth-
begin
crc5_serial[0] = crc[4] ^ data;
ods for hardware
crc5_serial[1] = crc[0]; description languages
crc5_serial[2] = crc[1] ^ crc[4] ^ data; (HDL) and found a
crc5_serial[3] = crc[2]; handful of papers.
crc5_serial[4] = crc[3];
end
(Refer to the Resources
endfunction section at the end of
this article.) However,
// 4 iterations of USB CRC5 LFSR most were academic
function [4:0] crc_iteration;
input [4:0] crc;
and focused on the the-
input [3:0] data; oretical aspect of the
integer i; parallel CRC genera-
tion. They were too
begin
crc_iteration = crc;
impractical to imple-
ment in software or
for(i=0; i<4; i=i+1) hardware for a quick
crc_iteration = crc5_serial(crc_iteration, data[3-i]); HDL code generation of
end
endfunction
CRC with arbitrary
data and polynomial
widths.
always @(posedge clk, posedge rst) begin An additional
I f(rst) begin
crc5 <= 5'h1F;
requirement for the
January 2010 – Issue 234

end method is that the par-


else begin allel CRC generator
crc5 <= crc_iteration(crc5,data_in); must be able to accept
end
end
any data width (not
endmodule only power-of-2) to be
//========================================================================== useful. Going back to
the USB 2.0 CRC5

www.circuitcellar.com • CIRCUIT CELLAR® 39


example, a convenient data width to Listing 2—This Verilog function implements the serial USB CRC5.
use for the parallel CRC of polyno-
mial width 5 is 11 because USB //=============================================================
packets using CRC5 are 16 bits. // Verilog function that implements serial USB CRC5
//=============================================================
Another example is the 16-lane PCI function [4:0] crc5_serial;
Express with a 128-bit datapath (16 input [4:0] crc;
8-bit symbols). Because the begin- input data;
ning of a packet is a K-code symbol
begin
and doesn’t participate in the CRC crc5_serial[0] = crc[4] ^ data;
calculation, the parallel CRC data is crc5_serial[1] = crc[0];
120 bits wide. crc5_serial[2] = crc[1] ^ crc[4] ^ data;
Before going any further into the crc5_serial[3] = crc[2];
crc5_serial[4] = crc[3];
topic of parallel CRC, I’ll briefly end
review modulo-2 polynomial arith- endfunction
metic. A polynomial is a value //============================================================
expressed in the following form:
P  x = ¤ iN 0 P i x i = p 0 + p 1 x + ... + p  N x N
Listing 3—This pseudocode is an example of CRCPARALLEL.
where p(i) = {0,1}.
Polynomial addition and subtrac- //=============================================================
tion operations use bitwise XOR.
routine CRCparallel(Nin, Min)
Here is an example:
Mout = Min
P x = x3 + x 2 + 1 for(i=0;i<N;i++)
Mout = CRCserial(Nin , Mout)
Q x = x 2 + x + 1 return Mout
P x + Q x = x3 + x 2 + x
//=============================================================
Polynomial multiplication by two

is a left shift, and unsigned division by

@
Circuit Cellar design contest two is the right shift. Modulo-2 poly-
nomial division is realized the same
entrants have received
way as long division over integers.
thousands of valuable Cyclic left and right shifts are multipli-
development tools and cation and division by (2 mod 2n – 1).
product samples. Because of
their contest participation, ฀ ฀
these engineers receive I’ll start the discussion with a Ver-
ilog module that generates parallel USB
advance e-mail notice from
CRC5 with 4-bit data (see Listing 1).
Circuit Cellar as soon as new A synthesis tool will do its magic
samples become available. and produce a circuit depending on
Now you too can benefit from the target FPGA or ASIC technology.
However, the purpose of this article
this early notification.
is to explain how to get a parallel
CRC circuit using XOR gates and
flip-flops.
Next I’ll describe a practical
method that I use to generate parallel
CRC in a number of projects. It works
Designer's Notification Network on any polynomial and data size,
January 2010 – Issue 234

independent of the target technolo-


gy. Later I’ll present other methods
Welcome to the Designer's Notification Network. Print subscribers are invited to
that have some useful properties.
join the Network for advance notice about our new sample distribution programs.
The step-by-step description is
accompanied by an example of parallel
CRC generation for the USB CRC5
polynomial G(x) = x5 + x2 + 1 with 4-

42 CIRCUIT CELLAR® • www.circuitcellar.com


MIN = 0 Mout[4] Mout[3] Mout[2] Mout[1] Mout[0]

Nin[0] 0 0 1 0 1
Nin[1] 0 1 0 1 0
Nin[2] 1 0 1 0 0
Nin[3] 0 1 1 0 1
Table 1—This is the matrix H1 for USB CRC5 with N = 4.

bit data width. The method—which takes advantage of representation. Table 2 shows the matrix H2 values for
the theory described in a paper by Guiseppe Campobello USB CRC5 with N = 4.
et al titled “Parallel CRC Realization,” as well as in a In Step 6, you’re ready to construct the parallel CRC
paper by G. Albertango and R. Sisto titled “Parallel CRC equations. Each set bit j in column i of the matrix H1—
Generation”—leverages a simple serial CRC generator and that’s the critical part of the method—participates
and the linear properties of the CRC to build a parallel in the parallel CRC equation of the bit MOUT[i] as NIN[j].
CRC circuit. Likewise, each set bit j in column i of the matrix H2
In Step 1, denote N = data width and M = CRC polyno- participates in the parallel CRC equation of the bit
mial width. For parallel USB CRC5 with a 4-bit data- MOUT[i] as MIN[j].
path, N = 4 and M = 5. All participating inputs MIN [j] and NIN [j] that form
In Step 2, implement a serial CRC generator routine MOUT[i] are XORed together. For USB CRC5 with N = 4,
for a given polynomial. It’s a straightforward process and the parallel CRC equations are as follows:
can be done using different programming languages or
scripts (e.g., C, Java, Verilog, or Perl). You can use the M OUT ;0= = M IN ;1= ^ M IN ;4= ^ M IN ;0= ^ M IN ;3=
Verilog function crc5_serial in Listing 2 for the serial M OUT ;1= = M IN ;2= ^ N IN ;1=
USB CRC5. Denote this routine as CRCSERIAL. You can M OUT ;2= = M IN ;1= ^ M IN ;3= ^ M IN ;4= ^ N IN ;0= ^ N IN ;2= ^ N IN ;3=
also build a routine CRCparallel(Nin, Min) that sim- M OUT ;3= = M IN ;2= ^ M IN ;4= ^ N IN ;1= ^ N IN ;3=
ply calls CRCSERIAL N times (the number of data bits) and
M OUT ;4= = M IN ;0= ^ M IN ;3= ^ N IN ;2=
returns MOUT. The pseudocode in Listing 3 is an example
of CRCPARALLEL. MOUT is the parallel CRC implementation. I used Table 1
In Step 3, parallel CRC implementation is a function and Table 2 to derive the equations.
of N-bit data input and M-bit current CRC state, as The reason this method works is in the way we con-
shown in the Figure 2. We’re going to build two matri- structed matrices H1 and H2, where rows are linearly
ces. Matrix H1 describes MOUT (next CRC state) as a independent. We also used the fact that CRC is a linear
function of NIN (input data) when MIN = 0. Thus, MOUT = operation:
CRCPARALLEL (NIN, MIN = 0), and H1 matrix is the size CRC  A + B = CRC  A + CRC  B
[NxM]. Matrix H2 describes MOUT (next CRC state) as a
function of MIN (current CRC state) when NIN = 0. Thus, The resulting Verilog module generates parallel USB
MOUT = CRCPARALLEL (NIN = 0, MIN), and H2 matrix is the CRC5 with 4-bit data (see Listing 4).
size [MxM].
In Step 4, build the matrix H1. Using the CRCPARALLEL ฀ ฀
routine from step 2, calculate the CRC for the N values There are many other methods for parallel CRC gener-
of NIN when MIN = 0. The values are one-hot encoded— ation. Each method has advantages and drawbacks. Some
that is, each of the NIN values has only one bit set. For N are more suitable for high-speed designs where logic area
= 4, the values are 0x1, 0x2, 0x4, 0x8 in hex representa- is less of an issue. Others offer the most compact
tion. Table 1 shows matrix H1 values for USB CRC5 designs, but for lower speed. As with almost everything
with N = 4. else in engineering, you have to make trade-offs to bring
In Step 5, build the matrix H2. Using the CRCPARALLEL your designs to completion.
routine from Step 2, calculate CRC for the M values of Let’s review the most notable methods. One method
MIN when NIN = 0. The values are one-hot encoded. For derives a recursive formula for parallel CRC directly
M = 5, MIN values are 0x1, 0x2, 0x4, 0x8, 0x10 in hex from a serial implementation. The idea is to represent an
LFSR for serial CRC as a dis-
crete-time linear system:
Nin = 0 Mout[4] Mout[3] Mout[2] Mout[1] Mout[0]
X i + 1 = FX i + U i
January 2010 – Issue 234

Min[0] 1 0 0 0 0
Min[1] 0 0 1 0 1 Vector X(i) is the current LFSR
Min[2] 0 1 0 1 0 output. X(i + 1) is the output in
Min[3] 1 0 1 0 0 the next clock. Vector U(i) is
Min[4] 0 1 1 0 1
the ith of the input sequence. F
Table 2—This is the matrix H2 for USB CRC5 with N = 4. is a matrix chosen according

www.circuitcellar.com • CIRCUIT CELLAR® 43


to the equations of serial M(x), which has a fixed length, is divided again by G(x) to
p(4) 1 0 0 0 0 1 0 0 0
LFSR. For example, USB get the CRC.
p(3) 0 1 0 0 0 0 1 0 0 CRC5 G(x) = x5 + x2 + 1 Calculating the CRC with “byte enable” is another
F = p(2) 0 0 1 0 = 1 0 0 1 0 will produce Figure 3, method that is important in many cases. For example, if
p(1) 0 0 0 1 0 0 0 0 1 where p(i) are polynomial the data width is 16 bits but a packet ends on an 8-bit
p(0) 0 0 0 0 1 0 0 0 0 coefficients. Addition and boundary, it would require having two separate CRC mod-
multiplication operations ules for 8 and 16 bits. The byte enable method allows for
Figure 3—This is matrix F in a for- are bitwise logic XOR and the reuse of the 16-bit CRC circuit to calculate an 8-bit
mula X(i + 1) = FX(i) + U(i) for AND, respectively. CRC.
recursive parallel CRC method. The After m clocks, the There is also a DSP unfolding technique to build a par-
values are for USB CRC5 polynomial
state is X(i + m), and the allel CRC. The idea is to model an LFSR as a digital filter
G(x) = x5 + x2 + 1.
solution can be obtained and use graph-based unfolding to unroll loops and obtain
recursively. the parallel processing.
X i + m = Fm X i + Fm 1U i + ... + FX i + m + U i + m Other methods include using look-up tables (LUTs) with
precomputed CRC values.
m is the desired data width. Each row k of the X(i + m) solu-
tion is a parallel CRC equation of bit k. An important result ฀
of this method is that it establishes a formal proof of solu- The logic use and timing performance of a parallel
tion existence. It’s not immediately obvious that it’s possi-CRC circuit largely depends on the underlying target
ble to derive a parallel CRC circuit from a serial one. FPGA or ASIC technology, data width, and polynomial
Another method uses two-stage CRC calculation. The width. For instance, Verilog or VHDL code will be synthe-
idea is that checking and generating CRC is done not with sized differently for the Xilinx Virtex5 and Virtex4 FPGA
generator polynomial G(x), but with another polynomial families because of the differences in the underlying LUT
M(x) = G(x) P(x). M(x) is chosen so that it has fewer input sizes. Virtex5 has 6-bit LUTs, whereas the Virtex4
terms than G(x) to simplify the complexity of the circuit has 4-bit LUTs.
that realizes the division. The result of the division by In general, the logic utilization of a parallel CRC cir-
cuit will grow linearly with the
data width. Using the big-O
Listing 4—This is a Verilog module that implements parallel USB CRC5 with 4-bit notation, logic size complexity
data using XOR gates. is O(n), where n is the data
width. For example, each of the
//=================================================================== CRC5’s five output bits is a
// Verilog module that implements parallel USB CRC5 with 4-bit
// data using XOR gates
function of four data input bits:
//===================================================================
module crc5_4bit( CRCout[i] = Function(CRCin4:0],
input [3:0] data_in, Data[3:0])
output [4:0] crc_out,
input rst,
input clk); Doubling the data width to 8
bits doubles the number of par-
reg [4:0] lfsr_q,lfsr_c; ticipating data bits in each
assign crc_out = lfsr_q;
CRC5 bit equation. That will
always @(*) begin make the total CRC circuit size
lfsr_c[0] = lfsr_q[1] ^ lfsr_q[4] ^ data_in[0] ^ data_in[3]; up to 10 times bigger (i.e., 5 2).
lfsr_c[1] = lfsr_q[2] ^ data_in[1]; Of course, not all bits will dou-
lfsr_c[2] = lfsr_q[1] ^ lfsr_q[3] ^ lfsr_q[4] ^ data_in[0] ^
data_in[2] ^ data_in[3];
ble—that depends on the poly-
lfsr_c[3] = lfsr_q[2] ^ lfsr_q[4] ^ data_in[1] ^ data_in[3]; nomial. But the point is that the
lfsr_c[4] = lfsr_q[0] ^ lfsr_q[3] ^ data_in[2]; circuit size will grow linearly.
end // always Logic utilization will grow as
a second power of the polynomi-
always @(posedge clk, posedge rst) begin
if(rst) begin al width, or O(n2). Doubling the
lfsr_q <= 5’h1F; polynomial width in CRC5 from 5
end
January 2010 – Issue 234

to 10—let’s call it CRC10, which


else begin
lfsr_q <= lfsr_c;
has different properties—doubles
end the size of each CRC10 output
end // always bit. The number of CRC outputs
is also doubled, so the total size
endmodule // crc5_4
//===================================================================
increase is up to 4 times (i.e., 22).
The circuit’s timing performance

44 CIRCUIT CELLAR® • www.circuitcellar.com


a) (ASIC or FPGA family) and synthesis and H2 matrices contains the polyno-
Number of LUTs 5 tool settings. mial coefficients of the CRC output
Number of FFs 5 bit [j].
Number of Slices 2 ฀ I’ve used this method successfully
b) The parallel CRC generation method in several communication and test-
leverages a simple serial CRC generator and-measurement projects. An online
Number of LUTs 5
and the linear properties of the CRC to parallel CRC generator tool available at
Number of FFs 5
build H1NxM and H2MxM matrices. Row [i] OutputLogic.com uses this method to
Number of Slices 2
of the H1 matrix is the CRC value of NIN produce Verilog or VHDL code given
c)
with a single bit [i] set, while MIN = 0. an arbitrary data and polynomial
Number of LUTs 214 Row [i] of the H2 matrix is the CRC width. A similar method is also used to
Number of FFs 32 value of MIN with a single bit [i] set, generate parallel scramblers. Perhaps
Number of Slices 93 while NIN = 0. Column [j] of the H1 I’ll cover the topic in a future article. ฀
d)
Number of LUTs 161
Number of FFs 32

If you’ve read this article carefully, you should be able to solve the following
Number of Slices 71
problem.
Table 3a—Logic utilization for USB CRC5,4-bit Problem: Consider the polynomial G(x) = x + 1. What well-known error
data using the “for loop” method. b—Logic detection code does this polynomial represent? Derive a parallel equation of
utilization for USB CRC5, 4-bit data the using
this polynomial for 8-bit data input. Hint: Draw a circuit with serial data input
“XOR” method. c—Logic utilization for CRC32,
32-bit data using the “for loop” method. and think about how the output depends on the number of “1” bits in the
d—Logic utilization for CRC32, 32-bit data input datastream.
using the “XOR” method. The answer is available on the Circuit Cellar FTP site.

decreases because it requires more Evgeni Stavinov (evgeni@outputlogic.com) is a system design engineer for Xilinx
combinational logic levels to synthe- who holds an MSEE from USC and a BSEE from The Technion — Israel Institute of
size CRC output logic given the wider Technology. He has more than 10 years of design experience in the areas of FPGA
data and polynomial inputs. logic design, embedded software, and networking. Evgeni worked for CATC, LeCroy,
I used free Xilinx WebPACK tools to and SerialTek designing test and measurement tools for USB, Wireless USB, PCI
simulate and synthesize parallel CRC Express, Bluetooth, SAS, and SATA protocols. He also created OutputLogic.com—a
circuits for USB CRC5 and the popular web portal that offers online tools for FPGA and ASIC designers—and serves as its
Ethernet CRC32. You can explore the main developer.
results in the available Verilog code
and project files.
Xilinx’s Virtex5 LX30 is the target
FPGA. Table 3a shows USB CRC5
with 4-bit data using “for loop” Verilog
P ฀
To download the code, go to ftp://ftp.circuitcellar.com/pub/Circuit_Cellar/
2009/234.
implementation. Table 3b shows USB
CRC5 with 4-bit data using “XOR”
Verilog implementation. Table 3c
shows CRC32 with 32-bit data using
“for loop” Verilog implementation.
R G. Albertango and R. Sisto, “Parallel CRC Generation,” IEEE Micro,
Vol. 10, No. 5, 1990.
Table 3d shows CRC32 with 32-bit
data using “XOR” Verilog implementa-
G. Campobello, G. Patane, M. Russo, “Parallel CRC Realization,”
tion. Note that a single Xilinx Virtex5
http://ai.unime.it/~gp/publications/full/tccrc.pdf.
Slice contains four FFs and four LUTs.
As expected, the number of FFs is
R. J. Glaise, “A Two-Step Computation of Cyclic Redundancy Code
five and 32 for CRC5 and CRC32. For
CRC-32 for ATM Networks,” IBM Journal of Research and Develop-
a small CRC5 circuit, there is no dif-
ment, Vol. 41, Issue 6, 1997.
ference in the logic utilization. Howev-
January 2010 – Issue 234

er, for a larger CRC32, the code using


A. Perez, “Byte-wise CRC Calculations,” IEEE Micro, Vol. 3, No. 3,
the XOR method produces more com-
1983.
pact logic than the “for loop”
approach.
A. Simionescu, “CRC Tool: Computing CRC in Parallel for Ethernet,”
These synthesis results should be
Nobug Consulting, http://space.ednchina.com/upload/2008/8/27/5300b83c-
taken with a grain of salt. The results
43ea-459b-ad5c-4dc377310024.pdf.
are specific to the targeted technology

www.circuitcellar.com • CIRCUIT CELLAR® 45

You might also like