Codec 2
Codec 2
David Rowe
1 Introduction
Codec 2 is an open source speech codec designed for communications quality
speech between 700 and 3200 bit/s. The main application is low bandwidth
HF/VHF digital radio. It fills a gap in open source voice codecs beneath 5000
bit/s and is released under the GNU Lesser General Public License (LGPL).
Key feature includes:
2. Modest CPU (a few 10s of MIPs) and memory (a few 10s of kbytes of
RAM) requirements such that it can run on stm32 class microcontrollers
with hardware FPU.
3. Codec 2 has been designed for digital voice over radio applications, and
retains intelligible speech at a few percent bit error rate.
1
description using math and signal processing theory. Combined with the C
source code, it is intended to give the reader enough information to understand
the operation of Codec 2 in detail and embark on source code level projects,
such as improvements, ports to other languages, student or academic research
projects. Issues with the current algorithms and topics for further work are also
included. Section 4 provides a summary of the Codec 2 modes, and Section 5
a guide to the C source files. A glossary of terms and symbols is provided in
Section 6, and Section 7 has suggestions for further documentation work.
This production of this document was kindly supported by an ARDC grant
[1]. As an open source project, many people have contributed to Codec 2 over
the years - we deeply appreciate all of your support.
2
On the right hand side it also appears to repeat itself - one cycle looks very
similar to the last. This cycle time is the “pitch period”, which for this example
is around P = 35 samples. Given we are sampling at Fs = 8000 Hz, the pitch
period is P/Fs = 35/8000 = 0.0044 seconds, or 4.4ms.
Figure 1: A 40ms segment from the word “these” from a female speaker,
sampled at 8kHz. Top is a plot against time, bottom (blue) is a plot of the same
speech against frequency. The waveform repeats itself every 4.3ms (F0 = 230
Hz); this is the “pitch period” of this segment. The red crosses are the sine
wave amplitudes, explained in the text.
30000
20000
10000
Amplitude
-10000
-20000
-30000
50 100 150 200 250 300
Time (samples)
80
60
Amplitude (dB)
40
20
3
Now if the pitch period is 4.4ms, the pitch frequency or fundamental fre-
quency F0 is about 1/0.0044 ≈ 230 Hz. If we look at the blue frequency domain
plot at the bottom of Figure 1, we can see spikes that repeat every 230 Hz.
If the signal is repeating itself in the time domain, it also repeats itself in the
frequency domain. Those spikes separated by about 230 Hz are harmonics of
the fundamental frequency F0 .
Note that each harmonic has its own amplitude, that varies across frequency.
The red line plots the amplitude of each harmonic. In this example, there is
a peak around 500 Hz and another broader peak around 2300 Hz. The ear
perceives speech by the location of these peaks and troughs.
Figure 2: The sinusoidal speech model. If we sum a series of sine waves, we can
generate a speech signal. Each sinewave has its own amplitude (A1 , A2 , ...AL ),
frequency, and phase (not shown). We assume the frequencies are multiples of
the fundamental frequency F0 . L is the total number of sinewaves we can fit in
4 kHz.
A1 , F0 Hz
A2 , 2F0 Hz
AL , LF0 Hz
4
The model parameters evolve over time, but can generally be considered
constant for a short time window (a few 10s of ms). For example, pitch evolves
over time, moving up or down as a word is articulated.
As the model parameters change over time, we need to keep updating them.
This is known as the frame rate of the codec, which can be expressed in terms
of frequency (Hz) or time (ms). For sampling model parameters, Codec 2 uses
a frame rate of 10ms. For transmission over the channel, we reduce this to
20-40ms in order to lower the bit rate. The trade off with a lower frame rate is
reduced speech quality.
The parameters of the sinusoidal model are:
1. The frequency of each sine wave. As they are all harmonics of F0 we can
just send F0 to the decoder, and it can reconstruct the frequency of each
harmonic as F0 , 2F0 , 3F0 , ..., LF0 . We used 5-7 bits/frame to represent F0
in Codec 2.
2. The amplitude of each sine wave, A1 , A2 , ..., AL . These “spectral ampli-
tudes” are really important as they convey the information the ear needs
to understand speech. Most of the bits are used for spectral amplitude
information. Codec 2 uses between 18 and 50 bits/frame for spectral am-
plitude information.
3. Voicing information. Speech can be approximated into voiced speech (vow-
els) and unvoiced speech (like consonants), or some mixture of the two.
The example in Figure 1 above is voiced speech. So we need some way to
describe voicing to the decoder. This requires just a few bits/frame.
4. The phase of each sine wave. Codec 2 discards the phases of each harmonic
at the encoder and reconstructs them at the decoder using an algorithm,
so no bits are required for phases. This results in some drop in speech
quality.
5
Figure 3: Codec 2 Encoder.
Input Pitch
Speech Estimator
Estimate
Voicing
of the model parameters to lower the frame rate, which helps us lower the bit
rate. Decimating to 20ms (throwing away every 2nd set of model parameters)
doesn’t have much effect, but beyond that the speech quality starts to degrade.
So there is a trade off between decimation rate and bit rate over the channel.
Once we have the desired frame rate, we “quantise” each model parameter.
This means we use a fixed number of bits to represent it, so we can send the bits
over the channel. Parameters like pitch and voicing are fairly easy, but quite
a bit of DSP goes into quantising the spectral amplitudes. For the higher bit
rate Codec 2 modes, we design a filter that matches the spectral amplitudes,
then send a quantised version of the filter over the channel. Using the example
in Figure 1 - the filter would have a band pass peaks at 500 and 2300 Hz. Its
frequency response would follow the red line. The filter is time varying - we
redesign it for every frame.
You’ll notice the term “estimate” being used a lot. One of the problems
with model based speech coding is the algorithms we use to extract the model
parameters are not perfect. Occasionally the algorithms get it wrong. Look
at the red crosses on the bottom plot of Figure 1. These mark the amplitude
estimate of each harmonic. If you look carefully, you’ll see that above 2000Hz,
the crosses fall a little short of the exact centre of each harmonic. This is an
example of a “fine” pitch estimator error, a little off the correct value.
Often the errors interact, for example the fine pitch error shown above will
mean the amplitude estimates are a little bit off as well. Fortunately, these errors
tend to be temporary and are sometimes not even noticeable to the listener -
remember this codec is often used for HF/VHF radio where channel noise is
part of the normal experience.
Figure 4 shows the operation of the Codec 2 decoder. We take the sequence
of bits received from the channel and recover the quantised model parameters,
pitch, spectral amplitudes, and voicing. We then resample the model parameters
6
Figure 4: Codec 2 Decoder
Synthesise
Phases
back up to the 10ms frame rate using a technique called interpolation. For
example, say we receive a F0 = 200 Hz pitch value, then 20ms later F0 = 220
Hz. We can use the average F0 = 210 Hz for the middle 10ms frame.
The phases of each harmonic are generated using the other model parameters
and some DSP. It turns out that if you know the amplitude spectrum, you can
determine a “reasonable” phase spectrum using some DSP operations, which in
practice is implemented with a couple of FFTs. We also use the voicing infor-
mation - for unvoiced speech we use random phases (a good way to synthesise
noise-like signals) - and for voiced speech we make sure the phases are chosen
so the synthesised speech transitions smoothly from one frame to the next.
Frames of speech are synthesised using an inverse FFT. We take a blank array
of FFT samples, and at intervals of F0 insert samples with the amplitude and
phase of each harmonic. We then inverse FFT to create a frame of time domain
samples. These frames of synthesised speech samples are carefully aligned with
the previous frame to ensure smooth frame-frame transitions and output to the
listener.
7
sees anything unusual (for example, a different microphone frequency response
or background noise), the quantisation can become very rough and speech qual-
ity poor. We train the tables at design time using a database of speech samples
and a training algorithm - an early form of machine learning.
Codec 2 3200 uses the method of fitting a filter to the spectral amplitudes,
this approach tends to be more forgiving of small variations in the input speech
spectrum, but is not as efficient in terms of bit rate.
3 Detailed Design
3.1 Overview
Codec 2 is based on sinusoidal [8] and Multi-Band Excitation (MBE) [3] vocoders
that were first developed in the late 1980s. Descendants of the MBE vocoders
(IMBE, AMBE etc) have enjoyed widespread use in many applications such
as VHF/UHF handheld radios and satellite communications. In the 1990s the
author studied sinusoidal speech coding [10], which provided the skill set and a
practical, patent free baseline for starting the Codec 2 project:
Some features of the Codec 2 Design:
1. A pitch estimator based on a 2nd order non-linearity developed by the
author.
2. A single voiced/unvoiced binary voicing model.
3. A frequency domain IFFT/overlap-add synthesis model for voiced and
unvoiced speech.
4. Phases are not transmitted, they are synthesised at the decoder from the
magnitude spectrum and voicing decision.
5. For the higher bit rate modes (1200 to 3200 bits/s), spectral magnitudes
are represented using LPCs extracted from time domain analysis and
scalar LSP quantisation.
6. For Codec 2 700C, vector quantisation of resampled spectral magnitudes
in the log domain.
8
7. Minimal interframe prediction in order to minimise error propagation and
maximise robustness to channel errors.
8. A post filter that enhances the speech quality of the baseline codec, espe-
cially for low pitched (male) speakers.
NLP ω0
The time domain speech signal s(n) is divided into overlapping analysis
windows (frames) of Nw = 279 samples. The centre of each analysis window
is separated by N = 80 or 10ms. Codec 2 operates at an internal frame rate
of 100 Hz. To analyse the l-th frame it is convenient to convert the fixed time
reference to a sliding time reference centred on the current analysis window:
9
where the energy in the window is normalised such that:
w −1
NX
1
w2 (n) = (5)
n=0
Ndf t
To analyse s(n) in the frequency domain the Ndf t point Discrete Fourier Trans-
form (DFT) can be computed:
N
Xw2
10
Figure 6: Sinusoidal Synthesis. At frame l the windowing function generates
2N samples. The first N samples complete the current frame. The second N
samples are stored for summing with the next frame.
n =0, ..,
ω0 N −1
{Am } Construct Window
IDFT + ŝ(n + lN )
{θm } Sw (k) ŝl (n) t(n)
1 frame
n =N, ..., delay
2N − 1
harmonic centres:
(
Am ejθm , k = ⌊mr⌉, m = 1..L
Ŝw (k) = (10)
0, otherwise
The frame size, N = 80, is the same as the encoder. The shape and overlap
of the synthesis window is not important, as long as sections separated by the
11
frame size (frame to frame shift) sum to 1:
The continuous synthesised speech signal ŝ(n) for the l-th frame is obtained
using:
(
ŝ(n + (l − 1)N ) + ŝl (Ndf t − N + 1 + n)t(n), n = 0, 1, ..., N − 2
ŝ(n+lN ) =
ŝl (n − N − 1)t(n) n = N − 1, .., 2N − 1
(15)
From the Ndf t samples produced by the IDFT (12), after windowing we have
2N output samples. The first N output samples n = 0, ...N − 1 complete the
current frame l and are output from the synthesiser. However we must also
compute the contribution to the next frame n = N, N + 1, ..., 2N − 1. These are
stored, and added to samples from the next synthesised frame.
1 − z −1
Hnotch (z) = (16)
1 − 0.95z −1
Before transforming the squared signal to the frequency domain, the signal
is low pass filtered and decimated by a factor of 5. This operation is performed
to limit the bandwidth of the squared signal to the approximate range of the
12
Figure 7: The Non-Linear Pitch (NLP) algorithm
Sub
F0 Refinement Multiple Peak Pick DFT
Search
fundamental frequency. All energy in the squared signal above 400 Hz is su-
perfluous and would lower the resolution of the frequency domain peak picking
stage. The low pass filter used for decimation is an FIR type with 48 taps and
a cut off frequency of 600 Hz. The decimated signal is then windowed and the
Ndf t = 512 point DFT power spectrum Fw (k) is computed by zero padding the
decimated signal, where k is the DFT bin.
The DFT power spectrum of the squared signal Fw (k) generally contains
several local maxima. In most cases, the global maxima will correspond to
F0 , however occasionally the global maxima |Fw (kmax )| corresponds to a spu-
rious peak or multiple of F0 . Thus it is not appropriate to simply choose the
global maxima as the fundamental estimate for this frame. Instead, we look at
submultiples of the global maxima frequency kmax /2, kmax /3, ...kmin for local
maxima. If local maxima exists and is above an experimentally derived thresh-
old we choose the submultiple as the F0 estimate. The threshold is biased down
for F0 candidates near the previous frames F0 estimate, a form of backwards
pitch tracking.
The accuracy of the pitch estimate in then refined by maximising the func-
tion:
XL
E(ω0 ) = |Sw (⌊rm⌉)|2 (17)
m=1
where r = ω0 Ndf t /2π maps the harmonic number m to a DFT bin. This
function will be maximised when mω0 aligns with the peak of each harmonic,
corresponding with an accurate pitch estimate. It is evaluated in a small range
about the coarse F0 estimate.
There is nothing particularly unique about this pitch estimator or it’s per-
formance. There are occasional artefacts in the synthesised speech that can be
traced to “gross” and “fine” pitch estimator errors. In the real world no pitch es-
timator is perfect, partially because the model assumptions around pitch break
down (e.g. in transition regions or unvoiced speech). The NLP algorithm could
benefit from additional review, tuning and better pitch tracking. However it ap-
pears sufficient for the use case of a communications quality speech codec, and
is a minor source of artefacts in the synthesised speech. Other pitch estimators
13
could also be used, provided they have practical, real world implementations
that offer comparable performance and CPU/memory requirements.
where r = ω0 Ndf t /2π is a constant that maps the m-th harmonic to a DFT bin,
and ⌊x⌉ is the rounding operator. As w(n) is a real and even, W (k) is real and
even so we can write:
Pbm
k=am Sw (k)W (k + ⌊mr⌉)
Bm = P bm
(19)
2
k=am |W (k + ⌊mr⌉)|
The error between the input and synthesised speech in this band is then:
m −1
bX
Em = |Sw (k) − Ŝw (k)|2
k=am
(21)
m −1
bX
2
= |Sw (k) − Bm W (k + ⌊mr⌉)|
k=am
14
where m1000 = ⌊L/4⌉ is the band closest to 1000 Hz, and {Am } are computed
from (7). If the energy in the bands up to 1000 Hz is a good match to a
harmonic series of sinusoids then Ŝw (k) ≈ Sw (k) and Em will be small compared
to the energy in the band resulting in a high SNR. Voicing is declared using the
following rule: (
1, SN R > 6dB
v= (23)
0, otherwise
The voicing decision is post processed by several experimentally derived rules
to prevent common voicing errors, see the C source code in sine.c for details.
where E(z) is an excitation signal with a relatively flat spectrum, and H(z)
is a synthesis filter that shapes the magnitude spectrum. The phase of each
harmonic is the sum of the excitation and synthesis filter phase:
h i
arg Ŝ(ejω0 m ) = arg E(ejω0 m )H(ejω0 m )
For voiced speech E(z) is an impulse train (in both the time and frequency
domain). We can construct a time domain excitation pulse train using a sum
of sinusoids:
X L
e(n) = cos(mω0 (n − n0 )) (26)
m−1
Where n0 is a time shift that represents the pulse position relative to the centre
of the synthesis frame n = 0. By finding the DTCF transform of e(n) we can
determine the phase of each excitation harmonic:
ϕm = −mω0 n0 (27)
15
The excitation pulses occur at a rate of ω0 (one for each pitch period). The
phase of the first harmonic advances by N ϕ1 radians over a synthesis frame of N
samples. For example if ω1 = π/20 (200 Hz), then over a (10ms N = 80) sample
frame, the phase of the first harmonic would advance (π/20)80 = 4π radians or
two complete cycles. We therefore derive n0 from the excitation phase of the
fundamental, which we treat as a timing reference. Each frame we advance the
phase of the fundamental:
ϕl1 = ϕl−1
1 + N ω0 (28)
Given ϕ1 we can compute n0 and the excitation phase of the other harmonics:
n0 = −ϕ1 /ω0
ϕm = −mω0 n0 (29)
= mϕ1 m = 2, ..., L
For unvoiced speech E(z) is a white noise signal. At each frame, we sample a
random number generator on the interval −π...π to obtain the excitation phase
of each harmonic. We set F0 = 50 Hz to use a large number of harmonics
L = 4000/50 = 80 for synthesis to best approximate a noise signal.
The second phase component is provided by sampling the phase of H(z) at
the harmonic centres. The phase spectra of H(z) is derived from the magnitude
response using minimum phase techniques. The method for deriving the phase
spectra of H(z) differs between Codec 2 modes and is described below in Sections
3.7 and 3.8. This component of the phase tends to disperse the pitch pulse
energy in time, especially around spectral peaks (formants).
The zero phase model tends to make speech with background noise sound
”clicky”. With high levels of background noise the low level inter-formant parts
of the spectrum will contain noise rather than speech harmonics, so modelling
them as voiced (i.e. a continuous, non-random phase track) is inaccurate. Some
codecs (like MBE) have a mixed voicing model that breaks the spectrum into
voiced and unvoiced regions. However (5-12) bits/frame (5-12) are required
to transmit the frequency selective voicing information. Mixed excitation also
requires accurate voicing estimation (parameter estimators always break occa-
sionally under exceptional conditions).
In our case we use a post processing approach which requires no additional
bits to be transmitted. The decoder measures the average level of the back-
ground noise during unvoiced frames. If a harmonic is less than this level it is
made unvoiced by randomising it’s phases. See the C source code for implemen-
tation details.
Comparing to speech synthesised using original phases {θm } the following
observations have been made:
1. Through headphones speech synthesised with this model drops in quality.
Through a small loudspeaker it is very close to original phases.
2. If there are voicing errors, the speech can sound clicky or staticy. If voiced
speech is mistakenly declared unvoiced, this model tends to synthesise
16
annoying impulses or clicks, as for voiced speech H(z) is relatively flat
(broad, high frequency formants), so there is very little dispersion of the
excitation impulses through H(z).
3. When combined with amplitude modelling or quantisation, such that H(z)
is derived from {Âm } there is an additional drop in quality.
4. This synthesis model (e.g. a pulse train exciting a LPC filter) is effectively
the same as a simple LPC-10 vocoders, and yet (especially when arg[H(z)]
is derived from unquantised {Am }) sounds much better. Conventional
wisdom (AMBE, MELP) says mixed voicing is required for high quality
speech.
6. The recent crop of neural vocoders produce high quality speech using
a similar parameters set, and notably without transmitting phase infor-
mation. Although many of these vocoders operate in the time domain,
this approach can be interpreted as implementing a function {θ̂m } =
F (ω0 , {Am}, v). This validates the general approach used here, and as
future work Codec 2 may benefit from being augmented by machine learn-
ing.
G G
H(z) = Pp = (30)
1− k=1 ak z −k A(z)
17
Figure 8: LPC spectrum |H(ejω )| (green line) and LSP frequencies {ωi } (green
crosses) for the speech frame in Figure 1. The original speech spectrum (blue)
and Am estimates (red) are provided as references.
80
60
Amplitude (dB)
40
20
where ω2i−1 and ω2i are the LSP frequencies, found by evaluating the polyno-
mials on the unit circle. The LSP frequencies are interlaced with each other,
where 0 < ω1 < ω2 <, ..., < ωp < π. The separation of adjacent LSP frequen-
cies is related to the bandwidth of spectral peaks in H(z) = G/A(z). A small
separation indicates a narrow bandwidth, as shown in Figure 8. A(z) may be
18
reconstructed from P (z) and Q(z) using:
P (z) + Q(z)
A(z) = (32)
2
Thus to transmit the LPC coefficients using LSPs, we first transform the LPC
model A(z) to P (z) and Q(z) polynomial form. We then solve P (z) and Q(z)
for z = ejω to obtain p LSP frequencies {ωi }. The LSP frequencies are then
quantised and transmitted over the channel. At the receiver the quantised LSPs
are then used to reconstruct an approximation of A(z). More details on LSP
analysis can be found in [10] and many other sources.
Figure 9 presents the LPC/LSP mode encoder. Overlapping input speech
frames are processed every 10ms (N = 80 samples). LPC analysis determines
a set of p = 10 LPC coefficients {ak } that describe the spectral envelope of the
current frame and the LPC energy E = G2 . The LPC coefficients are trans-
formed to p = 10 LSP frequencies {ωi }. The source code for these algorithms is
in lpc.c and lsp.c. The LSP frequencies are then quantised to a fixed number of
bits/frame. Other parameters include the pitch ω0 , LPC energy E, and voicing
v. The quantisation and bit packing source code for each Codec 2 mode can be
found in codec2.c. Note the spectral magnitudes {Am } are not transmitted but
are still computed for use in voicing estimation (22).
LPC LSP
Analysis Quantisation
19
Some disadvantages [7] are the LPC spectrum |H(ejω )| doesn’t follow the
spectral magnitudes Am exactly, in other words is requires a non-flat excitation
spectrum to accurately model the amplitude spectrum. The slope of the LPC
spectrum near 0 and π must be 0, which means it does not track perceptually
important low frequency information well. For high pitched speakers, LPC
tends to place poles around single harmonics, rather than tracking the spectral
envelope described by {Am}. All of these problems can be observed in Figure
8. Thus exciting the LPC model by a simple, spectrally flat E(z) will result in
some errors in the reconstructed magnitude speech spectrum.
In CELP codecs these problems can be accommodated by the (high bit rate)
excitation used to construct a non-flat E(z), and some low rate codecs such as
MELP supply supplementary low frequency information to “correct” the LPC
model.
Before bit packing, the Codec 2 parameters are decimated in time. An
update rate of 20ms is used for the highest rate modes, which drops to 40ms
for Codec 2 1300, with a corresponding drop in speech quality. The number
of bits used to quantise the LPC model via LSPs is also reduced in the lower
bit rate modes. This has the effect of making the speech less intelligible, and
can introduce annoying buzzy or clicky artefacts into the synthesised speech.
Lower fidelity spectral magnitude quantisation also results in more noticeable
artefacts from phase synthesis. Nevertheless at 1300 bits/s the speech quality
is quite usable for HF digital voice, and at 3200 bits/s comparable to closed
source codecs at the same bit rate.
Bit
Unpack Interpolate LSP to LPC Sample Am
Stream
Phase
Post Filter
Synthesis
Sinusoidal
ŝ(n)
Synthesis
Figure 10 shows the LPC/LSP mode decoder. Frames of bits received at the
frame rate are unpacked and resampled to the 10ms internal frame rate using
linear interpolation. The spectral magnitude information is resampled by linear
interpolation of the LSP frequencies, and converted back to a quantised LPC
model Ĥ(z). The harmonic magnitudes are recovered by averaging the energy
20
of the LPC spectrum over the region of each harmonic:
v
u bm −1
uX
Âm = t |Ĥ(k)|2 (33)
k=am
where Ĥ(k) is the Ndf t point DFT of the received LPC model for this frame.
For phase synthesis, the arg[H(z)] component is determined by sampling Ĥ(k)
in the centre of each harmonic:
h i
arg H(ejω0 m ) = arg Ĥ(⌊mr⌉)
(34)
Figure 11: LPC post filter. LPC spectrum before |H(ejω )| (green line) and after
(red) post filtering. The distance between the spectral peaks and troughs has
been increased. The step change at 1000 Hz is +3dB low frequency boost (see
source code).
80
60
Amplitude (dB)
40
20
Prior to sampling the amplitude and phase, a frequency domain post filter
is applied to the LPC power spectrum. The algorithm is based on the MBE
frequency domain post filter [6, Section 8.6, p 267], which is in turn based on
the frequency domain post filter from McAulay and Quatieri [5, Section 4.3, p
148]. The authors report a significant improvement in speech quality from the
post filter, which has also been our experience when applied to Codec 2. The
post filter is given by:
Pf (ejω ) = g Rw (ejω )β
(35)
Rw (jω ) = A(ejω/γ )/A(ejω )
21
where g is chosen to normalise the gain of the post filter, and β = 0.2, γ = 0.5
are experimentally derived constants. The post filter raises the spectral peaks
(formants), and lowers the inter-formant energy. The γ term compensates for
spectral tilt, providing equal emphasis at low and high frequencies. The authors
suggest the post filter reduces the noise level between formants, an explanation
commonly given to post filters used for CELP codecs where significant inter-
formant noise exists from the noisy excitation source. However, in harmonic
sinusoidal codecs, there is no excitation noise between formants in E(z). Our
theory is the post filter also acts to reduce the bandwidth of spectral peaks,
modifying the energy distribution across the time domain pitch cycle which
improves speech quality, especially for low pitched speakers.
A disadvantage of the post filter is the need for experimentally derived con-
stants. It performs a non-linear operation on the speech spectrum, and if mis-
applied can worsen speech quality. As it’s operation is not completely under-
stood, it represents a source of future quality improvement.
Microphone Resample
NLP
EQ b Rate K
Decimate Est
log ω0
& VQ Voicing
22
Consider a vector a of L harmonic spectral magnitudes expressed in dB:
a = 20log10 A1 , 20log10 A2 , . . . 20log10 AL (36)
Fs π
L= = (37)
2F0 ω0
F0 and L are time varying as the pitch track evolves over time. For speech
sampled at Fs = 8 kHz F0 is typically in the range of 50 to 400 Hz, giving L in
the range of 10 . . . 80.
To quantise and transmit a, it is convenient to resample a to a fixed length
K element vector b using a resampling function:
b = B1 , B2 , . . . BK = R(a) (38)
fk = warp(k, K) Hz k = 1...K
warp(1, K) = 200 Hz (39)
warp(K, K) = 3700 Hz
where warp() is a frequency warping function. Codec 2 700C uses K = 20, and
warp() is defined using the Mel function [9, p 150] (Figure 13) which samples the
spectrum more densely at low frequencies, and less densely at high frequencies:
We wish to use mel(f ) to construct warp(k, K), such that there are K
evenly spaced points on the mel(f ) axis (Figure 14). Solving for the equation of
a straight line we can obtain mel(f ) as a function of k, and hence warp(k, K)
(Figure 15):
mel(3700) − mel(200)
g=
K −1 (42)
mel(f ) = g(k − 1) + mel(200)
where g is the gradient of the line. Substituting (41) into the LHS:
23
Figure 13: Mel function
The input speech may be subject to arbitrary filtering, for example, due
to the microphone frequency response, room acoustics, and anti-aliasing filter.
This filtering is fixed or slowly time-varying. The filtering biases the target
vectors away from the VQ training material, resulting in significant additional
mean square error. The filtering does not greatly affect the input speech quality,
however the VQ performance distortion increases and the output speech quality
is reduced. This is exacerbated by operating in the log domain, the VQ will try
to match very low level, perceptually insignificant energy near 0 and 4000 Hz. A
microphone equaliser algorithm has been developed to help adjust to arbitrary
microphone filtering.
For every input frame l, the equaliser (EQ) updates the dimension K equaliser
vector e:
el = el−1 + β(b − t) (45)
where t is a fixed target vector set to the mean of the VQ quantiser, and β is a
small adaption constant.
The equalised, mean removed rate K vector d is vector quantised for trans-
24
Figure 14: Linear mapping of mel(f ) to Rate K sample index k
mel(f)
(K,mel(3700))
(1,mel(200))
Codec 2 700C uses a two stage VQ with 9 bits (512 entries) per stage. The mbest
multi-stage search algorithm is used to jointly search the two stages (using 5
survivors from the first stage). Note that VQ is performed in the log amplitude
(dB) domain. The mean of c is removed prior to VQ and scalar quantised and
transmitted separately as the frame energy. At the decoder, the rate L vector
â can then be recovered by resampling â:
â = S(ĉ + p) (47)
where p is a post filter vector. The post filter vector is generated from the
mean-removed rate K vector d̂ in the log frequency domain:
p = G + Pgain d̂ + r − r
r = R1 , R2 , . . . RK
(48)
Rk = 20log10 (fk /300) k = 1, ..., K
where G is an energy normalisation term, and 1.2 < Pgain < 1.5 describes the
amount if post filtering applied. G and Pgain are similar to g and β in the
LPC/LSP post filter (35). The r term is a high pass (pre-emphasis) filter with
+20 dB/decade gain after 300 Hz (fk is given in (43)). The post filtering is
applied on the pre-emphasised vector, then the pre-emphasis is removed from
the final result. Multiplying by Pgain in the log domain is similar to the α power
25
Figure 15: warp(k, K) function for K = 20
function in (35); spectral peaks are moved up, and troughs pushed down. This
filter enhances the speech quality but also introduces some artefacts.
Figure 16 is the block diagram of the decoder signal processing. Cepstral
techniques are used to synthesise a phase spectra arg[H(ejω ]) from â using a
minimum phase model.
Some notes on the Codec 2 700C newamp1 algorithms:
1. The amplitudes and Vector Quantiser (VQ) entries are in dB, which
matches the ear’s logarithmic amplitude response.
2. The mode is capable of communications quality speech and is in common
use with FreeDV, but is close to the lower limits of intelligibility, and
doesn’t do well in some languages (problems have been reported with
German and Japanese).
3. The VQ was trained on just 120 seconds of data - way too short.
4. The parameter set (pitch, voicing, log spectral magnitudes) is very similar
to that used for the latest neural vocoders.
5. The Rate K algorithms were recently revisited, and several improvements
were proposed and prototyped [2].
26
Figure 16: Codec 2 700C (newamp1) Decoder
Bit ĉ
Unpack Interpolate Post Filter
Stream
ĉ + p
Resample
ωˆ0 , v
to Rate L
â
Phase Sinusoidal
ŝ(n)
Synthesis Synthesis
The 3200 mode quantises the LSP differences ωi+1 − ωi , which provides low
distortion at the expense of robustness to bit errors, as an error in a low order
LSP difference will propagate through the frame. The 2400 and 1200 bit/s
modes use a joint delta ω0 and energy VQ, which is efficient but also suffers
from error propagation so is not suitable for high BER use cases.
There is an unfortunate overlap in the naming conventions of Codec 2 and
FreeDV. The Codec 2 700C mode is used in the FreeDV 700C, 700D, and 700E
modes.
27
cmake system builds the libcodec2 library, which is called by user applications
via the Codec 2 API in codec2.h. See the repository README for information
on building, demo applications, and an introduction to other features of the
codec2 repository.
File Description
c2dec Sample decoder application
c2enc Sample encoder application
c2sim Simulation and development application
codebook Directory containing quantiser tables
codec2.c Quantised encoder and decoder functions that implement each mode
codec2 fft.c Wrapper for FFT (usually kiss FFT)
defines.h Constants
lpc.c LPC functions
mbest.c Multistage VQ search
newamp1.c Codec 2 700C newamp1 mode
nlp.c Non-linear Pitch (NLP)
sine.c Sinusoidal analysis, synthesis, voicing estimation
phase.c Phase synthesis
quantise.c Quantisation, in particular for LPC/LSP modes
6 Glossary
Acronym Description
DFT Discrete Fourier Transform
DTCF Discrete Time Continuous Frequency Fourier Transform
EQ (microphone) Equaliser
IDFT Inverse Discrete Fourier Transform
LPC Linear Predictive Coding
LSP Line Spectrum Pair
MBE Multi-Band Excitation
MSE Mean Square Error
NLP Non Linear Pitch (algorithm)
VQ Vector Quantiser
28
Symbol Description Units
A(z) LPC (analysis) filter
am Lower DFT index of current band
bm Upper DFT index of current band
{Am } Set of harmonic magnitudes m = 1, ...L dB
a {Am } in vector form
Bm Complex spectral amplitudes used for voicing estimation
E Frame energy
E(z) Excitation in source-filter model
F0 Fundamental frequency (pitch) Hz
Fs Sample rate (usually 8 kHz) Hz
Fw (k) DFT of squared speech signal in NLP pitch estimator
G LPC gain
H(z) Synthesis filter in source-filter model
Ĥ(z) Synthesis filter approximation after quantisation
l Frame index
L Number of harmonics
N Processing frame size in samples
n0 Excitation pulse position
P Pitch period ms or samples
P (z), Q(z) LSP polynomials
Pf (ejω ) LPC post filter
{θm } Set of harmonic phases m = 1, ...L dB
r Maps a harmonic number m to a DFT index
s(n) Input time domain speech
ŝ(n) Output (synthesised) time domain speech
sw (n) Time domain windowed input speech
Sw (k) Frequency domain windowed input speech
Ŝw (k) Frequency domain output (synthesised)speech
t(n) Triangular synthesis window
ϕm Phase of excitation harmonic
ω0 Fundamental frequency (pitch) radians/sample
{ωi } Set of LSP frequencies
w(n) Window function
W (k) DFT of window function
v Voicing decision for the current frame
29
1. The c2sim utility is presently undocumented. We could add some worked
examples aimed at the experimenter - e.g. using c2sim to extract and
plot model parameters. Demonstrate how to listen to various stages of
quantisation.
2. Several GNU Octave scripts exist that were used to develop Codec 2. We
could add information describing how to use the Octave tools to single
step through the codec operation.
References
[1] Enhancing HF Digital Voice with FreeDV, 2023.
https://www.ardc.net/apply/grants/2023-grants/
enhancing-hf-digital-voice-with-freedv/.
[2] FreeDV-015 Codec 2 Rate K Resampler, 2023. https://github.com/
drowe67/misc/blob/master/ratek_resampler/ratek_resampler.pdf.
[3] Daniel W Griffin and Jae S Lim. Multiband excitation vocoder. IEEE
Transactions on acoustics, speech, and signal processing, 36(8):1223–1235,
1988.
[4] Fumitada Itakura. Line spectrum representation of linear predictor coeffi-
cients of speech signals. The Journal of the Acoustical Society of America,
57(S1):S35–S35, 1975.
[5] W Bastiaan Kleijn and Kuldip K Paliwal. Speech coding and synthesis.
Elsevier Science Inc., 1995.
[6] Ahmet M Kondoz. Digital speech: coding for low bit rate communication
systems. John Wiley & Sons, 1994.
[7] John Makhoul. Linear prediction: A tutorial review. Proceedings of the
IEEE, 63(4):561–580, 1975.
[8] Robert McAulay and Thomas Quatieri. Speech analysis/synthesis based
on a sinusoidal representation. IEEE Transactions on Acoustics, Speech,
and Signal Processing, 34(4):744–754, 1986.
30