Module2 SSP

SPEECH SIGNAL PROCESSING
ECE3028
BY
DR. K. GOWRI
ASSISTANT PROFESSOR,
DEPARTMENT OF ECE,
PRESIDENCY UNIVERSITY,
BANGALORE
MODULE: 2: Discrete time speech signals
Introduction, Time dependent processing of
speech, short time energy and average magnitude,
short time Average zero crossing rate, Speech vs.
silence discrimination using Energy and Zero
Crossings, Pitch period estimation using parallel
processing approach.
INTRODUCTION
• Objective : how digital signal processing methods can be applied to speech
signals to estimate the properties and parameters of such models.
• The first step usually is to obtain a convenient and useful parametric
representation of the information carried by the speech signal.
• This can be achieved by assuming that the speech signal, s[n], is the output of
a parametric synthesis model such as the one shown in Figure 1.
Model for speech production and synthesis.
Model for speech production and synthesis.
• The information carried by the speech signal includes,
• The (time-varying) pitch period (in samples), Np, (or pitch frequency, Fp =
Fs/Np, where Fs is the speech sampling frequency), for regions of voiced
speech, including possibly the locations of the pitch excitation impulses that
define the periods between adjacent pitch pulses.
• The glottal pulse model, g[n]
• The time-varying amplitude of voiced excitation, AV
• The time-varying amplitude of unvoiced excitation, AN
• The time-varying excitation type for the speech signal; i.e., quasi-periodic pitch
pulses for voiced sounds or pseudo-random noise for unvoiced sounds
• The time-varying vocal tract model impulse response, v[n], or equivalently, a
set of vocal tract parameters that control a vocal tract model.
• The radiation model impulse response, r[n] .
Different “representations” of the speech signal as
depicted in Figure 2.
• The waveform representation or time-domain representation is simply the

sample sequence s[n].
• We shall refer to any representation derived by digital signal processing
operations as an alternate representation.
• If the alternate representation is a parameter or set of parameters of a speech
model (Figure 1), then the alternate representation is a parametric
representation.
Different “representations” of the speech signal- Contd…
• Time-domain processing of speech - performing direct operations
on the speech waveform (or a filtered version of the waveform),
• Frequency-domain processing as performing operations on a
Fourier representation of the speech signal.
• Alternate representation methods involve direct operations on
the waveform of the speech signal.
• Examples of time-domain parametric representations of the
speech signal
short-time (log) energy,
short-time zero-crossing (or level-crossing) rate,
short-time autocorrelation.
Different “representations” of the speech signal- Contd…
• The required precision - particular information in the speech signal
that is to be measured.
• For example –
1. The purpose of the digital processing may be to facilitate the
determination of whether a particular section of waveform is speech or
background signal (or possibly noise).
2. To make a three-way classification as to whether a section of the
signal is voiced speech, unvoiced speech, or silence (background
signal).
• Reconstruction:
1. Discards “irrelevant” information and places the desired features
that retains all the inherent information in the speech signal.
2. Digital transmission - may require the most accurate representation
of the speech signal (e.g., bit rate constraints)- can be reconstructed
equivalent to the original sampled speech signal.
Alternative representations of the speech signal:
• Short-time energy
• Short-time zero-crossing rate
• Short-time autocorrelation function
• Short-time Fourier transform
SHORT-TIME ANALYSIS OF SPEECH
• A plot of waveform samples, along with annotated phoneme regions (at a
sampling rate of Fs = 10,000 samples/sec) representing a speech signal
produced by a male speaker is shown in Figure 3.
• This plot shows how the properties of the speech signal change slowly with
time, moving from one relatively stationary state (e.g., phoneme) to another.
• In this example, the excitation mode begins as unvoiced, then switches to
voiced, then back to unvoiced, then back to voiced, and finally back to
unvoiced.
• There is a significant variation in the peak amplitude of the signal, and there
is a steady variation of fundamental (pitch) frequency within voiced regions.
SHORT-TIME ANALYSIS OF SPEECH – Contd…
Fig. 3 Waveform of an utterance of /SH UH D - W IY - CH EY S/ (“should we chase”). The sampling rate is 10

kHz, and the samples are connected by straight lines. Each line in the plot corresponds to 1200 speech
samples or 0.12 seconds of signal.
• Figure 3 - simple time-domain processing techniques should be capable of
providing useful estimates of signal features such as signal energy, excitation
mode and state (voiced or unvoiced), pitch, and possibly even estimates of
vocal tract parameters such as formant frequencies and bandwidths.
• The underlying assumption in almost all speech processing systems is that the
properties of the speech signal change relatively slowly with time (compared
to the detailed sample-to-sample variations of the waveform) with rates of
change on the order of 10–30 times/sec.
• This slow variation corresponds to speech production rates on the order of 5–
15 sounds/sec.
• This assumption leads to a variety of “short-time” processing methods - short
segments of the speech signal are isolated and processed.
• Segmentation can be repeated until we reach the end of a sentence
or phrase to be processed, or in real-time applications, it could
continue indefinitely. The resulting speech segments are generally
referred to as analysis frames.
• The result of the processing on each frame can be either a single
number, or a set of numbers.
• A key issue : choice of segment duration or frame length.
1. The shorter the segment, the less likely that the speech properties of the
segment will vary significantly over the segment duration- uncertainty.
2. Speech parameters estimated from medium length segments - therefore
again provide highly variable estimates of basic speech parameters.
3. The use of long speech segments- leads also to inaccuracies because of
the large amount of sound change over such a long duration.
• Long segments make it difficult to pinpoint sharp changes (such as
voiced/unvoiced transitions).
• There is always a degree of “uncertainty” in short-time measurements on
speech signals, and we cannot eliminate completely the variability of speech
parameter estimates based on finite-duration speech segments.
• Therefore, a compromise analysis frame duration of between 10 and 40
msec is most often used in speech processing systems.
General Framework for Short-Time Analysis
• Short-time Fourier representation,
• ---(1)
• where w˜ [nˆ − m]  sliding analysis window,

• T( · )  some operation on the input signal
• The quantity Qnˆ  value (or vector of values) of the alternate short-time
representation of a speech signal x[n] at analysis time nˆ.
General representation of the short-time analysis
principle (Fig.4).
• The speech signal is subjected to a transformation (linear filter).

• T( · ), which may be either linear or non-linear- depend upon some adjustable
parameter or set of parameters.
• The purpose of this transformation is to make some property of the speech
signal more prominent.
• The resulting sequence is then multiplied by a lowpass window sequence, w˜
[nˆ − m], positioned at a particular analysis time nˆ.
principle.
• The multiplication by the lowpass window represents the framing operation;
i.e., the focusing of the analysis on the time interval “at or around” the
analysis time, nˆ.
• The product is then summed over all non-zero values inside the shifted
window.
• This corresponds to averaging the transformed signal over the time interval
selected by the window.
• The combined effect of multiplying by the shifted window and summing the
products is equivalent to lowpass filtering by a filter whose impulse response
is the window sequence.
• (1) is normalized by dividing by the effective window length, i.e.,
• ---(2)
principle.
• This has the effect of normalizing the frequency response of the effective
lowpass filter so that W˜ (e^j0) = 1.
• Usually the window sequence (impulse response) will be of finite duration, but
this is not a strict requirement.
• w˜ [m] should be “smooth” and concentrated in time so that it has lowpass
filter characteristics.
• The values Qnˆ are therefore a sequence of local weighted averages of values
of the sequence T(x[m]).
Filtering and Sampling in Short-Time Analysis
• Eq. (1) states that Qnˆ is the discrete-time convolution of the modified speech
signal T(x[n]) with the window sequence w˜ [n].
• Thus, w˜ [n] serves as the impulse response of a linear filter.
• Although the output of the filter can be computed at the input speech
sampling rate by moving the analysis window in steps of 1 sample, typically
the window is moved in jumps of R > 1 samples.
• This corresponds to downsampling the output of the filter by the factor R; i.e.,
the shorttime representation is evaluated at times nˆ = rR, where r is an
integer.
• The R is dependent of the window length.
• Clearly, if the window is of length L samples, then we should choose R < L so
that each speech sample is included in at least one analysis segment.
Filtering and Sampling in Short-Time Analysis
• Typically, the analysis windows overlap by more than 50% of the window
length.
• To be more specific about how the sampling interval (window shift), R, should
be chosen, it will be useful to consider two commonly used window
sequences;
• 1. The rectangular window ---(3)
• 2. The Hamming window
---(4)
• Time-domain plots of the rectangular and Hamming windows for L = 21 are
given in Figure 5.
• If we substitute Eq. (3) for w[n] in Eq. (1), we obtain,
---(5)
Fig.7 Plots of the time responses of a 21-point (a) rectangular window; (b) Hamming
window.
• The rectangular window corresponds to applying equal weight to all samples
in the interval (nˆ − L + 1) to nˆ.
• If the Hamming window is used, Eq. (6.5) will incorporate the window
sequence wH[nˆ − m] from Eq. (6.4) as weighting coefficients, with the same
limits on the summation.
• The frequency response of an L-point rectangular window is shown as,
•
• The log magnitude (in dB) of WR(ejω) is shown in Figure 6.7a for a 51-point
window (L = 51).
Fig. 8 Fourier transform of (a) 51-point rectangular window; (b) 51-point Hamming window.
• Note that the first zero of Eq. (6.6) occurs at ω = 2π/L, corresponding to analog
frequency,
• where Fs = 1/T is the sampling frequency.
Analysis of Rectangular and Hamming window, Fs = 10 kHz
Rectangular window Hamming window
• Nominal cutoff frequency = 2π/L • Nominal cutoff frequency = 4π/L
• F= Fs/L = 10,000/51 = 196 Hz • F= 2Fs/L =2* 10,000/51 = 392 Hz
(for L=51) (for L=51)
• Bandwidth is les • Bandwidth is twice the
• attenuation (> 14 dB) outside rectangular window.
the passband. • much greater attenuation (> 40
• Rectangular window would dB) outside the passband
seem to require only half the • Sampling rate of Qnˆ should be
sampling rate, nominal cutoff of greater than or equal to 4Fs/L,
Fs/L or R ≤ L/4
Analysis of Rectangular and Hamming window
• Increasing the window length, L, simply decreases the bandwidth.
• Attenuation of both these windows is essentially independent of the window
duration
• For typical finite-duration analysis windows, the short-time representation
Qnˆ has a restricted lowpass bandwidth that is inversely proportional to the
window length L.
• Poor frequency selectivity would generally result in significant aliasing at this
sampling rate.
SHORT-TIME ENERGY AND SHORT-TIME MAGNITUDE
• The energy of a discrete-time signal is defined as,
Fig. 8 Block diagram representation of computation of the short-time energy
• The energy of a sequence is a single number that has little meaning

or utility as a representation for speech because it gives no
information about the time-dependent properties of the speech
signal.
• What is needed is something that is sensitive to the time-varying changes of
signal amplitude with time.
• For this, the short-time energy is much more useful.
• short-time energy is,
--(5)
• where w[nˆ − m] is a window that is applied directly to the speech samples

before squaring,
• w˜ [nˆ − m] is a corresponding window that can be applied
equivalently after squaring.
• Equation (5) can be seen to be in the form of Eq. (5) with T(x[m]) = (x[m])^2
and w˜ [m] = w^2[m].
• The computation of the short-time energy representation is depicted in Figure
8.
• It can be seen that the sampling rate of the signal changes from Fs at the input
and output of the squaring box to Fs/R at the output of the lowpass filter, w˜
[n]
• For an L-point rectangular window, the effective window is w˜ [m] = w[m], and
it therefore follows that Enˆ is,
• Using rectangular window,
• Fig. 9 Illustration of the computation of short-time energy

• Note that as nˆ varies, the window literally slides along the sequence of
squared values [in general, T(x[m])] selecting the interval to be involved in the
computation.
• Figures 10 and 11 show the effects of varying the window length L (for the
rectangular and Hamming windows, respectively) on the short-time energy
representation for an utterance of the text “What she said”, spoken by a male
speaker.
• As L increases, the energy contour becomes smoother for both windows.
• This is because the bandwidth of the effective lowpass filter is inversely
proportional to L.
• These figures illustrate that for a given window length, the Hamming window
produces a somewhat smoother curve than does the rectangular window.
• This is because, even though the “nominal” cutoff is lower for the rectangular
window, the high-frequency attenuation is much greater for the Hamming
window filter.
• The major significance of Enˆ is that it provides a basis for distinguishing
voiced speech segments from unvoiced speech segments.
• As can be seen in Figures 6.10 and 6.11, the values of Enˆ for the unvoiced
segments are significantly smaller than for voiced segments.
• This difference in amplitude is enhanced by the squaring operation.
• The energy function can also be used to locate approximately the time at
which voiced speech becomes unvoiced, and vice versa, and, for very high
quality speech (high signal-to-noise ratio), the energy can be used to
distinguish speech from silence (or background signal).
Automatic Gain Control Based on Short-Time Energy
• An application of the short-time energy representation is a simple automatic
gain control (AGC) mechanism for speech waveform coding.
• The purpose of an AGC is to keep the signal amplitude as large as possible
without saturating or overflowing the allowable dynamic range of a digital
representation of the speech samples.
• To compute the short-time energy at every sample of the input, apply the AGC
to individual samples.
• While finite-duration windows can be used for this purpose, it can be more
efficient to use an infinite-duration window (impulse response) so that the
computation can be recursive.
• As a simple example, consider the exponential window sequence,
• so that Eq. (5) becomes, --(6)
• --(7)
• The z-transform of Eq. (6.11) is, --(8)

• The discrete-time Fourier transform (DTFT) (frequency response of the
analysis filter) is,
• Figure 12a shows 51 samples of the exponential window (for a value of α =

0.9) and the corresponding DTFT (log magnitude response) is shown in Figure
12b.
Exponential window for short-time energy computation using a value of α = 0.9: (a) w˜ [n] (impulse response of analysis filter)
and (b) 20 log10 |W˜ (e^ jω)| (log magnitude frequency response of the analysis filter).
• These are the impulse response and frequency response, respectively, of the
recursive short-time energy analysis filter.
• Note that by including the scale factor (1 − α) in the numerator, we ensure
that W˜ (ej0) = 1; i.e., the low frequency gain is around unity (0 dB)
irrespective of the value of α.
• By increasing or decreasing the parameter α, we can make the effective
window longer or shorter respectively.
• The effect on the corresponding frequency response is, of course, opposite.
Increasing α makes the filter more lowpass, and vice versa.
• To anticipate the need for a more convenient notation, we will denote the
short-time energy as En = σ2[n], where we have used the index [n] instead of
the subscript nˆ.
• we use the notation σ^2 to denote the fact that the short-time energy is an
estimate of the variance of x[m].
• Now since σ^2[n] is the output of the filter with impulse response w˜ [n] in
Eq. (6), it follows that it satisfies the recursive difference equation,
Fig.13 Block diagram of recursive computation of the short-time energy for an exponential window.
• Now we can define an AGC of the form,
• --(9)
• where G0 is a constant gain level to which we attempt to equalize the level of
all frames,
• The capability of the AGC control of Eq. (9) to equalize the variance (or more
precisely the standard deviation) of a speech waveform is illustrated in Figure
14.
• Larger values of α would introduce more smoothing so that the AGC would act
over a longer (e.g., syllabic) time scale.
FIGURE 14 AGC of speech

waveform: (a) waveform and
computed standard deviation, σ[n];
(b) equalized waveform, x[n] · G[n],
for a value of α = 0.9.
Short-Time Magnitude
• One difficulty with the short-time energy function, as defined by Eq. (5), is
that it is very sensitive to large signal levels.
--(10)
where the weighted sum of absolute values of the signal is computed instead of
the sum of squares.
• Note that a simplification in arithmetic is achieved by eliminating the squaring
operation in the short-time energy computation.
• Figure 15 shows that Eq. (10) can be implemented as a linear filtering
operation on |x[n]|.
Block diagram representation of computation of the
short-time magnitude function
FIGURE 6.15 Block diagram representation of computation of the short-time magnitude function.
Short-time magnitude functions
FIGURE 16 Short-time magnitude functions for rectangular FIGURE 17 Short-time magnitude functions for Hamming
windows of length L = 51, 101, 201, and 401. windows of length L = 51, 101, 201, and 401.
Comparing short time energy and magnitude
• For the short-time magnitude computation of Eq. (10), the dynamic range
(ratio of maximum to minimum) is approximately the square-root of the
dynamic range for the standard energy computation.
• To conclude our comments on the properties of short-time energy and short-
time magnitude, it is instructive to point out that the window need not be
restricted to rectangular or Hamming form, or indeed to any function
commonly used as a window in spectrum analysis or digital filter design.
• The filter can be either a finite duration impulse response (FIR) or an infinite
duration impulse response (IIR) filter as in the case of the exponential
windows.
• There is an advantage in having the impulse response (window) be always
positive since this guarantees that the short-time energy or short-time
magnitude will always be positive.
• FIR filters (such as the rectangular or Hamming impulse responses) have the
advantage that the output can easily be computed at a lower sampling rate.
• If we use the exponential window of Eq. (6.11), the short-time magnitude
would be,
• where we have again used the appropriate filter normalization factor (1 − α)

and included the delay of one sample.
SHORT-TIME ZERO-CROSSING RATE
• In the context of discrete-time signals, a zero-crossing is said to occur if
successive waveform samples have different algebraic signs.
FIGURE 18 Plots of waveform showing
locations of zero-crossings of the
signal
• The rate (number of crossings per some unit of time) at which zero-crossings
occur is a simple (and often highly reliable) measure of the frequency content
of a signal.
• This is particularly true of narrowband signals.
SHORT-TIME ZERO-CROSSING RATE – CONTD…
• For example, a sinusoidal signal of frequency F0, sampled at a rate Fs, has
Fs/F0 samples per cycle of the sine wave.
• Each cycle has two zero-crossings, so that the average rate of zero-crossings
per sample is,
• --(11)
• and the number of crossings in an interval of M samples is,
• --(12)
• where we use the notation Z(M) to denote the number of crossings per M
samples of the waveform.
• The short-time zero-crossing rate gives a reasonable way to estimate the
frequency of a sine wave.
• An alternative form of Eq. (12) is, ---(13)
• where Fe is the equivalent sinusoidal frequency corresponding to a given zero-
crossing rate, Z(1), per sample.
• If the signal is a single sinusoid of frequency F0, then Fe = F0.
• Thus, Eq. (13) can be used to estimate the frequency of a sinusoid, and if the
signal is not a sinusoid, Eq. (13) can be thought of as an equivalent sinusoidal
frequency for the signal.
• Consider the following examples of zero-crossing rates of sinusoidal
waveforms; (assume a sampling rate of Fs = 10,000 samples/sec)
• for a 100 Hz sinusoid (F0 = 100 Hz), with Fs/F0 = 10,000/100 = 100 samples
per cycle, we get Z(1) = 2/100 = 0.02 crossings/sample, or Z(100) = (2/100) ∗
100 = 2 crossings/10 msec interval (or 100 samples);
• F0 increases, crossings also increases.
• There are a number of practical considerations in implementing a
representation based on the short-time zero-crossing rate.
• Although the basic algorithm for detecting a zero-crossing requires only a
comparison of signs of pairs of successive samples, special care must be
taken in the sampling process.
• The zero-crossing rate is strongly affected by DC offset(60 Hz hum in the signal
or any noise ) in the analog-to-digital converter may be present in the
digitizing system.
• If the DC offset is greater than the peak value of the signal(small signal), no
zero-crossings will be detected.
• Therefore, care must be taken in the analog processing prior to sampling to
minimize these effects.
• A key question about the use and measurement of zero-crossing rates is the
effect of DC offsets on the measurements.
• Figures 6.19 and 6.20 show the effects of severe offset (much more than
might be anticipated in real systems) on the waveforms and the resulting
locations and counts of zero-crossings.
FIGURE 19 Plots of waveform for
sinusoid with no DC offset (top
panel) and DC offset of 0.75
times the peak amplitude
(bottom panel).
• Figure 6.20 shows the waveforms for a
Gaussian white noise sequence (zero-mean,
unit variance, flat spectrum) with no DC
offset (the top panel), and with an offset of
0.75 (the bottom panel).
• The locations of the zero-crossings have
changed with the DC offset.
• The count of zero-crossings over the
251-sample interval has changed from
124 (for no DC offset) to a value of 82 for FIGURE 20 Plots of waveform for
zero-mean, unity variance,
the 0.75 DC offset. Gaussian white noise signal with
no offset (top panel), and 0.75 DC
offset
• This shows that for zero-crossing counts to be a useful measure of frequency
content.
• The waveform must be high-pass filtered to insure that no such DC component
exists in the waveform, prior to calculation of zero crossing rates.
• Speech signals are broadband signals and the interpretation of the short-time
zero-crossing rate is therefore much less precise.
• However, rough estimates of spectral properties of the speech signal can be
obtained using a representation based on the short-time average zero-
crossing rate defined as simply the average number of zero-crossings in a
block of L samples.
• If we select a block of L samples, all that is required is to check samples in
pairs to count the number of times the samples change sign within the block
and then compute the average by dividing by L.
• This would give the average number of zero-crossings/sample.
• As in the case of the short-time energy, the window can be moved by R
samples and the process repeated, thus giving the short-time zero-crossing
representation of the speech signal.
• Defining the short-time average zero-crossing rate (per sample) as,
• --(14)
• where Leff is the effective window length defined in Eq. (14), and where the
signum (sgn) operator, defined as,
• --(15)
• Transforms x[n] into a signal that retains only the sign of the samples.
• The terms |sgn(x[m]) − sgn(x[m − 1])| are equal to 2 when the pair of samples
have opposite sign, and 0 when they have the same sign.
• Thus, each zero-crossing would be represented by a sample of amplitude 2 at
the time of the zero-crossing.
• This factor of 2 is taken into account by the factor of 2 in the averaging factor
1/(2Leff) in Eq. (14).
• Typically, the window used to compute the average zero-crossing rate is the
rectangular window,
• --(16)
• for which Leff = L.
• Thus, Eq. (14) becomes,
• The operations involved in the computation of Eq. (14) are represented in

block diagram form in Figure 21.
FIGURE 21 Block diagram representation of short-time zero-

crossings.
• This representation shows that the short-time zero-crossing rate has the same
general properties as the short-time energy and the short-time magnitude;
i.e., it is a low-pass signal whose bandwidth depends on the shape and length
of the window.
• Instead we need the rate of zero-crossings per fixed interval of M samples.
• For a sampling rate of Fs = 1/T and a segment duration of τ seconds,
corresponding to an interval of M = Fs · τ or M = τ/T samples, all we need to do
to modify the rate from Z(1) to Z(M) is multiply the definition of Z(1) by M,
that is,
• We see that the Gaussian curve
provides a reasonably good fit to
the zero-crossing rate distribution
for both unvoiced and voiced regions.
• Since the two distributions overlap,

an unequivocal voiced/unvoiced
decision is not possible based on
short-time zero-crossing rate alone.
FIGURE 23 Distribution of zero-

crossing for unvoiced and voiced
speech.
Pitch period estimation using parallel processing
approach
• Dudley described the speech signal in terms of an excitation function and the
shape of the short time spectrum.
• In this formulation, an important excitation parameter is fundamental voice
frequency, or its inverse, which is called the pitch period.
• A major obstacle to the use of vocoders in practical systems has been the
accurate estimation of pitch period.
• Precise estimation of the voice fundamental frequency appears to be
necessary to synthesize speech of acceptable quality.
• At the Bell Telephone Laboratories and led to the development of voice-
excited vocoder techniques - improved excitation was attained at the cost of
increased bandwidth.
approach
• In the early 1960's, the idea of designing a pitch period estimator based on
parallel processing techniques was developed at Lincoln Laboratory.
• The basic idea in parallel processing - improvement in accuracy.
• Parallel processing also seemed appropriate because it appeared similar to
the human processing associated with estimation of pitch period from visual
inspection of the speech wave.
• First parallel processing scheme:
• The first parallel processing scheme developed by GoldS was a computer
program using three parallel pitch-period estimators.
• This scheme processed a full band speech waveform, and used peakedness
and regularity tests to estimate pitch periods.
• A relatively elementary combiner algorithm was used to determine the final
pitch-period estimate.
approach
Pitch Period Estimators:

• Following this first attempt, a pitch-period estimator based on combinations
of six simple pitch-period estimators and suitable majority logic was
developed at Lincoln Laboratory.
• This scheme was both simulated on a digital computer, and built into a
hardware device.
• But the device is generally considered to be too intricate and costly a device
for general usage.
approach
Original Parallel Processing Algorithm:
• The algorithm can be conveniently divided into four parts, as shown in Fig. 24.
• Filtering of the speech signal.
• Generation of six functions of the peaks of the filtered speech signal.
• Six identical "simple" pitch-period estimators, each working on one of the
above six functions.
• Final pitch-period computation, based on examination of the results from each
"simple" pitch-period estimator.
approach
Fig. 24 Block diagram of pitch period estimation algorithm
3
SIX INDIVIDUAL PITCH

PERIOD ESTIMATORS
1. Filtering
• The primary purpose of the filter is to select approximately the first formant
region.
• No other information is necessary, and peaks caused by higher formants tend
to reduce the accuracy of subsequent pitch detection.
• If the input speech contains the fundamental frequency- LPF is used.
• Care should be taken to eliminate 60- and 120-Hz hum – HPF is used.
• In the event that no fundamental frequency – BPF is used.
2. Block 2- Generation of six functions
Fig 25. Basic measurements made on the filtered speech

2. Block 2- Generation of six functions – Contd…
• Pulses of height m1, m2, and m3 are generated at every positive peak of the
filtered speech while pulses of height m4, m5 and m6 are generated at each
negative peak.
• Measurements m1 and m4 are simple peak (positive and negative)
measurements, Whereas measurements m2, m3, m5 and m6 depend on
previous peaks.
• Measurements m2 - peak-to-valley and m5 is valley-to-peak.
• m3- peak to previous peaks
• m6- valley to previous valley
• All the m's are converted into positive pulse trains.
• Measurements m3 and m6 are not permitted to become negative.
• Hence, if a current peak (or valley) is not as large as the previous peak (or
valley) measurement tn3 (or me) is set to zero.
3. Block 3- 6 PPE
• The six sets of pulse trains are applied to the six individual pitch detectors as
shown in Fig. 24.
• The operation is given in Fig. 26.
• In essence, each simple pitch-period
estimator is a peak-detecting rundown
circuit.
• It should be noted that both the
rundown time constant and the
blanking time of each detector are
functions of the smoothed estimate of
pitch period (Pav). Fig. 26. Final estimation of pitch period
3. Block 3- 6 PPE- Contd…
• Puv is derived from,
• where Pnew is the most recent estimate of pitch period,

• Pav(n) is the current smoothed estimate of pitch period,
• (n-1) is the previous smoothed estimate of pitch period.
• To prevent extremes of value of blanking time or rundown time constant, is
limited to be greater than 4 msec and less than 10 msec.
• Within these limits, the dependence of blanking time and rundown time
constant on Pav is given by,
4. Block 4 – Final pitch calculation
• The final computation of pitch period is performed by Block 4 of Fig. 24, which
may be thought of as a special purpose computer, with a memory, an
arithmetic algorithm and control hardware to steer all the incoming signals.
• At any time to an estimate of pitch period is made by:
• 1. Forming a matrix of estimates of pitch period. The columns of the matrix
represent the individual detectors and the rows are estimates of period.
Fig. 27. Matrix of estimates of pitch period

4. Block 4 – Final pitch calculation- Contd…
• 2. Comparing each of the entries in the first row of the matrix to the other 35
entries of the matrix and counting the number of coincidences.
Fig. 28. Table of coincidence window

Module2 SSP

Uploaded by

Copyright:

Available Formats

Module2 SSP

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module2 SSP

Uploaded by

Copyright:

Available Formats

SPEECH SIGNAL PROCESSING

• The waveform representation or time-domain representation is simply the

Fig. 3 Waveform of an utterance of /SH UH D - W IY - CH EY S/ (“should we chase”). The sampling rate is 10

• where w˜ [nˆ − m]  sliding analysis window,

• The speech signal is subjected to a transformation (linear filter).

• 2. The Hamming window

Fig. 8 Block diagram representation of computation of the short-time energy

• The energy of a sequence is a single number that has little meaning

• where w[nˆ − m] is a window that is applied directly to the speech samples

• Fig. 9 Illustration of the computation of short-time energy

• so that Eq. (5) becomes, --(6)

• The z-transform of Eq. (6.11) is, --(8)

• Figure 12a shows 51 samples of the exponential window (for a value of α =

FIGURE 14 AGC of speech

• where we have again used the appropriate filter normalization factor (1 − α)

• and the number of crossings in an interval of M samples is,

• The operations involved in the computation of Eq. (14) are represented in

FIGURE 21 Block diagram representation of short-time zero-

• Since the two distributions overlap,

FIGURE 23 Distribution of zero-

Pitch Period Estimators:

SIX INDIVIDUAL PITCH

Fig 25. Basic measurements made on the filtered speech

• where Pnew is the most recent estimate of pitch period,

Fig. 27. Matrix of estimates of pitch period

Fig. 28. Table of coincidence window

You might also like