Chapter6 - SPEECH SIGNAL PROCESSING

COURSE:
DIGITAL SIGNAL PROCESSING
Instructor: Ninh Khanh Duy

CHAPTER 6:
SPEECH SIGNAL PROCESSING
Lecture 6.1: Introduction to speech signals

Lecture 6.2: Time-domain features and applications
Lecture 6.3: Frequency-domain features and applications
Duration: 6 periods
Lecture 6.1
Introduction to speech signals
! Outline:
1. Overview of speech signals
2. Basic properties of speech signals
Overview of speech signals
! Speech signals are obtained by a digital recording process

(sampling, quantizing, coding) of acoustic waves
“Các bạn trẻ …”
! Speech signals encode messages of speakers, which include

linguistic information such as phonemes, sentence types, etc
Overview of speech signals
! Acoustic wave at mouth and nose is the output of the air low
going from lung through human vocal tract
Mechanisms of phones and voicing
Air flow
/s/ /a/
Vocal
cords/folds
" Speech (Output signal): include different phones and voicing

" Resonance cavities (System) ⇒ diff. phones: /a/, /m/, /s/, /z/
" Air flow after vocal cords (Input signal) ⇒ diff. voicing:
• Vocal cords vibrate: Quasi-periodic pulses ⇒ voiced phones: /a/, /m/
• Vocal cords close: Turbulence ⇒ unvoiced phones: /s/, /z/, /p/, /k/
Lecture 6.1
Introduction to speech signals
! Outline:
1. Overview of speech signals
2. Basic properties of speech signals
Basic properties of speech signals
! Randomness
" Speech (like most real-world signals) is random: impossible to
predict with certainty their future values from past values
# Deterministic signal: for each value of time we have a rule which
enables us to determine the precise value of the signal
" The value of a signal at any instant of time x(t) is a random

variable
# The actual value of a signal is only known after observation
" A signal is assumed to be generated by a random process with a

structure that can be characterized and described
! Variability
" Depend on different microphones
! Variability
" Depend on different speakers (voices)
! Variability
" Depend on dif. physical/emotional states of the same speaker
! Characteristics are slowly varying in time

" Time/Frequency related features are quite stable within short
segments of 10-50 ms (duration to pronounce a phoneme)
Short-time processing technique
! Divide a signal into consecutive frames, each having a fixed duration

(e.g., 25 ms)
! Extract features frame-by-frame
! Combine extracted features into feature sequence (time axis is now

frame index)
Homework
1. Read Section 2 & 3 of “CS425 Audio and Speech Processing_Hodgkinson_2012”
2. Write a program to compute the energy and power of a recorded signal

following the formulas (2.1) & (2.2) in page 25 of the textbook
“Applied Digital Signal Processing -Theory and Practice_Manolakis-Ingle_2011”
CHAPTER 6:

Duration: 6 periods
Lecture 6.2
Time-domain features and applications
! Outline:
1. Voiced/Unvoiced/Silence segmentation
2. Time-domain pitch estimation
Introduction to
Voiced/Unvoiced/Silence classification
! Recorded signal include speech & silence regions

" Speech: regions exhibit voice activities (producing phones)
" Silence: regions exhibit no phone except environmental noise

Introduction to
Voiced/Unvoiced/Silence classification
! A speech region is divided into voiced & unvoiced segments

" Voiced: exhibit strong periodicity, resulted by vibration of vocal folds
" Unvoiced: exhibit weak/no periodicity, resulted by closed vocal folds

Speech/Silence discrimination
! Problem statement
" Input: a signal
" Output: the signal with vertical boundaries between speech and
silence regions
! Constraint
" The minimum length of silence region is 300ms to exclude very
short pauses when speaking
! Observation
Level of silence is mostly lesser than that of speech segments,

except when
" Environmental noise may has level higer than that of unvoiced
fricatives (e.g., /s/, /z/)
" Recording environment has a high noise level (or low Signal-to-
Noise Ratio (SNR))
$ Use signal level as the discrimination criterion

! Candidate attribute functions

" Short-Time Energy (STE): sum of square of the waveform values
over a finite number of samples belonging to a frame (20-25 ms)
n: frame index
m: sample index
N: frame length (samples)

" Magnitude Average (MA): sum of absolute of the waveform values
over a finite number of samples belonging to a frame
n: frame index
m: sample index
" For practical uses, we rather use the N values centered around n,
from n−N/2 to n+N/2−1


" Short-Time Energy (STE) vs. Magnitude Average (MA)
Both functions reflect the waveform envelope, but STE emphasizes large values
! Algorithm in general
" Based on some threshold of the attribute function to discriminate a
frame as speech or silence
" This threshold is to be found based on given training signals with

different environmental noise levels
! Algorithm to find the threshold

" Can be set manually or automatically
" Should be based on the distribution (histogram) of feature data

(STE/MA) of frames belong to speech or silence (no label needed),
or based on a binary search (label needed)
" Or should be based on simple statistics (mean & standard

deviation) (label needed) (assuming normal distribution)
Voiced/Unvoiced discrimination
! Problem statement
" Input: a signal including only speech region (assuming no silence)
" Output: the signal with vertical boundaries between voiced and
unvoiced segments
! If input signal includes some silence $ no problem because

silence is non-periodic & could be considered as unvoiced
! Same idea as previous task

" Look for attributes that characterise contrastingly the states to
discriminate
" Setting for each state a threshold based on training signals
! Different point
" Combine several features to discriminate voiced vs. unvoiced
! Discriminatory attributes and functions

" STE or MA: unvoiced segments has level generally lesser than
voiced segments

" Zero-Crossing Rate (ZCR): the rate at which the waveform crosses
the zero-axis
" Unvoiced segments exhibit a denser waveform, more turbulent

than voiced segments $ UV has significantly higher ZCR than V

" Zero-Crossing Rate (ZCR): the rate at which the waveform crosses
the zero-axis
n: frame index
m: sample index
N: frame length
! Normalisation of attribute functions

" Useful when combine (e.g., adding) multiple attribute functions into
one
" Then a voicing threshold can be set for the composite function
" Otherwise, must set various thresholds for dif. attribute functions
Lecture 6.2
Time-domain features and applications
! Outline:
1. Voiced/Unvoiced/Silence discrimination
2. Time-domain pitch estimation
Pitch or Fundamental frequency (F0)
! A feature dedicated only for periodic signals (e.g., voiced segments)

! Definition
" Fundamental frequency (F0), inverse of the fundamental period, is

the number of signal cycles per seconds
• For speech: F0 is actually the vibration frequency of vocal cords
" Pitch is the perceptual counterpart of F0 (e.g, high/low-pitched

voice)
! Importance
" Pitch contour conveys the intonation of an utterance (rising/falling)
" For Vietnamese: 06 tones (ngang, huyền, ngã, hỏi, sắc, nặng)
Pitch/F0 estimation
! Problem statement
" Input: a signal (may including silence/voiced/unvoiced segments)
" Output: F0 contour of the signal (a F0 value for each frame)
! Constraint
" Valid F0 values for adult voices is from 70Hz to 400 Hz
Pitch/F0 estimation
! An example F0 contour extracted from signal

Pitch/F0 estimation
! Two time-domain methods

" Short-Time Autocorrelation function (ACF)
" Short-Time Average Magnitude Difference Function (AMDF)
! Both based on the following property of periodic signal
NT : pitch period/fundamental period (in samples)
! Voiced segments of speech are quasi-periodic
$ “=“ never occurs

Autocorrelation function (ACF)
! The ACF of a signal gives an indication of how alike itself a

signal is when shifted
! Definition
n: lag/shift
m: sample index
! Application: for a periodic signal x, the ACF is globally

maximal at every lag that is an integer multiple of the period
" For quasi-periodic signal$ local maximal (peak)
Autocorrelation function (ACF)
! Short-time ACF of a frame:
n: lag (samples)
m: sample index
! The ACF should be normalized to obtain maximum value of 1
by dividing by largest autocorrelation value at lag zero xx[0]
! Complexity per frame: O(N2)

Short-Time Autocorrelation function
(Kondoz, 2004)
The normalized height of highest local peak is propotional

to degree of voicing $ can be used for V/U decision
Algorithm
(for a frame)
(Trần Văn Tâm, 2019)

! Autocorrelation peak detection
! Determine a suitable threshod for V/U decision
! Reducing the scope of the search

" F0 is from 70Hz to 400 Hz $ searching range of maximum lag
! Be careful with virtual pitch values
Lucky frame $ correct F0

! Be careful with virtual pitch values
Unlucky frame $ incorrect F0

Average Magnitude Difference Function
! The AMDF of a signal gives an indication of how different a

signal itself is compared to its shifted version
! Definition
(n: lag, m: sample index)
! Application: for a periodic signal x, the AMDF is zero at every

lag that is an integer multiple of the period of the waveform
" For quasi-periodic signal$ local minimal (dip)
(Ex. w/ 4 frames)
(Kondoz, 2004)
! Short-time AMDF of a frame
n: lag (samples)
! Computationally much cheaper than the ACF
! Have similar algorithm & problems to the ACF

Homework
Các thành viên mỗi nhóm thảo luận và phân công nhiệm vụ,
ghi rõ SV nào làm task nào (ko được trùng nhau):
- 1a (phân đoạn speech vs. silence)
- 1b (phân đoạn voiced vs. unvoiced)
- 2a (tính F0 dùng hàm tự tương quan)
- 2b (tính F0 dùng hàm AMDF).
Nhập task (1a/1b/2a/2b) vào link danh sách nhóm.
Hạn cuối: trước buổi học tuần sau.
Sau hạn này SV nào ko nhập coi như ko tham gia làm BT
nhóm và nhận 0 điểm thi GK.
CHAPTER 6:

Duration: 6 periods
Lecture 6.3
Frequency-domain features & applications
! Outline:
1. Frequency-domain pitch (F0) estimation
Theory of CTFS
A periodic signal x(t) has a line spectrum with uniform spacing

F0 = 1/T0 (F0: fundamental frequency of x(t))
F0 = 1/T0
Main idea
Spectrum of a periodic signal has a harmonic structure with the

distance between harmonics being the F0
$ The frame-based solution includes 2 steps:
! Estimate the spectrum using FFT (fast computation of DFT)
! Detect the spacing of adjacent harmonics (i.e., spectral lines)

Spectrum estimation using FFT
! Important parameters when using function fft(x,N)

" Window function to reduce spectral leakage (Hamm/Hann)
" # of FFT points (# of frequency-domain sampling points)

% Spectral resolution = Sampling frequency / N
% larger N to have better resolution $ more accurate F0 estimates
% But too large $ over-detailed spectrum $ harder to detect harmonics
% Should be chosen with high care
! Log magnitude spectrum should be used for low dynamic

range between spectral peaks
Harmonics spacing detection
! Detect all of harmonic peaks based on estimated spectrum
! Measure the F0 as either the common divisor of these

harmonics or the spacing of adjacent harmonics
! Note:
" Harmonic peaks appear clearer in low-frequency range (<2 kHz)
! Algorithm:
" Self-proposed (searching for spectral peaks in low-frequency range)
" Harmonic product spectrum (HPS)

Chapter6 - SPEECH SIGNAL PROCESSING

Uploaded by

Copyright:

Available Formats

Chapter6 - SPEECH SIGNAL PROCESSING

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter6 - SPEECH SIGNAL PROCESSING

Uploaded by

Copyright:

Available Formats

COURSE:

DIGITAL SIGNAL PROCESSING

Instructor: Ninh Khanh Duy

Lecture 6.1: Introduction to speech signals

! Speech signals are obtained by a digital recording process

“Các bạn trẻ …”

! Speech signals encode messages of speakers, which include

" Speech (Output signal): include different phones and voicing

" The value of a signal at any instant of time x(t) is a random

" A signal is assumed to be generated by a random process with a

! Characteristics are slowly varying in time

! Divide a signal into consecutive frames, each having a fixed duration

! Extract features frame-by-frame

! Combine extracted features into feature sequence (time axis is now

1. Read Section 2 & 3 of “CS425 Audio and Speech Processing_Hodgkinson_2012”

2. Write a program to compute the energy and power of a recorded signal

Lecture 6.1: Introduction to speech signals

! Recorded signal include speech & silence regions

" Silence: regions exhibit no phone except environmental noise

! A speech region is divided into voiced & unvoiced segments

" Unvoiced: exhibit weak/no periodicity, resulted by closed vocal folds

Level of silence is mostly lesser than that of speech segments,

$ Use signal level as the discrimination criterion

! Candidate attribute functions

! Candidate attribute functions

from n−N/2 to n+N/2−1

! Candidate attribute functions

" This threshold is to be found based on given training signals with

! Algorithm to find the threshold

" Should be based on the distribution (histogram) of feature data

" Or should be based on simple statistics (mean & standard

! If input signal includes some silence $ no problem because

! Same idea as previous task

" Setting for each state a threshold based on training signals

! Discriminatory attributes and functions

! Discriminatory attributes and functions

" Unvoiced segments exhibit a denser waveform, more turbulent

! Discriminatory attributes and functions

! Normalisation of attribute functions

! A feature dedicated only for periodic signals (e.g., voiced segments)

" Fundamental frequency (F0), inverse of the fundamental period, is

" Pitch is the perceptual counterpart of F0 (e.g, high/low-pitched

" Pitch contour conveys the intonation of an utterance (rising/falling)

" Output: F0 contour of the signal (a F0 value for each frame)

! An example F0 contour extracted from signal

! Two time-domain methods

" Short-Time Average Magnitude Difference Function (AMDF)

! Both based on the following property of periodic signal

NT : pitch period/fundamental period (in samples)

! Voiced segments of speech are quasi-periodic

$ “=“ never occurs

! The ACF of a signal gives an indication of how alike itself a

! Application: for a periodic signal x, the ACF is globally

! Short-time ACF of a frame:

! Complexity per frame: O(N2)

The normalized height of highest local peak is propotional

(Trần Văn Tâm, 2019)

! Autocorrelation peak detection

! Determine a suitable threshod for V/U decision