! Outline:
1. Overview of speech signals
2. Basic properties of speech signals
Overview of speech signals
! Acoustic wave at mouth and nose is the output of the air low
going from lung through human vocal tract
Mechanisms of phones and voicing
Air flow
/s/ /a/
! Outline:
1. Overview of speech signals
2. Basic properties of speech signals
Basic properties of speech signals
! Randomness
" Speech (like most real-world signals) is random: impossible to
predict with certainty their future values from past values
# Deterministic signal: for each value of time we have a rule which
enables us to determine the precise value of the signal
! Variability
" Depend on different microphones
Basic properties of speech signals
! Variability
" Depend on different speakers (voices)
Basic properties of speech signals
! Variability
" Depend on dif. physical/emotional states of the same speaker
Basic properties of speech signals
! Outline:
1. Voiced/Unvoiced/Silence segmentation
2. Time-domain pitch estimation
Introduction to
Voiced/Unvoiced/Silence classification
! Problem statement
" Input: a signal
" Output: the signal with vertical boundaries between speech and
silence regions
! Constraint
" The minimum length of silence region is 300ms to exclude very
short pauses when speaking
Speech/Silence discrimination
! Observation
" Recording environment has a high noise level (or low Signal-to-
Noise Ratio (SNR))
n: frame index
m: sample index
N: frame length (samples)
Speech/Silence discrimination
n: frame index
m: sample index
N: frame length (samples)
" For practical uses, we rather use the N values centered around n,
Both functions reflect the waveform envelope, but STE emphasizes large values
Speech/Silence discrimination
! Algorithm in general
" Based on some threshold of the attribute function to discriminate a
frame as speech or silence
! Problem statement
" Input: a signal including only speech region (assuming no silence)
" Output: the signal with vertical boundaries between voiced and
unvoiced segments
! Different point
" Combine several features to discriminate voiced vs. unvoiced
Voiced/Unvoiced discrimination
n: frame index
m: sample index
N: frame length
Voiced/Unvoiced discrimination
" Then a voicing threshold can be set for the composite function
" Otherwise, must set various thresholds for dif. attribute functions
Lecture 6.2
Time-domain features and applications
! Outline:
1. Voiced/Unvoiced/Silence discrimination
2. Time-domain pitch estimation
Pitch or Fundamental frequency (F0)
! Importance
" For Vietnamese: 06 tones (ngang, huyền, ngã, hỏi, sắc, nặng)
Pitch/F0 estimation
! Problem statement
" Input: a signal (may including silence/voiced/unvoiced segments)
! Constraint
" Valid F0 values for adult voices is from 70Hz to 400 Hz
Pitch/F0 estimation
! Definition
n: lag/shift
m: sample index
n: lag (samples)
m: sample index
N: frame length (samples)
! The ACF should be normalized to obtain maximum value of 1
by dividing by largest autocorrelation value at lag zero xx[0]
(Kondoz, 2004)
Short-Time Autocorrelation function
! Definition
(Kondoz, 2004)
Average Magnitude Difference Function
n: lag (samples)
N: frame length (samples)
! Outline:
1. Frequency-domain pitch (F0) estimation
Theory of CTFS
! Note:
" Harmonic peaks appear clearer in low-frequency range (<2 kHz)
! Algorithm:
" Self-proposed (searching for spectral peaks in low-frequency range)