DSP Project 2
DSP Project 2
DSP Project 2
Question 1: Play each sound file in the TRAIN folder. Can you distinguish the voices of the eight
speakers in the database? Now play each sound in the TEST folder in a random order without
looking at the file name (pretending that you do not known the speaker) and try to identify the
speaker using your knowledge of their voices that you just learned from the TRAIN folder. This
is exactly what the computer will do in our system. What is your (human performance)
recognition rate? Record this result so that it could be later on compared against the computer
performance of our system.
Answer:
In the TRAIN folder audio files, perhaps all speakers are women and speak the word zero
with different intonations.
s1: The speaker has a neutral sound and a little bit snappy of saying the word zero.
s2: The speaker has a high tone in the latter part of saying the word zero.
s3: The speaker speaks softly in saying the word zero.
s4: The speaker has a deep tone of saying the word zero.
s5: The speaker prolonged the whole word zero.
s6: The speaker has low tone of saying the word zero.
s7: The speaker prolonged the last syllable of the word zero.
s8: The speaker has a high intonation in the first syllable in saying the word zero.
I compared the audio file in TRAIN folder with the TEST folder and s1s and s3s didnt
match. So I have identified 6 out of 8 audio files correctly which means a 75% recognition rate.
Question 2: Read a sound file into Matlab. Check it by playing the sound file in Matlab using
the function: sound. What is the sampling rate? What is the highest frequency that the
recorded sound can capture with fidelity? With that sampling rate, how many msecs of actual
speech are contained in a block of 256 samples?
Plot the signal to view it in the time domain. It should be obvious that the raw data in the
time domain has a very high amount of data and it is difficult for analyzing the voice
characteristic. So the motivation for this step (speech feature extraction) should be clear now!
Answer:
The sampling rate can be determined using the Matlab command at the commad window:
[y, Fs] = audioread( file address)
To hear the audio file s1.wav, the sound command has been used:
sound ( y, Fs)
The sampling rate of all the audio files is 12500 Hz. To determine the highest frequency to sample
the audio files with high quality, the Shannon-Whithake theorem can be used; fo = fs/2.Thus the minimum
sampling rate is 6250 Hz.
The duration of speech with 256 samples at 6250 Hz is equal to 256/12500 = 20.48 msec. The plot
of s1.wav in time domain is shown Figure 1.
0.6
0.5
0.4
Amplutide
0.3
0.2
0.1
0
-0.1
-0.2
-0.3
-0.4
0.1
0.2
0.3
0.4
0.5
0.6
Time (sec)
0.7
0.8
0.9
The Matlab syntax used to plot the data in s1.wav file is:
clear all
[y,Fs]=audioread('s1.wav');
audio2=audioread('s1.wav');
fs=length(audio2);
t=0:1/fs:1-1/fs;
plot(t,audio2)
xlabel('Time (sec)');
ylabel('Amplutide');
Computers cannot analyze a continuous signal like a the plot in Figure 1. Thus we need to have a
discrete time reperesntation an analog signal. To have a discreret equivalent of a sound signal, we use
Fast Fourier Transform (FFT) and to maintaina a high quality of the sound, we perform windowing to
acknoweldege the samples adjacent to FFT discrete points.
To determine the FFT and perform the windowing, the continous speech signal is blocked into
frames of N samples, with adjacent frames being separated by M, where M < N. The assigned value for M
is 100 samples and for N is 256 samples.
a) The first frame consists of the first N samples. The second frame begins M samples after the
first frame, and overlaps it by N - M samples and so on.
b) This process continues until all the speech is accounted for within one or more frames.
The syntax used to have a discrete-time representation of the speech signal is:
clear all
[y,Fs]=audioread('s1.wav');
audio1=audioread('s1.wav');
fs=length(audio1);
l = length(audio1);
n = 256;
m = 100;
blockFrames = floor((l-n)/m) + 1;
for i = 1:n
for j = 1:blockFrames
M1(i,j) = audio1(((j-1) * m) + i);
end
end
M1
h=hamming(n);
M2= diag(h)*M1;
M2
for i = 1:blockFrames
M3(:,1) = fft(M2(:,i));
end
M3
tspan=(0:n-1)/Fs;
plot(tspan,abs(M3),'-ok','linewidth',2,'MarkerFaceColor','black');
The discrete-time representation of s1.wav with 256 samples at 12500 Hz sampling rate is shown
in Figure 2.
Figure 2
Question 3: After successfully running the preceding process, what is the interpretation of the result?
Compute the power spectrum and plot it out using the imagesc command. Note that it is better to view
the power spectrum on the log scale. Locate the region in the plot that contains most of the energy.
Translate this location into the actual ranges in time (msec) and frequency (in Hz) of the input speech
signal.
Answer:
The source code used to view the power spectrum of s1.wav using the imagesc command:
Fs=12544
r = n/2;
rt = l/Fs;
subplot(121);
imagesc([0 rt],[0 Fs/2],abs(M3(1: r, :)).^2),axis xy;
title('Power Spectrum on Linear Scale');
xlabel('Time [s]');
ylabel('Frequency [Hz]');
colorbar;
%log scale
subplot(122);
imagesc([0 rt],[0 Fs/2] ,20*log10(abs(M3(1: r, :)).^2)), axis xy;
title('Power Spectrum on Log Scale');
xlabel('Time [s]');
ylabel('Frequency [Hz]');
colorbar;
The power spectrum of s1.wav audio file is shown in Figure 3. In the Figure, it is evident that
most of the energy is concentrated at time equal to 0.3 sec up to 0.7 sec. Also, from the Figure 2 the power
spectrum can be easily visualized in Log scale in which the power or intensity of sound is 50.
Question 4: Compute and plot the power spectrum of a speech file using different frame size: for example
N = 128, 256 and 512. In each case, set the frame increment M to be about N/3. Can you describe and
explain the differences among those spectra?
Answer:
Figure 3
Figure 4
Figure 5
In Figure 3, the power intensity is 30; in Figure 4 the intensity is 220 and in Figure 5 the intensity
is 60. The power intensity indicates the loudness of the sound and the number of frame determines the
perception of the human ears to the sound. Among the figure shown above, the power spectrum which
has the largest M: N ratio is the Figure 4 which has a ratio of 3.0117. This means that at these values of M
and N parameters there is a good balance between the power intensity of the sound at a particular
sampling rate and the number of frames. Unlike in Figures 3 and 5, the M-N ratio are 2.977 and 2.994,
respectively. The power intensity indicates the loudness of the sound and the number of frame
determines the perception of the human ears to the sound.
Question 5: Type help melfb at the Matlab prompt for more information about this function. Follow the
guidelines to plot out the mel-spaced filter bank. What is the behavior of this filter bank? Compare it
with the theoretical part.
Answer:
The Mel-spaced filter bank depicts the set of filtered frequency bands of an input signal or sound
that can be perceived by a human ear . In Figure 6, each filter of the filter bank indicates the frequency
response of a sound /signal at certain sapling rate.
Based on the plot, the spectrum is limited to 20 coefficients or bands which means that in every
input sound signal, only 20 sinusoids will be considered to be filtered and processed into the system. In
theory, when the number of bands is increased, the area of each triangles would vary thus the more
sinusoids can be processed to the system.
The codes used to plot the Mel-spaced Filter bank are:
function m = melfb(p, n, Fs)
% MELFB Determine matrix for a mel-spaced filterbank
% Inputs: p number of filters in filterbank
% n length of fft
% fs sample rate in Hz
% Outputs: x-a (sparse) matrix containing the filter bank amplitudes
% size(x) = [p, 1+ floor (n/2)]
%
audio1=audioread('s1.wav');
p=20;
n=256;
Fs=12500;
f0 = 700 / Fs;
fn2 = floor(n/2);
lr = log(1 + 0.5/f0) / (p+1);
% convert to fft bin numbers with 0 for DC term
bl = n * (f0 * (exp([0 1 p p+1] * lr) - 1));
b1 = floor(bl(1)) + 1;
b2 = ceil(bl(2));
b3 = floor(bl(3));
b4 = min(fn2, ceil(bl(4))) - 1;
pf = log(1 + (b1:b4)/n/f0) / lr;
fp = floor(pf);
pm = pf - fp;
r = [fp(b2:b4) 1+fp(1:b3)];
c = [b2:b4 1:b3] + 1;
v = 2 * [1-pm(b2:b4) pm(1:b3)];
m = sparse(r, c, v, p, 1+fn2);
end
Figure 6
Question 6: Compute and plot the spectrum of a speech file before and after the mel-frequency
wrapping step. Describe and explain the impact of the melfb program.
Answer:
Initially prior to the application of melfb program, the energy of the sound signal or s1.wav file is
within the range of smaller frequencies as shown in Figure 7. When the melfb program is applied the
energy level is amplified or increased due to increase thus increasing the quality if the sound signal of the
audio file.
Figure 7
Question 7: To inspect the acoustic space (MFCC vectors) we can pick any two dimensions (say the 5th
and the 6th) and plot the data points in a 2D plane. Use acoustic vectors of two different speakers and
plot data points in two different colors. Do the data regions from the two speakers overlap each other?
Are they in clusters?
Answer:
Based on the result plot, the data points overlap to each other, which means that both speakers
utter the same word and probably the s1 and s2 has different intonation of saying the word zero.
Also, based on the graph, it shows that there is no cluster has been formed nor a pattern, hence
the s1.wav and s2.wav files didnt match with each other. In other words, both wav files have different
speakers. Here we decide to plot the acoustic vectors of Speaker 1 and Speaker 2. It is important to take
note that this is only a two dimensional plot. The actual vector contains 20 dimensions. The plot is show
in Figure 8.
The code used to plot the data points is:
[ya1,Fa1]=audioread('s1.wav');
[ya3,Fa3]=audioread('s3.wav');
vec1 = mfcc(ya1,Fa1);
vec2 = mfcc(ya3,Fa3);
plot(vec1(5, :), vec1(6, :), '*g');
hold on;
plot(vec2(5, :), vec2(6, :), '*r');
title('2D Plot of Acoustic Vectors');
xlabel('5th Dimension');
ylabel('6th Dimension');
Figure 8
Question 8: Plot the resulting VQ codewords after function vqlbg using the same two dimensions over the
plot of the previous question. Compare the result with Figure 5.
Answer:
Figure 9 is the resulting plot of the acoustic vectors of s1.wav and s3.wav audio files with their
distinct codewords. The green o refers to the acoustic vectors from S1.wav and the blue+ are from
s3.wav. In the training phase, a speaker specific VQ codebook is generated for each known speaker by
clustering his/her training acoustic vectors. The result codewords (centroids) are shown in the figure by
red triangles and black * for s1.wav and s3.wav, respectively. By comparing Figure 9 to Figure 8, It is
obvious that the latter is an ideal case for teaching since forming centroids with can make clustering easy.
Though the clusters cannot be that easily distinguished in practice just as shown.
Figure 9
Question 9: What is recognition rate our system can perform? Compare this with the human
performance. For the cases that the system makes errors, re-listen to the speech files and try to come
up with some explanations.
Answer:
The following command will run the system to analyze audio files:
code = train('D:\Documents\MATLAB\train\', 8);
test('D:\Documents\MATLAB\test\', 8, code);
The results are:
Speaker 1 matches with Speaker 1
Speaker 2 matches with Speaker 2
Speaker 3 matches with Speaker 7
Speaker 4 matches with Speaker 4
Speaker 5 matches with Speaker 5
Speaker 6 matches with Speaker 6
Speaker 7 matches with Speaker 7
Speaker 8 matches with Speaker 8
Based on the result, only the speaker in s3.wav audio file doesnt have a match to the test files.
However, based on my hearing capability I only managed to identify 2 mismatch of audio files. However I
couldnt distinguish any similarity of s7 and s3 in the given files. The computer reveals 87.5% recognition
rate compared to my hearing capability which has 75% recognition rate.
Question 10: You can also test the system with your own speech files. Use the Windows program Sound
Recorder to record more voices from yourself and your friends. Each new speaker needs to provide one
speech file for training and one for testing. Can the system recognize your voice? Enjoy!
Answer:
I used Audacity to edit such that the audio files I recorded would have a 12500-Hz sampling rate.
The word I used to record with 5 different persons is hey. And the when run the files in the using the
previous codes, the result are:
Speaker 1 matches with Speaker 1
Speaker 2 matches with Speaker 2
Speaker 3 matches with Speaker 3
Speaker 4 matches with Speaker 4
Speaker 5 matches with Speaker 5
The computer made a 100% recognition rate in analyzing the 5 different audio files I gathered.