Voice Classification
Voice Classification
Voice Classification
Abstract: Nowadays classification of gender is one of the most important processes in speech processing. Usually gender
classification is based on considering pitch as feature. The pitch value of female is higher than the male. In most of the recent
research works gender classification process is performed using the abovementioned condition. In some cases the pitch value
of male is higher and also pitch of some female is low, in that case this classification does not produce the exact required
result. By considering the aforementioned problem we have here proposed a new method for gender classification method
which considers three features. The new method uses fuzzy logic and neural network to identify the gender of the speaker. To
train fuzzy logic and neural network, training dataset is generated by using the above three features. Then mean value is
calculated for the obtained result from fuzzy logic and neural network. By using this threshold value, the proposed method
identifies the speaker belongs to which gender. The implementation result shows the performance of the proposed technique in
gender classification.
Keywords: Gender classification, fuzzy logic, neural network, energy entropy, short time energy, zero crossing rate.
Received July 16, 2011; accepted December 30, 2011; published online August 5, 2012
The rest of the paper is structured as follows: The feature space was an important process in the design of
related works are briefly reviewed in section 2 and the a signal classification system. Noise classification is
proposed technique with adequate mathematical crucial process in order to reduce the consequence of
models and illustrations are detailed in section 3. The environmental noises on speech processing tasks. They
implementation results obtained are discussed in have proposed a fuzzy ARTMAP network and
section 4 and section 5 concludes the paper. modified fuzzy ARTMAP network to classify the
various background noise signals. Moreover in
2. Related Works addition to it their experimental results were compared
with both back propagation networks and Radial Basis
Some of the recent research works related to speech Function Network (RBFN).
classification is discussed as follows. Sedaaghi [19], have discussed a comparative study
Rakesh et al. [16] have proposed two different of gender and age classification algorithms which is
models by using several speech processing techniques used in speech signal. Experimental results are used for
and algorithms, and one of their models is used to the Danish Emotional Speech database (DES) and
produce formant values of the voice sample and the English Language Speech Database for Speaker
other model to produce pitch value of the voice Recognition (ELSDSR). Identification of the best
sample. The gender biased features and pitch value of a classifier for gender and age classification when
speaker were extracted by employing these two speech signals were processed, has been made by
models. The mean of formants and pitch of all the experimentally comparing the Bayes classifier using
samples of a speaker were calculated by applying a various techniques Sequential Floating Forward
model having loops and counters which generates a Selection (SFFS) for feature selection, probabilistic
mean of Formant 1 and pitch value of the speaker. The Neural Networks (PNNs), Support Vector Machines
speaker are classified between Male and Female by (SVMs), the K-Nearest Neighbor (K-NN) and
computing euclidean distance from the mean value of Gaussian Mixture Model (GMM), as different
Males and Females of the generated mean values of classifiers. They have shown that gender classification
formant 1 and Pitch by using a nearest neighbor can be carried out with a precision of 95%
technique. Using NI Lab VIEW, the algorithm is approximately by using speech signal either from both
implemented in real time. genders or from male and female individually.
Rao and Prasad [17], have proposed that the Sigmund [21], have proposed an approach for
different time-varying glottal excitation components of automatic identification of gender in a short segment of
speech were used for text independent gender normally spoken continuous speech. They have studied
recognition studies. The excitation information in all the vowels separately to observe which phonemes
speech was represented by using a Linear Prediction are useful for gender recognition and have also
(LP) residual. They have used a Hidden Markov evaluated the selected Mel-Frequency Cepstral
Models (HMMs) to capture the gender-specific Coefficients (MFCCs) based two different simple
information in the excitation of different voiced identifiers. More than 90% of accuracy is being
speech. The decrease in the error during training and achieved for gender identification in short-time
identifying genders during testing phase close to 100% analysis (20 msec) by using the vowel phonemes.
precision have proved that the continuous Ergodic Particularly there was no error for vowel “a”. To
HMM can effectively capture the gender-specific recognize male/female speakers with the accuracy of
information in the excitation component of the speech. more than 93%, the speech duration of 500 msec is
In their gender identification study, they have also enough for text-independent analysis. Automatic
calculated the size of testing data on the gender assessment of speaker’s gender by her/his voice has
recognition performance by using gender specific been an important aspect for achieving high-quality
features in various HMM states, and mixture dialogue systems.
components. They have also performed the gender Mahdi and Jafer [11], have suggested a wavelet-
recognition studies on Texas Instruments and based algorithm for voice and unvoiced classification
Massachusetts Institute of Technology (TIMIT) of speech segments. The classification process
database. involves two steps:
Devi et al. [1], in their study have discussed that the
1. Statistical analysis of the energy-frequency
background noise from noisy environment for example
distribution of the different speech signals by means
car, bus, babble, factory, helicopter, street noise and
of wavelet transform.
more have reduced the performance of speech-
2. Evaluation of the short-time zero-crossing rate of
processing systems like speech coding, speech
the signal.
recognition etc. Thus the classification of noise is
necessary to improve the performance of the speech For each time segment of the pre-emphasized speech,
recognition system. The selection of excellent set of they have also calculated the ratio of the average
features that can efficiently separate the signals in the energy in the low-frequency wavelet sub bands in
Gender Classification in Speech Recognition using Fuzzy Logic and Neural Network 479
comparison to that of highest-frequency wavelet sub have discussed about the features which are used in our
band by using a 4-level dyadic wavelet transform, and method.
then have compared it to a pre-determined threshold.
An experimentally confirmed criterion depends upon 3.1. Feature Analysis for Speech Signals
the results of comparison process was used to obtain
the classification decision. Feature selection plays one of the important roles in
Silovsky and Nouza [22], in his research have gender classification. The gender classification fully
presented a set of methods to categorize of various depends on the feature which we have selected in
audio segments in a system for automatic transcription proposed method. The three features used in our
of broadcast programs. Their task is to decide: method are as follows:
1. Whether the segment should be labeled as speech or • Short Time Energy (STE).
as non-speech and also in the previous case. • Zero Crossing Rate (ZCR).
2. Whether the talking person is one of the speakers in • Energy Entropy (EE).
the database. Among these three features, the most important feature
3. Or else, the speaker belongs to which gender. is ZCR. These features are explained briefly in [2].
Extending the information obtained from transcription Now we can see the basic operation of these three
system and also by improving the performance of the features one by one.
speech recognition module was done by using the
result of classification. Like all other modern speaker 3.1.1. STE
recognition systems, their proposed method is also The STE of speech signal is said to be the sudden
based on GMM. Since the number of the database increase in energy signal. To compute STE, initially
speakers can be large, they have developed a method the signal is split into s windows and then the window
that accelerates the recognition process in a significant function is calculated for each window. The STE is
way. calculated using the equation given below.
While reviewing these recent researches which has
∞
discussed the same problem, in most of the researches S = ∑ y( r) 2
.h( s − r ) (1)
pitch is considered as feature and in some other r =−∞
researches other statistical features are considered. For By using the above equation the STE is calculated.
testing their methods, in some researches emotional From the testing results we have observed that the
speech is utilized and in some other, continuous/real energy entropy output for males is low whereas for
time speech data is considered and in other researches females it is high and continuous.
any words speech dataset was considered.
3.1.2. ZCR
3. Gender Classification using Fuzzy Logic The ZCR is the most important feature considered in
and Neural Network our method. The ZCR is defined as to be the ratio of
Gender classification plays a major role in speech number of time domain zero crossings occurred to the
processing. This technique is used to identify the frame length. The equation 3 shows the formula to
gender of the speaker. There are various methods used calculate zero crossing rate.
N −1
for gender classification. But the major problem is 1
most of these works depends on pitch value. The pitch
Z =
2N
∑ sgn{ x ( i )} − sgn{ x ( i −1 )}
i =1
(2)
mainly depends on the frequency of sound. Normally
where, sgn{x(i)} stands for the sign function, i.e.
the pitch of female is high and for male the pitch is
low. In some cases the pitch of male is higher like the 1; x ( i ) > 0
(3)
female and also the pitch of female is lower like male. sgn{ x ( i )} = 0 ; x ( i ) = 0
−1 ; x ( i ) < 0
In this situation speech classification using pitch will
not produce appropriate results. By considering this By using the above equation the ZCR for each signal is
drawback here we proposed a new method for speech calculated. From the testing results we observed that
classification using three features namely; energy the ZCR for female speech is higher than that of the
entropy, short time energy, and zero crossing rates. male speech.
Initially the three feature values are computed and
given as an input to the fuzzy logic and neural network 3.1.3. EE
individually and it gives the percentage of male and
EE in speech signal is defined as the sudden different
female feature as output. Then mean value is taken and
changes in the energy level of a speech signal. To
using this value gender classification is done. The
calculate EE, initially the speech signal is split into k
process takes place in proposed method which is
explained briefly in the below sections. Initially, we frames and then the normalized energy for each frame
480 The International Arab Journal of Information Technology, Vol. 10, No. 5, September 2013
is evaluated. The formula to calculate energy entropy After generation of fuzzy rules the next step is to
is given below: train fuzzy logic. The fuzzy logic is trained by using
k −1 the rules shown in Table 1. To train fuzzy logic,
E = −∑ σ 2 .log 2 ( σ 2 ) (4) training datasets are to be generated. The input training
i =0
dataset is generated as {[Emax,Emin], [Smax,Smin],
where, σ2 is the normalized energy. [Zmax,Zmin]}. After completion of training, the fuzzy
By using the above equation the EE is computed. From logic obtained is ready for practical operation. In
the testing results we have observed that the energy testing if we will give E, S and Z values as input to the
entropy for males is low and distributed while for fuzzy logic it will provide the output as the feature
females it is high and remains for a short period. belong to male or female.
The features used in our method are explained in the Table 1. Fuzzy rules.
above sections. Next process is to identify the
S. No Fuzzy Rules for Gender Classification
percentage of male and female feature present in the 1 if E=high and S=low and Z=low, then Male
given speech signal using fuzzy logic and neural 2 if E=high and S=low and Z=medium, then Female/Male
network. 3 if E=high and S=low and Z=high, then Female
4 if E=high and S=medium and Z=low, then Female/Male
5 if E=high and S=medium and Z=medium, then Female
3.2. Identifying Male and Female Feature
6 if E=high and S=medium and Z=high, then Female
using Fuzzy Logic 7 if E=high and S=high and Z=low, then Female
Fuzzy Logic offers several unique parameters which 8 if E=high and S=high and Z=medium, then Female
9 if E=high and S=high and Z=high, then Female
alternatively produces better results in many control
10 if E=medium and S=low and Z=low, then Male
problems [5]. Fuzzy logic here is used to calculate the
11 if E=medium and S=low and Z=medium, then Male
percentage of various male and female features if E=medium and S=low and Z=high, then Female
12
presents in the given speech signal. Generally fuzzy 13 if E=medium and S=medium and Z=low, then Female/Male
logic consists of three important steps. This includes 14 if E=medium and S=medium and Z=medium, then Female/Male
fuzzification, generating fuzzy rules and 15 if E=medium and S=medium and Z=high, then Female/Male
defuzzification. In the fuzzification process the system 16 if E=medium and S=high and Z=low, then Female/Male
data is converted in to fuzzy data. For fuzzification 17 if E=medium and S=high and Z=medium, then Female/Male
process triangular membership function is used. Next 18 if E=medium and S=high and Z=high, then Female
process after this is generating fuzzy rules. Figure 1 19 if E=low and S=low and Z=low, then Male
shows the structure of fuzzy logic used in the proposed 20 if E=low and S=low and Z=medium, then Male
21 if E=low and S=low and Z=high, then Male
method with 3 input variables and one output variable.
22 if E=low and S=medium and Z=low, then Male
23 if E=low and S=medium and Z=medium, then Male
24 if E=low and S=medium and Z=high, then Female/Male
25 if E=low and S=high and Z=low, then Female/Male
26 if E=low and S=high and Z=medium, then Female/Male
27 if E=low and S=high and Z=high, then Female
Accuracy
datasets with 20 signals each.
Table 2. Performance analysis.
Proposed Using
Data Set FL NN NB
Method Pitch
1 0.8 1 0.8 0 0.1 Dataset
SP 2 0.8 1 0.8 0 0.2
Figure 3. Comparison graph for accuracy vs dataset.
3 0.8 1 0.8 0 0.1
4 0.7 1 0.7 1 0.2
Specificity vs Dataset
1 0.5 0 0.5 0.5 1
2 0.4 0 0.4 1 1
SE
3 0.3 0 0.3 1 1
4 0.3 0 0.3 0 1
Specificity
1 5 0 5 5 10
2 4 0 4 10 10
TP
3 3 0 3 10 10
4 3 0 3 0 10
1 8 10 8 0 1
TN 2 8 10 8 0 2
3 8 10 8 0 1 Dataset
4 7 10 7 10 2
Figure 4. Comparison graph for specificity vs dataset.
1 2 0 2 10 9
FP 2 2 0 2 10 8 Specificity vs Dataset
3 2 0 2 10 9
4 3 0 3 0 8
1 5 10 5 5 0
FN 2 6 10 6 0 0
Specificity
3 7 10 7 0 0
4 7 10 7 10 0
1 0.2 0 0.2 1 0.9
2 0.2 0 0.2 1 0.8
α
3 0.2 0 0.2 1 0.9
4 0.3 0 0.3 0 0.8
1 0.5 1 0.5 0.5 0
Dataset
2 0.6 1 0.6 0 0
β
3 0.7 1 0.7 0 0 Figure 5. Comparison graph for sensitivity vs dataset.
4 0.7 1 0.7 1 0
1 2.5 0 2.5 0.5 1.11 Precision vs Dataset
LRP 2 2 0 2 1 1.25
3 1.5 0 1.5 1 1.11
4 1 0 1 0 1.25
Precision
1 0.625 1 0.625 0 0
LRN 2 0.75 1 0.75 0 0
3 0.875 1 0.875 0 0
4 1 1 1 1 0
1 0.65 0.5 0.55 0.25 0.5
Acc 2 0.6 0.5 0.6 0.5 0.55
3 0.55 0.5 0.5 0.5 0.5
Dataset
4 0.5 0.5 0.5 0.5 0.6
1 0.714 0 0.714 0.33 0.526 Figure 6. Comparison graph for precision vs dataset.
2 0.667 0 0.667 0.5 0.55
Pre
3 0.6 0 0.6 0.5 0.526 The above Figures 3, 4, 5, and 6 shows the
4 0.5 0 0.5 0 0.55
accuracy, specificity, sensitivity and precision vs
dataset graph respectively for proposed method, fuzzy
Table 2 shows the performance of proposed method
logic, neural network, Naive Bayes and using pitch.
and other methods like fuzzy logic, neural network,
From the above graphs it is clear that the proposed
Naive Bayes and using pitch for various performance
method is better than other methods.
parameters. From the table obtained above, it is clear
Figure 7 shows the membership function used for
that the accuracy of the proposed method is very much
training fuzzy logic and Figures 8, 9 and 10 shows the
better than the fuzzy logic and neural network, Naive
performance, regression and training graph obtained
Bayes and using pitch. The graph of accuracy,
during the training of neural network respectively.
specificity, sensitivity and precision values obtained
Gender Classification in Speech Recognition using Fuzzy Logic and Neural Network 483
Reference
[1] Devi M., Kasthuri N., and Natarajan A.,
“Performance Comparison of Noise
Classification using Intelligent Networks,”
International Journal of Electronics Engineering,
Figure 8. Performance graph obtained during neural network vol. 2, no. 1, pp. 49-54, 2010.
training.
[2] Gomathy M., Meena K., and Subramaniam K.,
“Gender Grouping in Speech Recognition using
Statistical Metrics of Pitch Strength,” European
Journal of Scientific Research, vol. 61, no. 4, pp.
524, 2011.
[3] Gudi A., Shreedhar H., and Nagaraj H., “Signal
Processing Techniques to Estimate the Speech
Disability in Children,” IACSIT International
Journal of Engineering and Technology, vol. 2,
no. 2, pp. 169-176, 2010.
[4] Gudi A. and Nagaraj H., “Optimal Curve Fitting
of Speech Signal for Disabled Children,”
Figure 9. Regression graph obtained during neural network International Journal of Computer Science &
training. Information Technology, vol. 1, no. 2, pp. 99-
107, 2009.
[5] Haider T. and Yusuf M., “A Fuzzy Approach to
Energy Optimized Routing for Wireless Sensor
Network,” The International Arab Journal of
Information Technology, vol. 6, no. 2, pp. 179-
188, 2009.
[6] Haraty R. and Ariss O., “CASRA+: A Colloquial
Arabic Speech Recognition Application,”
American Journal of Applied Sciences, vol. 4, no.
1, pp. 23-32, 2007.
[7] Haraty H. and Ghaddar C., “Arabic Text
Figure 10. Training graph obtained during neural network training. Recognition,” The International Arab Journal of
Information Technology, vol. 1, no. 2, pp. 156-
5. Conclusions 163, 2004.
[8] Hasegawa Y. and Hata K., “Non-Physiological
In this paper, a novel gender classification technique in Differences between Male and Female Speech:
speech processing using neural network and fuzzy Evidence from the Delayed F0 fall Phenomenon
logic was proposed. In this technique gender in Japanese,” in Proceedings of the International
classification is performed by considering three Conference on Spoken Language Processing,
different features such as energy entropy, short time Japan, pp. 1179-82, 1994.
484 The International Arab Journal of Information Technology, Vol. 10, No. 5, September 2013
[9] Hasegawa Y. and Hata K., “The Function of F0- [21] Sigmund M., “Gender Distinction using Short
Peak Delay in Japanese,” in Proceedings of the Segments of Speech Signal,” International
21st Annual Meeting of the Berkeley Linguistics Journal of Computer Science and Network
Society, pp. 141-151, 1995. Security, vol. 8, no. 10, pp. 159-162, 2008.
[10] Kotti M. and Kotropoulos C., “Gender [22] Silovsky J. and Nouza J., “Speech, Speaker and
Classification in Two Emotional Speech Speaker’s Gender Identification in Automatically
Databases,” in Proceedings of the 19th Processed Broadcast Stream,” Radio Engineering
International Conference on Pattern Journal, vol. 15, no. 3, pp. 42-48, 2006.
Recognition, Tampa, pp. 1-4, 2008. [23] Singh G., Junghare A., and Chokhani P., “Multi
[11] Mahdi A. and Jafer E., “Two-Feature Utility E-Controlled Cum Voice Operated Farm
Voiced/Unvoiced Classifier Using Wavelet Vehicle,” International Journal of Computer
Transform,” The Open Electrical and Electronic Applications, vol. 1, no. 13, pp. 109-113, 2010.
Engineering Journal, vol. 2, no. 1874-3005, pp. [24] Vesicle, available at: http://vesicle.nsi.edu/
8-13, 2008. users/patel/download.html, last visited 2002.
[12] McAulay R. and Quatieri T., “Speech Processing [25] Zanuy F., McLaughlin S., Esposito A., Hussain
Based on a Sinusoidal Model,” The Lincoln A., Schoentgen J., Kubin G., Kleijn W., and
Laboratory Journal, vol. 1, no. 2, pp. 153-168, Maragos P., “Non-Linear Speech Processing:
1988. Overview and Applications, Control & Intelligent
[13] Othman A. and Riadh M., “Speech Recognition Systems,” ACTA Press, vol. 30, no. 1, pp. 1-10,
using Scaly Neural Networks,” in Proceedings of 2002.
World Academy of Science, Engineering and [26] Zengi Y., Wu Z., Falk T., and Chan W., “Robust
Technology, vol. 38, pp. 253-258, 2008. GMM Based Gender Classification using Pitch
[14] Patel I. and Rao S., “Speech Recognition using and Rasta-PLP Parameters of Speech,” in
HMM with MFCC- an Analysis using Frequency Proceedings of the 5th International Conference
Specral Decomposion Technique,” Signal & on Machine Learning and Cybernetics, Dalian,
Image Processing : An International Journal, pp. 13-16, 2006.
vol. 1, no. 2, pp. 101-110, 2010.
[15] Qi Y. and Hunt B., “Voiced-Unvoiced-Silence Kunjithapatham Meena received
Classifications of Speech using Hybrid Features her MSc, M.Phil, ME in computer
and a Network Classifier,” IEEE Transactions on science and engineering, MIE, PhD.
Speech and Audio Processing, vol. 1, no. 2, pp. She is the vice-chancellor of
250-255, 1993. Bharathidhasan University. She is
[16] Rakesh K., Dutta S., and Shama K., “Gender the principal and director MBA and
Recognition using Speech Processing Techniques MCA of Shrimathi Indira Gandhi
in LABVIEW,” International Journal of College, Trichirapalli. She has rich experience in the
Advances in Engineering & Technology, vol. 1, development of software tools for the assessment of
no. 2, pp. 51-63, 2011. specially abled children. Also, provides consultancy
[17] Rao R. and Prasad A., “Glottal Excitation Feature for organizing specific programmes for creating
Based Gender Identification System using awareness/literacy about the computer and information
Ergodic HMM,” International Journal of technology among specific cross-sections of the
Computer Applications, vol. 17, no. 3, pp. 31-36, society (Co-ordinator of the novel project IT ON
2011. WHEELS-from Lab to Land). Provides counseling for
[18] Rodger J. and Pendharkar P., “A Field Study of higher education, career placement and training.
the Impact of Gender and User’s Technical
Experience on the Performance of Voice- Kulumani Subramaniam
Activated Medical Tracking Application,” received his BSc, MSc degree in
International Journal of Human-Computer maths, MA (English), MEd, MSc
Studies, vol. 60, no. 2, pp. 529-544, 2004. (IT) and PhD degree in maths and
[19] Sedaaghi M., “A Comparative Study of Gender computer applications from Madras,
and Age Classification in Speech Signals,” Annamalai University, Madurai and
Iranian Journal of Electrical & Electronic Bharathidhasan Universities, Tamil
Engineering, vol. 5, no. 1, pp. 1-12, 2009. nadu, India in the years 1966, 1969, 1982, 1977, 1983,
[20] Shue Y. and Iseli M., “The Role of Voice Source 2009 and 2003 respectively. From 1969 to 2007 he has
Measures on Automatic Gender Classification,” been an educationist for mathematics, english,
in Proceedings of IEEE International Conference educational technology and computer applications as
on Acoustics, Speech and Signal Processing, Las lecturer and professor. He has headed the Department
Vegas, pp. 4493-4496, 2008. of Master of Computer Applications, Shrimathi Indira
Gandhi College, Trichy-2, from 2007 to 2010.
Gender Classification in Speech Recognition using Fuzzy Logic and Neural Network 485