OC2021 Paper 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

O-COCOSDA 2021, Singapore, Nov 18-20, 2021

WHICH PHONEMES WILL DISTINGUISH THE DIFFERENT REGIONS


WITHIN THE SAME DIALECT?
Xuefei Liu1, Jianhua Tao1, 2, 3, Yurong Han1, 4, Chenglong Wang1, 5, Xueying Zheng1, Zhengqi Wen1
1
National Laboratory of Pattern Recognition, Institute of Automation,
Chinese Academy of Sciences, Beijing
2
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing
3
CAS Center for Excellence in Brain Science and Intelligence Technology, Beijing
4
Northwest University, Xi’an
5
University of Science and Technology of China, Hefei
{ xuefei.liu, jhtao, chenglong.wang, xueying.zheng, zqwen} @nlpr.ia.ac.cn

ABSTRACT not only brings convenience to our lives but also builds a
bridge for communication between different peoples and
The work of finding which phonemes distinguish the countries [2, 21]. The current language recognition mainly
different regions within the same dialect will be of great focuses on the recognition of different languages, such as
significance to the improvement the recognition technology, Oriental Language Recognition Competition (OLR
national and information security and dialect protection. To Challenge) and Language Recognition Evaluation (LRE) [3].
address this issue, this paper investigates the refinement OLR Challenge is held every year while LRE has an
recognition of different regions of the same dialect based on evaluation every two years. The main tasks of these
the corpus designed and recorded by CASIA (CASIA competitions are to recognize several languages such as
Dialect Corpus, CASIA DC) firstly and then finds the Oriental languages (Cantonese, Indonesian, Japanese,
distinguishing phonemes by the probabilistic accumulation Chinese…), Arabic, English, Slavic, Iberian, etc. Even if
of phonemes method. Based on i-vector model, the specific to a certain language, such as Chinese, the
recognition results indicate that the recognition rates of recognition task is also to identify different dialects within
different regions of the same dialect are different from one Chinese. At the same time, the smaller differences in
dialect to another. The recognition rate of the Mandarin phonetics, vocabulary, grammar, etc. in different regions of
dialects is lower than that of other dialects. Through the the same dialect than different dialects impose great
probabilistic accumulation of phonemes, we find that the challenges to the recognition of different regions within the
phonemes with significant difference can distinguish same dialect [4].
different regions of the same dialect, which will provide Recognition of different types of languages / dialects is
significance for the synthesis and recognition of dialects in important, but it also has important significance for the
the future. recognition of different areas within the same dialect. From
the aspect of improving scientific technology, the situation
Index Terms—distinguishing phonemes, language that the differences in different regions of the same dialects
recognition, regions of dialects, i-vector model, probabilities are smaller than that in different dialects poses a certain
accumulation of phonemes challenge to the traditional language / dialect recognition
technology. In order to identify different areas of the same
1. INTRODUCTION dialect accurately, scholars have to propose more refined
systems and technique which will effectively promote the
In order to find which phonemes distinguish the different continuous optimization of language recognition technology
regions within the same dialect, the first step is language and system.
recognition, that is, the recognition of different regions Recognition of different regions within the same dialect
within the same dialect. and find the distinguishing phonemes is also important for
The main task of language recognition is to quickly and national security. It has been found that the dialect will be
accurately identify the type of language according to a given strongly associated with the geographic location by the
segment [1]. It has numerous applications in the fields of recognition of different areas within the dialects. The refiner
speech recognition, voiceprint recognition, machine the dialect is, the more precise the position is. Through the
translation, communication and information retrieval, which investigation of the distribution of Chinese dialects, it is

978-1-6654-0870-7/21/$31.00 ©2021 IEEE 152


found that the same dialect can be distributed in different different types of languages. When it comes to dialect
provinces and cities. For example, Gan dialect is distributed recognition, the main task is to identify different dialects not
in Jiangxi, Hubei, Hunan, Fujian and other provinces. But to recognize their regions. In view of the fact that the
specific to different areas of the same dialect, the recognition technology of different languages is relatively
distribution is clearer. For example, the region of Changdu mature, and the different regions of the same dialect have
of Gan dialect is mainly distributed in Jiangxi province, the strong correlations in phonetics, vocabulary and grammar,
region of Datong is mainly distributed in Huangshi City and this paper will recognize the different regions in the same
Xianning City of Hubei Province, and Yueyang City of dialect based on i-vector firstly, trying to find out the
Hunan Province. The region of dialect can be directly phonemes that cause the difference between different dialect
connected with the counties and cities through the regions.
identification of different areas of the dialects, which has
greater significance for national and information security. 3. CORPUS
Refinement of the dialect research is also helpful to
dialect protection, ―how to protect dialects, especially some The corpus used in this paper, CASIA Dialect Corpus
dialects without words‖ is an important issue related to (CASIA DC) was designed and recorded by the State Key
national development and culture in the context of Mandarin Laboratory of Pattern Recognition, Institute of Automation,
Chinese as a common language challenging the survival of Chinese Academy of Sciences (CASIA). The database
dialects. Especially southern dialects are very complicated, includes 10 dialects, which are: 4 Mandarin dialects
―the pronunciation is different from places‖, through (Jiaoliao, Jilu, Dongbei, Jianghuai Mandarin), Jin dialect,
recognizing different regions within the same dialect, Wu dialect, Hui dialect, Cantonese, Min dialect, and Xiang
accurately extracting the effective distinguishing dialect. The recordings in our corpus are still increasing. In
characteristics will provide useful value and clues for speech order to compare the differences between Mandarin dialects
synthesis and recognition of low resource dialects. and other dialects, 4 dialects of Jilu Mandarin, Jianghuai
The rest of this paper is organized as follows: section 2 Mandarin, Wu and Jin dialects have been chosen as our
introduces related work of dialect recognition, section 3 database, consisting of 17 datasets. To balance the data, 4
summarizes the corpus used in this experiment, section 4 speakers (2 males and 2 females) 1000 utterances for each
shows the identification recognition results with i-vector dialect have been chosen. So the total number of sentences is
model, section 5 describes which phonemes are effective in 17,000. The recording duration was 14.64 hours. The signals
identifying different regions of dialects. Concluding remarks were recorded by mobile phones in freestyle, with a
are giving in the final section. sampling rate of 16k Hz and a sample size of 16 bits.

2. RELATED WORKS Table 1: Data profile.

In the field of language identification recognition,


lacking effective data, susceptibility to multiple noises,
lower language information, most current recognition
systems are hard to recognize shorter (3-10s) samples [1].
So the current research on language recognition focuses on
how to continuously optimize the system to improve the
shorter speeches recognition performance [7-9,12,14,16],
such as Deep Neural Networks (DNNs) [10,11,15,17], Long
Short-Term Memory (LSTM)[5,6], Recurrent Neural
Networks (RNNs) [13], Convolutional Deep Neural
Networks (CDNNs) [22]and so on.
Considering the recognition of dialects, this situation is
complicated, not only facing the problem of the duration but
also facing the problem of small differences within the same
dialect. For example, the Oriental Language Recognition
Challenge (OLR Challenge—Oriental Language
Recognition Challenge) jointly organized by Tsinghua Male and Female speakers are balanced.
University and Speechocean, Chinese dialect recognition AI Some regions of dialects have insufficient sentences, so
challenge which was held by iFLYTEK based on their open the data is deleted from the overall data.
dialect database, etc. In these competitions, the duration of The sentences selected in this experiment are naturally
the sentences are divided into different types (1s, 3s and the spoken sentences. After reading the scripts, the speakers are
whole sentence) and the competitions are the recognition of required to express the meaning of the sentence in their own

2021 24th Conference of the Oriental COCOSDA (O-COCOSDA) 153


dialect instead of reading the scripts. If there are some words IDR is formally defined as follows:
that are not commonly used in the dialect, they can be IDR=Tc/ (Tc + Ti)
replaced, such as: where Tc and Ti are the numbers of correctly and
Bu gua feng bu xia yu de, feng he ri li de jiu ba ren gei incorrectly identified utterances respectively. Specifically,
zhuo nong le. (It is not windy or raining, people were fooled during the training stage, we trained four i-vetor models for
in such a peaceful, sunny day.) the four dialects: Jilu mandarin, Jianghuai mandarin, Wu
Due to ―Feng he ri li (a peaceful, sunny day)‖ is a dialect and Jin dialect. For the test stage, test data is only
written expression, it is not used in some dialects. So some tested within a single dialect. For example, Jilu mandarin is
dialects expressed this as: a three classifications problem and the Wu language is five
Ye bu gua feng ye bu xia yu de, jiu zhe me ba ren gei classifications. Table 2 presents the IDR results of each
zhuo nong la. (It is neither windy nor rainy, it makes fun of region in the same dialect and the Equal Error Rate of each
people.) dialect.
As can be seen from the table, the EER of Wu and Jin
4. EXPERIMENTS dialects is lower than Jilu and Jianghuai Mandarin which
indicates that the performance on these two dialects (85%-
4.1. Experiment setup 90%) is better than Jilu and Jianghuai Mandarin (about
80%). At the same time, this result also reflects that the
After i-vector based language recognition system was differences in the different regions of Mandarin are larger
proposed, it has been the mainstream method in the field of than those in other dialects.
language recognition because of its lower computational In order to find out which phonemes are important in
complexity, accuracy, and improvement of cross-channel distinguishing different dialect areas in the same dialects, we
language recognition performance [18, 19]. Combining did an experiment of accumulating phoneme probabilities.
PLDA as a discriminant model, it has a good effect in the
language recognition task all the time. So we adopt i-vector 5. PHONEME PROBABILITIES ACCUMULATION
model with PLDA classifier in our experiment. The acoustic
features that we used are 13 MFCCs with a frame length of The probabilistic accumulation of phonemes in a
25ms. Each dataset was split into a training set consisting of language is the probability of the occurrence of each
800 utterances, and a test set consisting of 200 utterances. phoneme in the language [20]. According to this, the
distribution of each phoneme in a certain space can be
4.2. Baseline Results derived. By accumulating the probabilities of phonemes in
different languages, the distribution of the same phoneme in
The performance is evaluated by equal error rate (EER) different dialects can be obtained, and this will help find
and the recognition rate (IDR) of 4 dialects and 17 regions, effective phonemes that distinguish different languages.
which are presented in Table 2 separately. For the 1,000 sentences of each dialect region, we first
calculate the cumulative probability of each phoneme (66
Table 2: IDR and EER results. phonemes in total) of each sentence. After the normal test,
we found that it does not conform to the normal distribution.
So the Wilcoxon rank-sum test method was adopted to
analyze the significant difference between phonemes in
different regions of the same dialect. At the same time, the
following formula (x-min) / (max-min) was used to
normalize the accumulative values of phoneme probabilities
in each dialect region. The following pictures show the
phonemes within each dialect that have a significant
difference in distinguishing different regions of the same
dialects. This test is a pairwise comparison.

2021 24th Conference of the Oriental COCOSDA (O-COCOSDA) 154


Fig. 4: The phonemes which have significant differences in
Fig. 1: The phonemes which have significant differences in Wu dialect.
Jilu Mandarin.
In the above figures, the x-axis represents phonemes
with significant differences, and the y-axis is the cumulative
probability of each phoneme after normalization. There are
significant differences between the Jilu Mandarin and the
Jianghuai Mandarin in the comparison between the dialect
regions in each dialect. Among them, the probabilities of ―g,
h, l‖, and ―x‖appear higher than that of ―c, iz, o, vv‖ in the
Jilu Mandarin, which indicates that in dialect recognition
and synthesis, we could focus on the ―g, h, l,‖ and ―x‖. As
for Jianghuai Mandarin, the probability of other phonemes
appearing is relatively high, in addition to the three
phonemes ―iz, o, oo‖, which means much more emphasis
should be placed on these phonemes in dialect recognition.
The situation in Jin and Wu dialects is more
complicated which needs to be explained separately:
Fig. 2: The phonemes which have significant differences in (1) The distribution of phonemes with no significant
Jianghuai Mandarin. differences in Jin dialect is as follows:

Table 3: The distribution of phonemes in Jin Dialect without


significant difference.

Fig. 3: The phonemes which have significant differences in


Jin dialect.

2021 24th Conference of the Oriental COCOSDA (O-COCOSDA) 155


Table 4: The distribution of phonemes in Wu Dialect will continue to expand the CASIA Dialect Corpus and
without significant difference. extend the recognition of different regions to other dialects.

7. ACKNOWLEDGEMENTS

This work is supported by the National Key Research


& Development Plan of China (No. 2017YFC0820602), the
National Natural Science Foundation of China (NSFC)
(No.61831022, No.61901473, No.61771472, No.61773379)
and Inria-CAS Joint Research Project
(No.173211KYSB20170061 and
No.173211KYSB20190049).

8. REFERENCES
The phonemes in the table mean that the phoneme has no [1] E. Ambikairajah, H. Li, L. Wang, B. Yin, and V. Sethu, ―Language
significant differences in the corresponding two dialect Identification: A Tutorial,‖ IEEE Circuits and Systems Magazine, pp.
regions, which indicates that the two dialect areas cannot be 82-108, 2011.
[2] Y. Muthusamy, E. Barnard, and R. Cole, ―Reviewing automatic
distinguished, but the other regions can be identified. language identification,‖ Signal Processing Magazine, IEEE, vol.11,
In Jin dialect, phoneme ―s‖ has a significant difference in no. 4, pp. 33-41, 1994.
all dialect regions; while in Wu dialect, phoneme ―ing‖ has a [3] https://www.nist.gov/itl/iad/mig/language-recognition.
significant difference in all regions. ―s‖ and ―ing‖ would [4] H. Qi, and J. Xiao, ―The Phonological Features of Yunnan Yongren
Dialect,‖ Science and Innovation, vol. 7, no. 1, pp. 26-30, 2019.
distinguish all the regions in Jin and Wu dialects [5] W. Cai, Z. Cai, W. Liu, X. Wang, and M. Li, ―Insights into end-to-
respectively. Otherwise, the combination of the phonemes is end learning scheme for language identification,‖ in ICASSP 2018-
required in recognition of these regions in the same dialect. 2018 IEEE International Conference on Acoustics, Speech and
By comparison of these two tables, we can conclude that Signal Processing, April 15-20, Calgary, Alberta, Canada,
Proceedings, 2018.
there are more phonemes with no significant difference in
[6] W. Cai, D. Cai, H. Shen, and M. Li, ―Utterance-Level End-to-End
Wu dialect than in Jin dialect which indicates that Wu Language Identification Using Attention-Based CNN-BLSTM,‖ in
dialect is more complicated than Jin dialect. ICASSP 2019-2019 IEEE International Conference on Acoustics,
Speech and Signal Processing, May 12-17, Brighon, UK,
6. CONCLUSIONS Proceedings, 2019.
[7] A. Lozano-Diez, R. Zazo-Candil, J. Gonzalez-Dominguez, D. T.
Toledano, and J. Gonzalez-Rodriguez, ―An End-to-end Approach to
The CASIA Dialect Corpus (CASIA DC) is a large-scale Language Identification in Short Utterances using Convolutional
dialect Chinese dialect corpus that was constructed in 2019. Neural Networks,‖ in INTERSPEECH 2015- 16th Annual Conference
Its original purpose was to provide effective data resource of the International Speech Communication Association, September
6-10, Dresden, Germany, Proceedings, 2015, pp. 403–407.
support for dialect recognition and speaker recognition. [8] P. A. Torres-Carrasquillo, E. Singer, M. A. Kohler, R. J. Greene, D. A.
Based on CASIA DC, this paper conducted a refinement Reynolds, and J. R. Deller, ―Approaches to language identification
dialect recognition. I-vector model with PLDA classifier using Gaussian mixture models and Shifted Delta Cepstral features,‖
was used in the experiments, and the accuracy of different in Proceedings of ICSLP, 2002, pp. 89–92.
[9] S. Fernando, V. Sethu, E. Ambikairajah, and J. Epps, ―Bidirectional
regions of the same dialect is different between different Modelling for Short Duration Language Identification,‖ in
dialects. The accuracy of Wu and Jin dialects is about 85%, INTERSPEECH 2017 - 18th Annual Conference of the International
and some even can reach 90%, while that of Jilu and Speech Communication Association, August 20-24, Stockholm,
Jianghuai Mandarin is only about 80%. This indicates that Sweden, Proceedings, 2017, pp. 2809–2813.
[10] I. Lopez-Moreno, J. Gonzalez-Dominguez, O. Plchot, D. Martinez, J.
the internal differences in Mandarin dialects are smaller than Gonzalez-Rodriguea, and P. J. Moreno, ―Automatic Language
those in other dialects. Identification using Deep Neural Networks,‖ Acoustics, Speech, and
By accumulating the probabilities of phonemes of Signal Processing, IEEE International Conference, 2014.
different dialect regions within each dialect, we can [11] J. Gonzalez-Dominguez, I. Lopez-Moreno, P. J. Moreno, and J.
Gonzalez-Rodriguez, "Frame-by-frame language identification in
conclude that some phonemes can effectively distinguish
short utterances using deep neural networks," Neural Networks, vol.
different regions within the same dialect, but concrete to a 64, pp. 49-58, 2015.
dialect, the contributing phonemes should be analyzed [12] P. Shen, X. Lu, Sh. Li, and H. Kawai, ―Feature Representation of
specifically. And this will provide useful clues for future Short Utterances based on Knowledge Distillation for Spoken
Language Identification,‖ in INTERSPEECH 2018- 19th Annual
speech synthesis and speech recognition.
Conference of the International Speech Communication Association,
This work provides valuable clues for national and September 2-6, Hyderabad, India, Proceedings, 2018, pp. 1813–
information security and dialect protection. In the future, we 1817.
[13] W. Geng, W. Wang, Y. Zhao, X. Cai, and B. Xu, ―End-to-end
Language Identification using Attention-based Recurrent Neural

2021 24th Conference of the Oriental COCOSDA (O-COCOSDA) 156


Networks,‖ in INTERSPEECH 2016 - 17th Annual Conference of the
International Speech Communication Association, September 8-12,
San Francisco, USA, Proceedings, 2016, pp. 2944–2948.
[14] M. Penagarikano, A. Varona, and M. Diez, L. J. Rodriguez-Fuentes,
G. Bordel, ―Study of Different Backends in a State-Of-the-Art
Language Recognition System,‖ in INTERSPEECH 2012 - 13th
Annual Conference of the International Speech Communication
Association, September 9-13, Portland, OR, USA, Proceedings, 2012,
pp. 2049–2052.
[15] F. Richardson, D. Reynolds, and N. Dehak, "A unified deep neural
network for speaker and language recognition," arXiv preprint
arXiv:1504.00923, 2015.
[16] Zhuo. Li, Zh. Gao, H. Wang, J. Liu, and G. Zhu, ―An Improvement
System for Short-term and Confusing Language Recognition,‖
Journal of Chinese Information Processing, vol. 33, no. 10, pp.135-
142, 2019.
[17] R. Cui, Y. Song, B. Jiang, and L. Dai, ―Language Identification
Based on Deep Neural Network,‖ Pattern Recognition and Artificial
Intelligence, vol. 28, no. 12, pp. 1093-1099, 2015.
[18] D. Martinez, O. Plchot, L. Burget, O. Glembek, and P. Matejka,
―Language Recognition in iVectors Space,‖ in INTERSPEECH 2011-
12th Annual Conference of the International Speech Communication
Association, August 28-31, Florence, Italy, 2011, pp. 861–864.
[19] E. Singer, P. A. Torres-Carrasquillo, T. P. Gleason, W. M. Campbell,
and D. A. Reynolds, ―Acoustic, phonetic and discriminative
approaches to automatic language identification,‖ in Proceedings of
Eurospeech (Interspeech), Geneva, Switzerland, 2003, pp. 1345–
1348.
[20] P. Schwarz, ―Phoneme recognition based on long temporal context,‖
Ph.D. dissertation, Faculty of Information Technology BUT,
http://www.fit.vutbr.cz, Brno, CZ, 2008.
[21] H. Li, B. Ma, and K. A. Lee, ―Spoken language recognition: From
fundamentals to practice,‖ in Proceddings of The IEEE, vol. 101, no.
5, pp. 1136-1159, 2013.
[22] Ch. Bartz, T. Herold, H. Yang, and Ch. Meinel, ―Language
Identification Using Deep Convolutional Recurrent Neural
Networks,‖ in ICONIP 2018- 25th International Conference on
Neural Information Processing, December 13-16, Siem Reap,
Cambodia, pp. 880-889, 2017.

2021 24th Conference of the Oriental COCOSDA (O-COCOSDA) 157

You might also like