O-COCOSDA 2021, Singapore, Nov 18-20, 2021


Xuefei Liu1, Jianhua Tao1, 2, 3, Yurong Han1, 4, Chenglong Wang1, 5, Xueying Zheng1, Zhengqi Wen1
National Laboratory of Pattern Recognition, Institute of Automation,
Chinese Academy of Sciences, Beijing
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing
CAS Center for Excellence in Brain Science and Intelligence Technology, Beijing
Northwest University, Xi’an
University of Science and Technology of China, Hefei
{ xuefei.liu, jhtao, chenglong.wang, xueying.zheng, zqwen} @nlpr.ia.ac.cn

ABSTRACT not only brings convenience to our lives but also builds a
bridge for communication between different peoples and
The work of finding which phonemes distinguish the countries [2, 21]. The current language recognition mainly
different regions within the same dialect will be of great focuses on the recognition of different languages, such as
significance to the improvement the recognition technology, Oriental Language Recognition Competition (OLR
national and information security and dialect protection. To Challenge) and Language Recognition Evaluation (LRE) [3].
address this issue, this paper investigates the refinement OLR Challenge is held every year while LRE has an
recognition of different regions of the same dialect based on evaluation every two years. The main tasks of these
the corpus designed and recorded by CASIA (CASIA competitions are to recognize several languages such as
Dialect Corpus, CASIA DC) firstly and then finds the Oriental languages (Cantonese, Indonesian, Japanese,
distinguishing phonemes by the probabilistic accumulation Chinese…), Arabic, English, Slavic, Iberian, etc. Even if
of phonemes method. Based on i-vector model, the specific to a certain language, such as Chinese, the
recognition results indicate that the recognition rates of recognition task is also to identify different dialects within
different regions of the same dialect are different from one Chinese. At the same time, the smaller differences in
dialect to another. The recognition rate of the Mandarin phonetics, vocabulary, grammar, etc. in different regions of
dialects is lower than that of other dialects. Through the the same dialect than different dialects impose great
probabilistic accumulation of phonemes, we find that the challenges to the recognition of different regions within the
phonemes with significant difference can distinguish same dialect [4].
different regions of the same dialect, which will provide Recognition of different types of languages / dialects is
significance for the synthesis and recognition of dialects in important, but it also has important significance for the
the future. recognition of different areas within the same dialect. From
the aspect of improving scientific technology, the situation
Index Terms—distinguishing phonemes, language that the differences in different regions of the same dialects
recognition, regions of dialects, i-vector model, probabilities are smaller than that in different dialects poses a certain
accumulation of phonemes challenge to the traditional language / dialect recognition
technology. In order to identify different areas of the same
1. INTRODUCTION dialect accurately, scholars have to propose more refined
systems and technique which will effectively promote the
In order to find which phonemes distinguish the different continuous optimization of language recognition technology
regions within the same dialect, the first step is language and system.
recognition, that is, the recognition of different regions Recognition of different regions within the same dialect
within the same dialect. and find the distinguishing phonemes is also important for
The main task of language recognition is to quickly and national security. It has been found that the dialect will be
accurately identify the type of language according to a given strongly associated with the geographic location by the
segment [1]. It has numerous applications in the fields of recognition of different areas within the dialects. The refiner
speech recognition, voiceprint recognition, machine the dialect is, the more precise the position is. Through the
translation, communication and information retrieval, which investigation of the distribution of Chinese dialects, it is

found that the same dialect can be distributed in different different types of languages. When it comes to dialect
provinces and cities. For example, Gan dialect is distributed recognition, the main task is to identify different dialects not
in Jiangxi, Hubei, Hunan, Fujian and other provinces. But to recognize their regions. In view of the fact that the
specific to different areas of the same dialect, the recognition technology of different languages is relatively
distribution is clearer. For example, the region of Changdu mature, and the different regions of the same dialect have
of Gan dialect is mainly distributed in Jiangxi province, the strong correlations in phonetics, vocabulary and grammar,
region of Datong is mainly distributed in Huangshi City and this paper will recognize the different regions in the same
Xianning City of Hubei Province, and Yueyang City of dialect based on i-vector firstly, trying to find out the
Hunan Province. The region of dialect can be directly phonemes that cause the difference between different dialect
connected with the counties and cities through the regions.
identification of different areas of the dialects, which has
greater significance for national and information security. 3. CORPUS
Refinement of the dialect research is also helpful to
dialect protection, ―how to protect dialects, especially some The corpus used in this paper, CASIA Dialect Corpus
dialects without words‖ is an important issue related to (CASIA DC) was designed and recorded by the State Key
national development and culture in the context of Mandarin Laboratory of Pattern Recognition, Institute of Automation,
Chinese as a common language challenging the survival of Chinese Academy of Sciences (CASIA). The database
dialects. Especially southern dialects are very complicated, includes 10 dialects, which are: 4 Mandarin dialects
―the pronunciation is different from places‖, through (Jiaoliao, Jilu, Dongbei, Jianghuai Mandarin), Jin dialect,
recognizing different regions within the same dialect, Wu dialect, Hui dialect, Cantonese, Min dialect, and Xiang
accurately extracting the effective distinguishing dialect. The recordings in our corpus are still increasing. In
characteristics will provide useful value and clues for speech order to compare the differences between Mandarin dialects
synthesis and recognition of low resource dialects. and other dialects, 4 dialects of Jilu Mandarin, Jianghuai
The rest of this paper is organized as follows: section 2 Mandarin, Wu and Jin dialects have been chosen as our
introduces related work of dialect recognition, section 3 database, consisting of 17 datasets. To balance the data, 4
summarizes the corpus used in this experiment, section 4 speakers (2 males and 2 females) 1000 utterances for each
shows the identification recognition results with i-vector dialect have been chosen. So the total number of sentences is
model, section 5 describes which phonemes are effective in 17,000. The recording duration was 14.64 hours. The signals
identifying different regions of dialects. Concluding remarks were recorded by mobile phones in freestyle, with a
are giving in the final section. sampling rate of 16k Hz and a sample size of 16 bits.

2. RELATED WORKS Table 1: Data profile.

In the field of language identification recognition,

lacking effective data, susceptibility to multiple noises,
lower language information, most current recognition
systems are hard to recognize shorter (3-10s) samples [1].
So the current research on language recognition focuses on
how to continuously optimize the system to improve the
shorter speeches recognition performance [7-9,12,14,16],
such as Deep Neural Networks (DNNs) [10,11,15,17], Long
Short-Term Memory (LSTM)[5,6], Recurrent Neural
Networks (RNNs) [13], Convolutional Deep Neural
Networks (CDNNs) [22]and so on.
Considering the recognition of dialects, this situation is
complicated, not only facing the problem of the duration but
also facing the problem of small differences within the same
dialect. For example, the Oriental Language Recognition
Challenge (OLR Challenge—Oriental Language
Recognition Challenge) jointly organized by Tsinghua Male and Female speakers are balanced.
University and Speechocean, Chinese dialect recognition AI Some regions of dialects have insufficient sentences, so
challenge which was held by iFLYTEK based on their open the data is deleted from the overall data.
dialect database, etc. In these competitions, the duration of The sentences selected in this experiment are naturally
the sentences are divided into different types (1s, 3s and the spoken sentences. After reading the scripts, the speakers are
whole sentence) and the competitions are the recognition of required to express the meaning of the sentence in their own

dialect instead of reading the scripts. If there are some words IDR is formally defined as follows:
that are not commonly used in the dialect, they can be IDR=Tc/ (Tc + Ti)
replaced, such as: where Tc and Ti are the numbers of correctly and
Bu gua feng bu xia yu de, feng he ri li de jiu ba ren gei incorrectly identified utterances respectively. Specifically,
zhuo nong le. (It is not windy or raining, people were fooled during the training stage, we trained four i-vetor models for
in such a peaceful, sunny day.) the four dialects: Jilu mandarin, Jianghuai mandarin, Wu
Due to ―Feng he ri li (a peaceful, sunny day)‖ is a dialect and Jin dialect. For the test stage, test data is only
written expression, it is not used in some dialects. So some tested within a single dialect. For example, Jilu mandarin is
dialects expressed this as: a three classifications problem and the Wu language is five
Ye bu gua feng ye bu xia yu de, jiu zhe me ba ren gei classifications. Table 2 presents the IDR results of each
zhuo nong la. (It is neither windy nor rainy, it makes fun of region in the same dialect and the Equal Error Rate of each
people.) dialect.
As can be seen from the table, the EER of Wu and Jin
4. EXPERIMENTS dialects is lower than Jilu and Jianghuai Mandarin which
indicates that the performance on these two dialects (85%-
4.1. Experiment setup 90%) is better than Jilu and Jianghuai Mandarin (about
80%). At the same time, this result also reflects that the
After i-vector based language recognition system was differences in the different regions of Mandarin are larger
proposed, it has been the mainstream method in the field of than those in other dialects.
language recognition because of its lower computational In order to find out which phonemes are important in
complexity, accuracy, and improvement of cross-channel distinguishing different dialect areas in the same dialects, we
language recognition performance [18, 19]. Combining did an experiment of accumulating phoneme probabilities.
PLDA as a discriminant model, it has a good effect in the
language recognition task all the time. So we adopt i-vector 5. PHONEME PROBABILITIES ACCUMULATION
model with PLDA classifier in our experiment. The acoustic
features that we used are 13 MFCCs with a frame length of The probabilistic accumulation of phonemes in a
25ms. Each dataset was split into a training set consisting of language is the probability of the occurrence of each
800 utterances, and a test set consisting of 200 utterances. phoneme in the language [20]. According to this, the
distribution of each phoneme in a certain space can be
4.2. Baseline Results derived. By accumulating the probabilities of phonemes in
different languages, the distribution of the same phoneme in
The performance is evaluated by equal error rate (EER) different dialects can be obtained, and this will help find
and the recognition rate (IDR) of 4 dialects and 17 regions, effective phonemes that distinguish different languages.
which are presented in Table 2 separately. For the 1,000 sentences of each dialect region, we first
calculate the cumulative probability of each phoneme (66
Table 2: IDR and EER results. phonemes in total) of each sentence. After the normal test,
we found that it does not conform to the normal distribution.
So the Wilcoxon rank-sum test method was adopted to
analyze the significant difference between phonemes in
different regions of the same dialect. At the same time, the
following formula (x-min) / (max-min) was used to
normalize the accumulative values of phoneme probabilities
in each dialect region. The following pictures show the
phonemes within each dialect that have a significant
difference in distinguishing different regions of the same
dialects. This test is a pairwise comparison.

Fig. 4: The phonemes which have significant differences in
Fig. 1: The phonemes which have significant differences in Wu dialect.
Jilu Mandarin.
In the above figures, the x-axis represents phonemes
with significant differences, and the y-axis is the cumulative
probability of each phoneme after normalization. There are
significant differences between the Jilu Mandarin and the
Jianghuai Mandarin in the comparison between the dialect
regions in each dialect. Among them, the probabilities of ―g,
h, l‖, and ―x‖appear higher than that of ―c, iz, o, vv‖ in the
Jilu Mandarin, which indicates that in dialect recognition
and synthesis, we could focus on the ―g, h, l,‖ and ―x‖. As
for Jianghuai Mandarin, the probability of other phonemes
appearing is relatively high, in addition to the three
phonemes ―iz, o, oo‖, which means much more emphasis
should be placed on these phonemes in dialect recognition.
The situation in Jin and Wu dialects is more
complicated which needs to be explained separately:
Fig. 2: The phonemes which have significant differences in (1) The distribution of phonemes with no significant
Jianghuai Mandarin. differences in Jin dialect is as follows:

Table 3: The distribution of phonemes in Jin Dialect without

significant difference.

Fig. 3: The phonemes which have significant differences in

Jin dialect.

Table 4: The distribution of phonemes in Wu Dialect will continue to expand the CASIA Dialect Corpus and
without significant difference. extend the recognition of different regions to other dialects.


This work is supported by the National Key Research

& Development Plan of China (No. 2017YFC0820602), the
National Natural Science Foundation of China (NSFC)
(No.61831022, No.61901473, No.61771472, No.61773379)
and Inria-CAS Joint Research Project
(No.173211KYSB20170061 and

