7, JULY 2015
Abstract—A critical challenge to automatic language identifi- languages, a steep performance drop was observed when the
cation (LID) is achieving accurate performance with the shortest duration of the audio excerpts was decreased from 6 to 1 sec-
possible speech segment in a rapid fashion. The accuracy to onds. An average human accuracy of 49.7%, 37.4% and 20.7%
correctly identify the spoken language is highly sensitive to the
duration of speech and is bounded by the amount of information
respectively for speech utterances with a duration of 6, 2 and
available. The proposed approach for rapid language identifi- 1 seconds was reported.
cation transforms the utterances to a low dimensional i-vector Over the years, several machine learning approaches have
representation upon which language classification methods are ap- been proposed to develop algorithms for automated language
plied. In order to meet the challenges involved in rapidly making identification. Automated LID systems could have potential ad-
reliable decisions about the spoken language, a highly accurate vantages over humans in that they can be trained much faster
and computationally efficient framework of i-vector extraction
is proposed. The LID framework integrates the approach of uni- and on a larger number of languages simultaneously. Moreover,
versal background model (UBM) fused total variability modeling. these systems can provide valuable support to humans in re-
UBM-fused modeling yields the estimation of a more discriminant, trieving the spoken language, especially when the language is
single i-vector space. This way, it is also a computationally more unknown to them. With the expanding internationalization and
efficient alternative than system level fusion. A further reduction continuing growth of technology-driven public and commercial
in equal error rate is achieved by training the i-vector model
on long duration speech utterances and by the deployment of a
services, the implementation of robust, fast and accurate lan-
robust feature extraction scheme that aims to capture the relevant guage identification systems can improve the customer experi-
language cues under various acoustic conditions. Evaluation ence in many ways. Examples include technologies for security
results on the DARPA RATS data corpus suggest the potential of and defense applications [2], and multi-language translation de-
performing successful automated language identification at the vices where LID serves as a front-end to discover the input lan-
level of one second of speech or even shorter duration. guage before the appropriate translation process can be initiated.
Index Terms—I-vector, noise robustness, rapid language identifi- When used in emergency call centers, the impact of LID systems
cation, short-duration speech, total variability modeling, universal could be crucial by rapidly dispatching calls in order to make the
background model (UBM) fusion. operator’s responses more effective. This paper aims to specifi-
cally contribute to the domain of Rapid Language Identification
I. INTRODUCTION in which reliable decisions about the spoken language need to
be made quickly with as few seconds of speech as possible.
session aspects and background noises in the data. This way, the UBMs Gaussian components, assuring an estimation of
JFA retains a subspace that ideally captures the variability of the i-vector space that is more discriminant between the lan-
the desired factor of interest, e.g. in the present case, the spoken guage classes. Therefore, a multi-feature extraction scheme
language identity. Although originally applied to the problem is designed that is robust to changing acoustic conditions
of speaker verification, the factor analysis formulation can be in the recording environment. The paper also contributes to
easily generalized to language identification. reduce the computational complexity of conventional language
The method of JFA has led to its successful variant, namely identification systems where accuracy and robustness improve-
total variability or i-vector modeling, which was introduced in ments are obtained by applying system level fusion. Instead
[13] and has since become popular due to its excellent perfor- of training, evaluating and combining the output probabilities
mance, reduced complexity and small model size. The success of multiple systems, the proposed UBM-fused LID system
of performing LID in the i-vector framework has been shown only requires the estimation of a single i-vector space. We
in [14]–[17]. A review of GMM-based LID, will be provided in will show that UBM-fusion achieves a better accuracy and is
Section II. computationally less complex as compared to conventional
This work contributes to the challenging problem of rapid system level fusion.
language identification (RLID) where decisions about the The RLID system will be evaluated in Section V on the
spoken language need to be made on short duration utterances LID data corpus collected by the Linguistic Data Consortium
as well as in a real-time mode to enable online or interactive (LDC) under the DARPA Robust Automatic Transcription of
processing. Achieving a high accuracy on a small amount Speech (RATS) program. The main goal of the RATS program
of speech data is the main goal of this paper. Therefore, we is to accurately separate the target speech from interfering
also attempt to address a fundamental question in the field of background sources, to identify the language and the speaker,
RLID: what is the minimum amount of data we need in order and to apply keyword detection on a data corpus that consists of
to make reliable decisions about the language? When the utter- highly degraded speech recordings. The RATS data collection
ance duration is in the order of a few seconds or shorter, key contains conversational telephone recordings that were re-
informational cues such as the prosodic patterns, vocabulary transmitted through eight different noisy radio communication
or grammatical structure, tend to become less evident to be channels [25]. In this work, the challenging requirements of the
algorithmically extractable. Furthermore, with a decreasing RATS LID task were further compounded by the constraints of
number of phones expected in a word-based context, the lan- performing language identification on speech utterances with
guage models of phonotactic-based approaches do not always durations shorter than the 3 seconds utterance duration, while
guarantee providing sufficient discriminating rules to distin- maintaining a high level of robustness and keeping the compu-
guish between languages. Moreover, inconsistency in phone tational demands low. When tested on short speech utterances
recognition due to mismatched conditions between training of 5, 3 and 1 second duration, an equal error rate of respectively
and testing will have a relatively greater impact on language 6.61%, 8.36% and 14.49% is achieved, demonstrating the
identification accuracy. In GMM-based LID, the statistical potential of rapid automated language identification (as well as
probabilities derived from the acoustical representation of the current limits of performance).
the utterance will be accumulated over time and propagated This paper concludes in Section VI by summarizing the pre-
until the final classification stage. Although their performance sented work and suggesting future directions in the domain of
also increases when more statistics can be accumulated, these RLID.
systems tend to be more robust on short utterances as they do
not rely on rule-based approaches applied on phonetic tran- II. GMM-BASED LANGUAGE IDENTIFICATION
scription. Various attempts of accurate language identification This section provides a review of language identification
on short-duration sentences using the i-vector framework have based on acoustic representations of spoken languages. It
been proposed in [15], [18]–[22]. requires the training of a prior model that forms an adequate
Section III presents a novel algorithmic framework for rapid basis for representing the variability present in speech signals,
language identification and describes the proposed techno- the adaptation of this model into a low-dimensional subspace
logical advances to improve robustness and performance. To to discriminate between the target languages, the projection
deploy rapid LID in real-life application, it is desirable that the of speech utterances onto this subspace, and finally a classifi-
computational time required for making the decision about the cation strategy applied on the parameter representation of the
language is small. To address the computational requirements, utterances.
the proposed RLID approach adopts the simplified i-vector
framework proposed in [23] and exploits the computational A. Universal Background Modeling
benefits of UBM-fused total variability modeling [24]. In The first step in training a language-specific model is to train a
the simplified i-vector approach, the complexity of i-vector general prior model that represents a generic language-indepen-
extraction is drastically reduced by defining a well-chosen dent, statistical distribution of the underlying acoustical char-
prenormalization of the first order Baum-Welch statistics acteristics captured by the feature vectors extracted from var-
allowing fewer computations in the factor analysis. UBM-fused ious languages. Research on acoustic modeling has shown the
total variability modeling is a novel technique that combines success of Gaussian Mixture Models (GMM) as probabilistic
multiple UBMs trained on diverse feature representations models that are able to adequately represent the acoustic vari-
into a single combined UBM, with the goal of making the ability of speech. The prior model that is deployed in our lan-
extracted Baum-Welch statistics more equally distributed along guage identification system is a GMM composed of a number of
mixture components, and trained on all available training data. and the session information that is present in each utterance.
This GMM is commonly referred to as the Universal Back- These vectors are all normally distributed and referred to as re-
ground Model (UBM). The UBM plays a fundamental role in spectively the language factors, and , and the session fac-
achieving the desired language classification accuracy of the tors .
final system. Adaptation techniques against the target language The between-language and within-language variability are
data are applied to the UBM in order to discriminate between ideally captured in eigenspaces that are maximally decorrelated
the languages of interests. to prevent the loss of important language information in the
A well-designed UBM should take the following consider- session eigenspace. However, the assumption of zero mutual
ations into account. Firstly, the UBM should be trained inde- information between these subspaces does not hold in practice
pendent of the spoken language. A training set with a balanced when real life speech signals are observed [28]. Therefore, an
amount of data from each language prevents the UBM from alternative factor analysis framework, namely total variability
being biased toward a specific language. The mixture compo- modeling, was presented in [13] to mitigate this type of infor-
nents of the UBM need to accurately model the subtle acoustic mation loss by the estimation of a single low-dimensional sub-
differences between the languages of interest. Furthermore, ide- space, i.e., the identity or i-vector space, modeling all variability
ally the UBM should be acoustically matched with the data ex- together.
pected to be observed in the testing phase; this requirement mo- In the total variability framework, the supervector of
tivates the robust feature extraction proposed in Section III-A. equation (1) is now reformulated as
Robust modeling also yields consistency in the set Gaussian
components that are dominant over time for utterances spoken (2)
in the same language. As will be shown in Section III-C, a uni-
form occupancy distribution of the Gaussian components will where the matrix spans a low-dimensional total variability
have a beneficial influence on the performance. subspace of rank . The utterance is now represented by a
Let us define a UBM composed of Gaussian mixture com- normally distributed vector containing the corresponding
ponents as where each mixture com- total factors, commonly referred to as the identity- or i-vector.
ponent is characterized by with mixture Note that the probability function of the feature vectors given
weight , Gaussian mean and (diagonal) covariance ma- is a Gaussian mixture model with mean supervector and
trix . Given the training data, Maximum Likelihood Esti- super covariance matrix denoted by that explains the residual
mation (MLE) [26] is applied to estimate the GMM. Here, the variability not captured in the eigenspace defined by the column
model parameters are iteratively found by means of the Expec- vectors of .
tation-Maximization (EM) algorithm [27]. The low-dimensional eigenspace spanned by the i-vectors,
i.e., the i-vector eigenspace, yields an intermediate, suboptimal
B. Total Variability Modeling representation of language variability. Hence, post-processing
techniques are required to compensate for the undesired vari-
The system for Rapid Language Identification (RLID) used ability of the session factors and will be briefly mentioned in
in this work uses the total variability or i-vector modeling ap- Section II-C.
proach, originally proposed in [13]. Total variability modeling Let denote the -dimensional acoustic feature vector at
is based on, and motivated by, the technique of Joint Factor a time frame of utterance . The i-vectors are then estimated
Analysis (JFA) [11], [12] and was first applied to the task of from the acoustic feature representation of the utterance, using
speaker verification. The aim of JFA is to jointly capture the the corresponding UBM as a prior. The zeroth order Baum-
desired variability of the predefined signal factor of interest Welch statistics of UBM mixture component for utterance
(e.g. gender, speaker, language, etc.), and the undesired session are then given as
variability originating from other factors, such as the transmis-
sion channel, the recording environment or the affective speaker
state, to name a few. (3)
Given the Universal Background Model of Section II-A,
applying JFA to model spoken languages implies representing where the sum of the occupancy probabilities is taken over all
each utterance as a language- and session-dependent super- frames that are present in the utterance. The centralized first
vector : order Baum-Welch statistics are computed as
where is a language- and session-independent supervector
constructed by stacking all mean vectors of the Gaussian mix- Rearranging the statistics (3)–(4) over all mixture compo-
ture components of the UBM. Matrices and can be seen nents, we stack all vectors into the supervector , and we
as the eigenspace of the language and the eigenspace of the ses- define the diagonal matrix which is composed of
sion, modeling respectively the desired (between-language) and diagonal blocks of respectively , with being the identity
undesired (within-language) variability. Matrix is a diagonal matrix of dimension .
matrix and contains the residual of the language subspace that The total variability framework (2) can now be restated as
is not captured by . The utterance representation of (1) al-
lows extracting a low-dimensional vector for both the language (5)
and associated probability distributions As mentioned above, the presence of the undesired factors
in the variance of the i-vector should be compensated prior to
classifier training. Therefore, variability compensation methods
(6) such as Within-Class Covariance Normalization (WCCN) [31],
Linear Discriminative analysis (LDA) and Nuisance Attribute
Note that the distribution of is both conditioned on and Projection (NAP) [10], are typically applied within the i-vector
the UBM . space.
The total variability matrix is iteratively trained by the Our previous research on LID [17], [23], also confirmed by
EM-algorithm described in [29] for only one factor in the JFA the work of [15], [17], has shown that best results are obtained
and by considering each training utterance as being produced by when WCCN is applied prior to training an SVM classifier with
a new speaker. The Expectation-step involves the computation polynomial kernel of high order. In this paper, the WCCN fea-
of the posterior probability using Bayes’ rule and ture transformation matrices and the SVM are both trained on
the priors given in (6). The estimated i-vectors are explained as the same training set that was used in total variability modeling.
the expected values of the posterior distribution and given as
Obtaining a high accuracy at a low computational cost is es-
with sential for making rapid and reliable decisions about the spoken
language on utterances with short duration. This section gives
(8) an overview of recent advances that have been made in the con-
The Maximization-step of the EM algorithm updates the text of the RATS LID project, all of which are steps to enable
total variability matrix and the supervector covariance matrix systems of rapid language identification (RLID). We start by ex-
such that the global likelihood defined over all training plaining the importance of a robust front-end module, together
utterances: with the proposal of an acoustic feature set capturing multiple
discriminative speech characteristics. Next, we restate the modi-
fication to the i-vector modeling that was proposed in [17], [23]
to improve LID performance in terms of computational load.
This simplified i-vector system is further extended to the frame-
is maximized. The updated matrices are found by linear regres- work of UBM-fused total variability modeling [24]. It will be
sion of (9) using the estimated i-vectors (7) as explanatory vari- shown that significant improvements in accuracy, while main-
ables. The total variability matrix is randomly initialized, taining the system’s complexity, are achieved when the i-vector
while is initialized by the covariance matrices of the UBM. space is estimated in this framework and by training on utter-
For further algorithmic details on the training procedure, we ances with long duration.
refer to [11].
Experimental evidence of the success of total variability mod- A. Robust Feature Extraction
eling as compared to JFA was given in [13] for speaker verifica- The feature vectors that are extracted from the audio data
tion, while [16] presents its benefits when applied on language should capture the acoustic properties that are relevant to
identification. Despite the simplification of the JFA equations, discriminate between the languages. Furthermore, deployment
the method of total variability modeling is still computation- of LID in real-life scenario requires the features to be relatively
ally expensive. The iterative EM procedure in the training stage invariant for a wide range of adverse acoustic conditions, such
and the i-vector extraction during testing are both dominated by as variations in the background noise environment and changes
the computationally expensive matrix products of equations (7) in the audio transmission channels or recording devices. To this
and (8). This yields a combined complexity of end, an ideal front-end for language identification contains a
for each speech signal of any duration that needs be Voice Activity Detection (VAD) to prevent non-speech audio
evaluated to retrieve the spoken language. segments from interfering with the classification decision, a
speech enhancement method [32] to compensate for noise
C. Language Classification distortions and a robust feature extraction module followed
Total variability modeling transforms each test utterance into by a normalization step to further reduce the sensitivity of the
a single low-dimensional i-vector representation of fixed dimen- features to the acoustic variability. A schematic overview of
sion . To perform language identification on these utterance such a robust front-end for a LID system is shown in Fig. 1.
i-vectors, various classification techniques have been proposed. As discussed in Section II-A, language (identification)
Generative modeling approaches [14] attempt to model the lan- learning involves the adaptation of Universal Background
guage classes by training a Gaussian distribution on the i-vec- Model that is trained on the chosen feature representation. It is
tors, while discriminative methods seek to find the language de- clear that the accuracy of the Baum-Welch statistics (3) and (4)
cision boundaries in the i-vector space. For the latter, a distinc- which are derived from the UBM will have an important impact
tion is further made between classifiers that require training, on the entire system performance. Since UBM components are
such as Support Vector Machine (SVM) or Neural Networks assumed to have diagonal covariance Gaussians allowing their
[21], and direct scoring approaches against a target i-vector for evaluation to be computationally tractable, feature compo-
each language, e.g. Cosine Distance Scoring (CDS) [13] and nents require to be sufficiently decorrelated to ensure accurate
sparse representation classification (SRC) [30]. modeling.
for . It was shown that a limited quantization error can be with probability distributions
assured for a table size of the order of a few hundred entries,
which is typically much smaller than the number of utterances
in the training set. The table look-up strategy further reduces the (14)
complexity in training mode to .
As shown in [17], the simplification of the i-vector system Finally, the system is trained by the EM-procedure of
slightly reduces the performance of the conventional i-vector Section II-B, where the i-vectors are found as the posterior
baseline. However, the sacrifice in performance is often negli- expectations of , and computed as:
gible and tolerated given the measured computation speed in-
crease of more than 100 times compared to the baseline. (15)
C. Accurate Modeling (16)
It was shown in [24] that accurate modeling of the i-vector
Note that when all UBMs have a number of com-
space is highly related to the extracted UBM Baum-Welch-sta-
ponents, the computational complexity during i-vector training
tistics. An improved accuracy of Baum-Welch statistics can
remains . Hence, UBM-fused i-vector modeling re-
be achieved by (i) training on long duration speech utterances,
duces the total complexity by a factor when compared to
i.e. containing multiple conversational sentences, since they
conventional system level fusion of the same number of sys-
activate more components per utterance and hence accumulate
tems. The latter requires the estimation of an i-vector space for
more statistically relevant acoustic language cues, and (ii) by
each system in the fusion, which yields a total complexity of
fusion of multiple UBMs that are trained on various feature
for i-vector modeling. In a real time implementa-
representations capturing diverse and complementary acoustic
tion, the multi-feature extraction scheme of the LID front-end
information. Hence, both approaches will be exploited in the
extracts features simultaneously during the recording of the ut-
training stage of the proposed RLID system.
terance, leaving only a few frames unprocessed when the utter-
The concept of UBM fused total variability modeling was
ance ends. Therefore, an efficiently implemented feature extrac-
recently proposed in [24] where its efficiency, especially when
tion module only adds a negligible computational bias to the lan-
evaluated on short-duration speech utterances, was demon-
guage identification operation that is dominated by the i-vector
strated. The technique of UBM-fusion increases the number
extraction. During testing phase, the fused UBM is treated as if it
of dominant (and non-redundant) UBM components by com-
was a single UBM of the same size and hence the total number of
bining various feature representations into a single i-vector
computations remains . Note that system level
model and yields significant gains in classification accuracy
fusion of systems would have required
without increasing its computational complexity.
computations during testing.
The prenormalization strategy of Section III-B allows a
Experimental evidence for the beneficial impact on long dura-
straightforward implementation of UBM-fused total variability
tion training and UBM-fused total variability modeling is given
modeling, which involves the following steps prior to i-vector
in Section V-B and Section V-E, respectively.
1) Define as the total number of UBMs that will be jointly IV. DATA CORPUS
exploited in the fused i-vector training. In order to be effec-
tive, each UBM should be trained on a different acoustical The DARPA Robust Automatic Transcription of Speech
feature representation. (RATS) data corpus [25] was chosen as it allows assessing
2) Construct the supervector by combined stacking of all the robustness and performance of the proposed RLID system
UBMs (unweighted) first order Baum-Welch statistics: under various use conditions and to compare it with many
state-of-the-art LID systems that have been recently reported on
(11) this corpus. The Linguistic Data Consortium (LDC) collected
conversational recordings from public telephone networks of
five target languages (Arabic Levantine, Dari, Farsi, Pashto,
3) Reweight each UBM-specific component of (11) ac-
and Urdu) and 10 non-target languages. The recordings were
cording to formula (10), with summed over all UBM
about 2 minutes long and were retransmitted through eight
components, hence
different radio communication channels using different trans-
mitter and receiver systems. The process of rebroadcasting
(12) introduces various aspects of heavy speech degradation such
as nonlinear speech distortions and channel noise, frequency
shift, band limitation and a variable signal-to-noise ratio (SNR)
4) Initialize from the covariance matrices of all Gaussian ranging from 30 dB to values lower than 0 dB. A training and
components in (11). development set of these retransmitted data was distributed by
With the above steps, the UBM-fused i-vector modeling the LDC to all participants of the DARPA RATS program. The
framework is then formulated as: official development set, denoted by DEV2, was split into four
test sets containing utterances with durations of 120, 30, 10 or
(13) 3 seconds.
Fig. 2. Impact of the number of Gaussian UBM components on the EER (in%)
for language identification on RATS data. The LID system extracts MFCC fea-
tures in the front-end, while training and testing is done in the simplified i-vector
framework on utterances of 1, 3, 5 or 10 seconds duration.
Fig. 4. EER versus Cavg for the UBM-fused i-vector system evaluated on the
3 seconds TEST set. The system was trained on training utterances of 3, 10
and 30 seconds duration. The trend towards higher accuracies for long-dura- Fig. 5. Demonstration of the positive effect of UBM-fusion on LID perfor-
tion training experimentally motivates the discussion related to Fig. 3. Different mance. The figure shows the improvements in error rates achieved by UBM-fu-
i-vector dimensions of 200, 400 and 600 were used to tune this value for re- sion of the four feature representation of Section III-A for the RLID system
porting on the DEV-2 set. trained on 30 seconds duration utterances.
with 2048 components trained on the features of Section III-A. that result in inaccurate UBM modeling. When multiple LID
The LID system trained of the FuSS features performs 3-6% systems are available during testing, linear fusion of the indi-
better compared to the systems trained on standard features. vidual systems is typically applied instead of feature fusion to
This difference can be explained since the FuSS features cap- further improve performance [14], [15], [17], [21], [32]. In this
tures, beside spectral shape, information about spectro-temporal paper, the fusion is done by running four systems in parallel
modulations, voicing and long-term spectral variability. and computing the linear combination of the SVM output prob-
Feature level fusion has been reported to be efficient when abilities. The results of system level fusion are given in row 6
the number of target classes are large, such as in speaker ver- of Table III and show a 5-8% relative improvement compared
ification. Here, concatenated feature vectors can be processed the best individual system, i.e. the FuSS feature LID system of
by a Linear Discriminant Analysis (LDA) step to discriminate row 4.
the feature along the class labels. Due to the low number of
language classes, LDA discrimination is not effective for the E. UBM-Fusion
RATS LID task. Feature fusion was applied by concatenating The impact of the UBM Baum-Welch statistics on LID per-
all four feature streams followed by PCA to decorrelate and formance was experimentally shown in Section V-B by means
reduce the dimensionality to 120 components retaining around of long utterance duration training. Another approach to im-
95% of the total variance. The evaluation metrics are shown in prove i-vector space modeling by manipulation of UBM counts
row 5 of Table III and illustrate the limitations of feature fusion can be done through the UBM-fusion technique of Section III-C.
in our LID system. Similar as in [51], exploiting the diversity Fig. 5 shows how the DET-curves of the individual LID
of the features by means of stacking does not guarantee perfor- systems of Table III-row 2 are moved downward lowering
mance gains as it could introduce inconsistencies in the fusion error rates. The dashed lines correspond to the DET-curves
for the 3 and 10 seconds duration test set by the RLID system
where the i-vector model was derived from the 30 seconds
duration training set, using MFCC features and a UBM of 2048
components. The effect of UBM-fusion on the RLID error rate
is illustrated by the solid DET-curves of Fig. 5. These curves
are obtained by the RLID system where the i-vector training
was performed by combining four UBMs of 512 components,
each trained on one of the four feature representations of
Section III-A, into a single UBM. The explanation of the posi-
tive effect UBM-fusion lies in the estimation of a better, more
discriminant i-vector space that is derived from Baum-Welch
statistics extracted from an increased number of dominant and
diverse UBM components.
The evaluation metrics on the 1, 3, 5 and 10 seconds du-
ration tasks are given in row 7 of Table III. As mentioned in
Section III-C, the computational complexity of the UBM-fused
RLID system during testing is identical to those of the individual
systems, while this is not the case for system level fusion, i.e.
the complexity is multiplied by the number of systems in the
fusion. It is interesting to observe that UBM-fusion also out-
performs system level fusion in LID accuracy, with relative im-
provements obtained around 13-18% of the best individual LID
