0% found this document useful (0 votes)

48 views4 pages

Deep Learning For Spoken Language Identification: GR Egoire Montavon

The document summarizes research on using deep learning for spoken language identification from audio clips. A deep neural network with convolutional and pooling layers (CNN-TDNN) is implemented to automatically learn features from spectrograms of speech samples, without relying on hand-coded features. The network is evaluated on two datasets of English, French, and German speech clips. On both datasets, the deep network achieves substantially higher identification accuracy compared to a shallow network, especially when identifying languages from new speakers or radio stations not in the training data.

Uploaded by

Maged Hamouda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views4 pages

Deep Learning For Spoken Language Identification: GR Egoire Montavon

Uploaded by

Maged Hamouda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Deep learning for spoken language identification

Grégoire Montavon
Machine Learning Group
Berlin Institute of Technology
Germany
gmontavon@cs.tu-berlin.de

Abstract Empirical results have shown that many spoken language identification systems based on
hand-coded features perform poorly on small speech samples where a human would be successful.
A hypothesis for this low performance is that the set of extracted features is insufficient. A deep
architecture that learns features automatically is implemented and evaluated on several datasets.

1 Introduction
Spoken language identification is the problem of mapping continuous speech to the language it
corresponds to. Applications of spoken language identification include front-ends for multilingual
speech recognition systems, web information retrieval, automatic customer routing in call centers or
monitoring.
Empirical results have shown that many systems based on the manual extraction of acoustic,
prosodic, phonotactic or lexical features have significantly lower performance on small speech sam-
ples than on large speech samples [3, 4], while a human would still be successful. A hypothesis for
this low performance is that the set of extracted features is insufficient [5].
Deep learning potentially addresses this issue by exploring the space of features automatically, by-
passing the traditional phoneme recognition layer and learning instead purely discriminative fea-
tures. A deep architecture is implemented and evaluated on several datasets.

2 Design and implementation

We train and evaluate our architecture on two datasets:

VoxForge This dataset consists of multilingual speech samples available on the VoxForge [9] web-
site. This dataset contains 5 seconds speech samples associated with different metadata including
the language of the sample. Given that speech samples are recorded by users with their own micro-
phones, quality varies significantly between different samples. This dataset contains 25420 English
samples, 4021 French samples and 2963 German samples.

RadioStream This dataset consists of samples ripped from several web radios. It has the ad-
vantage of containing a virtually infinite number of samples that are moreover of excellent quality.
However, some samples are outliers, for example, music sequences or interviews in foreign lan-
guages. It means that the classification error is lower bounded by some constant e ≃ 5% also known
as the Bayesian rate. A possible workaround consists of removing outliers manually from the test
set, however, we don’t use it because in certain cases the definition of “outlier” can be ambiguous.
We use the following web radios:

English KCRW, Newstalk, KALW

French BFM, RFI, RTL, France Info
German B5 Aktuell, B5 Plus, Deutsche Welle, NDR Info, HR Info

1
they act as a pri...s..m and form a rain......bow af..ter for....ty seven minutes before the home side sc..ored

Figure 1: Spectrograms corresponding to a sample from the VoxForge dataset (left) and from the
RadioStream dataset (right). Spectrograms showed here are truncated to 2.25 seconds (270 pixels)
instead of 5 seconds (600 pixels). Spectrograms encode speech with 39 mel-frequencies between 0
and 5 kHz. Quality of spectrograms varies depending on the microphone, the voice of the speaker
and the environmental noise.

D EEP ARCHITECTURE

p(y = EN )
39 × 600

34 × 595

17 × 297

12 × 292

6 × 146

1 × 141
12 kernels 144 kernels 144 kernels
p(y = F R)
6×6 2×2 6×6 2×2 6×6 1×141
C1→12 S12 C12→12 S12 C12→12 S12
p(y = DE)
12 variables

12 frames 12 frames 12 frames 12 frames 12 time series

S HALLOW ARCHITECTURE

p(y = EN )
39 × 600

1 × 562

12 kernels
p(y = F R)
39×39 1×562
C1→12 S12
p(y = DE)
12 variables

12 time series

k×l
Figure 2: Deep and shallow CNN-TDNN architectures. A convolutional layer Cm→n computes m·n
convolutions between m input frames and n output frames with convolution kernels of size k × l
and applies element-wise the nonlinearity max(min(x, 1), −1) to the output. A subsampling layer
k×l
Sm subsamples m input frames by a factor k × l. The TDNN is implemented by the uppermost
subsampling layer.

The classification problem consists of determining whether speech samples are English, French or
German. These languages are chosen because both datasets contain a sufficient number of samples
for each of them. We train and evaluate the classifier on balanced classes (33% English samples,
33% French samples and 33% German samples). Each sample corresponds to a speech signal of 5
seconds.
For each speech signal, a spectrogram of 39 × 600 pixels is constructed where the y-axis represents
39 mel-frequencies between 0 and 5 kHz and the x-axis represents 600 observed times spaced by
8.33 milliseconds. Each frequency of the spectrogram is captured using a Hann window. Examples
of spectrograms are given in figure 1. The range 0–5 kHz is chosen because most of the spectral
power of speech falls into that range.
The classifier maps spectrograms into languages and is implemented as a time-delay neural network
(TDNN) with two-dimensional convolutional layers as feature extractors. Our implementation of the
TDNN performs a simple summation on the outputs of the convolutional layers. The architecture is
implemented with the Torch5 [8] machine learning library and is presented in figure 2.
Using a TDNN is motivated by good results obtained for speech recognition [2, 7]. Using con-
volutional layers as feature extractors is motivated by good results obtained by convolution-based
architectures such as convolutional neural networks (CNN) for various visual perception tasks such
as handwriting digit recognition [6]. The classifier is trained with a stochastic gradient descent [1].

2
VoxForge Deep architecture Shallow architecture
Known speakers New speakers New speakers
EN FR DE EN FR DE EN FR DE
EN 33.4 0.6 0.3 EN 33.0 0.8 1.4 EN 28.3 1.2 4.4
FR 1.9 30.8 0.6 FR 2.8 27.4 1.4 FR 3.7 26.7 2.5
DE 4.5 0.9 26.9 DE 12.0 1.6 19.6 DE 10.5 3.1 19.7

Accuracy = 91.2% Accuracy = 80.1% Accuracy = 74.6%

RadioStream Deep architecture Shallow architecture

Known radios New radios New radios
EN FR DE EN FR DE EN FR DE
EN 28.7 1.5 3.6 EN 28.0 1.7 5.5 EN 22.1 2.7 7.9
FR 1.3 28.6 2.1 FR 1.4 27.7 2.5 FR 4.0 23.3 5.4
DE 2.3 1.5 30.4 DE 2.9 2.5 27.8 DE 4.8 2.0 27.6

Accuracy = 87.7% Accuracy = 83.5% Accuracy = 73.1%

Figure 3: Performance of the classifier on 5 seconds speech samples. Rows of the confusion matrices
represent the true label and columns represent the prediction of the classifier. Accuracy is computed
as the trace of the confusion matrix.

3 Results and analysis

The performance of the deep architecture presented in figure 2 is evaluated on VoxForge and Ra-
dioStream datasets presented in section 2 in two different settings:

1. Classification for known speakers and known radios: speech samples are randomly as-
signed to the training and test set with a respective probability of 0.5 and 0.5.
2. Classification for new speakers and new radios: on VoxForge, speech samples coming from
speakers with initials [A-P] are assigned to the training set and speakers with initials [Q-Z]
to the test set. On RadioStream, speech samples coming from KALW, France Info and HR
Info are assigned to the test set and the remaining ones to the training set.

We compare the deep architecture with the shallow architecture also presented in figure 2. Choosing
convolution kernels of size 39 × 39 for the shallow architecture is motivated by the fact that the
subsequent numbers of weights for both architectures have the same order of magnitude (∼ 104
weights) and that both architectures are then able to model 39 pixels of time dependence. Time
dependence is measured as the time interval occupied by the subset of input nodes connected to a
single hidden node located just before the uppermost subsampling layer. The deep architecture has
2.8·107 neural connections against 107 for the shallow architecture and takes consequently 2.8 times
longer to train. We train the deep architecture for 0.75 · 106 iterations and the shallow architecture
for 2.8 · (0.75 · 106 ) = 2.1 · 106 iterations so that both architectures benefit from the same amount
of computation time. Controlling the number of parameters, the amount of time dependence and the
number of iterations allows to effectively measure the influence of depth on language identification.
Results are presented in figure 3. We observe the following:

1. The deep architecture is 5–10% more accurate than its shallow counterpart. Translation
invariances are not directly encoded by the structure of the shallow architecture and must
therefore be inferred from the data, slowing down the convergence time and leading to poor
generalization when the data is limited.
2. The neural network builds better discriminative features between French and non-French
samples than between English and German samples. A possible explanation is that German
and English are perceptually similar due to their common West-Germanic ancestor. It
shows that the overall accuracy of a system can vary considerably depending on the selected
subset of languages to identify.
3. On the VoxForge dataset, samples from new German speakers are often misclassified. It
seems that the low number of German samples or the low number of German speakers

3
Figure 4: Convolution kernels obtained on the VoxForge dataset. On the left: the 12 + 144 + 144
convolution kernels of size 6 × 6 of the deep architecture. On the right: the 12 convolution kernels
of size 39 × 39 of the shallow architecture. In both cases, not all convolution kernels are used, which
means that the capacity of the neural network is not fully used and that the performance bottleneck
is not the number of frames in the hidden layers but rather the distance between train and test data,
the presence of local minima in the loss function or the structure of the neural network.

prevents the classifier from creating good “German” features. The sensitivity to the number
of samples or speakers is an argument for collecting more samples from more speakers.
4. Samples from known speakers are not classified perfectly. While figure 4 suggests that the
number of frames in each hidden layer is sufficient, 39 pixels of time dependence might not
be sufficient to create lexical or syntactic features. Solutions to increase time dependence
are (1) to increase the size of the convolution kernels and control the subsequent risk of
overfitting by using more samples or (2) to replace the last averaging module by a hierarchy
of convolutional layers and, if necessary, handle the subsequent depth increase by training
the new architecture greedily layer-wise.

4 Conclusion
A deep architecture for spoken language identification is presented and evaluated. Results show that
it can identify three different languages with 83.5% accuracy on 5 seconds speech samples coming
from radio streams and with 80.1% accuracy on 5 seconds speech samples coming from VoxForge.
The deep architecture improves accuracy by 5–10% compared to its shallow counterpart. It indicates
that depth is important to encode invariances required to learn fast and generalize well on new data.
While we emphasize the superiority of deep architectures over shallow ones for this problem, it
remains to determine how deep learning compares to techniques based on hand-coded features. We
suggest that accuracy can be improved by (1) collecting more samples from more speakers and (2)
extending time dependence in order to learn higher level language features.

References
[1] L. Bottou, Stochastic Gradient Learning in Neural Networks, 1991
[2] L. Bottou, Une Approche théorique de l’Apprentissage Connexionniste: Applications à la Reconnaissance
de la Parole, 1991
[3] J. Hieronymous and S. Kadambe, Spoken Language Identification Using Large Vocabulary Speech
Recognition, 1996
[4] R. Tong, B. Ma, D. Zhu, H. Li and E.-S. Chng, Integrating Acoustic, Prosodic and Phonotactic Features
for Spoken Language Identification, 2006
[5] R. Cole, Survey of the State of the Art in Human Language Technology, 1997
[6] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition,
1998
[7] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano and K. J. Lang, Phoneme recognition using time-delay
neural networks, 2002
[8] R. Collobert, Torch5, www.torch5.sf.net
[9] VoxForge, Free Speech Recognition, www.voxforge.org

Programming 2D Games 1st Kelly PDF Download
No ratings yet
Programming 2D Games 1st Kelly PDF Download
301 pages
Survey of Deep Learning Paradigms For Speech Processing
No ratings yet
Survey of Deep Learning Paradigms For Speech Processing
37 pages
(Ebook PDF) Counseling Theories and Techniques For Rehabilitation and Mental Health Professionals, Second Edition PDF Download
100% (4)
(Ebook PDF) Counseling Theories and Techniques For Rehabilitation and Mental Health Professionals, Second Edition PDF Download
50 pages
Evaluation of State of Art Open-Source ASR Engines With Local Inferencing
No ratings yet
Evaluation of State of Art Open-Source ASR Engines With Local Inferencing
81 pages
Mestrado-Engenharia Informatica-Eduardo Farofia Medeiros
No ratings yet
Mestrado-Engenharia Informatica-Eduardo Farofia Medeiros
103 pages
Representation Analysis Methods - For Translation
No ratings yet
Representation Analysis Methods - For Translation
218 pages
DL Report
No ratings yet
DL Report
16 pages
A Review On Speech Recognition Approaches and Challenges For Portuguese: Exploring The Feasibility of Fine-Tuning Large-Scale End-To-End Models
No ratings yet
A Review On Speech Recognition Approaches and Challenges For Portuguese: Exploring The Feasibility of Fine-Tuning Large-Scale End-To-End Models
13 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
16 pages
Voice Assistant
No ratings yet
Voice Assistant
34 pages
BTP Report
No ratings yet
BTP Report
39 pages
Proposing XLSR Multilingual Model For Vietnamese Language Recognition-2
No ratings yet
Proposing XLSR Multilingual Model For Vietnamese Language Recognition-2
6 pages
Automatic Speaker Recognition Using Hybrid Parameters Based On Machine Learning Applied On Two Dataset
No ratings yet
Automatic Speaker Recognition Using Hybrid Parameters Based On Machine Learning Applied On Two Dataset
12 pages
Seminar Report Parthiv
No ratings yet
Seminar Report Parthiv
58 pages
2016 - An Investigation of Deep Neural Network Architectures For Language - Interspeech - LID
No ratings yet
2016 - An Investigation of Deep Neural Network Architectures For Language - Interspeech - LID
5 pages
Deep Learning in Paralinguistic Recognition Tasks: Are Hand-Crafted Features Still Relevant?
No ratings yet
Deep Learning in Paralinguistic Recognition Tasks: Are Hand-Crafted Features Still Relevant?
5 pages
Data-Driven Neural Network Based Feature - Phd-Thesis
No ratings yet
Data-Driven Neural Network Based Feature - Phd-Thesis
155 pages
Linear Dynamic Models For Automatic Speech Recognition
No ratings yet
Linear Dynamic Models For Automatic Speech Recognition
335 pages
Paper - Bims 2
No ratings yet
Paper - Bims 2
6 pages
Reem's CV 4
No ratings yet
Reem's CV 4
3 pages
Latihan UAP
No ratings yet
Latihan UAP
3 pages
Full Text 01
No ratings yet
Full Text 01
54 pages
SV - VLSP2021 The Smartcall - ITS S Systems
No ratings yet
SV - VLSP2021 The Smartcall - ITS S Systems
5 pages
Ar 9
No ratings yet
Ar 9
10 pages
IOUG - Oracle-Application-Express-Administration-Francis-Mignault
100% (2)
IOUG - Oracle-Application-Express-Administration-Francis-Mignault
312 pages
Nepali Speech Recognition PDF
No ratings yet
Nepali Speech Recognition PDF
90 pages
Nepali Speech Recognition PDF
No ratings yet
Nepali Speech Recognition PDF
90 pages
Induction of Decision Trees: Machine Learning
No ratings yet
Induction of Decision Trees: Machine Learning
52 pages
CNN Bilstm 2021
No ratings yet
CNN Bilstm 2021
5 pages
Robust Speech Recognition Using Articulatory Information: Der Technischen Fakult at Der Universit at Bielefeld
100% (1)
Robust Speech Recognition Using Articulatory Information: Der Technischen Fakult at Der Universit at Bielefeld
148 pages
2002 03788
No ratings yet
2002 03788
5 pages
Scaling Speech Technology To 1,000+ Languages
No ratings yet
Scaling Speech Technology To 1,000+ Languages
41 pages
Signals and Communication Technology
No ratings yet
Signals and Communication Technology
22 pages
Speech Command Recognition Using Deep Learning
No ratings yet
Speech Command Recognition Using Deep Learning
25 pages
Xiao Guest Lecture ASR
No ratings yet
Xiao Guest Lecture ASR
39 pages
Seminar Report Final
No ratings yet
Seminar Report Final
37 pages
2018ac04523 Final Report
No ratings yet
2018ac04523 Final Report
27 pages
Greek Orthodox Church Hymns Recognition Using Deep Learning Techniques
No ratings yet
Greek Orthodox Church Hymns Recognition Using Deep Learning Techniques
3 pages
DRNN Am
No ratings yet
DRNN Am
5 pages
Thesis Bich Ngoc Do
No ratings yet
Thesis Bich Ngoc Do
72 pages
On The Use of Deep Feedforward Neural Networks For Aut - 2016 - Computer Speech
No ratings yet
On The Use of Deep Feedforward Neural Networks For Aut - 2016 - Computer Speech
14 pages
BS en 970
100% (2)
BS en 970
16 pages
Rapid Language Identification
No ratings yet
Rapid Language Identification
12 pages
Temporal Pattern Classification Using Spiking Neural Networks
No ratings yet
Temporal Pattern Classification Using Spiking Neural Networks
67 pages
Spoken Language Identification Using Language Bottleneck Features
No ratings yet
Spoken Language Identification Using Language Bottleneck Features
9 pages
2018ac04523 FR
No ratings yet
2018ac04523 FR
27 pages
A Phonotactic Language Model For Spoken Language Identification
No ratings yet
A Phonotactic Language Model For Spoken Language Identification
8 pages
Mba-Ai Speech Technologies: Prof. Brian Mak
No ratings yet
Mba-Ai Speech Technologies: Prof. Brian Mak
56 pages
Rise and Shine STR U2 - Teacher's Book BrE
No ratings yet
Rise and Shine STR U2 - Teacher's Book BrE
20 pages
Spectrogram Transformers For Audio Classification
No ratings yet
Spectrogram Transformers For Audio Classification
7 pages
Mohini Dey - Capstone
No ratings yet
Mohini Dey - Capstone
52 pages
Comparative Study On Spoken Language Identification Based On Deep Learning
No ratings yet
Comparative Study On Spoken Language Identification Based On Deep Learning
5 pages
Integrating Acoustic, Prosodic and Phonotactic Features For Spoken Language Identification
No ratings yet
Integrating Acoustic, Prosodic and Phonotactic Features For Spoken Language Identification
5 pages
Integrating Acoustic, Prosodic and Phonotactic Features For Spoken Language Identification
No ratings yet
Integrating Acoustic, Prosodic and Phonotactic Features For Spoken Language Identification
5 pages
Parallel Phonetically Aware Dnns and Lstm-Rnns For Frame-By-Frame Discriminative Modeling of Spoken Language Identification
No ratings yet
Parallel Phonetically Aware Dnns and Lstm-Rnns For Frame-By-Frame Discriminative Modeling of Spoken Language Identification
5 pages
RBPRATYUSH448
No ratings yet
RBPRATYUSH448
20 pages
SpokenLanguages2 - Report
No ratings yet
SpokenLanguages2 - Report
4 pages
EHaCON - 2019 Paper 8
No ratings yet
EHaCON - 2019 Paper 8
20 pages
Animal Farm Portfolio
No ratings yet
Animal Farm Portfolio
8 pages
Acoustic Modeling Using Deep Belief Networks: Abdel-Rahman Mohamed, George E. Dahl, and Geoffrey Hinton
No ratings yet
Acoustic Modeling Using Deep Belief Networks: Abdel-Rahman Mohamed, George E. Dahl, and Geoffrey Hinton
10 pages
Effects of Dataset Sampling Rate For Noise Cancellation Through Deep Learning
No ratings yet
Effects of Dataset Sampling Rate For Noise Cancellation Through Deep Learning
16 pages
Apps - Testing OAF Extensions From Jdeveloper - Things To Keep in Mind
No ratings yet
Apps - Testing OAF Extensions From Jdeveloper - Things To Keep in Mind
2 pages
Grey Color: (Questions Written in Won't Appear in The Mid-Term/exam)
No ratings yet
Grey Color: (Questions Written in Won't Appear in The Mid-Term/exam)
2 pages
12 Exam WBHS 2015-06 P2
No ratings yet
12 Exam WBHS 2015-06 P2
13 pages
IT Discussion Forum
No ratings yet
IT Discussion Forum
2 pages
IT Report-1
No ratings yet
IT Report-1
14 pages
Tratamiento Sonido
No ratings yet
Tratamiento Sonido
5 pages
Monoclonal Antibody-Preparation Application - MPH201T
No ratings yet
Monoclonal Antibody-Preparation Application - MPH201T
21 pages
Solving The Cocktail Party Problem Using Deep Neural Networks
No ratings yet
Solving The Cocktail Party Problem Using Deep Neural Networks
2 pages
The Restoration and The 18th Century Literature
100% (2)
The Restoration and The 18th Century Literature
29 pages
Teaching and Learning in Diverse and Inclusive Classrooms - Key Issues For New Teachers (PDFDrive)
No ratings yet
Teaching and Learning in Diverse and Inclusive Classrooms - Key Issues For New Teachers (PDFDrive)
189 pages
Classroom 1 Class Notes For Article
No ratings yet
Classroom 1 Class Notes For Article
2 pages
Final Intro AIReport
No ratings yet
Final Intro AIReport
9 pages
Lecture 9 - Speech Recognition
No ratings yet
Lecture 9 - Speech Recognition
65 pages
Comparative Analysis of Automatic Speech Recognition Techniques
No ratings yet
Comparative Analysis of Automatic Speech Recognition Techniques
8 pages
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
No ratings yet
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
10 pages
Deep Learning For Signal Processing
No ratings yet
Deep Learning For Signal Processing
19 pages
Speech Recognition Using Deep Neural Networks: A Systematic Review
No ratings yet
Speech Recognition Using Deep Neural Networks: A Systematic Review
23 pages
Nebosh How To Pass Your Open Book Exam On The First Attempt
No ratings yet
Nebosh How To Pass Your Open Book Exam On The First Attempt
4 pages
Villancicos Edition Complete Wlscm32
No ratings yet
Villancicos Edition Complete Wlscm32
239 pages
Paper 10
No ratings yet
Paper 10
7 pages
Significance of Neural Phonotactic Models For Large-Scale Spoken Language Identification
No ratings yet
Significance of Neural Phonotactic Models For Large-Scale Spoken Language Identification
9 pages
Year 2 Naplan - Reading Sample Test
No ratings yet
Year 2 Naplan - Reading Sample Test
20 pages
Falcis V CIvil Registrar Case Digest
No ratings yet
Falcis V CIvil Registrar Case Digest
2 pages
ChatGPT: Applications, Opportunities, and Threats
No ratings yet
ChatGPT: Applications, Opportunities, and Threats
13 pages
Textbook 2 27-28
No ratings yet
Textbook 2 27-28
28 pages
DCRG8
No ratings yet
DCRG8
16 pages
Speaker and Language Recognition by GMM
No ratings yet
Speaker and Language Recognition by GMM
5 pages
NPC v. Heirs of Casionan
100% (2)
NPC v. Heirs of Casionan
2 pages
Polyiamonds Folding: Folding Polyiamonds into Deltaheda with 12 Faces or Less: Book 2
From Everand
Polyiamonds Folding: Folding Polyiamonds into Deltaheda with 12 Faces or Less: Book 2
Dr. Keh-Ming Lu
No ratings yet
Lecture 1. The Accountancy Profession
No ratings yet
Lecture 1. The Accountancy Profession
3 pages
Kanski Picture Test
100% (1)
Kanski Picture Test
102 pages
Web-Based Barangay Information System For Malita, Davao Occidental
71% (7)
Web-Based Barangay Information System For Malita, Davao Occidental
108 pages
Lesson Plan Feed Relationship Between Plants and Animals (Concretization)
No ratings yet
Lesson Plan Feed Relationship Between Plants and Animals (Concretization)
8 pages
Stat GCSE Edexcel June 2007
No ratings yet
Stat GCSE Edexcel June 2007
24 pages
MATSUDA, Mari - Besides My Sister, Facing The Enemy
No ratings yet
MATSUDA, Mari - Besides My Sister, Facing The Enemy
11 pages
Final Draft 2 Teacher - S Manual
100% (1)
Final Draft 2 Teacher - S Manual
82 pages
Utilization of Community Resources in Social Studies Teaching
No ratings yet
Utilization of Community Resources in Social Studies Teaching
10 pages
Manufacturing Engineering - II
No ratings yet
Manufacturing Engineering - II
3 pages
Qualifications of Board of Nursing
No ratings yet
Qualifications of Board of Nursing
11 pages
Mathematical Functions
From Everand
Mathematical Functions
Oliver Linton
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

Deep Learning For Spoken Language Identification: GR Egoire Montavon

Uploaded by

Deep Learning For Spoken Language Identification: GR Egoire Montavon

Uploaded by

Deep learning for spoken language identification

2 Design and implementation

English KCRW, Newstalk, KALW

12 frames 12 frames 12 frames 12 frames 12 time series

Accuracy = 91.2% Accuracy = 80.1% Accuracy = 74.6%

RadioStream Deep architecture Shallow architecture

Accuracy = 87.7% Accuracy = 83.5% Accuracy = 73.1%

3 Results and analysis

You might also like