Skip to main content

Martin Russell

Followers

5

Following

1

Public Views

David Pierre Leibovitz

Carleton University

Oleg Yu Vorobyev

Siberian Federal University

Joaquim Llisterri

Universitat Autònoma de Barcelona

Kwantlen Polytechnic University

George Christodoulides

Université de Mons

Viacheslav Kuleshov

Stockholm University

Waldemar Ferreira Netto

Universidade de São Paulo

University of Helsinki

Richard A. Wright

University of Washington

Samsun Ondokuz Mayis University

Interests

Uploads

Papers by Martin Russell

Overview of the 2017 Spoken CALL Shared Task

We present an overview of the second edition of the Spoken CALL Shared Task. Groups competed on a... more We present an overview of the second edition of the Spoken CALL Shared Task. Groups competed on a prompt-response task using English-language data collected, through an online CALL game, from Swiss German teens in their second and third years of learning English. Each item consists of a written German prompt and an audio file containing a spoken response. The task is to accept linguistically correct responses and reject linguistically incorrect ones, with "linguistically correct" defined by a gold standard derived from human annotations. Scoring was performed using a metric defined as the ratio of the relative rejection rates on incorrect and correct responses. The second edition received eighteen entries and showed very substantial improvement on the first edition; all entries were better than the best entry from the first edition, and the best score was about four times higher. We present the task, the resources, the results, a discussion of the metrics used, and an analysis of what makes items challenging. In particular, we present quantitative evidence suggesting that incorrect responses are much more difficult to process than correct responses, and that the most significant factor in making a response challenging is its distance from the closest training example.

Phone Classification Using a Non-Linear Manifold with Broad Phone Class Dependent DNNs

Most state-of-the-art automatic speech recognition (ASR) systems use a single deep neural network... more Most state-of-the-art automatic speech recognition (ASR) systems use a single deep neural network (DNN) to map the acoustic space to the decision space. However, different phonetic classes employ different production mechanisms and are best described by different types of features. Hence it may be advantageous to replace this single DNN with several phone class dependent DNNs. The appropriate mathematical formalism for this is a manifold. This paper assesses the use of a nonlinear manifold structure with multiple DNNs for phone classification. The system has two levels. The first comprises a set of broad phone class (BPC) dependent DNN-based mappings and the second level is a fusion network. Various ways of designing and training the networks in both levels are assessed, including varying the size of hidden layers, the use of the bottleneck or softmax outputs as input to the fusion network, and the use of different broad class definitions. Phone classification experiments are performed on TIMIT. The results show that using the BPC-dependent DNNs provides small but significant improvements in phone classification accuracy relative to a single global DNN. The paper concludes with visualisations of the structures learned by the local and global DNNs and discussion of their interpretations.

Analysis of Phone Errors Attributable to Phonological Effects Associated With Language Acquisition Through Bottleneck Feature Visualisations

Previous work aimed to investigate the extent to which errors attributable to phonological effect... more Previous work aimed to investigate the extent to which errors attributable to phonological effects associated with language acquisition (PEALA) contribute to the output of children's ASR. Opposite to what was intuitively expected, the proportion of errors predictable from PEALA was positively correlated with recognition accuracy, therefore increased across ages. In order to interpret this finding, the present paper employs a DNN-HMM automatic speech recognition system, built on the CSLU children's speech corpus, to produce bottleneck feature (BNF) visualisations of phones and examine how these relate with respect to PEALA. The focus is drawn particularly on ASR errors caused by phone confusions, which are compared against phone substitution pairs indicated by PEALA. The ASR results confirm the previously observed interaction between errors predictable from PEALA and rising accuracy, but also suggest that these errors only account for a small percentage of the total phone substitution error. The BNF visualisations for the most part outline the age progression smoothly and demonstrate clear clusters of neighbouring phones consistently. The distance between PEALA related phones can be partitioned in four sets; two that increase with age (at a higher or lower rate), one that roughly remains constant and one that decreases with age.

Speech Recognition on an FPGA Using Discrete and Continuous Hidden Markov Models

Lecture Notes in Computer Science, Mar 29, 2010

Where a licence is displayed above, please note the terms and conditions of the licence govern yo... more Where a licence is displayed above, please note the terms and conditions of the licence govern your use of this document. When citing, please reference the published version. While the University of Birmingham exercises care and attention in making items available there are rare occasions when an item has been uploaded in error or has been deemed to be commercially or otherwise sensitive.

The Development of the Speaker Independent ARM Continuous Speech Recognition System

STIN, 1992

This memorandum describes the development of a speaker independent con--il, YFech recornition svs... more This memorandum describes the development of a speaker independent con--il, YFech recornition svstem based on phoneme level hidden Markov fkouc,~ 'i Itt SY\Ste. lb CO~i...21;d I, ~ spokia!i rb, ix reconnaissance reports, a task which involves a vocabulary of approimately 500 words . On a test set of speech from 80 male subjects, the final system achieves a word accuracy of 74.1% with no explicit syntactic constraints.

Modelling speech signals using formant frequencies as an intermediate representation

Iet Signal Processing, Mar 1, 2007

This paper concerns Multiple-level Segmental Hidden Markov Models (M-SHMMs) in which the relation... more This paper concerns Multiple-level Segmental Hidden Markov Models (M-SHMMs) in which the relationship between symbolic and acoustic representations of speech is regulated by a formant-based intermediate representation. New TIMIT phone recognition results are presented, confirming that the theoretical upper-bound on performance is achieved provided that either the intermediate representation or the formant-to-acoustic mapping is sufficiently rich. The way in which M-SHMMs exploit formant-based information is also investigated, using singular value decomposition of the formant-to-acoustic mappings and linear discriminant analysis. The analysis shows that if the intermediate layer contains information which is linearly related to the spectral representation, that information is used in preference to explicit formant frequencies, even though the latter are useful for phone discrimination. In summary, while these results confirm the utility of M-SHMMs for automatic speech recognition, they provide empirical evidence of the value of non-linear formant-to-acoustic mappings.

Development of articulatory-based multilevel segmental HMMs for phonetic classification in ASR

A simple multiple-level HMM is presented in which speech dynamics are modelled as linear trajecto... more A simple multiple-level HMM is presented in which speech dynamics are modelled as linear trajectories in an intermediate, formant-based representation and the mapping between the intermediate and acoustic data is achieved using one or more linear transformations. An upper-bound on the performance of such a system is established. Experimental results on the TIMIT corpus demonstrate that, if the dimension of the intermediate space is sufficiently high or the number of articulatory-to-acoustic mappings is sufficiently large, then this upper-bound can be achieved.

The efficacy of a task model approach to ADL rehabilitation in stroke apraxia and action disorganisation syndrome: A randomised controlled trial

PLOS ONE, Mar 3, 2022

Apraxia and action disorganization syndrome (AADS) after stroke can disrupt activities of daily l... more Apraxia and action disorganization syndrome (AADS) after stroke can disrupt activities of daily living (ADL). Occupational therapy has been effective in improving ADL performance, however, inclusion of multiple tasks means it is unclear which therapy elements contribute to improvement. We evaluated the efficacy of a task model approach to ADL rehabilitation, comparing training in making a cup of tea with a stepping training control condition. Of the 29 stroke survivors with AADS who participated in this cross-over randomized controlled feasibility trial, 25 were included in analysis [44% females; mean(SD) age = 71.1(7.8) years; years post-stroke = 4.6(3.3)]. Participants attended five 1-hour weekly tea making training sessions in which progress was monitored and feedback given using a computerbased system which implemented a Markov Decision Process (MDP) task model. In a control condition, participants received five 1-hour weekly stepping sessions. Compared to stepping training, tea making training reduced errors across 4 different tea types. The time taken to make a cup of tea was reduced so the improvement in accuracy was not due to a speed-accuracy trade-off. No improvement linked to tea making training was evident in a complex tea preparation task (making two different cups of tea simultaneously), indicating a lack of generalisation in the training.

Implementing a simple continuous speech recognition system on an FPGA

Speech recognition is a computationally demanding task, particularly the stage which uses Viterbi... more Speech recognition is a computationally demanding task, particularly the stage which uses Viterbi decoding for converting pre-processed speech data into words or sub-word units. We present an FPGA implementations of the decoder based on continuous hidden Markov models (HMMs) representing monophones, and demonstrate that it can process speech 75 times real time, using 45% of the slices of a Xilinx Virtex XCV1000.

The effect of an intermediate articulatory layer on the performance of a segmental HMM

We present a novel multi-level HMM in which an intermediate 'articulatory' representation is incl... more We present a novel multi-level HMM in which an intermediate 'articulatory' representation is included between the state and surface-acoustic levels. A potential difficulty with such a model is that advantages gained by the introduction of an articulatory layer might be compromised by limitations due to an insufficiently rich articulatory representation, or by compromises made for mathematical or computational expediency. This paper decribes a simple model in which speech dynamics are modelled as linear trajectories in a formant-based 'articulatory' layer, and the articulatory-to-acoustic mappings are linear. Phone classification results for TIMIT are presented for monophone and triphone systems with a phone-level syntax. The results demonstrate that provided the intermediate representation is sufficiently rich, or a sufficiently large number of phone-class-dependent articulatory-to-acoustic mapping are employed, classification performance is not compromised.

Compensating for Object Variability in DNN–HMM Object-Centered Human Activity Recognition

This paper describes a deep neural networkhidden Markov model (DNN-HMM) human activity recognitio... more This paper describes a deep neural networkhidden Markov model (DNN-HMM) human activity recognition system based on instrumented objects and studies compensation strategies to deal with object variability. The sensors, comprising an accelerometer, gyroscope, magnetometer and force-sensitive resistors (FSRs), are packaged in a coaster attached to the base of an object, here a mug. Results are presented for recognition of actions involved in manipulating a mug. Evaluations are performed using over 24 hours of data recordings containing sequences of actions, labelled without time-stamp information. We demonstrate the importance of data alignments. While the DNN-HMM system achieved error rate below 0.1% for matched train-test conditions, this increased up to 26.5% for highly mismatched conditions. The error rate averaged over all conditions was 1.4% when using multi-condition training and decreased to 0.8% by employing feature augmentation. The use of FSR feature compensation, specific to weight variability, resulted in 0.24% error rate.

Explicit Modelling of State Duration Correlations in Hidden Markov Models

Memorandum 4152 vibI1:,Cd JAvail 2nd/eor Dit Special. EXP)LICIT MODELLING OF STATE DURATION CORRE... more Memorandum 4152 vibI1:,Cd JAvail 2nd/eor Dit Special. EXP)LICIT MODELLING OF STATE DURATION CORRELATIONS VN TMIDDEN MARKOV MODELS-5 M J Russell and L Siine September 1988 ABSTRACT4 fn recent years considerable effort has been directed towards impro'ving thc treatment of durational structure in hidden Markov model (1-MM) based approaches to speech pattern modelling. In general these studies have been concerned with more accurate modelling of. thc variations in segment duration which occur when words are spoken at a nominally constant speaking rate. However, recent work has shown that some of the performancc gains which can be achieved by improved duration modelling are lost when thc wor2s inl the test set are spoken at a different rate from those in the training set. This memorandum presents an approach to 5ol, ng this problem based on the capture and use of informa tion about state duration correlations. A mnethod for measuring correlations Ibetween the durations of 6djacent states in a HMM is described. The method involves expanding a standard HMM into a special type of hidden se-ni-NMarkov model (HSMM;\), caled a Correlated Duration H-MM (CDHMIM), in which each state of the original HMNM is expanded into a set of fixed-duration HSMM states. The probabilities associatce '-ith transitions between these states are measures of state duration correlaion. Experiments ar~c described in which the CDHMM method is applied to a set of sentences spoken at four different speaking rates. Note: This memorandum is an expanded version of a paper trescntcd Pt SPEECH-8'SS Sevcn-,h FASE.-Symnposiumn, 22-26 August 1958, Edinburgh. Copyright 0Controier tMSZO, London, 198&. I ~~ I I..

Evaluation of hidden Markov models robustness in uncovering focus of visual attention from noisy eye-tracker data

Eye Tracking Research and Applications Symposium 2004 (ETRA 2004)#N#, 2004

Assistive System for People with Apraxia Using A Markov Decision Process

Studies in health technology and informatics, 2014

CogWatch is an assistive system to re-train stroke survivors suffering from Apraxia or Action Dis... more CogWatch is an assistive system to re-train stroke survivors suffering from Apraxia or Action Disorganization Syndrome (AADS) to complete activities of daily living (ADLs). This paper describes the approach to real-time planning based on a Markov Decision Process (MDP), and demonstrates its ability to improve task's performance via user simulation. The paper concludes with a discussion of the remaining challenges and future enhancements.

Unsupervised model selection for recognition of regional accented speech

Interspeech 2014, 2014

Automatic speaker, age-group and gender identification from children’s speech

Computer Speech & Language, 2018

A speech signal contains important paralinguistic information, such as the identity, age, gender,... more A speech signal contains important paralinguistic information, such as the identity, age, gender, language, accent, and the emotional state of the speaker. Automatic recognition of these types of information in adults' speech has received considerable attention, however there has been little work on children's speech. This paper focuses on speaker, gender, and age-group recognition from children's speech. The performances of several classification methods are compared, including Gaussian Mixture Model -Universal Background Model (GMM-UBM), GMM -Support Vector Machine (GMM-SVM) and i-vector based approaches. For speaker recognition, error rate decreases as age increases, as one might expect. However for gender and age-group recognition the effect of age is more complex due mainly to consequences of the onset of puberty. Finally, the utility of different frequency bands for speaker, age-group and gender recognition from children's speech is assessed.

Identification of British English regional accents using fusion of i-vector and multi-accent phonotactic systems

The Speaker and Language Recognition Workshop (Odyssey 2016), 2016

The para-linguistic information in a speech signal includes clues to the geographical and social ... more The para-linguistic information in a speech signal includes clues to the geographical and social background of the speaker. This paper is concerned with recognition of the 14 regional accents of British English. For Accent Identification (AID), acoustic methods exploit differences between the distributions of sounds, while phonotactic approaches exploit the sequences in which these sounds occur. We demonstrate these methods are good complements for each other and use their confusion matrices for further analysis. Our relatively simple i-vector and phonotactic fused system with recognition accuracy of 84.87% outperforms the i-vector fused results reported in literature, by 4.7%. Further analysis on distribution of British English accents has been carried out by analyzing the low dimensional representation of i-vector AID feature space.

Object-Centred Recognition of Human Activity

2015 International Conference on Healthcare Informatics, 2015

This paper describes an approach to real-time human activity recognition using hidden Markov mode... more This paper describes an approach to real-time human activity recognition using hidden Markov models (HMMs) and sensorised objects, and its application to rehabilitation of stroke patients with apraxia or action disorganisation syndrome (AADS). Results are presented for the task of making a cup of tea. Unlike speech or other sequential decoding problems where HMMs have previously been successfully applied, human actions can occur simultaneously or at least in overlapping time. The solution proposed in this paper is based on a parallel, asynchronous set of detectors, each responsible for the detection of one of the component sub-goals of the tea-making task. The inputs to these detectors are formed from the outputs of sensors attached to the objects involved in that sub-goal, plus hand coordinate data. The sensors, comprising an accelerometer and three force-sensitive resistors, are packaged in a coaster which can be easily attached to the base of a mug or jug. In tests on complete tea-making trials, error rates range from less than 5% for sub-goals where all of the objects involved are sensorised, to up to 30% for detectors that rely on hand-coordinate data alone. The complete set of detectors runs in real-time. It is concluded that a set of parallel HMM-based sub-goal detectors combined with fully sensorised objects, is a viable, accurate and easily deployable approach to real-time object-centred human activity recognition.

Recent results from the ARM continuous speech recognition project

Proceedings of the workshop on Speech and Natural Language - HLT '90, 1990

POMDP Based Action Planning and Human Error Detection

IFIP Advances in Information and Communication Technology, 2015

This paper presents a Partially Observable Markov Decision Process (POMDP) model for action plann... more This paper presents a Partially Observable Markov Decision Process (POMDP) model for action planning and human errors detection, during Activities of Daily Living (ADLs). This model is integrated into a sub-component of an assistive system designed for stroke survivors; it is called the Artificial Intelligent Planning System (AIPS). Its main goal is to monitor the user's history of actions during a specific task, and to provide meaningful assistance when an error is detected in his/her sequence of actions. To do so, the AIPS must cope with the ambiguity in the outputs of the other system's components. In this paper, we first give an overview of the global assistive system where the AIPS is implemented, and explain how it interacts with the user to guide him/her during tea-making. We then define the POMDP models and the Monte Carlo Algorithm used to learn how to retrieve optimal prompts, and detect human errors under uncertainty.

Overview of the 2017 Spoken CALL Shared Task

We present an overview of the second edition of the Spoken CALL Shared Task. Groups competed on a... more We present an overview of the second edition of the Spoken CALL Shared Task. Groups competed on a prompt-response task using English-language data collected, through an online CALL game, from Swiss German teens in their second and third years of learning English. Each item consists of a written German prompt and an audio file containing a spoken response. The task is to accept linguistically correct responses and reject linguistically incorrect ones, with "linguistically correct" defined by a gold standard derived from human annotations. Scoring was performed using a metric defined as the ratio of the relative rejection rates on incorrect and correct responses. The second edition received eighteen entries and showed very substantial improvement on the first edition; all entries were better than the best entry from the first edition, and the best score was about four times higher. We present the task, the resources, the results, a discussion of the metrics used, and an analysis of what makes items challenging. In particular, we present quantitative evidence suggesting that incorrect responses are much more difficult to process than correct responses, and that the most significant factor in making a response challenging is its distance from the closest training example.

Phone Classification Using a Non-Linear Manifold with Broad Phone Class Dependent DNNs

Most state-of-the-art automatic speech recognition (ASR) systems use a single deep neural network... more Most state-of-the-art automatic speech recognition (ASR) systems use a single deep neural network (DNN) to map the acoustic space to the decision space. However, different phonetic classes employ different production mechanisms and are best described by different types of features. Hence it may be advantageous to replace this single DNN with several phone class dependent DNNs. The appropriate mathematical formalism for this is a manifold. This paper assesses the use of a nonlinear manifold structure with multiple DNNs for phone classification. The system has two levels. The first comprises a set of broad phone class (BPC) dependent DNN-based mappings and the second level is a fusion network. Various ways of designing and training the networks in both levels are assessed, including varying the size of hidden layers, the use of the bottleneck or softmax outputs as input to the fusion network, and the use of different broad class definitions. Phone classification experiments are performed on TIMIT. The results show that using the BPC-dependent DNNs provides small but significant improvements in phone classification accuracy relative to a single global DNN. The paper concludes with visualisations of the structures learned by the local and global DNNs and discussion of their interpretations.

Analysis of Phone Errors Attributable to Phonological Effects Associated With Language Acquisition Through Bottleneck Feature Visualisations

Previous work aimed to investigate the extent to which errors attributable to phonological effect... more Previous work aimed to investigate the extent to which errors attributable to phonological effects associated with language acquisition (PEALA) contribute to the output of children's ASR. Opposite to what was intuitively expected, the proportion of errors predictable from PEALA was positively correlated with recognition accuracy, therefore increased across ages. In order to interpret this finding, the present paper employs a DNN-HMM automatic speech recognition system, built on the CSLU children's speech corpus, to produce bottleneck feature (BNF) visualisations of phones and examine how these relate with respect to PEALA. The focus is drawn particularly on ASR errors caused by phone confusions, which are compared against phone substitution pairs indicated by PEALA. The ASR results confirm the previously observed interaction between errors predictable from PEALA and rising accuracy, but also suggest that these errors only account for a small percentage of the total phone substitution error. The BNF visualisations for the most part outline the age progression smoothly and demonstrate clear clusters of neighbouring phones consistently. The distance between PEALA related phones can be partitioned in four sets; two that increase with age (at a higher or lower rate), one that roughly remains constant and one that decreases with age.

Speech Recognition on an FPGA Using Discrete and Continuous Hidden Markov Models

Lecture Notes in Computer Science, Mar 29, 2010

Where a licence is displayed above, please note the terms and conditions of the licence govern yo... more Where a licence is displayed above, please note the terms and conditions of the licence govern your use of this document. When citing, please reference the published version. While the University of Birmingham exercises care and attention in making items available there are rare occasions when an item has been uploaded in error or has been deemed to be commercially or otherwise sensitive.

The Development of the Speaker Independent ARM Continuous Speech Recognition System

STIN, 1992

This memorandum describes the development of a speaker independent con--il, YFech recornition svs... more This memorandum describes the development of a speaker independent con--il, YFech recornition svstem based on phoneme level hidden Markov fkouc,~ 'i Itt SY\Ste. lb CO~i...21;d I, ~ spokia!i rb, ix reconnaissance reports, a task which involves a vocabulary of approimately 500 words . On a test set of speech from 80 male subjects, the final system achieves a word accuracy of 74.1% with no explicit syntactic constraints.

Modelling speech signals using formant frequencies as an intermediate representation

Iet Signal Processing, Mar 1, 2007

This paper concerns Multiple-level Segmental Hidden Markov Models (M-SHMMs) in which the relation... more This paper concerns Multiple-level Segmental Hidden Markov Models (M-SHMMs) in which the relationship between symbolic and acoustic representations of speech is regulated by a formant-based intermediate representation. New TIMIT phone recognition results are presented, confirming that the theoretical upper-bound on performance is achieved provided that either the intermediate representation or the formant-to-acoustic mapping is sufficiently rich. The way in which M-SHMMs exploit formant-based information is also investigated, using singular value decomposition of the formant-to-acoustic mappings and linear discriminant analysis. The analysis shows that if the intermediate layer contains information which is linearly related to the spectral representation, that information is used in preference to explicit formant frequencies, even though the latter are useful for phone discrimination. In summary, while these results confirm the utility of M-SHMMs for automatic speech recognition, they provide empirical evidence of the value of non-linear formant-to-acoustic mappings.

Development of articulatory-based multilevel segmental HMMs for phonetic classification in ASR

A simple multiple-level HMM is presented in which speech dynamics are modelled as linear trajecto... more A simple multiple-level HMM is presented in which speech dynamics are modelled as linear trajectories in an intermediate, formant-based representation and the mapping between the intermediate and acoustic data is achieved using one or more linear transformations. An upper-bound on the performance of such a system is established. Experimental results on the TIMIT corpus demonstrate that, if the dimension of the intermediate space is sufficiently high or the number of articulatory-to-acoustic mappings is sufficiently large, then this upper-bound can be achieved.

The efficacy of a task model approach to ADL rehabilitation in stroke apraxia and action disorganisation syndrome: A randomised controlled trial

PLOS ONE, Mar 3, 2022

Apraxia and action disorganization syndrome (AADS) after stroke can disrupt activities of daily l... more Apraxia and action disorganization syndrome (AADS) after stroke can disrupt activities of daily living (ADL). Occupational therapy has been effective in improving ADL performance, however, inclusion of multiple tasks means it is unclear which therapy elements contribute to improvement. We evaluated the efficacy of a task model approach to ADL rehabilitation, comparing training in making a cup of tea with a stepping training control condition. Of the 29 stroke survivors with AADS who participated in this cross-over randomized controlled feasibility trial, 25 were included in analysis [44% females; mean(SD) age = 71.1(7.8) years; years post-stroke = 4.6(3.3)]. Participants attended five 1-hour weekly tea making training sessions in which progress was monitored and feedback given using a computerbased system which implemented a Markov Decision Process (MDP) task model. In a control condition, participants received five 1-hour weekly stepping sessions. Compared to stepping training, tea making training reduced errors across 4 different tea types. The time taken to make a cup of tea was reduced so the improvement in accuracy was not due to a speed-accuracy trade-off. No improvement linked to tea making training was evident in a complex tea preparation task (making two different cups of tea simultaneously), indicating a lack of generalisation in the training.

Implementing a simple continuous speech recognition system on an FPGA

Speech recognition is a computationally demanding task, particularly the stage which uses Viterbi... more Speech recognition is a computationally demanding task, particularly the stage which uses Viterbi decoding for converting pre-processed speech data into words or sub-word units. We present an FPGA implementations of the decoder based on continuous hidden Markov models (HMMs) representing monophones, and demonstrate that it can process speech 75 times real time, using 45% of the slices of a Xilinx Virtex XCV1000.

The effect of an intermediate articulatory layer on the performance of a segmental HMM

We present a novel multi-level HMM in which an intermediate 'articulatory' representation is incl... more We present a novel multi-level HMM in which an intermediate 'articulatory' representation is included between the state and surface-acoustic levels. A potential difficulty with such a model is that advantages gained by the introduction of an articulatory layer might be compromised by limitations due to an insufficiently rich articulatory representation, or by compromises made for mathematical or computational expediency. This paper decribes a simple model in which speech dynamics are modelled as linear trajectories in a formant-based 'articulatory' layer, and the articulatory-to-acoustic mappings are linear. Phone classification results for TIMIT are presented for monophone and triphone systems with a phone-level syntax. The results demonstrate that provided the intermediate representation is sufficiently rich, or a sufficiently large number of phone-class-dependent articulatory-to-acoustic mapping are employed, classification performance is not compromised.

Compensating for Object Variability in DNN–HMM Object-Centered Human Activity Recognition

This paper describes a deep neural networkhidden Markov model (DNN-HMM) human activity recognitio... more This paper describes a deep neural networkhidden Markov model (DNN-HMM) human activity recognition system based on instrumented objects and studies compensation strategies to deal with object variability. The sensors, comprising an accelerometer, gyroscope, magnetometer and force-sensitive resistors (FSRs), are packaged in a coaster attached to the base of an object, here a mug. Results are presented for recognition of actions involved in manipulating a mug. Evaluations are performed using over 24 hours of data recordings containing sequences of actions, labelled without time-stamp information. We demonstrate the importance of data alignments. While the DNN-HMM system achieved error rate below 0.1% for matched train-test conditions, this increased up to 26.5% for highly mismatched conditions. The error rate averaged over all conditions was 1.4% when using multi-condition training and decreased to 0.8% by employing feature augmentation. The use of FSR feature compensation, specific to weight variability, resulted in 0.24% error rate.

Explicit Modelling of State Duration Correlations in Hidden Markov Models

Memorandum 4152 vibI1:,Cd JAvail 2nd/eor Dit Special. EXP)LICIT MODELLING OF STATE DURATION CORRE... more Memorandum 4152 vibI1:,Cd JAvail 2nd/eor Dit Special. EXP)LICIT MODELLING OF STATE DURATION CORRELATIONS VN TMIDDEN MARKOV MODELS-5 M J Russell and L Siine September 1988 ABSTRACT4 fn recent years considerable effort has been directed towards impro'ving thc treatment of durational structure in hidden Markov model (1-MM) based approaches to speech pattern modelling. In general these studies have been concerned with more accurate modelling of. thc variations in segment duration which occur when words are spoken at a nominally constant speaking rate. However, recent work has shown that some of the performancc gains which can be achieved by improved duration modelling are lost when thc wor2s inl the test set are spoken at a different rate from those in the training set. This memorandum presents an approach to 5ol, ng this problem based on the capture and use of informa tion about state duration correlations. A mnethod for measuring correlations Ibetween the durations of 6djacent states in a HMM is described. The method involves expanding a standard HMM into a special type of hidden se-ni-NMarkov model (HSMM;\), caled a Correlated Duration H-MM (CDHMIM), in which each state of the original HMNM is expanded into a set of fixed-duration HSMM states. The probabilities associatce '-ith transitions between these states are measures of state duration correlaion. Experiments ar~c described in which the CDHMM method is applied to a set of sentences spoken at four different speaking rates. Note: This memorandum is an expanded version of a paper trescntcd Pt SPEECH-8'SS Sevcn-,h FASE.-Symnposiumn, 22-26 August 1958, Edinburgh. Copyright 0Controier tMSZO, London, 198&. I ~~ I I..

Evaluation of hidden Markov models robustness in uncovering focus of visual attention from noisy eye-tracker data

Eye Tracking Research and Applications Symposium 2004 (ETRA 2004)#N#, 2004

Assistive System for People with Apraxia Using A Markov Decision Process

Studies in health technology and informatics, 2014

CogWatch is an assistive system to re-train stroke survivors suffering from Apraxia or Action Dis... more CogWatch is an assistive system to re-train stroke survivors suffering from Apraxia or Action Disorganization Syndrome (AADS) to complete activities of daily living (ADLs). This paper describes the approach to real-time planning based on a Markov Decision Process (MDP), and demonstrates its ability to improve task's performance via user simulation. The paper concludes with a discussion of the remaining challenges and future enhancements.

Unsupervised model selection for recognition of regional accented speech

Interspeech 2014, 2014

Automatic speaker, age-group and gender identification from children’s speech

Computer Speech & Language, 2018

A speech signal contains important paralinguistic information, such as the identity, age, gender,... more A speech signal contains important paralinguistic information, such as the identity, age, gender, language, accent, and the emotional state of the speaker. Automatic recognition of these types of information in adults' speech has received considerable attention, however there has been little work on children's speech. This paper focuses on speaker, gender, and age-group recognition from children's speech. The performances of several classification methods are compared, including Gaussian Mixture Model -Universal Background Model (GMM-UBM), GMM -Support Vector Machine (GMM-SVM) and i-vector based approaches. For speaker recognition, error rate decreases as age increases, as one might expect. However for gender and age-group recognition the effect of age is more complex due mainly to consequences of the onset of puberty. Finally, the utility of different frequency bands for speaker, age-group and gender recognition from children's speech is assessed.

Identification of British English regional accents using fusion of i-vector and multi-accent phonotactic systems

The Speaker and Language Recognition Workshop (Odyssey 2016), 2016

The para-linguistic information in a speech signal includes clues to the geographical and social ... more The para-linguistic information in a speech signal includes clues to the geographical and social background of the speaker. This paper is concerned with recognition of the 14 regional accents of British English. For Accent Identification (AID), acoustic methods exploit differences between the distributions of sounds, while phonotactic approaches exploit the sequences in which these sounds occur. We demonstrate these methods are good complements for each other and use their confusion matrices for further analysis. Our relatively simple i-vector and phonotactic fused system with recognition accuracy of 84.87% outperforms the i-vector fused results reported in literature, by 4.7%. Further analysis on distribution of British English accents has been carried out by analyzing the low dimensional representation of i-vector AID feature space.

Object-Centred Recognition of Human Activity

2015 International Conference on Healthcare Informatics, 2015

This paper describes an approach to real-time human activity recognition using hidden Markov mode... more This paper describes an approach to real-time human activity recognition using hidden Markov models (HMMs) and sensorised objects, and its application to rehabilitation of stroke patients with apraxia or action disorganisation syndrome (AADS). Results are presented for the task of making a cup of tea. Unlike speech or other sequential decoding problems where HMMs have previously been successfully applied, human actions can occur simultaneously or at least in overlapping time. The solution proposed in this paper is based on a parallel, asynchronous set of detectors, each responsible for the detection of one of the component sub-goals of the tea-making task. The inputs to these detectors are formed from the outputs of sensors attached to the objects involved in that sub-goal, plus hand coordinate data. The sensors, comprising an accelerometer and three force-sensitive resistors, are packaged in a coaster which can be easily attached to the base of a mug or jug. In tests on complete tea-making trials, error rates range from less than 5% for sub-goals where all of the objects involved are sensorised, to up to 30% for detectors that rely on hand-coordinate data alone. The complete set of detectors runs in real-time. It is concluded that a set of parallel HMM-based sub-goal detectors combined with fully sensorised objects, is a viable, accurate and easily deployable approach to real-time object-centred human activity recognition.

Recent results from the ARM continuous speech recognition project

Proceedings of the workshop on Speech and Natural Language - HLT '90, 1990

POMDP Based Action Planning and Human Error Detection

IFIP Advances in Information and Communication Technology, 2015

This paper presents a Partially Observable Markov Decision Process (POMDP) model for action plann... more This paper presents a Partially Observable Markov Decision Process (POMDP) model for action planning and human errors detection, during Activities of Daily Living (ADLs). This model is integrated into a sub-component of an assistive system designed for stroke survivors; it is called the Artificial Intelligent Planning System (AIPS). Its main goal is to monitor the user's history of actions during a specific task, and to provide meaningful assistance when an error is detected in his/her sequence of actions. To do so, the AIPS must cope with the ambiguity in the outputs of the other system's components. In this paper, we first give an overview of the global assistive system where the AIPS is implemented, and explain how it interacts with the user to guide him/her during tea-making. We then define the POMDP models and the Monte Carlo Algorithm used to learn how to retrieve optimal prompts, and detect human errors under uncertainty.