VR For Smarthome
VR For Smarthome
VR For Smarthome
{Michel.Vacher,Benjamin.Lecouteux,Pedro.Chahuara, Francois.Portet}@imag.fr
dan.istrate@esigetel.fr, mohamed.sehili@esigetel.fr, thierry.joubert@theoris.fr
Abstract
recognition. Furthermore, in contrast with current triggered-bybutton ASR systems commonly found in smart phone, this voice
control should be able to work in an hand free manner in case
the person is not able to move. Another important aspect is the
respect for privacy: the system should not disseminate any raw
personal data outside the home without the users consent. Our
approach, called PATSH is a step toward these goals. The originality of the approach is to consider these problems together
while they have mostly been studied separately.
To the best of our knowledge, the main trends in audio technology in Smart Homes are related to augmented human machine interaction (e.g., voice command, conversation) and security (mainly fall detection and distress situation recognition).
Regarding security, the main application is the fall detection using the signal of a wearable microphone which is often fused
with other modalities (e.g., accelerometer) [4, 3]. However, the
person is constrained to wear these sensors at all times. To address this constraint, the dialogue system developed by [6] was
proposed to replace traditional emergency systems that requires
too much change in the lifestyle of the elders. However, the
prototype had a limited vocabulary (yes/no dialogue), was not
tested with aged users and there is no mention about how the
noise was taken into account. Most of the speech related research or industrial projects in AAL are actual highly focused
on dialogue to build communicative agent (e.g., see the EU
funded Companions or CompanionAble projects or the Semvox
system 1 ). These systems are often composed of ASR, NLU,
Dialogue management and TTS parts supplying the user the
ability to communicate with the system in an interactive fashion. However, it is generally the dialogue module (management, modelling, architecture, personalization, etc.) that is the
main focus of these projects (e.g., see Companions, OwlSpeak
or Jaspis). Moreover, this setting is different from the Smart
Home one as the user must be close to the avatar to speak (i.e.,
not a distant speech setting). In [7], a communicative avatar was
designed to interact with a person in a smart office. In this research, enhanced speech recognition is performed using beamforming and a geometric area of recording. But this promising
research is still to be tested in a multiroom and multisource realistic home.
Designing and applying speech interfaces in Smart Home
to provide security reassurance and natural man-machine interaction is the aim of the S WEET-H OME2 project. With respect
1. Introduction
Due to the demographic change and ageing in developed countries, the number of older persons is steadily increasing. In this
situation, the society must find solutions to allow these people
to live in their home as comfortably and safely as possible by
assisting them in their daily life. This concept, known as Ambient Assisted Living (AAL) aims at anticipating and responding
to the special needs of these persons. In this domain, the development of Smart homes and intelligent companions is seen as a
promising way of achieving in-home daily assistance [1]. However, given the diverse profiles of the senior population (e.g.,
low/high technical skill, disabilities, etc.), complex interfaces
should be avoided. Nowadays, one of the best interfaces seems
to be the speech interface, that makes possible interaction using
natural language so that the user does not have to learn complex
computing procedures or jargon. Moreover, it is well adapted
to people with reduced mobility and to some emergency situations because the user doesnt need to be close to a switch
(hands free system). Despite all this, very few Smart Home
projects have seriously considered speech recognition in their
design [2, 3, 4, 5, 6, 7, 8]. Part of this can be attributed to the
complexity of setting up this technology in a real environment
and to important challenges that still need to be overcome [9].
In order to make in home voice control a success and a
benefit for people with special needs, we argue that a complete
framework for audio analysis in Smart Home must be designed.
This framework should be able to provide real-time response,
to analyse concurrently several audio channels, to detect audio
events, to filter out noise and to perform robust distant speech
1 http://www.semvox.de
2 http://sweet-home.imag.fr
99
SLPAT 2013, 4th Workshop on Speech and Language Processing for Assistive Technologies, pages 99105,
c
Grenoble, France, 2122 August, 2013.
2013
Association for Computational Linguistics
1. Multichannel data Acquisition through the NIDAQ6220E card. Seven channels are acquired at 16kHz
(16 bits quantification);
100
PATSH
Capture
N Channels
Audio source
Work
Work
Object
Work
Object
Object
Work
Object
Acquisition
transfer
Sound
Object
Sound
Object
write
back
transfer
Sound
Object
FIFO
FIFO
create
Presentation
Sound
Classification
fill
transfer
Work
Object
ASR
Sound/Speech
Discrimination
Detection
write
back
Work
Object
write
back
write
back transfer
transfer
Sound
Object
Sound
Object
Sound
Object
duplicate
destroy
Bathroom
M4
M7
M1
00
11
11
00
00
11
M3
M6
Switch
Kitchen
M5
Door switch
00
11
11
00
00
11
00
11
11
00
00
11
PID IR
M2
Placard technique
Study
(Laboratoire dInformatique dAvignon). Indeed, its 1xRT configuration allows a decoding time similar to the signal duration. Speeral relies on an A decoder with HMM-based
context-dependent acoustic models and trigram language models. HMMs are classical three-state left-right models and state
tying is achieved by using decision trees. Acoustic vectors are
composed of 12 PLP (Perceptual Linear Predictive) coefficients,
the energy, and the first and second order derivatives of these 13
parameters.
The acoustic models of the ASR system were trained on
about 80 hours of annotated speech. Furthermore, acoustic
models were adapted to the speech of 23 speakers recorded in
the same flat during previous experiments by using Maximum
Likelihood Linear Regression (MLLR) [8]. A 3-gram Language
Model (LM) with a 10K lexicon was used. It results from the
interpolation of a generic LM (weight 10%) and a domain LM
(weight 90%). The generic LM was estimated on about 1000M
of words from the French newspapers Le Monde and Gigaword.
The domain LM was trained on the sentences generated using
the grammar of the application (see Fig. 3). The LM combination biases the decoding towards the domain LM but still allows
decoding of out-of-domain sentences. A probabilistic model
was preferred over using strictly the grammar because it makes
it possible to use uncertain hypotheses in a fusion process for
more robustness.
Microphone
Window
The grammar was built after a user study that showed that
targeted users would prefer precise short sentences over more
natural long sentences [11]. In this study, although most of the
older people spontaneously controlled the home by uttering sentences, the majority said they wanted to control the home using
keywords. They believe that this mode of interaction would be
the quickest and the most efficient. This study also showed that
they also had tendency to prefer or to accept the tu form (informal in French) to communicate with the system given this
system would be their property.
101
basicCmd
Figure 3: Excerpt of the grammar of the voice orders (terminal symbols are in French)
given that the bedroom had two lights (the ceiling and the bedside one) as well as the kitchen (above the dinning table and
above the sink), the four following situations were planed:
Speaker
ID
S01
S02
S03
S04
S05
S06
S07
S08
S09
S10
S11
S12
S13
S14
S15
S16
ALL
Speech
and
sound
213
285
211
302
247
234
289
249
374
216
211
401
225
235
641
262
4595
Sound
Speech
184
212
150
211
100
189
216
190
283
163
155
346
184
173
531
216
3503
29
73
61
91
48
45
72
59
91
53
56
55
41
62
111
46
993
Mis
classified
speech
8
10
8
10
11
17
21
25
19
10
18
13
4
9
39
10
232
Mis
classified
sound
1
6
6
11
4
6
6
3
7
4
2
13
7
10
17
5
108
In this study, we are only interested in recognizing vocal orders or distress sentences. All other spontaneous sentences and
system messages are not irrelevant. Therefore, the global audio
records were annotated using Transcriber in order to extract the
syntactically correct vocal orders, results are shown in Table 2.
The average SNR and duration are 15.8dB and 1s, this SNR
value is low compared to studio conditions (SNR35dB). As
the home automation system needs only one correct sentence to
interact, only the less noisy channel was kept. The number of
vocal orders is different for each speaker because if a vocal order was not correctly recognized, the requested action was not
operated by the intelligent controller (light on or off, curtains up
or down. . . ) and thus the speaker often uttered the order two or
three times. Thanks to this annotation, an oracle corpus was extracted. The comparison between experimental real-time results
with thus obtained with the same ASR on the oracle corpus will
allow to analyse the performance of the PATSH system.
102
Number
20
22
26
19
40
37
21
28
443
SNR
(dB)
17
19
12
25
20
14
14
14
15.8
Speaker
ID
S02
S04
S06
S08
S10
S12
S14
S16
Number
32
26
24
33
40
26
27
22
Speaker
ID
S01
S03
S05
S07
S09
S11
S13
S15
Average
SNR
(dB)
17
18
15
12
11
17
12
14
Expe.
(%)
35
22.7
15
79
40
46
43
71
38%
Oracle
(%)
20
22.7
3.8
52.6
22.5
27
19
55.5
23.9%
Speaker
ID
S02
S04
S06
S08
S10
S12
S14
S16
Expe.
(%)
12.5
23
21
30
67
21
48
18
Oracle
(%)
6.2
7.7
8.3
33.3
47.5
7.7
29.6
13.6
low performance because they were not able to follow the given
instructions, the presence of large part of silence mixed with
noise between the words is analysed as phoneme and therefore
increases the error rate.
Part of the errors was due to the way PATSH managed simultaneous detections of one sound event. At this time of the
process, the SNR is not known with a sufficient precision and
the choice is not perfect. Then, in some cases, a part of the
speech signal is missed (beginning or end of the order) and this
introduces a bad recognition. Moreover, very often the detection is not perfectly simultaneous and more than one channel is
analysed by the ASR. Therefore, some improvement were introduced in PATSH for future experiments: that consisted in making the decision after the end of detection on the 7 threads (each
thread corresponding to one channel) thanks to a filtering window of 500ms. The disadvantage is that the system is slowed
down with a delay of 500ms but this will avoid the recognition
of bad extracted sentences and this is compensated by the analysis of only the signal of the best channel.
An important aspect is the decoding time because the device must be activated with a delay as short as possible. In this
experiment, the decoding times reached up to 4 seconds which
was a clear obstacle for usage in real condition. Hopeful, this
has been reduced.
4. Results
4.1. Discrimination between speech and sounds
The detection part of the system is not specifically evaluated
because of the lack of time to label all the sound events on
the 7 channels. However, all the results presented take into
account the performances of the detection because the signals
are extracted automatically by the system. The sound/speech
discrimination misclassified 108 sound and 232 speech occurrences which gives a total error rate of about 7.4% which is in
line with other results of the literature [13]. 23.4% of speech occurrences were classified as sound. Theses poor performances
are explained by the fact that PATSH was not successful in
selecting the best audio event among the set of simultaneous
events and thus the events with low SNR introduced errors and
were not properly discriminated. For the sounds, 3.1% of sound
occurencies are classified as speech. Sounds such as dishes, water flow or electric motor were often confused with speech. For
instance, when certain persons stirred the coffee and chocked
the spoon on the cup or when they chocked plates and cutlery,
the emitted sounds had resonant frequencies very close to the
speech one. This is emphasizing the difficulty of the task and
models must be improved to handle these problematic samples.
The method has also been applied in the same context but with
aged and visually impaired people. The aim was both to validate
the technology with this specific population and to perform a
user study to assess the adequacy of this technology with the
targeted users and to compare with the other user studies of the
literature [14, 11].
Between the two experiments, several corrections were applied to PATSH so that the sound/speech discrimination was
greatly improved as well as the speech decoding time. The measured decoding time was 1.47 times the sentence duration; as
the average duration of a vocal order was 1.048s, the delay between the end of the utterance and the execution of the order
was 1.55s. This is still not a satisfactory delay but this does not
prevent usage in real conditions.
5.1. Experimental set up
In this experiment, eleven participants either aged (6 women) or
visually impaired (2 women, 3 men) were recruited. The average age was 72 years (49-91, min-max). The aged persons were
103
For instance, [15] emphasized that elder Germans tend to utter longer and politer commands than their fellow countrymen
which contrast with our findings. Despite longitudinal studies
are require to understand human preferences regarding voice orders, methods to adapt on-line the grammar to the user must be
developed.
The acquired corpus made it possible to evaluate the performance of the audio analysis software. But interest goes far beyond this experiment because it constitutes a precious resource
for future work. Indeed, one of the main problems that impede researches in this domain is the need for a large amount
of annotated data (for analysis, machine learning and reference
for comparison). The acquisition of such datasets is highly expensive both in terms of material and of human resources. For
instance, in a previous experiment involving 21 participants in
the D OMUS smart home, the acquisition and the annotation of
a 33-hour corpus has cost approximatively 70ke. Thus, making these datasets available to the research community is highly
desirable. This is why we are studying the possibility of making part of it available to the society as we did in our previous
project [16].
2. The participant is coming back from shopping and is going to have a nap. She asked the same kind of commands
but in this case, a warning situation alerts about the front
door not being locked.
3. The participant is going to the study to communicate
with one relative through the dedicated e-lio system. After the communication, the participant simulates a sudden weakness and call for help.
4. The participant is waiting in the study for friends going
to visit her. She tests various voice orders with the radio,
lights and blinds.
During this experiment, 4 hours and 39 minutes of data was
collected including the same sensors as the one previously described in Section 3.4.
7. Conclusion
6. Discussion
Overall, the performance of the system was still low but the
results showed there is room for improvement. Sound/Speech
discrimination has been improved since the beginning of the
experiment and continue to be improved. The biggest problems
were the response time which was unsatisfactory (for 6 participants out of 16) and the mis-understanding of the system which
implied to repeat the order (8/16). These technical limitations
were reduced when we improved the ASR memory management and reduced the search space. After this improvement,
only one participant with special needs complained about the
response time. None of the encountered problem challenged
the PATSH architecture. That is why we are studying the possibility of releasing the code publicly.
The grammar was not the focus of the project but it has
been built to be easily adaptable at the word level (for instance,
if someone wants to change Nestor for another word). All
the 16 participants found the grammar easy to learn. Only four
of them found the keyword Nestor unnatural while the others found it natural and funny. However, this approach suffers
a lack of natural adaptivity to the users preferences, capacities
and culture as any change would require technical intervention.
104
8. Acknowledgements
9. References
[17] M. Vacher, P. Chahuara, B. Lecouteux, D. Istrate, F. Portet, T. Joubert, M. Sehili, B. Meillon, N. Bonnefond, S. Fabre, C. Roux,
and S. Caffiau, The SWEET-HOME project: Audio processing
and decision making in smart home to improve well-being and
reliance, in 34th Annual International Conference of the IEEE
Engineering in Medicine and Biology Society (EMBC13), 2013.
105