Sok: A Study of The Security On Voice Processing Systems
Sok: A Study of The Security On Voice Processing Systems
Sok: A Study of The Security On Voice Processing Systems
Systems
Robert Chang, Logan Kuo, Arthur Liu, and Nader Sehatbakhsh
S2ysArch Lab, UCLA
Abstract—As the use of Voice Processing Systems (VPS) recognition passcodes to unlock private environments as well
continues to become more prevalent in our daily lives through the as commanding the intelligent device to make an action for us
increased reliance on applications such as commercial voice (e.g., making a purchase or playing said music). Voice
recognition devices as well as major text-to-speech software, the processing systems encourage users to initiate voice commands
attacks on these systems are increasingly complex, varied, and
constantly evolving. With the use cases for VPS rapidly growing
as a quality-of-life improvement towards their lives. As we
into new spaces and purposes, the potential consequences realize the impressiveness and increasing popularity VPSes
regarding privacy are increasingly more dangerous. In addition, have due to its lasting impacts and how it can deeply change
the growing number and increased practicality of over-the-air our lives for the better, the rapid growth of this field has also
attacks have made system failures much more probable. In this created red flags in terms of adversarial attacks and privacy
paper, we will identify and classify an arrangement of unique concerns.
attacks on voice processing systems. Over the years research has With the recent focus on methods to break device security
been moving from specialized, untargeted attacks that result in the and/or privacy ranging from password cracking and malware
malfunction of systems and the denial of services to more general, installation to physical side-channels and leakage, VPSes have
targeted attacks that can force an outcome controlled by an
adversary. The current and most frequently used machine
presented themselves as an interesting new area of focus.
learning systems and deep neural networks, which are at the core Unlike the targeting of direct defense mechanisms such as
of modern voice processing systems, were built with a focus on multi-factor authentication and application permissions which
performance and scalability rather than security. Therefore, it is were constructed with a focus geared towards security and
critical for us to reassess the developing voice processing therefore have effective safeguards in place to slow down
landscape and to identify the state of current attacks and defenses attackers, VPSes were designed with a focus on usability and
so that we may suggest future developments and theoretical simplicity for its users and therefore do not offer much
improvements. resistance against new cyber-attacks. In 2017, hidden voice
attacks were used to successfully break certain white-box
Keywords—Adversarial Attacks, Automatic Speaker
Identification, Automatic Speech Recognition, Machine Learning,
systems by injecting voice commands that were properly
Voice Processing Systems processed and understood by VPS machine learning models
while remaining inaudible to humans [10]. Since then, new
I. INTRODUCTION attacks and improvements have been developed which instead
5G is around the corner and the Internet of Things is target the entirety of the signal processing phase of VPSes thus
becoming a reality. Voice assistants like Siri, Alexa, and Echo generalizing the attack from specific white-box models to
can now turn on your lights and answer your questions in a black-box systems [6, 8, 13, 14, 18, 19, 20, 21].
blink of an eye. The assistants are beginning to understand you As recent works have built upon these developed attacks by
with higher accuracy and might just stop telling you to repeat increasing their impact range and rate of accuracy, it has
yourself. This is all propelled by the advancement of Voice become more imperative to better understand the privacy
Processing Systems (VPS). However, like any new technology, dangers that lie in the use of various VPS technologies [4, 9, 11,
it might be wise to assess just how safe it is before deciding 14, 17].
whether to fully embrace the convenience of executing every In this work, we present a systematic analysis on the current
command through your voice. threats on VPS and study the proposed defense mechanisms
Technologies around Voice Processing Systems rapidly against those attacks.
improves in terms of its practicality and accuracy, society has Specifically, this paper makes the following contributions:
gravitated towards smart devices that implement convenient
features which assist people’s daily life. VPS aims to classify, • A systematic analysis of attack strategies on VPS:
describe, or generate audio from input audio samples which To fully understand the existing threats on VPS, we break
requires the utilization of machine learning. Many intelligent down the threat model into five different categories and
devices created from these machine learning datasets provide analyze how each category can impact the overall security
features such as voice control and voice recognition that deeply and/or privacy of the overall system.
influence our daily life. Some examples include voice
1
• A taxonomy on proposed defense mechanisms for
VPS:
By providing a thorough analysis on the state-of-the-art,
we systematically categorize the existing defense
mechanisms for securing VPS.
• Providing insights on future attacks and defense
trends in VPS:
Using our observations, we share our insights on how the
future would look like for VPS security?
This paper is organized such that all relevant information
needed to understand the general trends in both the attacks and
defenses of voice processing systems is contained in Section II.
Furthermore, Section II will highlight the major use cases of Figure 1: Pipeline diagram demonstrating the VPS functional process. It
voice processing and demonstrate the input flow for a given includes the Audio and Transmission (which we grouped together as Audio
sound up to the machine learning algorithm stage. In addition, Input), Pre-processing, Signal Processing/Feature Extraction, and Machine
supplemental information will be provided on general machine Learning Inference stages.
learning processing techniques and algorithms that are often
used by voice processing systems. Sections III and IV will B. Primary Functional Blocks of VPS
respectively traverse and explain the categories that we used to Modern VPS are split into four functioning blocks to convert
group related patterns observed in the voice system landscape. the audio input to the desired output, whether it be speech to text
Section V will then take the assigned groupings from Sections or speaker identification.
III and IV and provide additional commentary and new The first block is the microphone, and it converts audio to
suggestions for future areas of focus and critical sections for electric signals. The front-end applies pre-processing functions
improvement. Conclusion is presented in Section VI. such as noise filtering, sampling, and digitizing. The main signal
processing blocks are designed based on the application of the
II. BACKGROUND VPS. The most popular ones function in the frequency domain
A. Primary Subsets of Voice Processing Systems (VPS) and apply the Fast Fourier transform algorithm to form
spectrograms. The spectrogram typically contains information
VPS can be used for two main scenarios, one of them being
about the frequency, timescale, acoustic frame, and the energy
Automatic Speech Recognition and the other being Automatic
used to produce the sound and is injected to classifiers for
Speaker Identification.
inference. In the following, we describe these four blocks in
1) Automatic Speech Recognition (ASR): The purpose of more detail.
ASR is to achieve speech-to-text with a high level of accuracy. 1) Audio Input: The first part of the system is the audio
The system takes in audio input and translates them into a text input. Raw audio inputs are injected into the voice processing
output for a certain language. ASR is an important aspect of system either through a microphone or directly into the audio
Internet of Things (IoT) systems in which voice assistants, such input system. The microphone serves as a tool to convert sound
as Apple Siri, Google Alexa, Amazon Echo, etc., are at the from an analog environment to a digital form. Injecting audio
center of controlling smart devices. ASR is also commonly used directly into the system makes the input replayable and has less
in captioning of live broadcasts on the news, during conferences, volatility from the environment. Audio input through the
and in virtual meetings. Some popular productized ASR services microphone will have its quality impacted by the range of the
include Google Cloud Speech-to-Text, Dragon Naturally audio source, noise in the background, echo from the
Speaking, Amazon Transcribe, and Microsoft Azure Speech to surrounding, and sensitivity of the microphone hardware.
Text. Many Automatic Speech Recognition (ASR) services If adversaries are able to obtain the raw audio input that is
require some utilization of VPS to allow users to fully interact supposed to be run through the voice processing system, then
with smart devices. they can directly manipulate the input of the system.
2) Automatic Speaker Identification (ASI): The purpose This could be categorized as advarsarial attacks include
of ASI is to be able to match the audio source to its speaker. In attacks such as feature injection or even poisoning data. By
addition, ASI are often extended by automatic speaker obtaining information about the input and looking at what the
verification (ASV) systems to decide if an identity claim on an system would output, adversaries are able to make the
audio source is true or false. ASI is used as biometric recognition connection of the processes made onto the input signal that
and can be placed in the security system similar to other would give said output. This would give the attackers a lot of
biometrics such as fingerprint and iris scanning. free room to explore even more vulnerabilities in the system by
being able to access the inputs and feeding many attacks and
2
modified raw audio inputs for information gain or malicious numerous layers. Although the number of layers can vary,
activity. HMMs typically have at least three layers with their own unique
The front end is also vulnerable to audio jamming by adding goals. The first is focused on the acoustic level and its job is to
audio features undetectable by the human ear. The jamming determine if the guessed phoneme, the smallest unit of distinct
signal significantly raises the noise floor to reduce signal to sound, is actually the correct phoneme that was heard. The
noise ratio requirements for proper signal processing. second layer then is responsible for checking if multiple
phonemes are probabilistically likely to be placed next to one
2) Audio Pre-processing: The second part of the system
another in an actual word. The third layer then operates on a
is the audio pre-processing stage which aims to extract the
word level in order to verify if multiple predicted words from
important audio information from the raw input. The process
the second layer actually make sense to be placed together in a
involves the removal of unwanted partitions such as any
sentence (both logically and grammatically). If at any point, a
background noise that aims to lower the resolution quality of the
layer determines that it made a mistake with the previous
audio input. The result would be a cleaner audio file without
predictions, it will backtrack to the previous layer in order to
major signal changes. The block utilizes analog filtering and
select a different choice. Both HMMs and neural networks have
audio signal amplification before relaying the signal processing.
their pros and cons and so in some cases, hybrid approaches are
In addition, noise reduction, harmonic enhancement, and other
adopted which utilize both kinds of models in order to make
processes are used to remove unnecessary information from the
their predictions.
audio input.
There is a diverse number of ML algorithms that target
By understanding what parts of the audio signals are deleted
different features from the audio signal. Some target temporal
or removed, attackers can avoid detection through
dependency [1, 7, 11, 12], high order frequency behavior [1, 6,
understanding what is processed or filtered out of the signal
10, 13, 18], baseband aliases [1, 14], sound perturbation [2, 4,
before entering the next block of processing data. On the flip
5, 9, 17, 18], mechanical vibrations [8], etc. Additionally, ML
side, adversaries are also able to sabotage accurate or important
may also extract different parameters from these features
partitions of the audio input if the pre-processing block of the
including duration and power signatures.
system is compromised. Information can be stolen and even
Attackers have since moved from requiring white-box
corrupted if adversaries are able to control the algorithms that
information to attempts in creating algorithms that are effective
involve the deletion and filtering of the “unwanted” parts.
with black box information. In the next section, we provide our
3) Signal Processing: The third part of the system is the detailed analysis on the existing attack vectors on a VPS.
signal processing. This block prepares the time domain signal
into features that are ready to be analyzed and inference by
machine learning algorithm. The best signal processing aims to
retain and best approximate audio features that human ears can
capture.
Some of the most common processing algorithms include
mel frequency cepstrum coefficient (MFCC). There are also
many other techniques including Mel-Frequency Spectral
Coefficients (MFSC), Linear Predictive Coding, Perceptual
Linear Prediction, Constant-Q Cepstral Coefficients, and
Cochlear Filter Cepstral Coefficients to extract various aspects
of the audio features.
Figure 2: Iterative attack model consisting of a feedback loop. If the output of
This block is also the second most flexible and diverse the target VPS is not the one wanted by the adversary classification group, we
portion next to the machine learning algorithm. Defense relies reprocess the audio signal and attack the VPS again.
on this stage to insert detection mechanisms that attackers try
to evade. III. ATTACK MODEL
Attacks on VPS are best understood in relation to the context
4) Machine Learning Inference: The final part of the
of the victim system, what exactly is being done, where the
system is machine learning inference. The machine learning
vulnerabilities are. To break down the details of each attack for
(ML) algorithm depends on the application of the voice
fair analysis it is important to define categories to be used in
processing system. For ASR the ML aims to use the audio comparisons. Specifically, we define five different categories
feature and correctly assign word labels for the audio input. that can distinguish the existing attacks. In the following, we
Popular algorithms often implement a convolution neural describe them in more detail.
network trained with supervised learning as the inference
system. ASI system uses this stage to match the input against a A. Consequences
feature of the database of speakers to infer the identity. An The consequence of a successful attack is dependent on what
alternative approach is the use of a Hidden Markov Model the original system was meant to perform versus what the
(HMM) which is a statistical model that is structured with compromised system actually does. There are a few main types
3
of attacks that we will call denial of service, targeted control compromise VPS, especially systems with ASR functionalities
of service, and leakage of information from the service. since they would misinterpret voice commands and function as
1) Denial of Service: Denial of service can happen in the attacker has intended. Therefore, there are many factors to
many forms, but all symptoms of the attack culminate in the be considered for physical attacks done over-the-air against
scenario where a legitimate user is attempting to use the VPS, voice processing systems in terms of the physical space and
environment.
but the result is incorrect. For example, the input is not captured
due to jamming, therefore the output simply could not output 2) Digital Attacks: The digital attacks can start as early
the correct inference [6, 21]. as the pre-processing stage where digital attacks can be inputted
2) Targeted Control of Service: Targeted control of in the module. If attackers can figure out signal properties and
service is the scenario where the attacker is able to manipulate patterns as well as the desired commands to use on an attack,
the input to the VPS and obtain a targeted output from the then they will be able to determine a raw audio signal that
system. This can happen with or without the presence of preserves the features and commands which is
legitimate user commands. A popular attack is known as hidden incomprehensible to the human ear. This process is usually
voice attack, which broadcasts sound incomprehensible to done through decoding parts of the signal which comes from
humans but recognized as proper command by VPS [6, 8, 10, the pre-processing module extracting some amount of loss in
13, 14, 16, 18, 19, 20]. the signal. From these losses that the pre-processing module has
extracted, attackers can correspond these losses to the actual
3) Leakage of Information: Leakage of information is acoustic feature and thus get an idea of the architecture and
best described and dependent on the implementation of the structure of the voice processing module. Other strategies of
VPS. The purpose of this attack is to collect voice information digital attacks also include knowing the voice activity detection
that is being processed by the system and to fingerprint the threshold value which determines whether an audio segment is
usage data and behavior of the legitimate user. This type of considered a command or not. If attackers can find the voice
attack is most feasible when the model inference needs to activity detection value, they will be able to optimize their
happen in the cloud and audio information may be intercepted attacks to compromise the VPS. Many other digital attack
or reconstructed with side channel attacks at any point in the models can be explored since signal processing itself is very
VPS [3]. broad and has a high chance that attacks can go through this
B. Domain module block.
Attacks can vary based on their purpose and malicious intent C. Generality
as seen from Section III-A, whether it be the denial of service, Generality defines how well the attack applies to or could be
system control override, or leakage of information. Due to the applied to various VPSes and system configurations.
large number of varying attacks with different purposes, there
exists many ranges and domains each attack traverses. We will 1) Universal: The attack is evident to work in both ASR
classify the traversal domains into two different main groups. and ASI systems. The type of attack is also likely extendable to
all commercial datasets and machine learning models.
1) Physical Attacks: When attackers use the physical
type of attacks, the attack from the physical audio source is first 2) Specific: The attack is targeted at a particular group
transmitted through an open channel through the environmental of individual VPSes and is likely not effective against systems
air, which is also known as an over-the-air attack, before outside of the targeted group.
reaching to the device side. Because the attack is transmitted D. Knowledge of the Victim System
through the air into the device, environmental factors in the
Knowledge of the victim system describes the amount of
surrounding perimeter such as background noise may affect the information about the architecture and VPS implementation
transmission of the audio signal. For physical attacks, it is that the attacker needs to implement his/her attack.
important to note the distance the audio signal is traveling. This
is the case because as the distance during the transmission of 1) Black Box: The attacker does not need information
the signal increases, the strength of the signal decreases. This about the system beyond knowing if the system is a voice
imposes a constraint of the maximum distance that a signal can processing system.
be read on a device to obtain enough energy from that signal. 2) White Box: The attacker requires the information of
Another important factor to consider is the environment the system components, setup, ML algorithm, and potentially
surrounding itself. The input signal would be reflected across the model parameters as well.
many surfaces which results in the capture signal on the device
to be a composition of signals instead of the original raw signal. E. Attack Mechanism
Because of this mixed signal composition, adversaries can Attack mechanism describes the actual implementation of
exploit this fact and broadcast their attacks while hiding it with each attack and what area of vulnerability the attack is
the human voice by overlaying the signals. This attack would exploiting.
4
TABLE I
Attack Model Labeling
[1] 2020 Targeted Control of Service Digital Specific Black Box Speaker Impersonation
[2] 2019 Denial of Service Digital Specific Black Box Synthetic Speech
[3] 2020 Information Leakage Physical Specific Black Box VPS Side-Channel
[4] 2020 Denial of Service Physical Specific White Box Hidden Voice Command
[5] 2018 Denial of Service Digital Specific White Box Synthetic Speech
[6] 2017 Targeted Control of Service, Physical Universal Black Box Hidden Voice Command
Denial of Service
[8] 2019 Information Leakage, Physical Universal Black Box Hidden Voice Command
Targeted Control of Service
[9] 2020 Targeted Control of Service Physical, Digital Universal Black Box Synthetic Speech
[10] 2017 Targeted Control of Service, Physical Universal White Box Hidden Voice Command
Denial of Service
[11] 2019 Denial of Service Digital Universal Black Box Synthetic Speech
[13] 2017 Targeted Control of Service, Physical Universal Black Box Hidden Voice Command
Denial of Service
[14] 2018 Targeted Control of Service Physical Universal Black Box Hidden Voice Command
[16] 2019 Targeted Control of Service Physical Specific White Box Hidden Voice Command
[17] 2020 Targeted Control of Service Physical, Digital Specific White Box Synthetic Speech
[18] 2019 Targeted Control of Service Physical, Digital Universal Black Box Hidden Voice Command
[19] 2020 Denial of Service, Targeted Digital Universal Black Box, Hidden Voice Command, Synthetic
Control of Service White Box Speech
[20] 2021 Targeted Control of Service, Physical, Universal, Black Box, Hidden Voice Command,
Denial of Service Digital Specific White Box Synthetic Speech
[21] 2019 Denial of Service Physical Universal Black Box Hidden Voice Command
1) Hidden Voice Command: Hidden Voice Command is to generate modified audio samples from valid voice command
a family of attack mechanisms against VPS that exploits sounds input samples that would fool the VPS or transfer its control to
not recognized by humans but recognized by VPS. This is the attacker. A common theme includes digitally manipulating
achieved through various means including Time Domain input samples and then digitally transmitting it so that the VPS
Inversion [18], Random Phase generation [18], High Frequency will create errors and output different results. This typically
(ultrasonic) [6, 10, 13, 18], Time Scaling [18], etc. Hidden attacks the machine learning stage.
voice commands are usually manufactured from scratch
3) Speaker Impersonation: Speaker impersonation is a
without any base sample voice input. This typically targets the
family of attack mechanisms that tries to fool ASI systems
signal pre-processing and processing stages.
through mimicking audio features present in the correct owner.
2) Synthetic Speech: Synthetic Speech also known as There are a few types of attack. The first is a human
adversarial sample is a family of attack mechanisms that tries impersonation by a malicious attacker pretending to sound like
5
a legitimate speaker. The second is synthetic speech that efficient and faster performing. However, hardware
generates voice samples that mimic the original speaker. The implemented defense mechanisms would be more financially
third is voice conversion that records an attacker’s audio sample costly, resource impactful, and time demanding due to its
and attempts to convert it with features similar to the victim’s tangibility and specialized functionality. Some examples of
voice. The fourth type is a replay attack that records audio hardware implementation defenses include microphones with
samples from the victim and plays it against the ASI system. the capability to recognize processed malicious signals or even
microphone jamming hardware that prevents audio input when
4) VPS Side-Channel: VPS Side-Channel is a family of
voice commands are not wanted in certain situations [10, 21].
attack mechanisms that ultimately try to obtain information that
may be sensitive. The leakage is based on the implementation 2) Software Implementation: The majority of defense
of a computer system. By being able to observe and gather data mechanisms are focused on the software implementation of
using side-channels, we can combine our intel with unhidden defense mechanisms. Software implemented defense
voice attacks. This can be done by exploiting the gained mechanisms have some universality towards them since there
information and compromising the system security. For are only a handful of ways to exploit the software through
example, voice data leaked from side-channels can be used to software. Software-based defenses are usually slower
redirect access control of a system to an adversary’s node, performing than hardware-based ones but faster to develop.
enabling malicious unauthorized commands from the attacker. Software defense mechanisms tend to focus more towards
increasing the robustness of machine learning models, as well
IV. DEFENSE MODEL as introducing algorithms that can detect differences and
Defense models are usually generated from previously anomalies among data results. An example of software
created adversarial attacks on certain open-source data sets, implemented defense mechanisms include pre-training an
meaning that it is specialized in preventing attacks from a type observed neural network with datasets inclusive of adversarial
of known attack. This means that there is no single defense attacks as well as random clean samples to have the neural
system that can withstand every known attack based on the one network be wary of some adversarial attack, just like how a
particular model and framework. This part of the paper aims to vaccine works. Software mechanisms, as we can see, are
gather certain specialized defense mechanisms and provide its focused on pattern recognition and attack identification with
target capabilities against certain adversarial attacks. algorithms and programs preset before system operation [1, 7,
10, 12, 14, 15].
A. Mitigated Attacks
An important trait for analyzing and comparing different C. Generality
defense models is the type of attack mechanisms that the This section will classify defense mechanisms based on how
defense is able to mitigate or was engineered to counteract. wide of a variety of platforms that they apply to or can be
Such attack mechanisms for emerging VPS attacks can vary potentially applied to. These classifications were based on
greatly and therefore, we will be applying the same attack experimental results of certain adversarial attacks.
mechanism categories that we used previously in the paper. For
1) Universal: These defense mechanisms seem to be
definitions and explanations of the following labels, see Section
able to be implemented with a common factor of products,
III-E.
meaning that the defense mechanisms can either protect many
1) Hidden Voice Command differing datasets with a universal hardware component or a
2) Synthetic Speech universal algorithm or machine learning model.
3) Speaker Impersonation
2) Specific: Specific defense mechanisms include
4) VPS Side-Channel
results on certain datasets and have not yet had the results be
B. Domain of Defense Implementation replicated through other datasets. More tests are needed in order
Since attacks can be very situational and instantaneous to increase either the generality of the mechanism or the
whether it be physical attacks transmitted over the air or digital practicality of the defense mechanism implementation.
attacks through signal processing, there are many different
implementations of defense mechanisms. The paper will
discuss (1) hardware implementation and (2) software
implementation defense mechanisms that are prevalent in
protecting against certain attacks.
1) Hardware Implementation: These implementations
tend to be more secure and robust against adversarial attacks
since there are hardware components in devices that are Figure 3: The architecture analyzes the clean input and adversarial input
dedicated to security consisting of many attack prevention through a neural network and compares its temporal dependency to find if there
are discrepancies.
mechanisms. Hardware implementations are also almost more
6
TABLE II
Defense Model Labeling
Synthetic Speech,
Anomaly Pattern Detection –
Hidden Voice Detectable ML Misbehavior
[7] 2019 Software Universal ML Classification, Input
Command, Speaker – Temporal Dependency
Transformation
Impersonation
Detectable Frequency
Hidden Voice
[10] 2017 Hardware Specific Corruption – High Input Transformation
Command
Frequency
Detectable Frequency
Hidden Voice Anomaly Pattern Detection –
[10] 2017 Software Universal Corruption – High
Command ML Classification
Frequency
Synthetic Speech,
Hidden Voice Detectable ML Misbehavior Anomaly Pattern Detection –
[15] 2019 Software Universal
Command, Speaker – ML Activation ML Classification
Impersonation
7
distribution. At runtime, the defense compares the VPS B. Practicality of Attacks
behavior from the current input with the expected behavior from The practicality of attacks have also been improving over
prebuilt distribution. The properties monitor often comes in the the years. Many of the early attacks aim to demonstrate proof
form of frequency spectrums or ML activations. In terms of of concepts on vulnerability exploits. As research in the types
frequency spectrums, defense typically analyzes power or of attacks mature, researchers have begun to expand the
frequency patterns that are outside audio ranges.
capabilities of such attacks. For example, early synthetic speech
2) Input Transformation: This technique was first attacks have been requiring white box knowledge [5, 10], but
explored in the image domain since input transformation was a recently more attacks have been conducted on a black box
common method to prevent image adversarial attacks. The main setting [2, 9]. In addition early attacks abstracted away from the
strategy is to remove the adversarial and hidden contents of real world setting by digitally injecting attacks directly to the
audio samples. A few techniques include quantization, VPS [2, 5, 10]. In contrast, more recent papers [9, 17] have
downsampling, autoencoding, local smoothing, and frequency demonstrated successful synthetic speech attacks on VPS over
noise canceling. All of these techniques aim to remove noise the air. Similar trends can be seen in research in hidden voice
from the audio sample and only collect the semantic portion of attacks, speaker impersonation attacks, and VPS side channel
the input. These methods assume that attacks will thus be attacks.
removed directly and omit the need for detection.
C. Trends of Defense
V. DISCUSSION As we have discussed and identified trending attacks, we
also aim to discuss some defense mechanisms that are currently
A. Trends of Attack
being developed. We see trends in defending against synthetic
We have classified and created categories of the main speech adversarial attacks through pattern recognition and
consequences of recent attacks into denial of service attacks, anomaly detection. What we have concluded is that no matter
targeted control of service, and information leakage attacks. how small a feature injection from an attack, or how nuance a
1) Denial of Service Attacks: Firstly, we have seen early signal is changed to form an attack, there is still a discrepancy to
papers describe attacks that aim to deny the service of some be found inside the original audio signal. Adversarial attacks and
audio device or machine learning network. Denial of service defense protocols are always striving to compete against each
attacks seem to be more easily conducted than other attacks other with adversarial attacks always having the first move due
since the goal of the attack is to return an error in order to deny to its freedom in what type of system to attack and the
the service of the observed device. The error that the attack aims methodology.
to return can be regarded as a garbage value, which is sufficient This is also true in the fact that security and defense
for attacks that only require the denial of service. mechanisms have always been a second priority when
developing a system or device. Some specific defense
2) Targeted Control of Service Attacks: On the other mechanisms we have gathered include input transformation
hand, attacks that are targeted control of service require much techniques used to observe temporal dependency, as well as pre-
more precision. Instead of only returning a dummy value to training machine learning systems to account for adversarial
deny the service, targeted control of service attacks need to find attacks in order to detect extreme anomalies.
some way to gain exclusive control, meaning some transfer of As we have discussed above, there needs to be a sufficient
control process needs to occur in the victim system device. The amount of currently existing adversarial attacks in order to
attacks that result in sensitive information leaks enable targeted properly train a machine learning system to counter these attacks
control of service attacks. Since targeted control of service through anomaly detection. Input classification methods may be
attacks requires a certain way to gain access or control of the more costly or expensive since they would transform all audio
victim system, there needs to be some prior knowledge these signal data and then rely on observing the temporal dependency
adversarial attacks must gain first in order to succeed on its data set in order to come up with a consensus of attacks
effectively. and clean audio input. These methods we have outlined focus
3) Information Leakage Attacks: Information leakage towards pattern recognition and always compare an adversarial
attacks are the first step of more complicated and complex attack using some sort of metric in order to differentiate the two.
attacks since knowing more about the victim system will result Some future defense mechanisms that we encourage can be
in better understanding in the defenses adversarial attacks must towards creating embedded systems, since these types of defense
avoid and bypass. In order for attacks to be much more realistic mechanisms can be more secure. Even though having hardware
and practical, these attacks also should be hidden from normal implementations like an embedded chip can be more costly, it
device users, which add another layer of complexity for any may be able to prevent some types of attacks by having a layer
type of adversarial attack. of security adversarial attacks are not able to reach.
8
D. Practicality of Defenses typically more accessible than other ML systems), conducting
As of now, a single universal defense mechanism that can successful attacks is much more challenging because these ML
prevent a large general field of attack, such as audio attacks, systems normally rely on signal processing stage extracted
does not seem to be possible since adversarial attacks in a features as inputs rather than raw image data.
certain field can use many different mechanisms and techniques VI. CONCLUSION
as we have discussed above. We can now see that the reason
why defenses always come after attacks is because adversarial In this paper we have systematically analyzed state of the art
attacks need to be prevalent in order for defense mechanisms to attacks against VPS systems. Existing attacks are becoming
prevent these specific attacks. We try to generalize and group more practical and more dangerous. In addition, new avenues
defense mechanisms which is still a work in progress. In of attacks to leak information are gradually being explored.
addition, there is a lack of evidence regarding how easily the Research on new areas of attack start by being creative and look
defense mechanisms can be deployed on commercial systems to be more efficient, effective, and more general. On the other
since the authors of defense papers often implement a custom hand, many of the defenses have been tightly coupled with
system to verify performance. specific attacks presented in the research. More generic defense
mechanisms are mainly being explored to target more popular
E. Comparison to Other Machine Learning Systems attack mechanisms. All in all, further advancement in both VPS
As mentioned throughout this paper, machine learning, attacks and defense is likely to be propelled by benefits of better
especially neural networks, are a key part of many existing machine learning algorithms. With the improvements over the
VPSs. While ML generally provides a significant improvement years, state of the art defense and attacks are increasingly
for these speech recognition systems, it also makes them practical on a much wider range of VPSes. It will be important
vulnerable to the existing threats on ML systems, collectively for manufacturers to keep up with the latest developments to
referred to as adversarial machine learning. The attacks in this implement adequate countermeasures to protect the end users.
domain are typically categorized into (1) Poisoning Attacks: It
occurs when the adversary is able to inject malicious data into REFERENCES
the model's training pool, causing it to learn something it [1] M. R. Kamble, H. B. Sailor, H. A. Patil, and H. Li, “Advances in anti-
shouldn't have [22-25]. (2) Evasion Attacks: It happens when spoofing: from the perspective of ASVspoof challenges,” APSIPA
Transactions on Signal and Information Processing, vol. 9, p. e2, 2020.
the adversary, by crafting noisy samples (aka adversarial
[2] S. Khare, R. Aralikatte, and S. Mani, “Adversarial Black-Box Attacks on
samples), fools the ML model to incorrectly label the given Automatic Speech Recognition Systems using Multi-Objective
input [25-29]. (3) Privacy Attacks: Where the adversary steals Evolutionary Optimization,” arXiv:1811.01312 [cs], Jul. 2019, Accessed:
some sensitive information for the model. This information Jun. 06, 2021. [Online]. Available: http://arxiv.org/abs/1811.01312.
could be extracting the model [30-32] or inverting it [33], input [3] D. Caputo, L. Verderame, A. Merlo, A. Ranieri, and L. Caviglione, “Are
data [35], or some aspects of it [34]. you (Google) Home? Detecting Users’ Presence through Traffic Analysis
of Smart Speakers,” p. 14.
Compared to the other applications for machine learning,
[4] K.-H. Chang, P.-H. Huang, H. Yu, Y. Jin, and T.-C. Wang, “Audio
such as image recognition, vision, etc., to the best of our Adversarial Examples Generation with Recurrent Neural Networks*,” in
knowledge, there is no known privacy and/or poisoning attacks 2020 25th Asia and South Pacific Design Automation Conference (ASP-
on VPSs. However, we believe this a very possible research DAC), Jan. 2020, pp. 488–493. doi: 10.1109/ASP-
DAC47756.2020.9045597.
direction in future, as these VPSs share a similar threat model
[5] N. Carlini and D. Wagner, “Audio Adversarial Examples: Targeted
with other ML models such as vision. Possible attack scenarios Attacks on Speech-to-Text,” in 2018 IEEE Security and Privacy
include reverse-engineering the VPS in order to extract its ML Workshops (SPW), San Francisco, CA, May 2018, pp. 1–7. doi:
model, poisoning data sets (e.g., voice command samples) to 10.1109/SPW.2018.00009.
create backdoors, etc. [6] N. Roy, H. Hassanieh, and R. Roy Choudhury, “BackDoor: Making
In contrary, as described in this paper, there are a number of Microphones Hear Inaudible Sounds,” in Proceedings of the 15th Annual
International Conference on Mobile Systems, Applications, and Services,
evasion attacks for VPS. Most of the literature on evasion Niagara Falls New York USA, Jun. 2017, pp. 2–14. doi:
attacks though, has been focused on the computer vision. In that 10.1145/3081333.3081366.
context, adversarial samples arise from introducing [7] Z. Yang, B. Li, P.-Y. Chen, and D. Song, “Characterizing Audio
perturbations to an image that do not alter its semantics (as Adversarial Examples Using Temporal Dependency,” arXiv:1809.10875
[cs, eess, stat], Jun. 2019, Accessed: Jun. 06, 2021. [Online]. Available:
determined by human observers), but that lead machine http://arxiv.org/abs/1809.10875.
learning models to misclassify the image.
[8] C. Wang, S. A. Anand, J. Liu, P. Walker, Y. Chen, and N. Saxena,
In the context of audio, however, evasion attacks can happen “Defeating hidden audio channel attacks on voice assistants via audio-
by manipulating the input sound by adding perturbations (e.g., induced surface vibrations,” in Proceedings of the 35th Annual Computer
by adding inaudible signals, silence, noise, etc.) some of which Security Applications Conference, San Juan Puerto Rico USA, Dec. 2019,
pp. 42–56. doi: 10.1145/3359789.3359830.
are also common in other ML systems (e.g., adding noise), but
[9] Y. Chen et al., “Devil’s Whisper: A General Approach for Physical
some other are unique to VPS (e.g., inaudible signals). Adversarial Attacks against Commercial Black-box Speech Recognition
Besides having different methods for creating adversarial Devices,” p. 19.
samples, successful evasion attacks are typically more difficult [10] G. Zhang, C. Yan, X. Ji, T. Zhang, T. Zhang, and W. Xu, “DolphinAtack:
to carry out on VPSs. In evasion attacks, although sending Inaudible Voice Commands,” Proceedings of the 2017 ACM SIGSAC
adversarial samples to a VPS is relatively simple (since they are
9
Conference on Computer and Communications Security, pp. 103–117, [28] Su, J., Vargas, D.V. and Sakurai, K., 2019. One pixel attack for fooling
Oct. 2017, doi: 10.1145/3133956.3134052. deep neural networks. IEEE Transactions on Evolutionary
[11] H. Abdullah et al., “Hear ‘No Evil’, See ‘Kenansville’: Efficient and Computation, 23(5), pp.828-841.
Transferable Black-Box Attacks on Speech Recognition and Voice [29] Carlini, N. and Wagner, D., 2017, November. Adversarial examples are
Identification Systems,” arXiv:1910.05262 [cs, eess], Oct. 2019, not easily detected: Bypassing ten detection methods. In Proceedings of
Accessed: Jun. 06, 2021. [Online]. Available: the 10th ACM Workshop on Artificial Intelligence and Security (pp. 3-14).
http://arxiv.org/abs/1910.05262. [30] Tramèr, F., Zhang, F., Juels, A., Reiter, M.K. and Ristenpart, T., 2016.
[12] V. Akinwande, C. Cintas, S. Speakman, and S. Sridharan, “Identifying Stealing machine learning models via prediction apis. In 25th {USENIX}
Audio Adversarial Examples via Anomalous Pattern Detection,” Security Symposium ({USENIX} Security 16)(pp. 601-618).
arXiv:2002.05463 [cs, eess, stat], Jul. 2020, Accessed: Jun. 06, 2021. [31] Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z.B. and
[Online]. Available: http://arxiv.org/abs/2002.05463. Swami, A., 2017, April. Practical black-box attacks against machine
[13] L. Song and P. Mittal, “Inaudible Voice Commands,” arXiv:1708.07238 learning. In Proceedings of the 2017 ACM on Asia conference on
[cs], Aug. 2017, Accessed: Jun. 06, 2021. [Online]. Available: computer and communications security (pp. 506-519).
http://arxiv.org/abs/1708.07238. [32] Wang, B. and Gong, N.Z., 2018, May. Stealing hyperparameters in
[14] N. Roy, S. Shen, H. Hassanieh, and R. R. Choudhury, “Inaudible Voice machine learning. In 2018 IEEE Symposium on Security and Privacy
Commands: The Long-Range Attack and Defense,” p. 14. (SP) (pp. 36-52). IEEE.
[15] Z. Xu, F. Yu, and X. Chen, “LanCe: A Comprehensive and Lightweight [33] Fredrikson, M., Jha, S. and Ristenpart, T., 2015, October. Model inversion
CNN Defense Methodology against Physical Adversarial Attacks on attacks that exploit confidence information and basic countermeasures.
Embedded Multimedia Applications,” arXiv:1910.08536 [cs], Oct. 2019, In Proceedings of the 22nd ACM SIGSAC Conference on Computer and
Accessed: Jun. 06, 2021. [Online]. Available: Communications Security (pp. 1322-1333).
http://arxiv.org/abs/1910.08536. [34] Shokri, R., Stronati, M., Song, C. and Shmatikov, V., 2017, May.
[16] J. Szurley and J. Z. Kolter, “Perceptual Based Adversarial Audio Membership inference attacks against machine learning models. In 2017
Attacks,” arXiv:1906.06355 [cs, eess], Jun. 2019, Accessed: Jun. 06, IEEE Symposium on Security and Privacy (SP) (pp. 3-18). IEEE.
2021. [Online]. Available: http://arxiv.org/abs/1906.06355. [35] Juvekar, C., Vaikuntanathan, V. and Chandrakasan, A., 2018.
[17] Z. Li, C. Shi, Y. Xie, J. Liu, B. Yuan, and Y. Chen, “Practical Adversarial {GAZELLE}: A low latency framework for secure neural network
Attacks Against Speaker Recognition Systems,” in Proceedings of the inference. In 27th {USENIX} Security Symposium ({USENIX} Security
21st International Workshop on Mobile Computing Systems and 18) (pp. 1651-1669).
Applications, Austin TX USA, Mar. 2020, pp. 9–14. doi:
10.1145/3376897.3377856.
[18] H. Abdullah, W. Garcia, C. Peeters, P. Traynor, K. R. B. Butler, and J.
Wilson, “Practical Hidden Voice Attacks against Speech and Speaker
Recognition Systems,” arXiv:1904.05734 [cs, eess], Mar. 2019,
Accessed: Jun. 06, 2021. [Online]. Available:
http://arxiv.org/abs/1904.05734.
[19] T. Du, S. Ji, J. Li, Q. Gu, T. Wang, and R. Beyah, “SirenAttack:
Generating Adversarial Audio for End-to-End Acoustic Systems,” in
Proceedings of the 15th ACM Asia Conference on Computer and
Communications Security, Taipei Taiwan, Oct. 2020, pp. 357–369. doi:
10.1145/3320269.3384733.
[20] Y. Chen et al., “SoK: A Modularized Approach to Study the Security of
Automatic Speech Recognition Systems,” arXiv:2103.10651 [cs, eess],
Mar. 2021, Accessed: Jun. 06, 2021. [Online]. Available:
http://arxiv.org/abs/2103.10651.
[21] Y. Chen et al., “Understanding the Effectiveness of Ultrasonic
Microphone Jammer,” arXiv:1904.08490 [cs, eess], Apr. 2019, Accessed:
Jun. 06, 2021. [Online]. Available: http://arxiv.org/abs/1904.08490.
[22] Jagielski, Matthew, Alina Oprea, Battista Biggio, Chang Liu, Cristina
Nita-Rotaru, and Bo Li. "Manipulating machine learning: Poisoning
attacks and countermeasures for regression learning." In 2018 IEEE
Symposium on Security and Privacy (SP), pp. 19-35. IEEE, 2018.
[23] Suciu, O., Marginean, R., Kaya, Y., Daume III, H. and Dumitras, T., 2018.
When does machine learning {FAIL}? generalized transferability for
evasion and poisoning attacks. In 27th {USENIX} Security Symposium
({USENIX} Security 18)(pp. 1299-1316).
[24] Alfeld, S., Zhu, X. and Barford, P., 2016, February. Data poisoning
attacks against autoregressive models. In Proceedings of the AAAI
Conference on Artificial Intelligence(Vol. 30, No. 1).
[25] Carlini, N. and Wagner, D., 2017, May. Towards evaluating the
robustness of neural networks. In 2017 ieee symposium on security and
privacy (sp) (pp. 39-57). IEEE.
[26] Biggio, B., Corona, I., Maiorca, D., Nelson, B., Šrndić, N., Laskov, P.,
Giacinto, G. and Roli, F., 2013, September. Evasion attacks against
machine learning at test time. In Joint European conference on machine
learning and knowledge discovery in databases (pp. 387-402). Springer,
Berlin, Heidelberg.
[27] Madry, A., Makelov, A., Schmidt, L., Tsipras, D. and Vladu, A., 2017.
Towards deep learning models resistant to adversarial attacks. arXiv
preprint arXiv:1706.06083.
10