Neural NILM: Deep Neural Networks Applied To Energy Disaggregation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Neural NILM: Deep Neural Networks

Applied to Energy Disaggregation

Jack Kelly William Knottenbelt


Department of Computing Department of Computing
Imperial College London Imperial College London
180 Queen’s Gate, London, SW7 2RH, UK 180 Queen’s Gate, London, SW7 2RH, UK
jack.kelly@imperial.ac.uk w.knottenbelt@imperial.ac.uk

ABSTRACT
arXiv:1507.06594v3 [cs.NE] 28 Sep 2015

Power (kW)
Energy disaggregation estimates appliance-by-appliance elec-
tricity consumption from a single meter that measures the
1
whole home’s electricity demand. Recently, deep neural net-
works have driven remarkable improvements in classification
performance in neighbouring machine learning fields such as 0
image classification and automatic speech recognition. In 0 30 60 90
this paper, we adapt three deep neural network architectures Time (minutes)
to energy disaggregation: 1) a form of recurrent neural net-
work called ‘long short-term memory’ (LSTM); 2) denoising Figure 1: Example power demand during one activa-
autoencoders; and 3) a network which regresses the start tion of the washing machine in UK-DALE House 1.
time, end time and average power demand of each appliance
activation. We use seven metrics to test the performance
of these algorithms on real aggregate power data from five
aim might be to help users reduce their energy consumption;
appliances. Tests are performed against a house not seen
or to help operators to manage the grid; or to identify faulty
during training and against houses seen during training. We
appliances; or to survey appliance usage behaviour.
find that all three neural nets achieve better F1 scores (aver-
Research on NILM started with the seminal work of George
aged over all five appliances) than either combinatorial op-
Hart [1, 2] in the mid-1980s. Hart described a ‘signature tax-
timisation or factorial hidden Markov models and that our
onomy’ of features [2] and his earliest work from 1984 de-
neural net algorithms generalise well to an unseen house.
scribed experiments of extracting more detailed features1 .
However, Hart decided to focus on extracting only transi-
Categories and Subject Descriptors tions between steady-states. Many NILM algorithms de-
I.2.6 [Artificial Intelligence]: Learning—Connectionism signed for low frequency data (1 Hz or slower) follow Hart’s
and neural nets; I.5.2 [Pattern Recognition]: Design Method- lead and only extract a small number of features. In con-
ology—Pattern analysis, Classifier design and evaluation tract, in high frequency NILM (sampling at kHz or even
MHz), there are numerous examples in the literature of man-
ually engineering rich feature extractors (e.g. [3, 4]).
Keywords Humans can learn to detect appliances in aggregate data
Energy disaggregation; neural networks; feature learning; by eye, especially appliances with feature-rich signatures
NILM; energy conservation; deep learning such as the washing machine signature shown in Figure 1.
Humans almost certainly make use of a variety of features
1. INTRODUCTION such as the rapid on-off cycling of the motor (which pro-
duces the rapid ∼ 200 watt oscillations), the ramps towards
Energy disaggregation (also called non-intrusive load mon-
the end as the washer starts to rapidly spin the clothes etc.
itoring or NILM) is a computational technique for estimating
We could consider hand-engineering feature extractors for
the power demand of individual appliances from a single me-
these rich features. But this would be time consuming and
ter which measures the combined demand of multiple appli-
the resulting feature detectors may not be robust to noise
ances. One use-case is the production of itemised electricity
and artefacts. Two key research questions emerge: Could
bills from a single, whole-home smart meter. The ultimate
an algorithm automatically learn to detect these features?
Can we learn anything from neighbouring machine learning
fields such as image classification?
Before 2012, the dominant approach to extracting features
for image classification was to hand-engineer feature detec-
tors such as scale-invariant feature transform [5] (SIFT) and
difference of Gaussians (DoG). Then, in 2012, Krizhevsky et
This is the authors’ version of the work. Copyright is held by the authors. The defini-
tive version was published in ACM BuildSys’15, November 4–5, 2015, Seoul. 1
DOI:10.1145/2821650.2821672 This claim is taken from Hart 1992 [2] because no copy of
. George Hart’s 1984 technical report was available.
al.’s winning algorithm [6] in the ImageNet Large Scale Vi- sion ratio of 5:1, and ignoring the datetime index, the total
sual Recognition Challenge achieved a substantially lower storage requirements for a year of data from 10 million users
error score (15%) than the second-best approach (26%). would be 13 terabytes (which could fit on two 8 TB disks). If
Krizhevsky et al.’s approach did not use hand-engineered one week of aggregate data can be processed in one second
feature detectors. Instead they used a deep neural network per home (which should be possible given further optimi-
which automatically learnt to extract a hierarchy of features sation) then data from 10 million users could be processed
from the raw image. Deep learning is now a dominant ap- by 16 GPU compute nodes. Alternatively, disaggregation
proach not only in image classification but also fields such could be performed on a compute device within each home
as automatic speech recognition [7], machine translation [8], (a modern laptop or mobile phone or a dedicated ‘disaggre-
even learning to play computer games from scratch [9]! gation hub’ could handle the disaggregation). A GPU is not
In this paper, we investigate whether deep neural nets can required for disaggregation, although it makes it faster.
be applied to energy disaggregation. The use of ‘small’ neu- This paper is structured as follows: In Section 2 we pro-
ral nets on NILM dates back at least to Roos et al. 1994 [10] vide a very brief introduction to artificial neural nets. In
(although that paper was just a proposal) and continued Section 3 we describe how we prepare the training data for
with [11, 12, 13, 14] but these small nets do not appear to our nets and how we ‘augment’ the training data by syn-
learn a hierarchy of feature detectors. A big breakthrough in thesising additional data. In Section 4 we describe how we
image classification came when the compute power (courtesy adapted three neural net architectures to NILM. In Section 5
of GPUs) became available to train deep neural networks on we describe how we do disaggregation with our nets. In Sec-
large amounts of data. In the present research, we want tion 6 we present the disaggregation results of our three
to see if deep neural nets can deliver good performance on neural nets and two benchmark NILM algorithms. Finally,
energy disaggregation. in Section 7 discuss our results, offer our conclusions and
Our main contribution is to adapt three deep neural net- describe some possible future directions for research.
work architectures to NILM. For each architecture, we train
one network per target appliance. We compare two bench-
mark disaggregation algorithms (combinatorial optimisation
2. INTRODUCTION TO NEURAL NETS
and factorial hidden Markov models) to the disaggregation An artificial neural network (ANN) is a directed graph
performance of our three deep neural nets using seven met- where the nodes are artificial neurons and the edges allow
rics. We also examine how well our neural nets generalise information from one neuron to pass to another neuron (or
to appliances in houses not seen during training because, the same neuron in a future time step). Neurons are typ-
ultimately, when NILM is used ‘in the field’ we very rarely ically arranged into layers such that each neuron in layer
have ground truth appliance data for the houses for which l connects to every neuron in layer l + 1. Connections are
we want to disaggregate. So it is essential that NILM algo- weighted and it is through modification of these weights that
rithms can generalise to unseen houses. ANNs learn. ANNs have an input layer and an output layer.
Please note that, once trained, our neural nets do not need Any layers in between are called hidden layers. The forward
ground truth appliance data from each house! End-users pass of an ANN is where information flows from the input
would only need to provide aggregate data. This is because layer, through any hidden layers, to the output. Learning
each neural network should learn the ‘essence’ of its target (updating the weights) happens during the backwards pass.
appliance such that it can generalise to unseen instances of
that appliance. In a similar fashion, neural networks trained 2.1 Forwards pass
to do image classification are trained on many examples of Each artificial neuron calculates a weighted sum of its in-
each category (dogs, cats, etc.) and generalise to unseen puts, adds a learnt bias and passes this sum through an
examples of each category. activation function. Consider a neuron which receives I in-
To provide more context, we will briefly sketch how our puts. The value of each input is represented by input vector
neural networks could be deployed at scale, in the wild. Each x. The weight on the connection from input i to neuron
net would undergo supervised training on many examples of h is denoted by wih (so w is the ‘weights matrix’). The
its target appliance type so each network learns to generalise weighted sum (also called the ‘network input’) of the inputs
PI
well to unseen appliances. into neuron h can be written ah = i=1 xi wih . The net-
Training is computationally expensive (days of processing work input ah is then passed through an activation function
on a fast GPU). But training does not have to be performed θ to produce the neuron’s final output bh where bh = θ(ah ).
often. Once these networks are trained, inference is much In this paper, we use the following activation functions: lin-
cheaper (around a second of processing per network on a fast ear: θ(x) = x; rectified linear (ReLU): θ(x) = max(0, x);
x −x
GPU for a week of aggregate data). Aggregate data from sinh x
hyperbolic tangent (tanh): θ(x) = cosh = eex −e .
x +e−x
unseen houses would be fed through each network. Each Multiple nonlinear hidden layers can be used to re-represent
network should filter out the power demand for its target the input data (hopefully by learning a hierarchy of feature
appliance. This processing would probably be too computa- detectors), which gives deep nonlinear networks a great deal
tionally expensive to run on an embedded processor inside of expressive power [15, 16].
a smart meter or in-home-display. Instead, the aggregate
data could be sent from the smart meter to the cloud. The 2.2 Backwards pass
storage requirements for one 16 bit integer sample (0-64 kW
The basic idea of the backwards pass it to first do a for-
in 1 watt steps) every ten seconds is 17 kilobytes per day
wards pass through the entire network to get the network’s
uncompressed. This signal should be easily compressible
output for a specific network input. Then compute the error
because there are numerous periods in domestic aggregate
of the output relative to the target (in all our experiments
power demand with little or no change. With a compres-
we use the mean squared error (MSE) as the objective func-
tion). Then modify the weights in the direction which should cle of that appliance. For example, Figure 1 shows a single
reduce the error. activation for a washing machine.) We trained our nets on
In practice, the forward pass is often computed over a both synthetic aggregate data and real aggregate data in a
batch of randomly selected input vectors. In our work, we 50:50 ratio. We found that synthetic data acts as a regu-
use a batch size of 64 sequences per batch for all but the lariser. In other words, training on a mix of synthetic and
largest recurrent neural network (RNN) experiments. In our real aggregate data rather than just real data appears to
largest RNNs we use a batch size of 16 (to allow the network improve the net’s ability to generalise to unseen houses. For
to fit into the 3GB of RAM on our GPU). validation and testing we use only real data (not synthetic).
How do we modify each weight to reduce the error? It We used UK-DALE [23] as our source dataset. Each
would be computationally intractable to enumerate the en- submeter in UK-DALE samples once every 6 seconds. All
tire error surface. MSE gives a smooth error surface and houses record aggregate apparent mains power once every
the activation functions are differentiable hence we can use 6 seconds. Houses 1, 2 and 5 also record active and reactive
gradient descent. The first step is to compute the gradi- mains power once a second. In these houses, we downsam-
ent of the error surface at the position for current batch by pled the 1 second active mains power to 6 seconds to align
calculating the derivative of the objective function with re- with the submetered data and used this as the real aggre-
spect to each weight. Then we modify each weight by adding gate data from these houses. Any gaps in appliance data
the gradient multiplied by a ‘learning rate’ scalar parame- shorter than 3 minutes are assumed to be due to RF issues
ter. To efficiently compute the gradient (in O(W ) time) we and so are filled by forward-filling. Any gaps longer than
use the backpropagation algorithm [17, 18, 19]. In all our 3 minutes are assumed to be due to the appliance and meter
experiments we use stochastic gradient descent (SGD) with being switched off and so are filled with zeros.
Nesterov momentum of 0.9. We manually checked a random selection of appliance ac-
tivations from every house. The UK-DALE metadata shows
2.3 Convolutional neural nets that House 4’s microwave and washing machine share a sin-
Consider the task of identifying objects in a photograph. gle meter (a fact that we manually verified) and hence these
No matter if we hand engineer feature detectors or learn fea- appliances from House 4 are not used in our training data.
ture detectors from the data, it turns out that useful ‘low We train one network per target appliance. The target
level’ features concern small patches of the image and in- (i.e. the desired output of the net) is the power demand of
clude features such as edges of different orientations, cor- the target appliance. The input to every net we describe
ners, blobs etc. To extract these features, we want to build in this paper is a window of aggregate power demand. The
a small number of feature detectors (one for horizontal lines, window width is decided on an appliance-by-appliance basis
one for blobs etc.) with small receptive fields (overlapping and varies from 128 samples (13 minutes) for the kettle to
sub-regions of the input image) and slide these feature de- 1536 samples (2.5 hours) for the dish washer. We found that
tectors across the entire image. Convolutional neural nets increasing the window size hurts disaggregation performance
(CNNs) [20, 21, 22] build a small number of filters, each for short-duration appliances (for example, using a sequence
with a small receptive field, and these filters are duplicated length of 1024 for the fridge resulted in the autoencoder
(with shared weights) across the entire input. (AE) failing to learn anything useful and the ‘rectangles’ net
Similarly to computer vision tasks, in time series problems achieved an F1 score of 0.68; reducing the sequence length
we often want to extract a small number of low level features to 512 allowed the AE to get an F1 score of 0.87 and the
with a small receptive fields across the entire input. All of ‘rectangles’ net got a score of 0.82). On the other hand, it is
our nets use at least one 1D convolutional layer at the input. important to ensure that the window width is long enough
to capture the majority of the appliance activations.
3. TRAINING DATA For each house, we reserved the last week of data for test-
ing and used the rest of the data for training. The number
Deep neural nets need a lot of training data because they
of appliance training activations is show in Table 1 and the
have a large number of trainable parameters (the network
number of testing activations is shown in Table 2. The spe-
weights and biases). The nets described in this paper have
cific houses used for training and testing is shown in Table 3.
between 1 million to 150 million trainable parameters. Large
training datasets are important. It is also common practice
in deep learning to increase the effective size of the training 3.1 Choice of appliances
set by duplicating the training data many times and apply- We used five target appliances in all our experiments: the
ing realistic transformations to each copy. For example, in fridge, washing machine, dish washer, kettle and microwave.
image classification, we might flip the image horizontally or We chose these appliances because each is present in at least
apply slight affine transformations. three houses in UK-DALE. This means that, for each appli-
A related approach to creating a large training dataset is ance, we can train our nets on at least two houses and test
to generate simulated data. For example, Google DeepMind on a different house. These five appliances consume a signif-
train their algorithms [9] on computer games because they icant proportion of energy and the five appliances represent
can generate an effectively infinite amount of training data. a range of different power ‘signatures’ from the simple on/off
Realistic synthetic speech audio data or natural images are of the kettle to the complex pattern shown by the washing
harder to produce. machine (Figure 1).
In energy disaggregation, we have the advantage that gen- ‘Small’ appliances such as games consoles and phone charg-
erating effectively infinite amounts of synthetic aggregate ers are problematic for many NILM algorithms because the
data is relatively easy by randomly combining real appli- effect of small appliances on aggregate power demand tends
ance activations. (We define an ‘appliance activation’ to be to get lost in the noise. By definition, small appliances do
the power drawn by a single appliance over one complete cy- not consume much energy individually but modern homes
tend to have a large number of such appliances so their com-
bined consumption can be significant. Hence it would be Table 1: Number of training activations per house.
useful to detect small appliances using NILM. We have not 1 2 3 4 5
explored whether our neural nets perform well on ‘small’ Kettle 2836 543 44 716 176
appliances but we plan to in the future. Fridge 16 336 3526 0 4681 1488
Washing machine 530 53 0 0 51
3.2 Extract activations Microwave 3266 387 0 0 28
Appliance activations are extracted using NILMTK’s [24] Dish washer 197 98 0 23 0
Electric.get_activations() method. The arguments we
passed to get_activations() for each appliance are shown
in Table 4. On simple appliances such as toasters, we extract Table 2: Number of testing activations per house.
activations by finding strictly consecutive samples above some
1 2 3 4 5
threshold power. We then throw away any activations shorter
than some threshold duration (to ignore spurious spikes). Kettle 54 29 40 50 18
For more complex appliances such as washing machines whose Fridge 168 277 0 145 140
power demand can drop below threshold for short periods Washing machine 10 4 0 0 2
during a cycle, NILMTK ignores short periods of sub-threshold Microwave 90 9 0 0 4
power demand. Dish washer 3 7 0 3

3.3 Select windows of real aggregate data


First we locate all the activations of the target appliance might often appear within a few minutes of each other in
in the home’s submeter data for the target appliance. Then, real data, but our simple ‘simulator’ is completely unaware
for each training example, the code decides with 50% prob- of this sort of structure. We expect that a more realistic
ability whether this example should include the target ap- simulator might increase the performance of deep neural nets
pliance or not. If the code decides not include the target on energy disaggregation.
appliance then it finds a random window of aggregate data
in which there are no activations of the target appliance.
3.5 Implementation of data processing
Otherwise, the code randomly selects a target appliance ac- All our code is written in Python and we make use Pandas,
tivation and randomly positions this activation within the Numpy and NILMTK for data preparation. Each network
window of data that will be shown to the net as the target receives data in a mini-batch of 64 sequences (except for the
(with the constraint that the activation must be captured large RNN sequences, in which case we use a batch size of
completely in the window of data shown to the net, unless 16 sequences). The code is multi-threaded so the CPU can
the window is too short to contain the entire activation). be busy preparing one batch of data on the fly whilst the
The corresponding time window of real aggregate data is GPU is busy training on the previous batch.
also loaded and shown to the net and its input. If other
activations of the target appliance happen to appear in the
3.6 Standardisation
aggregate data then these are not included in the target In general, neural nets learn most efficiently if the input
sequence; the net is trained to focus on the first complete data has zero mean. First, the mean of each sequence is
target appliance activation in the aggregate data. subtracted from the sequence to give each sequence a mean
of zero. Every input sequence is divided by the standard
3.4 Synthetic aggregate data deviation of a random sample of the training set. We do not
divide each sequence by its own standard deviation because
To create synthetic aggregate data we start by extract-
that would change the scaling and the scaling is likely to be
ing a set of appliance activations for five appliances across
important for NILM.
all training houses: kettle, washing machine, dish washer,
Forcing each sequence to have zero mean throws away in-
microwave and fridge. To create a single sequence of syn-
formation. Information that NILM algorithms such as com-
thetic data, we start with two vectors of zeros: one vector
binatorial optimisation and factorial hidden Markov models
will become the input to the net; the other will become the
rely on. We have done some preliminary experiments and
target. The length of each vector defines the ‘window width’
found that neural nets appear to be able to generalise bet-
of data that the network sees. We go through the five appli-
ter if we independently centre each sequence. But there
ance classes and decide whether or not to add an activation
are likely to be ways to have the best of both worlds: i.e.
of that class to the training sequence. There is a 50% chance
to give the network information about the absolute power
that the target appliance will appear in the sequence and a
whilst also allowing the network to generalise well.
25% chance for each other ‘distractor’ appliance. For each
One big advantage of training our nets on sequences which
selected appliance class, we randomly select an appliance
have been independently centred is that our nets do not need
activation and then randomly pick where to add that acti-
to consider vampire (always on) loads.
vation on the input vector. Distractor appliances can appear
Targets are divided by a hand-coded ‘maximum power
anywhere in the sequence (even if this means that only part
demand’ for each appliance to put the target power demand
of the activation will be included in the sequence). The
into the range [0, 1].
target appliance activation must be completely contained
within the sequence (unless it is too large to fit).
Of course, this relatively naı̈ve approach to synthesising 4. NEURAL NETWORK ARCHITECTURES
aggregate data ignores a lot of structure that appears in In this section we describe how we adapted three different
real aggregate data. For example, the kettle and toaster neural net architectures to do NILM.
1. Input (length determined by appliance duration)
Table 3: Houses used for training and testing.
Training Testing 2. 1D conv (filter size=4, stride=1, number of filters=16,
activation function=linear, border mode=same)
Kettle 1, 2, 3, 4 5
Fridge 1, 2, 4 5 3. bidirectional LSTM (N=128, with peepholes)
Washing machine 1, 5 2 4. bidirectional LSTM (N=256, with peepholes)
Microwave 1, 2 5
5. Fully connected (N=128, activation function=TanH)
Dish washer 1, 2 5
6. Fully connected (N=1, activation function=linear)

At each time step, the network sees a single sample of


Table 4: Arguments passed to get_activations().
aggregate power data and outputs a single sample of power
Max On power Min. on Min. off data for the target appliance.
Appliance power threshold duration duration In principal, the convolutional layer should not be neces-
(watts) (watts) (secs) (secs) sary (because the LSTMs should be able to remember all the
Kettle 3100 2000 12 0 context). But we found the addition of a convolution layer
Fridge 300 50 60 12 to slightly increase performance (the conv. layer convolves
Washing m. 2500 20 1800 160 over the time axis). We also experimented with adding a
Microwave 3000 200 12 30 conv. layer between the two LSTM layers with a stride > 1
Dish washer 2500 10 1800 1800 to implement hierarchical subsampling [28]. This showed
promise but we did not use it for our final experiments.
On the backwards pass, we clip the gradient at [-10, 10] as
per Alex Graves in [29]. To speed up computation, we prop-
4.1 Recurrent Neural Networks agate the gradient backwards a maximum of 500 time steps.
In Section 2 we described feed forward neural networks Figure 2 shows an example output of our LSTM network in
which map from a single input vector to a single output the two ‘RNN’ rows.
vector. When the network is shown a second input vector,
it has no memory of the previous input. 4.2 Denoising Autoencoders
Recurrent neural networks (RNNs) allow cycles in the net- In this section, we frame energy disaggregation as a ‘de-
work graph such that the output from neuron i in layer l at noising’ task. Typical denoising tasks include removing grain
time step t is fed via weighted connections to every neuron from an old photograph; or removing reverb from an audio
in layer l (including neuron i) at time step t + 1. This allows recording; or even in-filling a masked part of an image. En-
RNNs, in principal, to map from the entire history of the ergy disaggregation can be viewed as an attempt to recover
inputs to an output vector. This makes RNNs especially the ‘clean’ power demand signal of the target appliance from
well suited to sequential data. In our work, we train RNNs the background ‘noise’ produced by the other appliances. A
using backpropagation through time (BPTT) [25]. successful neural network architecture for denoising tasks is
In practice, RNNs can suffer from the ‘vanishing gradient’ the ‘denoising autoencoder’.
problem [26] where gradient information disappears or ex- An autoencoder (AE) is simply a network which tries to
plodes as it is propagated back through time. This can limit reconstruct the input. Described like this, AEs might not
an RNN’s memory. One solution to this problem is the ‘long sound very useful! The key is that AEs first encode the in-
short-term memory’ (LSTM) architecture [26] which uses a put to a compact vector representation (in the ‘code layer’)
‘memory cell’ with a gated input, gated output and gated and then decode to reconstruct the input. The simplest way
feedback loop. The intuition behind LSTM is that it is a of forcing the net to discover a compact representation of the
differentiable latch (where a ‘latch’ is the fundamental unit data is to have a code layer with less dimensions than the
of a digital computer’s RAM). LSTMs have been used with input. In this case, the AE is doing dimensionality reduc-
success on a wide variety of sequence tasks including auto- tion. Indeed, a linear AE with a single hidden layer is almost
matic speech recognition [7, 27] and machine translation [8]. equivalent to PCA. But AEs can be deep and non-linear.
An additional enhancement to RNNs is to use bidirectional A denoising autoencoder (dAE) [30] is an autoencoder
layers. In a bidirectional RNN, there are effectively two par- which attempts to reconstruct a clean target from a noisy
allel RNNs, one reads the input sequence forwards and the input. dAEs are typically trained by artificially corrupting
other reads the input sequence backwards. The output from a signal before it goes into the net’s input, and using the
the forwards and backwards halves of the network are com- clean signal as the net’s target. In NILM, we consider the
bined either by concatenating them or doing an element-wise corruption as being the power demand from the other ap-
sum (we experimented with both and settled on concatena- pliances. So we do not add noise artificially. Instead we use
tion, although element-wise sum appeared to work almost the aggregate power demand as the (noisy) input to the net
as well and is computationally cheaper). and ask the net to reconstruct the clean power demand of
We should note that bidirectional RNNs are not naturally the target appliance.
suited to doing online disaggregation. Bidirectional RNNs The first and last layers of our NILM dAEs are 1D con-
could still be used for online disaggregation if we frame ‘on- volutional layers. We use convolutional layers because we
line disaggregation’ as doing frequent, small batches of offline want the network to learn low level feature detectors which
disaggregation. are applied equally across the entire input window (for ex-
We experimented with both RNNs and LSTMs and settled ample, a step change of 1000 watts might be a useful feature
on the following architecture for energy disaggregation: to extract, no matter where it is found in the input). The
aim is to provide some invariance to where exactly the acti- 1. Input (length determined by appliance duration)
vation is positioned within the input window. The last layer 2. 1D conv (filter size=4, stride=1, number of filters=16,
does a ‘deconvolution’. activation function=linear, border mode=valid)
The exact architecture is as follows:
3. 1D conv (filter size=4, stride=1, number of filters=16,
1. Input (length determined by appliance duration) activation function=linear, border mode=valid)
2. 1D conv (filter size=4, stride=1, number of filters=8, 4. Fully connected (N=4096, activation function=ReLU)
activation function=linear, border mode=valid)
5. Fully connected (N=3072; activation function=ReLU)
3. Fully connected (N=(sequence length - 3) × 8,
6. Fully connected (N=2048, activation function=ReLU)
activation function=ReLU)
4. Fully connected (N=128; activation function=ReLU) 7. Fully connected (N=512, activation function=ReLU)
5. Fully connected (N=(sequence length - 3) × 8, 8. Fully connected (N=3, activation function=linear)
activation function=ReLU)
4.4 Neural net implementation
6. 1D conv (filter size=4, stride=1, number of filters=1,
We implemented our neural nets in Python using the
activation function=linear, border mode=valid)
Lasagne library2 . Lasagne is built on top of Theano [32,
Layer 4 is the middle, code layer. The entire dAE is 33]. We trained our nets on an nVidia GTX 780Ti GPU
trained end-to-end in one go (we do not do layer-wise pre- with 3 GB of RAM (but note that Theano also allows code
training as we found it did not increase performance). We do to be run on the CPU without requiring any changes to the
not tie the weights as we found this also appears to not en- user’s code). On this GPU, our nets typically took between
hance NILM performance. An example output of our NILM 1 and 12 hours to train per appliance. The exact code used
dAE is shown in Figure 2 in the two ‘Autoencoder’ rows. to create the results in paper is available in our ‘NeuralNILM
Prototype’ repository3 and a more elegant (hopefully!) re-
4.3 Regress Start Time, End Time & Power write is available in our ‘NeuralNILM’ repository4 .
Many applications of energy disaggregation do not require We manually defined the number of weight updates to per-
a detailed second-by-second reconstruction of the appliance form during training for each experiment. For the RNNs we
power demand. Instead, most energy disaggregation use- performed 10,000 updates, for the denoising autoencoders
cases require, for each appliance activation, the identification we performed 100,000 and for the regression network we
of the start time, end time and energy consumed. In other performed 300,000 updates. Neither the RNNs nor the AEs
words, we want to draw a rectangle around each appliance appeared to continue learning past this number of updates.
activation in the aggregate data where the left side of the The regression networks appear to keep learning no matter
rectangle is the start time, the right side is the end time and how many updates we perform!
the height is the average power demand of the appliance The nets have a wide variation in the number of trainable
between the start and end times. parameters. The largest dAE nets range from 1M to 150M
Deep neural networks have been used with great success (depending on the input size); the RNNs all had 1M pa-
on related tasks. For example, Nouri used deep neural net- rameters and the regression nets varied from 28M to 120M
works to estimate the 2D location of ‘facial keypoints’ in parameters (depending on the input size).
images of faces [31]. Example ‘keypoints’ are ‘left eye cen- All our network weights were initialised randomly using
tre’ or ‘mouth centre top lip’. The input to Nouri’s neural Lasagne’s default initialisation. All of the experiments pre-
net is the raw image of a face. The output of the network is sented in this paper trained end-to-end from random initial-
a set of x, y coordinates for each keypoint. isation (no layerwise pre-training).
Our idea was to train a neural network to estimate three
scalar, real-valued outputs: the start time, the end time 5. DISAGGREGATION
and mean power demand of the first appliance activation to
How do we disaggregate arbitrarily long sequences of ag-
appear in the aggregate power signal. If there is no target
gregate data given that each net has an input window dura-
appliance in the aggregate data then all three outputs should
tion of, at most, a few hours? We first pad the beginning and
be zero. If there is more than one activation in the aggre-
end of the input with zeros. Then we slide the net along the
gate signal then the network should ignore all but the first
input sequence. As such, the first sequence we show to the
activation. All outputs are in the range [0, 1]. The start
network will be all zeros. Then we shift the input window
and end times are encoded as a proportion of the input’s
STRIDE samples to the right, where STRIDE is a manually
time window. For example, the start of the time window is
defined positive, non-zero integer. If STRIDE is less than the
encoded as 0, the end is encoded as 1 and half way through
length of the net’s input window then the net will see over-
the time window is encoded as 0.5. For example, consider
lapping input sequences. This allows the network to have
a scenario where the input window width is 10 minutes and
multiple attempts at processing each appliance activation in
an appliance activation starts 1 minute into the window and
the aggregate signal, and on each attempt each activation
ends 1 minute before the end of the window. This activation
will be shifted to the left by STRIDE samples.
would be encoded as having a start location of 0.1 and an
Over the course of disaggregation, the network produces
end location of 0.9. Example output is shown in Figure 2 in
multiple estimated values for each time step because we give
the two ‘Rectangles’ rows.
the network overlapping segments of the input. For our first
The three target values for each sequence are calculated
2
during data pre-processing. As for all of our other networks, github.com/Lasagne/Lasagne
3
the network’s objective is to minimise the mean squared github.com/JackKelly/neuralnilm prototype
4
error. The exact architecture is as follows: github.com/JackKelly/neuralnilm
Dish Washing Across all
two network architectures, we combine the multiple values Kettle washer Fridge Microwave machine appliances
per timestep simply by taking the mean. 1.0

Combing the output from our third network is a little more F1

0.31
0.19
0.93
0.70
0.93

0.11
0.05
0.44
0.74
0.08

0.35
0.55
0.87
0.82
0.74

0.05
0.01
0.26
0.21
0.13

0.10
0.08
0.13
0.27
0.03

0.18
0.18
0.53
0.55
0.38
0.5
score
complex. We layer every predicted ‘appliance rectangle’ on 0.0
top of each other. We measure the overlap and normalise the 1.0

overlap to [0, 1]. This gives a probabilistic output for each Precision

0.23
0.14
1.00
0.70
0.96

0.06
0.03
0.29
0.89
0.04

0.30
0.40
0.85
0.79
0.72

0.03
0.01
0.15
0.14
0.07

0.06
0.04
0.07
0.29
0.01

0.13
0.12
0.47
0.56
0.36
0.5
appliance’s power demand. To convert this to a single vector score
0.0
per appliance, we threshold the power and probability.
1.0

Recall

0.46
0.29
0.87
0.71
0.91

0.67
0.49
0.99
0.64
0.87

0.41
0.86
0.88
0.86
0.77

0.35
0.34
0.94
0.40
0.99

0.48
0.64
1.00
0.24
0.73

0.47
0.53
0.94
0.57
0.85
0.5
score
6. RESULTS 0.0
1.0
The disaggregation results on an unseen house are shown
Accuracy

0.99
0.99
1.00
1.00
1.00

0.64
0.33
0.92
0.99
0.30

0.45
0.50
0.90
0.87
0.81

0.98
0.91
0.99
0.99
0.98

0.88
0.79
0.82
0.98
0.23

0.79
0.70
0.93
0.97
0.66
in Figure 3. The results on houses seen during training are score
0.5

shown in Figure 4. 0.0


We used benchmark implementations from NILMTK [24] Relative
1

of the combinatorial optimisation (CO) and factorial hidden error in

-0.33
-0.31

-0.38
-0.13
-0.25

-0.74

-0.13
0.85
0.88
0.13
0.03
0.57

0.62
0.75

0.87

0.37
0.57

0.97
0.99
0.73
0.50
0.88

0.73
0.86
0.48

0.91

0.71
0.81
0.12

0.59
0
total
Markov model (FHMM) algorithms. energy
−1
On the unseen house (Figure 3), both the denoising au- Prop. of 1.0
toencoder and the net which regresses the start time, end total

0.94
0.92
1.00
0.99
0.99

0.94
0.91
0.98
0.98
0.86

0.94
0.94
0.98
0.99
0.97

0.93
0.84
0.99
1.00
0.98

0.93
0.88
0.96
0.98
0.81

0.94
0.90
0.98
0.99
0.92
energy 0.5
time and power demand (the ‘rectangles’ architecture) out- correctly
perform CO and FHMM on every appliance on F1 score, assigned 0.0
200
precision score, proportion of total energy correctly assigned Mean
absolute
and mean absolute error. The LSTM out-performs CO and

110

168

195

109

107
100

73
98

16

74

24
30

73
67
26
18
36

89

20

39
67
24
11

70

18
14
70
error

6
7

9
6
FHMM on two-state appliances (kettle, fridge and microwave) (watts)
0
but falls behind CO and FHMM on multi-state appliances
Combinatorial Opt. Factorial HMM Autoencoder Rectangles LSTM
(dish washer and washing machine).
On the houses seen during training (Figure 4), the dAE
outperforms CO and FHMM on every appliance on every Figure 3: Disaggregation performance on a house
metric except relative error in total energy. The ‘rectangles’ not seen during training.
architecture outperforms CO and FHMM on every appliance
(except the microwave) on F1, precision, accuracy, propor- T
X
tion of total energy correctly assigned and mean absolute mean absolute error = 1/T |ŷt − yt | (16)
error. t=1
The full disaggregated time series for all our algorithms proportion of total energy correctly assigned =
and the aggregate data and appliance ground truth data are PT Pn (i) (i)
available at www.doc.ic.ac.uk/∼dk3810/neuralnilm i=1 |ŷt − yt |
The metrics we used are: 1 − t=1 P T
(17)
2 t=1 ȳt
The proportion of total energy correctly assigned is taken
TP = number of true positives (1) from [34].
FP = number of false positives (2)
FN = number of false negatives (3)
7. CONCLUSIONS & FUTURE WORK
We have adapted three neural network architectures to
P = number of positives in ground truth (4)
NILM. The denoising autoencoder and the ‘rectangles’ ar-
N = number of negatives in ground truth (5) chitectures perform well, especially on unseen houses. We
E = total actual energy (6) believe that deep neural nets show great promise for NILM.
But there is plenty of work still to do!
Ê = total predicted energy (7)
It is worth noting that our comparison between each ar-
(i)
yt = appliance i actual power at time t (8) chitecture is not entirely fair because the architectures have
(i)
ŷt = appliance i estimated power at time t (9) a wide range of trainable parameters. For example, every
LSTM we used had 1M parameters whilst the larger dAE
ȳt = aggregate actual power at time t (10) and rectangles nets had over 150M parameters (we did try
TP training an LSTM with more parameters but it did not ap-
recall = (11)
TP + FN pear to improve performance).
TP Our LSTM results suggest that LSTMs work best for two-
precision = (12) state appliances but do not perform well on multi-state ap-
TP + FP
precision × recall pliances such as the dish washer and washing machine. One
F1 = 2 × (13) possible reason is that, for these appliances, informative
precision + recall
‘events’ in the power signal can be many time steps apart
TP + TN
accuracy = (14) (e.g. for the washing machine there might be over 1,000
P+N time steps between the first heater activation and the spin
|Ê − E| cycle). In principal, LSTMs have an arbitrarily long mem-
relative error in total energy = (15)
max(E, Ê) ory. But these long gaps between informative events may
Kettle Washing Machine Fridge
10 0.50
6
Aggregate

5 0.25
Data from House 1

0
0 0.00
−6

1.0
Appliance

0.5

0.0

1.0
LSTM

0.5
Raw Output from Neural Nets

0.0
Autoencoder

1.0

0.5

0.0

1.0
Rectangles

0.5

0.0

1.0
LSTM

0.5
Overlapping Output from Neural Nets

0.0
Autoencoder

1.0

0.5

0.0

1.0
Rectangles

0.5

0.0
0 80 0 600 0 300
Time (number of samples)

Figure 2: Example outputs produced by all three neural network architectures for three appliances. Each
column shows data for a different appliance. The rows are in three groups (the tall grey rectangles on the far
left). The top group shows measured data from House 1. The top row shows the measured aggregate power
data from House 1 (the input to the neural nets). The Y-axis scale for the aggregate data is standardised
such that its mean is 0 and its standard deviation is 1 across the data set. The Y-axis range for all other
subplots is [0, 1]. The second row shows the single-appliance power demand (i.e. what the neural nets are
trying to estimate). The middle group of rows shows the raw output from each neural network (just a single
pass through each network). The bottom group of rows shows the result of sliding the network over the
aggregate data with STRIDE=16 and overlapping the output. Please note that the ‘rectangles’ net is trained
such that the height of the output rectangle should be the mean power demand over the duration of the
identified activation.
Dish Washing Across all
Kettle washer Fridge Microwave machine appliances 7.2 Unsupervised pre-training
1.0
In NILM, we generally have access to much more unla-
F1

0.31
0.28
0.48
0.63
0.71

0.11
0.08
0.60
0.72
0.06

0.52
0.47
0.81
0.74
0.69

0.33
0.43
0.62
0.32
0.42

0.13
0.11
0.25
0.49
0.09

0.28
0.27
0.55
0.58
0.39
score
0.5
belled data than labelled data. One advantage of neural nets
0.0 is that they could, in principal, be ‘pre-trained’ on unlabelled
1.0 data before being fine-tuned on labelled data. ‘Pre-training’
Precision should allow the networks to start to identify useful features
0.45
0.30
1.00
0.80
0.91

0.07
0.04
0.45
0.88
0.03

0.50
0.39
0.83
0.71
0.71

0.24
0.35
0.50
0.32
0.28

0.08
0.06
0.15
0.72
0.05

0.27
0.23
0.58
0.69
0.39
0.5
score
0.0
from the data but does not allow the nets to learn to label ap-
1.0 pliances. (Pre-training is rarely used in modern image classi-
Recall fication tasks because very large labelled datasets are avail-
0.25
0.28
0.39
0.57
0.63

0.50
0.78
0.99
0.61
0.63

0.54
0.63
0.79
0.77
0.67

0.70
0.69
0.86
0.34
0.92

0.56
0.87
0.99
0.38
0.62

0.51
0.65
0.80
0.53
0.69
0.5
score able for image classification. But in NILM we have much
0.0
more unlabelled data than labelled data, so pre-training is
1.0
likely to be useful.) After unsupervised pre-training, each
Accuracy
0.99
0.99
0.99
0.99
1.00

0.69
0.37
0.95
0.98
0.35

0.61
0.46
0.85
0.79
0.76

0.98
0.99
0.99
0.99
0.98

0.69
0.39
0.76
0.97
0.31

0.79
0.64
0.91
0.94
0.68
score
0.5 net would undergo supervised training. Instead of (or as
0.0 well as) pre-training on all available unlabelled data, it may
Relative
1 also be interesting to try pre-training largely on unlabelled
error in data from each house that we wish to disaggregate.
-0.32

-0.34
-0.53

-0.35
-0.07
-0.22

-0.23

-0.65

-0.09
-0.36
0.43
0.57
0.02

0.36

0.28
0.66

0.76

0.26
0.50

0.85
0.80
0.06

0.50

0.65
0.76
0.18

0.73

0.49
0.66

0.43
0
total
energy
−1

Prop. of
total
1.0 8. ACKNOWLEDGMENTS
0.93
0.91
0.98
0.98
0.98

0.90
0.85
0.97
0.96
0.83

0.94
0.91
0.97
0.97
0.96

0.92
0.93
0.99
0.98
0.97

0.92
0.88
0.96
0.97
0.88

0.92
0.90
0.97
0.97
0.92
energy 0.5 Jack Kelly’s PhD is funded by the EPSRC and by In-
correctly
assigned 0.0 tel via their EU Doctoral Student Fellowship Programme.
Mean
200 The authors would like to thank Pedro Nascimento for his
absolute comments on a draft of this manuscript.
111

130

138

133

100
65
82
16
15
23

75

21
30

50
69
25
22
34

68
54
13
16
22

88

44
28

69
91
24
22
68
error
(watts)
0

Combinatorial Opt. Factorial HMM Autoencoder Rectangles LSTM


9. REFERENCES
[1] G. W. Hart. Prototype nonintrusive appliance load
Figure 4: Disaggregation performance on houses monitor. Technical report, MIT Energy Laboratory
seen during training (the time window used for test- and Electric Power Research Institute, Sept. 1985.
ing is different to that used for training). [2] G. W. Hart. Nonintrusive appliance load monitoring.
Proceedings of the IEEE, 80(12):1870–1891, Dec. 1992.
doi:10.1109/5.192069.
present a challenge for LSTMs. Further work is required [3] S. B. Leeb, S. R. Shaw, and J. L. Kirtley Jr. Transient
to understand exactly why LSTMs struggle on multi-state event detection in spectral envelope estimates for
appliances. One aspect of our LSTM results that we did ex- nonintrusive load monitoring. Power Delivery, IEEE
pect was that processing overlapping windows of aggregate Transactions on, 10(3):1200–1210, 1995.
data would not be necessary for LSTMs because they always doi:10.1109/61.400897.
output the same estimates, no matter what the offset of the [4] N. Amirach, B. Xerri, B. Borloz, and C. Jauffret. A
input window (see Figure 2). new approach for event detection and feature
We must also note that the FHMM implementation used extraction for nilm. In Electronics, Circuits and
in this work is not ‘state of the art’ and neither is it especially Systems (ICECS), 2014 21st IEEE International
tuned. Other FHMM implementations are likely to perform Conference on, pages 287–290. IEEE, 2014.
better. We encourage other researchers to download5 our [5] D. G. Lowe. Object recognition from local
disaggregation estimates and ground truth data and directly scale-invariant features. In Computer vision, 1999.
compare against our algorithms! The proceedings of the seventh IEEE international
This work represents just a first step towards adapting the conference on, volume 2, pages 1150–1157. IEEE,
vast number of techniques from the deep learning commu- 1999.
nity to NILM, for example: [6] A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural
7.1 Train on more data networks. In F. Pereira, C. Burges, L. Bottou, and
UK-DALE has many hundreds of days of data but only K. Weinberger, editors, Advances in Neural
from five houses. Any machine learning algorithm is only Information Processing Systems 25, pages 1097–1105.
able to generalise if given enough variety in the training set. Curran Associates, Inc., 2012.
For example, House 5’s dish washer sometimes has four acti- [7] A. Graves and N. Jaitly. Towards end-to-end speech
vations of its heater but the dish washers in the two training recognition with recurrent neural networks. In
houses (1 and 2) only ever have two peaks. Hence the au- Proceedings of the 31st International Conference on
toencoder completely ignores the first two peaks of House 5’s Machine Learning (ICML-14), pages 1764–1772, 2014.
dish washer! If neural nets are to learn to generalise well
[8] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to
then we must train on much larger numbers of appliances
sequence learning with neural networks. In
(hundreds or thousands). This should help the networks to
Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence,
generalise across the wide variation seen in some classes of
and K. Weinberger, editors, Advances in Neural
appliance.
Information Processing Systems 27, pages 3104–3112.
5
Data available from www.doc.ic.ac.uk/∼dk3810/neuralnilm Curran Associates, Inc., 2014.
[9] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, Proc. Neural Information Processing Systems (NIPS),
J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, page 31, 1988.
A. K. Fidjeland, G. Ostrovski, et al. Human-level [22] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.
control through deep reinforcement learning. Nature, Gradient-based learning applied to document
518(7540):529–533, 2015. recognition. Proceedings of the IEEE,
[10] J. Roos, I. Lane, E. Botha, and G. P. Hancke. Using 86(11):2278–2324, 1998.
neural networks for non-intrusive monitoring of [23] J. Kelly and W. Knottenbelt. The UK-DALE dataset,
industrial electrical loads. In Instrumentation and domestic appliance-level electricity demand and
Measurement Technology Conference, 1994. IMTC/94. whole-house demand from five uk homes. Scientific
Conference Proceedings. 10th Anniversary. Advanced Data, 2(150007), 2015. doi:10.1038/sdata.2015.7.
Technologies in I & M., 1994 IEEE, pages 1115–1118. [24] N. Batra, J. Kelly, O. Parson, H. Dutta,
IEEE, 1994. doi:10.1109/IMTC.1994.351862. W. Knottenbelt, A. Rogers, A. Singh, and
[11] H.-T. Yang, H.-H. Chang, and C.-L. Lin. Design a M. Srivastava. NILMTK: An open source toolkit for
neural network for features selection in non-intrusive non-intrusive load monitoring. In Fifth International
monitoring of industrial electrical loads. In Computer Conference on Future Energy Systems (ACM
Supported Cooperative Work in Design, 2007. e-Energy), Cambridge, UK, 2014.
CSCWD 2007. 11th International Conference on, doi:10.1145/2602044.2602051.
pages 1022–1027. IEEE, 2007. [25] P. J. Werbos. Backpropagation through time: what it
doi:10.1109/CSCWD.2007.4281579. does and how to do it. Proceedings of the IEEE,
[12] Y.-H. Lin and M.-S. Tsai. A novel feature extraction 78(10):1550–1560, 1990.
method for the development of nonintrusive load [26] S. Hochreiter and J. Schmidhuber. Long short-term
monitoring system based on BP-ANN. In 2010 memory. Neural Computation, 9(8):1735–1780, 1997.
International Symposium on Computer doi:10.1162/neco.1997.9.8.1735.
Communication Control and Automation (3CA), [27] J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio.
volume 2, pages 215–218. IEEE, 2010. End-to-end continuous speech recognition using
doi:10.1109/3CA.2010.5533571. attention-based recurrent nn: First results. 2014.
[13] A. G. Ruzzelli, C. Nicolas, A. Schoofs, and G. M. [28] A. Graves. Supervised sequence labelling with recurrent
O’Hare. Real-time recognition and profiling of neural networks, volume 385. Springer, 2012.
appliances through a single electricity sensor. In http://www.cs.toronto.edu/~graves/preprint.pdf.
Sensor Mesh and Ad Hoc Communications and [29] A. Graves. Generating sequences with recurrent neural
Networks (SECON), 2010 7th Annual IEEE networks. 2013.
Communications Society Conference on, pages 1–9.
[30] P. Vincent, H. Larochelle, Y. Bengio, and P.-A.
IEEE, 2010. doi:10.1109/SECON.2010.5508244.
Manzagol. Extracting and composing robust features
[14] H.-H. Chang, P.-C. Chien, L.-S. Lin, and N. Chen. with denoising autoencoders. In Proceedings of the
Feature extraction of non-intrusive load-monitoring 25th international conference on Machine learning,
system using genetic algorithm in smart meters. In pages 1096–1103. ACM, 2008.
e-Business Engineering (ICEBE), 2011 IEEE 8th
[31] D. Nouri. Using convolutional neural nets to detect
International Conference on, pages 299–304. IEEE,
facial keypoints tutorial, 2014.
2011.
http://bit.ly/1OduG83.
[15] Y. Bengio, Y. LeCun, et al. Scaling learning
[32] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin,
algorithms towards AI. Large-scale kernel machines,
R. Pascanu, G. Desjardins, J. Turian,
34(5), 2007.
D. Warde-Farley, and Y. Bengio. Theano: a CPU and
[16] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast GPU math expression compiler. In Proceedings of the
learning algorithm for deep belief nets. Neural Python for Scientific Computing Conference (SciPy),
computation, 18(7):1527–1554, 2006. June 2010. Oral Presentation.
[17] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. [33] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. J.
Learning internal representations by error Goodfellow, A. Bergeron, N. Bouchard, and
propagation. Technical report, DTIC Document, 1985. Y. Bengio. Theano: new features and speed
[18] R. J. Williams and D. Zipser. Gradient-based learning improvements. Deep Learning and Unsupervised
algorithms for recurrent networks and their Feature Learning NIPS 2012 Workshop, 2012.
computational complexity. Back-propagation: Theory, [34] J. Z. Kolter and M. J. Johnson. REDD: A public data
architectures and applications, pages 433–486, 1995. set for energy disaggregation research. In Workshop on
[19] P. J. Werbos. Generalization of backpropagation with Data Mining Applications in Sustainability
application to a recurrent gas market model. Neural (SIGKDD), San Diego, CA, volume 25, pages 59–62.
Networks, 1(4):339–356, 1988. Citeseer, 2011.
[20] K. Fukushima. Neocognitron: A self-organizing neural
network model for a mechanism of pattern recognition
unaffected by shift in position. Biological cybernetics,
36(4):193–202, 1980.
[21] L. E. Atlas, T. Homma, and R. J. Marks II. An
artificial neural network for spatio-temporal bipolar
patterns: Application to phoneme classification. In

You might also like