Neural NILM: Deep Neural Networks Applied To Energy Disaggregation
Neural NILM: Deep Neural Networks Applied To Energy Disaggregation
Neural NILM: Deep Neural Networks Applied To Energy Disaggregation
ABSTRACT
arXiv:1507.06594v3 [cs.NE] 28 Sep 2015
Power (kW)
Energy disaggregation estimates appliance-by-appliance elec-
tricity consumption from a single meter that measures the
1
whole home’s electricity demand. Recently, deep neural net-
works have driven remarkable improvements in classification
performance in neighbouring machine learning fields such as 0
image classification and automatic speech recognition. In 0 30 60 90
this paper, we adapt three deep neural network architectures Time (minutes)
to energy disaggregation: 1) a form of recurrent neural net-
work called ‘long short-term memory’ (LSTM); 2) denoising Figure 1: Example power demand during one activa-
autoencoders; and 3) a network which regresses the start tion of the washing machine in UK-DALE House 1.
time, end time and average power demand of each appliance
activation. We use seven metrics to test the performance
of these algorithms on real aggregate power data from five
aim might be to help users reduce their energy consumption;
appliances. Tests are performed against a house not seen
or to help operators to manage the grid; or to identify faulty
during training and against houses seen during training. We
appliances; or to survey appliance usage behaviour.
find that all three neural nets achieve better F1 scores (aver-
Research on NILM started with the seminal work of George
aged over all five appliances) than either combinatorial op-
Hart [1, 2] in the mid-1980s. Hart described a ‘signature tax-
timisation or factorial hidden Markov models and that our
onomy’ of features [2] and his earliest work from 1984 de-
neural net algorithms generalise well to an unseen house.
scribed experiments of extracting more detailed features1 .
However, Hart decided to focus on extracting only transi-
Categories and Subject Descriptors tions between steady-states. Many NILM algorithms de-
I.2.6 [Artificial Intelligence]: Learning—Connectionism signed for low frequency data (1 Hz or slower) follow Hart’s
and neural nets; I.5.2 [Pattern Recognition]: Design Method- lead and only extract a small number of features. In con-
ology—Pattern analysis, Classifier design and evaluation tract, in high frequency NILM (sampling at kHz or even
MHz), there are numerous examples in the literature of man-
ually engineering rich feature extractors (e.g. [3, 4]).
Keywords Humans can learn to detect appliances in aggregate data
Energy disaggregation; neural networks; feature learning; by eye, especially appliances with feature-rich signatures
NILM; energy conservation; deep learning such as the washing machine signature shown in Figure 1.
Humans almost certainly make use of a variety of features
1. INTRODUCTION such as the rapid on-off cycling of the motor (which pro-
duces the rapid ∼ 200 watt oscillations), the ramps towards
Energy disaggregation (also called non-intrusive load mon-
the end as the washer starts to rapidly spin the clothes etc.
itoring or NILM) is a computational technique for estimating
We could consider hand-engineering feature extractors for
the power demand of individual appliances from a single me-
these rich features. But this would be time consuming and
ter which measures the combined demand of multiple appli-
the resulting feature detectors may not be robust to noise
ances. One use-case is the production of itemised electricity
and artefacts. Two key research questions emerge: Could
bills from a single, whole-home smart meter. The ultimate
an algorithm automatically learn to detect these features?
Can we learn anything from neighbouring machine learning
fields such as image classification?
Before 2012, the dominant approach to extracting features
for image classification was to hand-engineer feature detec-
tors such as scale-invariant feature transform [5] (SIFT) and
difference of Gaussians (DoG). Then, in 2012, Krizhevsky et
This is the authors’ version of the work. Copyright is held by the authors. The defini-
tive version was published in ACM BuildSys’15, November 4–5, 2015, Seoul. 1
DOI:10.1145/2821650.2821672 This claim is taken from Hart 1992 [2] because no copy of
. George Hart’s 1984 technical report was available.
al.’s winning algorithm [6] in the ImageNet Large Scale Vi- sion ratio of 5:1, and ignoring the datetime index, the total
sual Recognition Challenge achieved a substantially lower storage requirements for a year of data from 10 million users
error score (15%) than the second-best approach (26%). would be 13 terabytes (which could fit on two 8 TB disks). If
Krizhevsky et al.’s approach did not use hand-engineered one week of aggregate data can be processed in one second
feature detectors. Instead they used a deep neural network per home (which should be possible given further optimi-
which automatically learnt to extract a hierarchy of features sation) then data from 10 million users could be processed
from the raw image. Deep learning is now a dominant ap- by 16 GPU compute nodes. Alternatively, disaggregation
proach not only in image classification but also fields such could be performed on a compute device within each home
as automatic speech recognition [7], machine translation [8], (a modern laptop or mobile phone or a dedicated ‘disaggre-
even learning to play computer games from scratch [9]! gation hub’ could handle the disaggregation). A GPU is not
In this paper, we investigate whether deep neural nets can required for disaggregation, although it makes it faster.
be applied to energy disaggregation. The use of ‘small’ neu- This paper is structured as follows: In Section 2 we pro-
ral nets on NILM dates back at least to Roos et al. 1994 [10] vide a very brief introduction to artificial neural nets. In
(although that paper was just a proposal) and continued Section 3 we describe how we prepare the training data for
with [11, 12, 13, 14] but these small nets do not appear to our nets and how we ‘augment’ the training data by syn-
learn a hierarchy of feature detectors. A big breakthrough in thesising additional data. In Section 4 we describe how we
image classification came when the compute power (courtesy adapted three neural net architectures to NILM. In Section 5
of GPUs) became available to train deep neural networks on we describe how we do disaggregation with our nets. In Sec-
large amounts of data. In the present research, we want tion 6 we present the disaggregation results of our three
to see if deep neural nets can deliver good performance on neural nets and two benchmark NILM algorithms. Finally,
energy disaggregation. in Section 7 discuss our results, offer our conclusions and
Our main contribution is to adapt three deep neural net- describe some possible future directions for research.
work architectures to NILM. For each architecture, we train
one network per target appliance. We compare two bench-
mark disaggregation algorithms (combinatorial optimisation
2. INTRODUCTION TO NEURAL NETS
and factorial hidden Markov models) to the disaggregation An artificial neural network (ANN) is a directed graph
performance of our three deep neural nets using seven met- where the nodes are artificial neurons and the edges allow
rics. We also examine how well our neural nets generalise information from one neuron to pass to another neuron (or
to appliances in houses not seen during training because, the same neuron in a future time step). Neurons are typ-
ultimately, when NILM is used ‘in the field’ we very rarely ically arranged into layers such that each neuron in layer
have ground truth appliance data for the houses for which l connects to every neuron in layer l + 1. Connections are
we want to disaggregate. So it is essential that NILM algo- weighted and it is through modification of these weights that
rithms can generalise to unseen houses. ANNs learn. ANNs have an input layer and an output layer.
Please note that, once trained, our neural nets do not need Any layers in between are called hidden layers. The forward
ground truth appliance data from each house! End-users pass of an ANN is where information flows from the input
would only need to provide aggregate data. This is because layer, through any hidden layers, to the output. Learning
each neural network should learn the ‘essence’ of its target (updating the weights) happens during the backwards pass.
appliance such that it can generalise to unseen instances of
that appliance. In a similar fashion, neural networks trained 2.1 Forwards pass
to do image classification are trained on many examples of Each artificial neuron calculates a weighted sum of its in-
each category (dogs, cats, etc.) and generalise to unseen puts, adds a learnt bias and passes this sum through an
examples of each category. activation function. Consider a neuron which receives I in-
To provide more context, we will briefly sketch how our puts. The value of each input is represented by input vector
neural networks could be deployed at scale, in the wild. Each x. The weight on the connection from input i to neuron
net would undergo supervised training on many examples of h is denoted by wih (so w is the ‘weights matrix’). The
its target appliance type so each network learns to generalise weighted sum (also called the ‘network input’) of the inputs
PI
well to unseen appliances. into neuron h can be written ah = i=1 xi wih . The net-
Training is computationally expensive (days of processing work input ah is then passed through an activation function
on a fast GPU). But training does not have to be performed θ to produce the neuron’s final output bh where bh = θ(ah ).
often. Once these networks are trained, inference is much In this paper, we use the following activation functions: lin-
cheaper (around a second of processing per network on a fast ear: θ(x) = x; rectified linear (ReLU): θ(x) = max(0, x);
x −x
GPU for a week of aggregate data). Aggregate data from sinh x
hyperbolic tangent (tanh): θ(x) = cosh = eex −e .
x +e−x
unseen houses would be fed through each network. Each Multiple nonlinear hidden layers can be used to re-represent
network should filter out the power demand for its target the input data (hopefully by learning a hierarchy of feature
appliance. This processing would probably be too computa- detectors), which gives deep nonlinear networks a great deal
tionally expensive to run on an embedded processor inside of expressive power [15, 16].
a smart meter or in-home-display. Instead, the aggregate
data could be sent from the smart meter to the cloud. The 2.2 Backwards pass
storage requirements for one 16 bit integer sample (0-64 kW
The basic idea of the backwards pass it to first do a for-
in 1 watt steps) every ten seconds is 17 kilobytes per day
wards pass through the entire network to get the network’s
uncompressed. This signal should be easily compressible
output for a specific network input. Then compute the error
because there are numerous periods in domestic aggregate
of the output relative to the target (in all our experiments
power demand with little or no change. With a compres-
we use the mean squared error (MSE) as the objective func-
tion). Then modify the weights in the direction which should cle of that appliance. For example, Figure 1 shows a single
reduce the error. activation for a washing machine.) We trained our nets on
In practice, the forward pass is often computed over a both synthetic aggregate data and real aggregate data in a
batch of randomly selected input vectors. In our work, we 50:50 ratio. We found that synthetic data acts as a regu-
use a batch size of 64 sequences per batch for all but the lariser. In other words, training on a mix of synthetic and
largest recurrent neural network (RNN) experiments. In our real aggregate data rather than just real data appears to
largest RNNs we use a batch size of 16 (to allow the network improve the net’s ability to generalise to unseen houses. For
to fit into the 3GB of RAM on our GPU). validation and testing we use only real data (not synthetic).
How do we modify each weight to reduce the error? It We used UK-DALE [23] as our source dataset. Each
would be computationally intractable to enumerate the en- submeter in UK-DALE samples once every 6 seconds. All
tire error surface. MSE gives a smooth error surface and houses record aggregate apparent mains power once every
the activation functions are differentiable hence we can use 6 seconds. Houses 1, 2 and 5 also record active and reactive
gradient descent. The first step is to compute the gradi- mains power once a second. In these houses, we downsam-
ent of the error surface at the position for current batch by pled the 1 second active mains power to 6 seconds to align
calculating the derivative of the objective function with re- with the submetered data and used this as the real aggre-
spect to each weight. Then we modify each weight by adding gate data from these houses. Any gaps in appliance data
the gradient multiplied by a ‘learning rate’ scalar parame- shorter than 3 minutes are assumed to be due to RF issues
ter. To efficiently compute the gradient (in O(W ) time) we and so are filled by forward-filling. Any gaps longer than
use the backpropagation algorithm [17, 18, 19]. In all our 3 minutes are assumed to be due to the appliance and meter
experiments we use stochastic gradient descent (SGD) with being switched off and so are filled with zeros.
Nesterov momentum of 0.9. We manually checked a random selection of appliance ac-
tivations from every house. The UK-DALE metadata shows
2.3 Convolutional neural nets that House 4’s microwave and washing machine share a sin-
Consider the task of identifying objects in a photograph. gle meter (a fact that we manually verified) and hence these
No matter if we hand engineer feature detectors or learn fea- appliances from House 4 are not used in our training data.
ture detectors from the data, it turns out that useful ‘low We train one network per target appliance. The target
level’ features concern small patches of the image and in- (i.e. the desired output of the net) is the power demand of
clude features such as edges of different orientations, cor- the target appliance. The input to every net we describe
ners, blobs etc. To extract these features, we want to build in this paper is a window of aggregate power demand. The
a small number of feature detectors (one for horizontal lines, window width is decided on an appliance-by-appliance basis
one for blobs etc.) with small receptive fields (overlapping and varies from 128 samples (13 minutes) for the kettle to
sub-regions of the input image) and slide these feature de- 1536 samples (2.5 hours) for the dish washer. We found that
tectors across the entire image. Convolutional neural nets increasing the window size hurts disaggregation performance
(CNNs) [20, 21, 22] build a small number of filters, each for short-duration appliances (for example, using a sequence
with a small receptive field, and these filters are duplicated length of 1024 for the fridge resulted in the autoencoder
(with shared weights) across the entire input. (AE) failing to learn anything useful and the ‘rectangles’ net
Similarly to computer vision tasks, in time series problems achieved an F1 score of 0.68; reducing the sequence length
we often want to extract a small number of low level features to 512 allowed the AE to get an F1 score of 0.87 and the
with a small receptive fields across the entire input. All of ‘rectangles’ net got a score of 0.82). On the other hand, it is
our nets use at least one 1D convolutional layer at the input. important to ensure that the window width is long enough
to capture the majority of the appliance activations.
3. TRAINING DATA For each house, we reserved the last week of data for test-
ing and used the rest of the data for training. The number
Deep neural nets need a lot of training data because they
of appliance training activations is show in Table 1 and the
have a large number of trainable parameters (the network
number of testing activations is shown in Table 2. The spe-
weights and biases). The nets described in this paper have
cific houses used for training and testing is shown in Table 3.
between 1 million to 150 million trainable parameters. Large
training datasets are important. It is also common practice
in deep learning to increase the effective size of the training 3.1 Choice of appliances
set by duplicating the training data many times and apply- We used five target appliances in all our experiments: the
ing realistic transformations to each copy. For example, in fridge, washing machine, dish washer, kettle and microwave.
image classification, we might flip the image horizontally or We chose these appliances because each is present in at least
apply slight affine transformations. three houses in UK-DALE. This means that, for each appli-
A related approach to creating a large training dataset is ance, we can train our nets on at least two houses and test
to generate simulated data. For example, Google DeepMind on a different house. These five appliances consume a signif-
train their algorithms [9] on computer games because they icant proportion of energy and the five appliances represent
can generate an effectively infinite amount of training data. a range of different power ‘signatures’ from the simple on/off
Realistic synthetic speech audio data or natural images are of the kettle to the complex pattern shown by the washing
harder to produce. machine (Figure 1).
In energy disaggregation, we have the advantage that gen- ‘Small’ appliances such as games consoles and phone charg-
erating effectively infinite amounts of synthetic aggregate ers are problematic for many NILM algorithms because the
data is relatively easy by randomly combining real appli- effect of small appliances on aggregate power demand tends
ance activations. (We define an ‘appliance activation’ to be to get lost in the noise. By definition, small appliances do
the power drawn by a single appliance over one complete cy- not consume much energy individually but modern homes
tend to have a large number of such appliances so their com-
bined consumption can be significant. Hence it would be Table 1: Number of training activations per house.
useful to detect small appliances using NILM. We have not 1 2 3 4 5
explored whether our neural nets perform well on ‘small’ Kettle 2836 543 44 716 176
appliances but we plan to in the future. Fridge 16 336 3526 0 4681 1488
Washing machine 530 53 0 0 51
3.2 Extract activations Microwave 3266 387 0 0 28
Appliance activations are extracted using NILMTK’s [24] Dish washer 197 98 0 23 0
Electric.get_activations() method. The arguments we
passed to get_activations() for each appliance are shown
in Table 4. On simple appliances such as toasters, we extract Table 2: Number of testing activations per house.
activations by finding strictly consecutive samples above some
1 2 3 4 5
threshold power. We then throw away any activations shorter
than some threshold duration (to ignore spurious spikes). Kettle 54 29 40 50 18
For more complex appliances such as washing machines whose Fridge 168 277 0 145 140
power demand can drop below threshold for short periods Washing machine 10 4 0 0 2
during a cycle, NILMTK ignores short periods of sub-threshold Microwave 90 9 0 0 4
power demand. Dish washer 3 7 0 3
0.31
0.19
0.93
0.70
0.93
0.11
0.05
0.44
0.74
0.08
0.35
0.55
0.87
0.82
0.74
0.05
0.01
0.26
0.21
0.13
0.10
0.08
0.13
0.27
0.03
0.18
0.18
0.53
0.55
0.38
0.5
score
complex. We layer every predicted ‘appliance rectangle’ on 0.0
top of each other. We measure the overlap and normalise the 1.0
overlap to [0, 1]. This gives a probabilistic output for each Precision
0.23
0.14
1.00
0.70
0.96
0.06
0.03
0.29
0.89
0.04
0.30
0.40
0.85
0.79
0.72
0.03
0.01
0.15
0.14
0.07
0.06
0.04
0.07
0.29
0.01
0.13
0.12
0.47
0.56
0.36
0.5
appliance’s power demand. To convert this to a single vector score
0.0
per appliance, we threshold the power and probability.
1.0
Recall
0.46
0.29
0.87
0.71
0.91
0.67
0.49
0.99
0.64
0.87
0.41
0.86
0.88
0.86
0.77
0.35
0.34
0.94
0.40
0.99
0.48
0.64
1.00
0.24
0.73
0.47
0.53
0.94
0.57
0.85
0.5
score
6. RESULTS 0.0
1.0
The disaggregation results on an unseen house are shown
Accuracy
0.99
0.99
1.00
1.00
1.00
0.64
0.33
0.92
0.99
0.30
0.45
0.50
0.90
0.87
0.81
0.98
0.91
0.99
0.99
0.98
0.88
0.79
0.82
0.98
0.23
0.79
0.70
0.93
0.97
0.66
in Figure 3. The results on houses seen during training are score
0.5
-0.33
-0.31
-0.38
-0.13
-0.25
-0.74
-0.13
0.85
0.88
0.13
0.03
0.57
0.62
0.75
0.87
0.37
0.57
0.97
0.99
0.73
0.50
0.88
0.73
0.86
0.48
0.91
0.71
0.81
0.12
0.59
0
total
Markov model (FHMM) algorithms. energy
−1
On the unseen house (Figure 3), both the denoising au- Prop. of 1.0
toencoder and the net which regresses the start time, end total
0.94
0.92
1.00
0.99
0.99
0.94
0.91
0.98
0.98
0.86
0.94
0.94
0.98
0.99
0.97
0.93
0.84
0.99
1.00
0.98
0.93
0.88
0.96
0.98
0.81
0.94
0.90
0.98
0.99
0.92
energy 0.5
time and power demand (the ‘rectangles’ architecture) out- correctly
perform CO and FHMM on every appliance on F1 score, assigned 0.0
200
precision score, proportion of total energy correctly assigned Mean
absolute
and mean absolute error. The LSTM out-performs CO and
110
168
195
109
107
100
73
98
16
74
24
30
73
67
26
18
36
89
20
39
67
24
11
70
18
14
70
error
6
7
9
6
FHMM on two-state appliances (kettle, fridge and microwave) (watts)
0
but falls behind CO and FHMM on multi-state appliances
Combinatorial Opt. Factorial HMM Autoencoder Rectangles LSTM
(dish washer and washing machine).
On the houses seen during training (Figure 4), the dAE
outperforms CO and FHMM on every appliance on every Figure 3: Disaggregation performance on a house
metric except relative error in total energy. The ‘rectangles’ not seen during training.
architecture outperforms CO and FHMM on every appliance
(except the microwave) on F1, precision, accuracy, propor- T
X
tion of total energy correctly assigned and mean absolute mean absolute error = 1/T |ŷt − yt | (16)
error. t=1
The full disaggregated time series for all our algorithms proportion of total energy correctly assigned =
and the aggregate data and appliance ground truth data are PT Pn (i) (i)
available at www.doc.ic.ac.uk/∼dk3810/neuralnilm i=1 |ŷt − yt |
The metrics we used are: 1 − t=1 P T
(17)
2 t=1 ȳt
The proportion of total energy correctly assigned is taken
TP = number of true positives (1) from [34].
FP = number of false positives (2)
FN = number of false negatives (3)
7. CONCLUSIONS & FUTURE WORK
We have adapted three neural network architectures to
P = number of positives in ground truth (4)
NILM. The denoising autoencoder and the ‘rectangles’ ar-
N = number of negatives in ground truth (5) chitectures perform well, especially on unseen houses. We
E = total actual energy (6) believe that deep neural nets show great promise for NILM.
But there is plenty of work still to do!
Ê = total predicted energy (7)
It is worth noting that our comparison between each ar-
(i)
yt = appliance i actual power at time t (8) chitecture is not entirely fair because the architectures have
(i)
ŷt = appliance i estimated power at time t (9) a wide range of trainable parameters. For example, every
LSTM we used had 1M parameters whilst the larger dAE
ȳt = aggregate actual power at time t (10) and rectangles nets had over 150M parameters (we did try
TP training an LSTM with more parameters but it did not ap-
recall = (11)
TP + FN pear to improve performance).
TP Our LSTM results suggest that LSTMs work best for two-
precision = (12) state appliances but do not perform well on multi-state ap-
TP + FP
precision × recall pliances such as the dish washer and washing machine. One
F1 = 2 × (13) possible reason is that, for these appliances, informative
precision + recall
‘events’ in the power signal can be many time steps apart
TP + TN
accuracy = (14) (e.g. for the washing machine there might be over 1,000
P+N time steps between the first heater activation and the spin
|Ê − E| cycle). In principal, LSTMs have an arbitrarily long mem-
relative error in total energy = (15)
max(E, Ê) ory. But these long gaps between informative events may
Kettle Washing Machine Fridge
10 0.50
6
Aggregate
5 0.25
Data from House 1
0
0 0.00
−6
1.0
Appliance
0.5
0.0
1.0
LSTM
0.5
Raw Output from Neural Nets
0.0
Autoencoder
1.0
0.5
0.0
1.0
Rectangles
0.5
0.0
1.0
LSTM
0.5
Overlapping Output from Neural Nets
0.0
Autoencoder
1.0
0.5
0.0
1.0
Rectangles
0.5
0.0
0 80 0 600 0 300
Time (number of samples)
Figure 2: Example outputs produced by all three neural network architectures for three appliances. Each
column shows data for a different appliance. The rows are in three groups (the tall grey rectangles on the far
left). The top group shows measured data from House 1. The top row shows the measured aggregate power
data from House 1 (the input to the neural nets). The Y-axis scale for the aggregate data is standardised
such that its mean is 0 and its standard deviation is 1 across the data set. The Y-axis range for all other
subplots is [0, 1]. The second row shows the single-appliance power demand (i.e. what the neural nets are
trying to estimate). The middle group of rows shows the raw output from each neural network (just a single
pass through each network). The bottom group of rows shows the result of sliding the network over the
aggregate data with STRIDE=16 and overlapping the output. Please note that the ‘rectangles’ net is trained
such that the height of the output rectangle should be the mean power demand over the duration of the
identified activation.
Dish Washing Across all
Kettle washer Fridge Microwave machine appliances 7.2 Unsupervised pre-training
1.0
In NILM, we generally have access to much more unla-
F1
0.31
0.28
0.48
0.63
0.71
0.11
0.08
0.60
0.72
0.06
0.52
0.47
0.81
0.74
0.69
0.33
0.43
0.62
0.32
0.42
0.13
0.11
0.25
0.49
0.09
0.28
0.27
0.55
0.58
0.39
score
0.5
belled data than labelled data. One advantage of neural nets
0.0 is that they could, in principal, be ‘pre-trained’ on unlabelled
1.0 data before being fine-tuned on labelled data. ‘Pre-training’
Precision should allow the networks to start to identify useful features
0.45
0.30
1.00
0.80
0.91
0.07
0.04
0.45
0.88
0.03
0.50
0.39
0.83
0.71
0.71
0.24
0.35
0.50
0.32
0.28
0.08
0.06
0.15
0.72
0.05
0.27
0.23
0.58
0.69
0.39
0.5
score
0.0
from the data but does not allow the nets to learn to label ap-
1.0 pliances. (Pre-training is rarely used in modern image classi-
Recall fication tasks because very large labelled datasets are avail-
0.25
0.28
0.39
0.57
0.63
0.50
0.78
0.99
0.61
0.63
0.54
0.63
0.79
0.77
0.67
0.70
0.69
0.86
0.34
0.92
0.56
0.87
0.99
0.38
0.62
0.51
0.65
0.80
0.53
0.69
0.5
score able for image classification. But in NILM we have much
0.0
more unlabelled data than labelled data, so pre-training is
1.0
likely to be useful.) After unsupervised pre-training, each
Accuracy
0.99
0.99
0.99
0.99
1.00
0.69
0.37
0.95
0.98
0.35
0.61
0.46
0.85
0.79
0.76
0.98
0.99
0.99
0.99
0.98
0.69
0.39
0.76
0.97
0.31
0.79
0.64
0.91
0.94
0.68
score
0.5 net would undergo supervised training. Instead of (or as
0.0 well as) pre-training on all available unlabelled data, it may
Relative
1 also be interesting to try pre-training largely on unlabelled
error in data from each house that we wish to disaggregate.
-0.32
-0.34
-0.53
-0.35
-0.07
-0.22
-0.23
-0.65
-0.09
-0.36
0.43
0.57
0.02
0.36
0.28
0.66
0.76
0.26
0.50
0.85
0.80
0.06
0.50
0.65
0.76
0.18
0.73
0.49
0.66
0.43
0
total
energy
−1
Prop. of
total
1.0 8. ACKNOWLEDGMENTS
0.93
0.91
0.98
0.98
0.98
0.90
0.85
0.97
0.96
0.83
0.94
0.91
0.97
0.97
0.96
0.92
0.93
0.99
0.98
0.97
0.92
0.88
0.96
0.97
0.88
0.92
0.90
0.97
0.97
0.92
energy 0.5 Jack Kelly’s PhD is funded by the EPSRC and by In-
correctly
assigned 0.0 tel via their EU Doctoral Student Fellowship Programme.
Mean
200 The authors would like to thank Pedro Nascimento for his
absolute comments on a draft of this manuscript.
111
130
138
133
100
65
82
16
15
23
75
21
30
50
69
25
22
34
68
54
13
16
22
88
44
28
69
91
24
22
68
error
(watts)
0