MSC Proj
MSC Proj
MSC Proj
Christopher Sipola
Master of Science
Artificial Intelligence
School of Informatics
University of Edinburgh
2017
Abstract
This project explores whether a neural network is capable of predicting summary statis-
tics of electricity usage for five common household appliances given only the aggregate
signal of a smart meter. These appliance-level statistics are needed for many kinds of
feedback and analytics provided to energy consumers so they can reduce electricity
consumption and save on energy bills. An example of such a statistic is the total en-
ergy used by a washing machine in a day. Currently the state-of-the-art approach is to
use non-intrusive load monitoring (NILM) to predict appliance-level signals timepoint-
by-timepoint, and then compute the statistics using these predictions. However, this is
indirect, computationally expensive and generally a harder learning problem.
The statistics can also be used as input into one of the most successful NILM mod-
els, the additive factorial hidden Markov model (AFHMM) with latent Bayesian meld-
ing (LBM). This model uses these appliance-level statistics as priors to significantly
improve timepoint-by-timepoint predictions. However, the model is currently limited
to using national averages, since so far there have been no methods for inferring day-
and house-specific statistics. Improved statistics can therefore lead to more effective
NILM models.
Since this type of learning problem is unexplored, we use a dynamic architecture
generation process to find networks that perform well. We also implement a new pro-
cess for generating more realistic synthetic data that preserves some cross-appliance
usage information. Results show that a neural network is in fact capable of predict-
ing appliance-level summary statistics. More importantly, most models generalize
successfully to houses that were not used in training and validation, with the best-
performing models having an error that is less than half the baseline.
iii
Acknowledgements
I would like to thank my supervisor Nigel Goddard, as well as Mingjun Zhong and
Chaoyun Zhang, for their guidance throughout this project.
iv
Declaration
I declare that this thesis was composed by myself, that the work contained herein is
my own except where explicitly stated otherwise in the text, and that this work has not
been submitted for any other degree or professional qualification except as specified.
(Christopher Sipola)
v
Table of Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Achieved results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Background 7
2.1 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . 8
2.3 Counting with neural networks . . . . . . . . . . . . . . . . . . . . . 10
2.4 Non-intrusive load monitoring . . . . . . . . . . . . . . . . . . . . . 11
2.5 Multi-task learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Methods 13
3.1 Resources and tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Overview of the dataset . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Calculation of target variables . . . . . . . . . . . . . . . . . . . . . 15
3.3.1 Number of activations . . . . . . . . . . . . . . . . . . . . . 15
3.3.2 Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Data exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4.1 Target appliance signatures . . . . . . . . . . . . . . . . . . . 17
3.4.2 Daily statistics . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 Data cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.1 Overview of data issues . . . . . . . . . . . . . . . . . . . . 23
3.5.2 Unused data . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.3 Data only used to create synthetic training data . . . . . . . . 24
3.6 Dataset creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
vii
3.6.1 Data split: training, validation and testing . . . . . . . . . . . 24
3.6.2 Structure of the datasets . . . . . . . . . . . . . . . . . . . . 28
3.6.3 Timestep standardization . . . . . . . . . . . . . . . . . . . . 28
3.6.4 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . 29
3.6.5 Data preprocessing: inputs . . . . . . . . . . . . . . . . . . . 31
3.6.6 Data preprocessing: targets . . . . . . . . . . . . . . . . . . . 33
3.7 Hyperparameter and architecture selection . . . . . . . . . . . . . . . 33
3.7.1 Architecture design overview . . . . . . . . . . . . . . . . . 33
3.7.2 Hyperparameter selection procedure . . . . . . . . . . . . . . 36
3.7.3 Random hyperparameter choices . . . . . . . . . . . . . . . . 37
3.7.4 Static hyperparameters . . . . . . . . . . . . . . . . . . . . . 39
3.7.5 Other network and training elements . . . . . . . . . . . . . . 40
3.7.6 Example architecture . . . . . . . . . . . . . . . . . . . . . . 41
3.7.7 Hyperparameter selection results . . . . . . . . . . . . . . . . 43
4 Evaluation 47
4.1 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Performance results . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 Single-target models . . . . . . . . . . . . . . . . . . . . . . 48
4.2.2 Multi-target models . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.3 Accuracy of activations feedback . . . . . . . . . . . . . . . 51
4.2.4 Results by house . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 Importance of synthetic data . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Exploration of convolutional layer activations . . . . . . . . . . . . . 56
5 Conclusion 63
5.1 Remarks and observations . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Drawbacks and limitations . . . . . . . . . . . . . . . . . . . . . . . 64
5.3 Further work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A Data issues 67
A.1 Known data issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A.2 Discovered data issues . . . . . . . . . . . . . . . . . . . . . . . . . 68
B Additional algorithms 71
viii
Bibliography 81
ix
List of Figures
xi
3.6 Relationship between energy used per day and number of activations
per day. The points are jittered along the x-axis to reduce overlap. . . 21
3.8 Relationship between energy used per day and number of activations
per day, with House 20 highlighted in orange. A line of least squares is
fitted to emphasize correlation. The points are jittered along the x-axis
to reduce overlap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.9 Correlation between target appliances for the specified target variable.
For example, the correlation between daily number of microwave and
kettle activations is 0.48. . . . . . . . . . . . . . . . . . . . . . . . . 22
3.10 Split between training, validation and test datasets. Real training data
is in blue, while the appliance signals from the blue and orange data
are used to make the synthetic data. The horizontal orange stripes
represent the houses whose aggregate signals are made unusable due
to solar panel interference (1, 11 and 21). The horizontal purple stripes
are for the houses that are held out for validation (3, 9 and 20), while
the green stripes are held out for testing (2, 5 and 15) (both “unseen”).
The thin vertical yellow stripes indicate the dates that are held out for
validation and testing (“seen”). The red data is not used at all due to
errors in the appliance signal or gaps in the data, such as the outages
in February 2014. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.11 Process for creating synthetic data for an example house that has two
washing machines but no microwave. . . . . . . . . . . . . . . . . . 31
xii
3.14 Results of the random search process. The curves represent the cumu-
lative minimum of the minimum smoothed validation error of the mod-
els for the specified appliance and target variable combination. The
validation error is computed on the preprocessed targets. . . . . . . . 43
3.15 Training and validation error for the chosen models, where the error is
computed on the preprocessed targets. The depicted validation error
curves have not been smoothed, although the choice of the models was
based on smoothed validation error. The number of epochs differ due
to early stopping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1 Test error relative to the baseline for single-target models, with 95%
confidence intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Test error relative to the baseline for unseen houses, with 95% confi-
dence intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Confusion matrix for washing machine activations on unseen houses.
Model predictions are rounded to the nearest integer. . . . . . . . . . 51
4.4 Confusion matrix for dishwasher activations on unseen houses. Model
predictions are rounded to the nearest integer. . . . . . . . . . . . . . 52
4.5 Confusion matrix for kettle activations on unseen houses. Model pre-
dictions are rounded to the nearest integer. . . . . . . . . . . . . . . . 52
4.6 Predictions vs. targets for washing machine activations. The points are
jittered along the x-axis to reduce overlap. . . . . . . . . . . . . . . . 53
4.7 Predictions vs. targets for fridge activations. . . . . . . . . . . . . . . 54
4.8 Test error for best-performing single-target models when models were
not trained with synthetic data. The error bars represent 95% confi-
dence intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.9 Example day of data where all target appliances are used. . . . . . . . 58
4.10 Dimensionality-reduced activations of the final convolutional layer of
the washing machine model. The target energy is 1.19 kWh and the
prediction is 1.11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.11 Dimensionality-reduced activations of the final convolutional layer of
the dishwasher model. The target energy is 1.64 kWh and the predic-
tion is 1.47. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
xiii
4.12 Dimensionality-reduced activations of the final convolutional layer of
the kettle model. The target energy is 0.83 kWh and the prediction is
0.51. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.13 Dimensionality-reduced activations of the final convolutional layer of
the microwave model. The target energy is 0.12 kWh and the predic-
tion is 0.15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.14 Dimensionality-reduced activations of the final convolutional layer of
the fridge model. The target energy is 0.75 kWh and the prediction is
0.45. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.15 Dimensionality-reduced activations of the final convolutional layer of
the multi-target model. . . . . . . . . . . . . . . . . . . . . . . . . . 61
xiv
List of Tables
3.1 Monitored appliances in each house. There is no House 14. Table data
comes from [3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Table from [6] showing arguments passed to NILMTK’s Electric.get_-
activations() function. . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Final architectures for models predicting energy. . . . . . . . . . . . . 45
3.4 Final architectures for models predicting the number of activations. . . 45
4.1 Test error relative to the baseline for single-target models, with 95%
confidence intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Test MSE relative to the baseline (with 95% confidence intervals) when
models were not trained with synthetic data. . . . . . . . . . . . . . . 56
C.1 Test MAE loss with 95% confidence intervals. The units for energy
are kWh, and the units for number of activations are counts. . . . . . . 75
C.2 Test MAE loss when trained only on synthetic data, with 95% confi-
dence intervals. The units for energy are kWh, and the units for number
of activations are counts. . . . . . . . . . . . . . . . . . . . . . . . . 76
xv
List of Algorithms
1 Bare-bones description of how the number of filters and the size of the
dense layers are determined. . . . . . . . . . . . . . . . . . . . . . . . 35
xvii
Chapter 1
Introduction
1.1 Motivation
1
2 Chapter 1. Introduction
tion, which is the use of software to disaggregate the signal from a smart meter into
the signals of the appliances.
NILM is often framed as a supervised machine learning problem, so training the
disaggregation models requires appliance-level data. There are a number of studies
that have collected such data using IAMs. Arguably the best dataset for aggregate-
and appliance-level data is REFIT, which has a combination of a long study duration,
a large number of participating houses and a high sampling rate [10].
While disaggregation is a powerful tool for demystifying energy usage, some chal-
lenges remain. One challenge pertains to a notable state-of-the-art model, the addi-
tive factorial hidden Markov model (AFHMM), which uses latent Bayesian melding
(LBM) to incorporate appliance-level statistics for more accurate predictions. Cur-
rently this model is limited to using national averages, and it would likely see improved
performance if the statistics were inferred specifically for each house and day [11]. A
second challenge is that computing summary statistics by first disaggregating into the
appliance series timepoint-by-timepoint is computationally expensive. It is likely that a
more straightforward approach would be more efficient. A third challenge is that com-
puting certain summary statistics on noisy timepoint-by-timepoint predictions may be
difficult. For example, simple methods that count appliance cycles are meant to be
used on clean IAM data, so using them on noisy predictions can introduce another
source of error.3 And a fourth challenge is that using NILM for summarization is in-
herently a difficult learning problem. This is because NILM models are incapable of,
for example, using cross-appliance information across the day to make predictions, or
trading off a positive error early in the day against a negative error later in the day.
The goal of this dissertation project is to support IDEAL by exploring whether a
model can summarize appliance-level statistics directly, given the aggregate signal of
a smart meter. The two statistics that will be predicted are the number of activations—
that is, the number of times an appliance is used—and energy used by the appliance.
The five appliances that will be modeled are the fridge, kettle, washing machine, dish-
washer, and microwave. In Figure 1.1, this would mean a model takes the aggregate
power series (blue) as input, and returns, for example, the energy used by the washing
machine series (pink) as output.
The project uses neural networks since they are generally good at modeling data
with complex structure between input features and have already proved successful
ing, abbreviated NALM or NAILM.
3 An example of such a method is by using the Electric.get_activations() function. See Sec-
tion 3.3.1.
1.2. Objectives 3
1.2 Objectives
There will be two primary objectives for this project, which have direct relevance to
the application domain:
1. Determine whether neural networks can, in principle, predict energy and the
number of activations given the aggregate signal of a day of data. This is achieved
by evaluating the models on houses that were seen during the training and vali-
dation process, but for days that were not seen.
4 Chapter 1. Introduction
2. Assess how well the models generalize. This is achieved by evaluating the mod-
els on houses that were not seen during training and validation. This is a more
difficult problem, but also a more important one since it explores how well the
models perform when used within the context of the application domain.
1. Explore whether models that predict a target variable for all target appliances (as
opposed to just one) perform comparably to models that are trained specifically
for certain appliances.4
3. Explore the results in a way that demystifies the neural network performance
beyond the headline evaluation metrics. This means exploring differences in
performance between houses that are seen and unseen during training and vali-
dation, performing diagnostics of the predictions, and visualizing the inner work-
ings of the neural networks to determine what the models are “looking at” when
making predictions.
We find that neural networks are in principle capable of predicting summarized elec-
tricity usage information. More importantly, nine of the twelve appliance models gen-
eralize to houses not seen during training, performing better than the baseline. The
data augmentation process can be credited in part for the generalization results.
The models perform best on appliances with complex signatures, like the washing
machine and the dishwasher. These results are arguably strong enough to provide
directly to consumers. The fridge model, on the other hand, generalizes poorly. The
multi-target model also does not generalize as well as the single-target models.
4 The idea exploited by these models, called multi-task learning, is discussed in Section 2.5.
1.4. Dissertation outline 5
Chapter 1 (this chapter) provided an overview of the motivation, objectives and results
of the project. Chapter 2 discusses prior research that is relevant to this project, which
is mostly research related to neural networks and NILM. Chapter 3 describes the meth-
ods used in the project, including a description and exploration of the REFIT data set,
the data cleaning process, the creation of the input data and targets, the selection pro-
cess for the network architectures, and the results of the architecture selection process.
Chapter 4 evaluates the models chosen in the selection process to determine whether
the objectives of the project have been met. Chapter 5 summarizes the findings of the
paper, describes the limitations of the methods, and suggests paths for further research.
Chapter 2
Background
In recent years, the revival of the neural network has revolutionized fields such as
image recognition, speech recognition, natural language processing, and other areas
of machine learning. A neural network—sometimes referred to as an artificial neural
network (ANN) or a multilayer perceptron (MLP)—is a powerful supervised machine
learning model that has the ability to represent complex nonlinear relationships in the
input data. The model was originally developed to model the biological brain, but has
since diverged to be optimized for classification and regression tasks.
The basic unit of the neural network is the neuron, although it is often simply
called it a “unit.” A unit takes a weighted average of input values, applies a nonlinear
“activation function,” and returns a scalar value (Figure 2.1). The classic activation
function is the sigmoid, defined as f (x) = 1/(1 + e t ), which compresses values to be
between zero and one [1].
7
8 Chapter 2. Background
Neural networks are often comprised of layers, which are themselves comprised of
units (Figure 2.2). Layers between the input layer and the output layer are called hidden
layers. When all the units of a layer are connected to all units of the previous layer, the
layer is called “dense” or “fully connected.” Neural networks with at least one hidden
layer with a nonlinear activation function are universal approximators, meaning that
they can, in principle, approximate any continuous function of arbitrary complexity.
In practice, more layers often result in more flexibility and representational power,
allowing the network to represent increasingly abstract relationships in the data [1].
Figure 2.2: Basic neural network architecture with two hidden layers, from [1].
For a neural network to learn, the practitioner must define a loss function to eval-
uate the performance of the network’s predictions. During training, the neural net-
work iteratively searches for a set of weights that minimize this loss function through
a gradient-based optimization algorithm such as gradient descent. A process called
backpropagation is used to determine the error contribution of each weight and there-
fore how much each weight should change in each iteration of the optimization.
The field of image recognition has seen particular gains from a type of neural network
called a convolutional neural network (CNN). A CNN is different from the usual neural
network in that it assumes local correlation in input features such as the pixel values
in image data. A CNN works by sliding small feature detectors—called “kernels” or
“filters”—over the image, identifying pixel patterns such as edges and curves. The
outputs of these kernels create “feature maps,” which are usually used as input values
in the next hidden layer (Figure 2.3). A different set of kernels can then use these
feature maps to identify more abstract patterns, such as textures. With enough hidden
layers, a CNN can represent complex abstract patterns such as human faces. Notably,
2.2. Convolutional neural networks 9
the features identified by the kernels are learned through training, not hand-crafted.
Since each kernel has a single set of weights, CNNs have a much smaller number of
trainable parameters than would otherwise be the case [12].
Figure 2.3: Three 5 ⇥ 5 kernels (red) start at the top left of the 28 ⇥ 28 set of pixels,
calculating the top-left value in three feature maps. These red kernels then slide over
columns to define the first row in the feature maps. Then the process is repeated for
the remaining rows. Image from [2].
In order to reduce the dimensionality of the network and to reduce overfitting, pool-
ing layers are commonly placed between convolutional layers every so often (Figure
2.4). The most common type of pooling layer is max pooling, which takes the maxi-
mum activation value over some area in a feature map (Figure 2.5). At the end of the
network, the final convolutional layer is often “flattened”—that is, rolled out into a 1D
vector—so that the final feature maps can be processed by dense layers.
Figure 2.4: Stacked convolutional and pooling layers. Image from [2].
When designing a new CNN architecture, practitioners often take inspiration from
state-of-the-art neural networks. One popular state-of-the-art CNN is called VGGNet [13],
10 Chapter 2. Background
Figure 2.5: Effect of max pooling on a feature map. Image from [1].
which was the runner-up in ILSVRC 2014, a well-known image recognition competi-
tion. This network is popular because of its relatively straightforward architecture.1
Time series often exhibit autocorrelation, where each timestep is correlated with the
previous timesteps in the series. Because autocorrelation is a type of local correlation,
time series can be modeled as if they were 1D images, allowing researchers to adapt
CNNs for time series classification problems. Instead of edges, the kernels might learn
to identify spikes or dips. And instead of human faces, the network might learn to
identify heartbeat patterns or household appliance signatures [14, 15, 16, 5, 6].2
The prediction of the number of activations is, at its core, a counting problem.3 Count-
ing is most explored in computer vision, where the methods often require: (1) hand-
crafted object detectors; and (2) images with many instances of the desired objects (of-
ten hundreds), allowing false positives and false negatives to cancel for the estimated
statistics [19]. Unfortunately, most of our appliances do not have anything close to
hundreds of activations per day, and we generally want to avoid hand-crafting features.
Research has shown that CNNs can predict the number of objects in an image using
1 Otherstate-of-the-art networks often use exotic architecture elements, such as ResNet’s heavy use
of skip connections [1].
2 This neural network approach is preferred to using k-nearest neighbor classification (k-NN) with
dynamic time warping (DTW) to identify appliance signatures. k-NN with DTW works by stretching or
compressing multiple time series so that they align, and then performing k-NN (usually 1-NN) to find
the labeled time series that has the most similar pattern [17, 18]. It is unclear how to implement this
method for this project given that the input data is aggregated, making matching to labels ineffective.
3 The prediction of energy is more akin to integration. However, energy and the number of activations
are highly correlated, suggesting that neural networks may learn similar features to predict both. See
Section 3.4.2.
2.4. Non-intrusive load monitoring 11
supervised learning [21]. More generally, neuroscience research has shown that neu-
ral networks can learn “visual numerosity”—or the capacity to estimate the number
of objects seen. Networks from this research are trained using unsupervised learning
(specifically, generative models), but the hidden layer activations can be used to clas-
sify images as having more or fewer target objects than some specified number. These
models are robust to differences in cumulative area, density and features between the
target objects [20].
4 Mathematically, the problem can be defined as pagg (t) = ÂNi=1 pi (t), where pagg (t) is the aggregate
signal at time t and pi (t) is the signal of an appliance. The goal of NILM would be to find one or more
pi (t) given only pagg (t), for all values of t.
12 Chapter 2. Background
Methods
This project is implemented in Python 2.7. The numpy and pandas Python libraries
are used heavily throughout the project for their convenient scientific computing op-
erations and data structures. Scikit-learn, the most popular Python machine learning
library, is used for preprocessing. Keras, a high-level neural networks API [24], is
used for all neural network modeling, with TensorFlow as the back-end.1 Matplotlib
and seaborn are used for data visualization. Jupyter notebooks are used for prototyp-
ing, data exploration and executing code for data visualization. Git is used for version
control.
1 Kerasis chosen over using TensorFlow directly due to its simplified syntax that better allowed for
dynamic architecture creation.
13
14 Chapter 3. Methods
There are many datasets with smart meter and appliance-level power data that are pop-
ular within the NILM community, including UK-DALE [25], HES [26], BLUED [27],
and REDD [28]. However, REFIT (Personalised Retrofit Decision Support Tools for
UK Homes Using Smart Home Technology) stands out from the others in that it has
the combination of a long study duration (21 months; see Figure 3.1), more than just
a few houses (20), a large number of appliances recorded with IAMs (9), and a high
sampling rate (6–8 seconds).2 In comparison, BLUED and UK-DALE have just 1 and
5 houses (respectively), HES a 10 minute sampling rate, and REDD a study duration
of only 3–19 days [28].
Figure 3.1: Data availability for each house. Figure from [3].
There are two versions of the REFIT dataset: raw and cleaned. This project uses the
cleaned data, which corrects some appliance labeling errors in the raw data,3 forward-
fills null values, and removes spikes of over 4000 watts [29]. This cleaned dataset is
stored as a set of CSVs, one for each house. Each file has 7–8 million rows and is
roughly 300–400 MB in size. There are two datetime columns (one for a datetime
string and the other for a Unix datetime), one column for the aggregate power signal,
and nine columns for the power signals of the appliances. There is an additional col-
umn called “Issues” which has a binary indicator for whether the sum of the recorded
appliances exceeds the aggregate signal.
2 Althoughthe mean and median sampling rates are around 6–8 seconds, there is great variability in
the difference between timestamps. The 10th percentile in the distribution of this difference is 2 seconds
and the 90th percentile is 13 seconds. This is because the IAMs only record data when there is a change
in load [3], which is discussed in Section 3.3.2.
3 For example, if the appliance plugged into an IAM changed at some point in the study, then this
Of the 20 houses, 18 have at least one fridge (or fridge-freezer), 16 have a kettle,
19 have at least one washing machine, 15 have a dishwasher, and 16 have a microwave
(Table 3.1).
Table 3.1: Monitored appliances in each house. There is no House 14. Table data
comes from [3].
The activations—that is, the periods when an appliance is being used—can be extracted
from an appliance’s power series using code adapted from the NILMTK Python pack-
age [6]. The process is described below:
than some threshold duration (to ignore spurious spikes). For more com-
plex appliances such as washing machines whose power demand can drop
below threshold for short periods during a cycle, NILMTK ignores short
periods of sub-threshold power demand.
Table 3.2: Table from [6] showing arguments passed to NILMTK’s Electric.get_-
activations() function.
The function arguments in this project are kept the same as in the original paper
(Table 3.2).4
3.3.2 Energy
The basic formula for energy is E = Pt where E is energy, P is power and t is time.
Since the IAMs only record data when there is a change in load [3], we can assume
that the power level at timestamp tn remains constant between timestamps tn and tn+1 ,
where n is the index of the recording such that tn < tn+1 .5 We can therefore calculate
total energy in our data as
N 1
E= Â Pn(tn+1 tn ) (3.1)
n=0
where E is the total energy used in the day, Pn is the power level at recording n, and N
is the total number of recordings in a day. Energy will be reported in kilotwatt-hours
(kWhs), which is equivalent to 1,000 watts of power for a period of one hour.
4 Theone exception is the on-power threshold of the kettle, which is reduced from 2,000 to 1,500
watts to capture the kettle activations in House 3.
5 Here, t is not elapsed time but rather the amount of time elapsed since some fixed datetime in the
past. This is a common way of representing timestamps in software, such as with Unix time. Elapsed
time can be calculated by taking the difference between timestamps.
3.4. Data exploration 17
In order to be able to diagnose the performance of our models later in the project, it
can be useful to explore the signatures of the target appliances. To start, the fridge’s
signature is somewhat rectangular and low-power, with a characteristic spike at the
start (Figure 3.2). There is also a great deal of variety in both its time duration and
energy per activation, with some skewness toward high values (Figure 3.3). The ket-
tle has a rectangular, high-power signature usually lasting for one or two minutes.
Prior research has shown that this simple rectangular shape has lent the kettle to eas-
ier timepoint-by-timepoint disaggregation than other appliances [5]. The washing
machine has a complex, noisy, long-duration signature that alternates between high-
power mode and low-power modes. The dishwasher signature is also complex and
long-duration, but it has less noisiness than the washing machine and exhibits more
switching between high-power and low-power modes. The microwave, like the kettle,
is rectangular and also lasts for a few minutes, but it has a duration distribution that is
skewed more strongly to the right.
Figure 3.2: Five random signatures for each of the five target appliances for a single
activation. Each target appliance is represented as a different color.
Figure 3.3: Distribution of time duration per activation (first row) and energy per activa-
tion (second row) for each target appliance. The largest 2% of values were not plotted.
in a way that is useful for generalization and does not needlessly hinder the model.
In particular, if two appliances have similar functions and similar signatures, then we
want to consider them to be the same appliance to avoid systematic false positives.
Also, requiring a model to distinguish between very similar signatures could force it to
over-rely on other house-specific appliance signals, which would hurt generalization.
This grouping problem is hardest for the fridge. There are 7 fridges, 14 fridge-
freezers and 1 IAM with both a fridge and a freezer plugged into it.6 While the details
of how and why each of these signatures differ is complicated,7 inspection of their
signals shows that they are similar enough that we can reasonably categorize them
as “fridges.” A similar decision has to be made for the washer and the washer-dryer,
but in this case the two appliances have signatures that are quite different from one
another. Therefore we can safely exclude washer-dryers from the category of “washing
machine.”
Using this appliance classification, we find that the fridge has the largest number
of activations across all houses at 204,774, while the kettle has 41,719, the microwave
17,080, the washing machine 6,577, and the dishwasher 4,725.
6 Thehouse with both a fridge and a freezer plugged into the same IAM is House 19, and the IAM is
Appliance 1. The power series has the label “Fridge & Freezer” in the README.
7 A thorough analysis of the fridge and freezer signatures requires some knowledge of how com-
pressors and fans operate within fridges and fridge-freezers, how many there are in each fridge(-freezer)
model, and their power. For example, a fridge-freezer with one compressor often has a simple signature,
but a fridge-freezer with two compressors may have a signature comprised of two stacked compressor
signatures.
3.4. Data exploration 19
Since our models will be training and testing on daily data (Section 3.6.2), it is useful
to explore energy and activations data summarized at the daily level. In doing so, we
see there is a great deal of variety in the amount of energy used by each house (Figure
3.4). For example, House 15 uses a median 5.6 kWh of energy per day, while House 8
uses around three times that much at a median 16.7 kWh per day. Some houses, such
as House 11, also exhibit more variance across days than other houses.
Figure 3.4: Distribution of daily energy used per house as recorded by the aggregate
signal. The distribution for all houses combined is highlighted in orange.
Breaking this energy usage down by target appliance (Figure 3.5) lets us compare
between-house and within-house variance of the target appliances. We see that the
fridge has a great deal of between-house variance, with House 8 using a median 0.2
kWh per day and House 8 using nearly 10 times as much at a median 1.9 kWh per day.
The fridge also seems to have the least amount of within-house variance. As a result,
the fridge’s distribution for all houses has much more variance than the distribution for
any individual house. The lack of within-house variance makes sense since the fridge
is the only target appliance that is activated without the express action of the user, and
it is therefore much more invariant to day-to-day changes in behavior.
Figure 3.6 explores the relationship between energy and the number of activations
in more detail. As expected, energy is positively correlated with the number of acti-
vations for all of the target variables. This is good news: It means that if one target
variable is more difficult to predict—likely energy, since it is a more nuanced and com-
plicated calculation than counting—then at the very least our models should be able to
get most of the way toward an accurate prediction by predicting the easier target vari-
able times some constant. However, we should hope that our models can learn how to
predict each target variable in a way that is not simply a linear transformation of the
other.
There are two main sources of variance in energy for a particular number of acti-
20 Chapter 3. Methods
Figure 3.5: Distribution of daily energy used by house, broken down by appliance. The
distribution for all houses combined is highlighted in orange. Horizontal lines indicate
that the house did not have that appliance hooked up to an IAM (e.g., the fridge for
House 6). If a house had more than one of the target appliance hooked up to IAMs (e.g.,
the two washing machines in House 4), then the daily energies for those appliances
were summed.
vations: (1) different appliance models across houses; and (2) different usage behav-
iors/settings for any one appliance model within a house.8 To separate these sources
of variance, we can calculate the correlations between the two target variables for each
house (Figure 3.7).
8 Thereare instances where the activation is zero and the energy is positive. In each instance, this
could be due to one of two things: (1) Electric.get_activations fails to catch the activation; or
(2) there is a data error giving positive energy without an accompanying signature that can be caught
by Electric.get_activations. Given the possibility of these two scenarios, we cannot rely on one
metric to correct the other. These errors can also add to the variability of energy in cases where the
number of activations is positive.
3.4. Data exploration 21
Figure 3.6: Relationship between energy used per day and number of activations per
day. The points are jittered along the x-axis to reduce overlap.
Figure 3.7: Correlation between daily energy and number of activations, by appliance
and house. A house without an appliance is represented as a gray square with no
correlation number—not to be confused with zero correlation.
We find that, for the most part, the correlations between the target variables are
strongly positive within houses, reflecting the positive correlation for all houses to-
gether. The exception is the fridge, where the correlation for all houses is higher than
the correlation for any individual house. Some of these within-house correlations are
actually strongly negative.
To visualize the negative correlation for one house, we can take Figure 3.6 and
highlight one of the houses that has a negative correlation—say, House 20. The result
is Figure 3.8. As expected, the fridge pattern of House 20 does not reflect the overall
pattern. The other appliance patterns, on the other hand, do reflect the overall pattern.
In general, we want our models to use the target appliance signatures to make
their predictions. But we also want them to use cross-appliance patterns to the extent
that they are generalizable across houses. An example of a potentially useful cross-
appliance pattern is the positive correlation between kettle and microwave usage (Fig-
ure 3.9). Another example is the correlation between washing machine and dishwasher
22 Chapter 3. Methods
Figure 3.8: Relationship between energy used per day and number of activations per
day, with House 20 highlighted in orange. A line of least squares is fitted to emphasize
correlation. The points are jittered along the x-axis to reduce overlap.
usage. This agrees with intuition: The first correlation represents days where there is
more cooking, and the second represents days where domestic chores are being done.
However, such conclusions must be made with caution, since aggregate patterns
may disappear when observing within-house patterns. For example, the fridge activa-
tions are strongly negatively correlated with kettle activations. This is clearly a pattern
that exists between houses as opposed to within houses. That is, houses with fridges
that have many activations per day tend to be houses where the kettle is not used very
often—at least in our dataset.
Figure 3.9: Correlation between target appliances for the specified target variable. For
example, the correlation between daily number of microwave and kettle activations is
0.48.
Our models may be led astray by such correlations. Appliances with high between-
3.5. Data cleaning 23
house variance but low within-house variance—such as the fridge—are more likely to
suffer from such issues, since the models will be encouraged simply to predict the
average target value for the house using house-specific appliance patterns. We will see
how this affects generalization performance in Section 4.
The data-cleaning process is based on previously known and newly discovered data
issues, which include gaps in the data, the effect of solar panels on the aggregate signal,
inconsistencies between recordings, and mislabeled data. For a full description of these
issues and more, see Appendix A.
Some observations are completely discarded. These discarded observations suffer from
at least one of the following:
• Errors in the appliance series or major errors in the aggregate series, mean-
ing one or more of the following is true: at least one appliance has a power value
that is unchanging and above 25 watts for more than 10% of the day (781, 7.2%);
the sum of the appliance series is greater than the aggregate series for more than
10% of the day (246, 2.3%); or there is a negative power values in at least one
power series (9, 0.083%).
In total, 2,663 observations (24% of the total) are unused because of these errors.
This is an unfortunate amount of data to discard, but pilot tests showed that substantial
cleaning is necessary in order to get reasonable modeling results.
9 The expected number of recordings in a day is around 14,400.See Appendix A.2 and Section 3.6.3.
10 The daylight saving change is not an error, but removing these observations simplifies the data
creation process.
24 Chapter 3. Methods
There are cases where only the aggregate signal has issues but the appliance signals
are otherwise clean. It would be wasteful to throw away this data altogether, so they
are used for the creation of the synthetic data. This includes the following cases:
• Solar panel interference. Three houses (1, 11 and 21) suffer solar panel inter-
ference, causing a large amount of noise in the aggregate signal. This represents
roughly 15% of the data. See Appendix A.1 for a description of this problem.
• Low correlation between IAMs and aggregate. Instances where the correla-
tion is below 0.1 captures 15 observations, or 0.21% of the total.
• Strings of large, repeating aggregate power values. Instances where the ag-
gregate series has a power value that is unchanging and above 25 watts for more
than 10% of the day represents 86 observations, or 1.2% of the total.
It is standard in machine learning to split data into three sets: one to train the model (the
training set), another to help choose between models (the validation set), and another
to report results (the test set). It is common to simply take all available observations
and split them into three sets, usually 60% for training, 20% for validation and 20%
for testing [30]. However, in this project we want to have two test sets: one containing
houses that the model has seen in training and validation (to explore the difficulty of
the learning problem in principle), and one with houses the model has not seen (to
test generalization performance). Therefore, a simple split of the dataset would not
work since we have two test sets and need to be especially careful about information
leakage.
To start, we hold out a set of houses explicitly until test time. Additionally, since we
want to validate our models in a way that encourages generalization, we also hold out
a different set of houses that we will use for validation. Three houses in each holdout
set should (hopefully) provide enough diversity of appliance models. Any fewer might
make the validation process and the test results very sensitive to the holdout houses,
while any more might take away too much data from the training set.
3.6. Dataset creation 25
We want each house in these two holdout sets to have at least one of each of the
five target appliances. The only houses that meet this criterion are Houses 2, 3, 5, 9,
15, and 20, which conveniently gives us exactly the number of houses we need. There
is little reason to choose some houses for one holdout set and other houses for another
holdout set, so we will randomly assign Houses 3, 9 and 20 to the validation holdout
set and 2, 5 and 15 to the test holdout set. These houses will be referred to throughout
this paper as “unseen” houses—that is, houses that were not seen by the model during
training. The unseen validation houses are represented as purple in Figure 3.10, while
unseen test houses are represented as green.
In addition to testing separately on “seen” houses, we will also include some diver-
sity of appliance signals to the validation set by adding some “seen” house data, which
can help reduce sensitivity to holdout houses. This seen house data for the validation
and test sets are from dates that are unseen during the training process.
To create this “seen” validation and test data, we simply split the data for houses
that we are not holding out for validation and testing into two groups: 70% for training
and 30% for the combined “seen” testing and validation sets (later split in half so that
each has 15%). This split between the training data and the seen validation/test data is
performed randomly across all dates to minimize the influence of seasonality—which
is preferred to simply splitting the dataset into three parts using date ranges. In Figure
3.10, the training data (70%) is represented as blue, and the collective validation and
testing data for seen houses (30%) is represented as yellow.11 . Since the largest source
of variation in performance is likely across houses, the split from the combined 30%
into the 15% for the seen validation set and 15% for the seen test set is stratified by
house, meaning each house is represented roughly evenly in each group.
Although this validation method results in models that generalize well to unseen
houses (Section 4), it would have been more effective (and elegant) to validate through
k-fold cross-validation (CV), where each fold is the data of one or more houses.12 This
way we would have been able to validate for generalization on all houses (except for
those held out for testing) and also not have to use seen days to provide diversity of
signatures. However, if we were to do 5-fold CV (that is, validation five times with
five different holdout sets), then the number of models that need to be trained would
11 One color is used for the collective 30% both because it creates simplicity in the graph and because
inspecting the breakdown between the two is not very informative.
12 In k-fold CV, the dataset is split into k “folds,” or subsets of the original data. In one iteration, the
model is trained on k 1 folds and validated on the remaining fold. This procedure is repeated until
every fold is used as the validation set. The validation error of each fold is then averaged.
26 Chapter 3. Methods
increase by a factor of five. This is not feasible since there are already a large number
of neural networks that need to be trained (Section 3.7). We can rely on other methods
to encourage generalization, discussed later in this paper.13
The individual appliance signals of the training data—but not the validation or test
data—are used to create synthetic data. This prevents information from the validation
or test data from leaking into the training process. Additionally, the individual appli-
ance signals of some observations with poor-quality aggregate signals (which includes
houses with solar interference) are used to create synthetic training data, since in this
case we only care about the quality of the appliance signals (orange in Figure 3.10).
In total, 3,146 examples—or 45% of all cleaned, real data—are in the “real” part
of the training set (as opposed to the synthetic part), 686 (10%) are in the validation set
for seen houses, 686 (10%) are in the test set for seen houses, 1,159 (17%) are in the
validation set for unseen houses, and 1,301 (19%) are in the test set for unseen houses.
There are also 39,800 synthetic training examples, increasing the size of the training
set by a factor of nearly 14.14
When training a model for NILM, care is taken to balance “positive” and “negative”
examples (i.e., windows where the target appliance is activated and when it is not) by
sampling them explicitly [6]. This is because most target appliances are activated
infrequently, and therefore the model would see more negative examples than positive
ones if sampling naively over time. However, for our summarization task, balancing is
not necessary since there are not so many examples where the target appliances have
zero energy or activations that we would have to worry about this issue.
13 However, these methods will still rely on a set of holdout houses for validation as opposed to using
CV.
14 This is still not many training examples for a neural network, but one consolation is that an obser-
vation often has multiple activations—so any model often updates its weights using information from
multiple signatures per example. This is assuming the model learns, at least in part, by using the signa-
tures of the target appliance.
3.6. Dataset creation
Figure 3.10: Split between training, validation and test datasets. Real training data is in blue, while the appliance signals from the blue and
orange data are used to make the synthetic data. The horizontal orange stripes represent the houses whose aggregate signals are made
unusable due to solar panel interference (1, 11 and 21). The horizontal purple stripes are for the houses that are held out for validation (3, 9
and 20), while the green stripes are held out for testing (2, 5 and 15) (both “unseen”). The thin vertical yellow stripes indicate the dates that
are held out for validation and testing (“seen”). The red data is not used at all due to errors in the appliance signal or gaps in the data, such as
the outages in February 2014.
27
28 Chapter 3. Methods
The goal is to create a feature matrix XN⇥D and a target matrix YN⇥K where N is the
number of aggregate signals, D the number of features (which is equivalent to the
number of timesteps in a day), and K the number of targets for a particular target
variable. K is equal to 1 for the single-target models and 5 for the multi-target models.
The window is set to be a 24-hour day, midnight-to-midnight, since it is a natural
unit of time for generating summary statistics.15 Another option for the window length
is a full week, but there are several downsides to this: (1) more recordings per example
when there are already quite a lot; (2) fewer independent observations (by a factor of
seven); and (3) potentially poorer performance for short-duration appliances [6]. Using
a day of data also does not result in a loss of granularity: If need be, the daily summary
statistics can be aggregated into weekly statistics.16
Since the number of recorded measurements varies by day (Appendix A.2), the num-
ber of timesteps in each aggregate series will need to be standardized so that they can
be used as model input. To do so, we create a standardized series that will assume
consistent 6-second sampling, and then align the aggregate power series to this stan-
dardized timestamp series. For a day of interest, the standardized series would have
the times 00:00:06, 00:00:12, 00:00:18, . . . , 23:59:56, and 00:00:00 (following day) in
hh:mm:ss format.
In order to align an unaligned power series p with the standardized timestamp
series t̃t , we take p ’s unaligned timestamp series t and find indices i where each i j
gives the maximum value of t i j such that t i j t̃t j for all j.17 Indices i can then be
used to find the power value p i j that is associated with standardized timestamp t̃t j . The
15 Some NILM researchers find it useful to train on randomly selected time windows [6]. While
this may increase robustness for NILM modeling by serving as a form of data augmentation, it would
mean at least some loss of time-of-day information for our problem. This may hurt model performance
because some time-based usage patterns may be generalizable across households. For example, if many
households use washing machines in the afternoon, then it could be beneficial to allow our model to use
this information for learning. This project introduces robustness through other means, namely through
other forms of data augmentation and the use of regularization in the architecture of the neural network
models.
16 One potential downside to using a day instead of a week as the input window is that appliance
signatures are more likely to be split at the dayline. However, an inspection of the appliance data shows
that long-duration appliances tend not to be used around midnight. It is also conceivable that the models
would be able to learn using partial signatures, but there are likely too few samples for this to happen.
17 The size of i is equal to the size of t̃t , but it is generally not equal to the size of t .
3.6. Dataset creation 29
Neural networks require a large amount of training data since there are so many train-
able parameters. To increase the amount of training data, researchers commonly use
a technique called data augmentation, which is the expansion of the training set by
applying realistic distortions to the input data [31, 32]. In image recognition, this is of-
ten done by blurring, rotating or cropping the images. Data augmentation also has the
benefit of adding robustness to the model by preventing it from memorizing specific
observations.
In NILM, it is possible to perform data augmentation by adding appliance signals
to synthesize aggregate signals. This is the approach in [6], where each target appli-
ance is included in the synthetic aggregate with 50% probability and each non-target
appliance—called a “distractor” appliance—is included with 25% probability.19 The
models are then trained on data that is half real, half synthetic. The authors concede
that this method “ignores a lot of structure that appears in real aggregate data,” and
18 Another option would have simply extended or reduced the number of points in the power series
by evenly duplicating or deleting observations throughout the series. This would also result in less
information loss. However, this does not standardize the differences between consecutive timestamps,
which is important in principle since energy is the product of power and time. Arbitrarily stretching time
would, in turn, arbitrarily stretch energy values. Zero-padding is yet another option which would also
result in less information loss, but it also does not standardize timestamp differences—and it is unclear
how the neural network model would respond to so many repeated zeros. Zero-padding is routinely used
in convolutional layers in neural networks in order to preserve dimensionality between certain layers,
but the networks in these cases does not observe block after block of zero matrices.
19 The appliance signals are chosen randomly from a bank and could be from any house or date.
30 Chapter 3. Methods
that “a more realistic simulator might increase the performance of deep neural nets on
energy disaggregation” [6].
This project aims to build a more realistic simulator for three main reasons:
1. We have fewer training observations given that we are summarizing at the daily
level, so the quality of the augmented data is important.
2. Since we are looking at a full day of data, there are more cross-appliance usage
patterns that the models can learn when compared to NILM.
3. While naive simulator may improve generalization and robustness to unseen data
points, a careful implementation can also improve the model’s ability to gener-
alize in a way that is specific to our application domain.
1. A loop over the target appliances in that house, where there is a 50% probability
of either adding the appliance signal to the synthetic aggregate or swapping it
with the same-appliance signal of a random house and date. This is loop affects
the first five appliances of the example house in Figure 3.11 and is represented
by Line 9 of Algorithm 4. If there are multiple power series of the same target
appliance type, then each is addressed separately.20
2. A loop over the distractor appliances of that house, where there is a 50% prob-
ability of either adding the appliance to the synthetic aggregate or excluding it.
This affects the last four appliances in Figure 3.11 and is represented by Line 26
of Algorithm 4.
During this process, the appropriate target metrics are computed based on the cho-
sen series and added to the matrix of targets (Line 18 of Algorithm 4).21
20 This is represented by Line 12 of Algorithm 4, which loops once for each washing machine in
Figure 3.11 and zero times for the microwave.
21 One alternative approach to data augmentation would be to start with the real aggregate signal of
3.6. Dataset creation 31
Figure 3.11: Process for creating synthetic data for an example house that has two
washing machines but no microwave.
1. The synthetic data has a systematically lower mean than the real data. The reason
for this is that the synthetic data only consists of data from the IAMs, which do
not account for all of the electricity usage (Appendix A.1).
2. We are dealing with time series data where each “feature” is a timestep. Stan-
dardizing each feature separately would simply not make sense because each
each house and date combination, and then subtract off appliance signals in order to either swap them
with other signals or to remove them altogether. Yet another approach would be to subtract off all
IAMs to create a “residual” series that could be used to create even more realistic data. However, these
approaches are not possible due to the misalignment of timestamps between power series, which would
have created spiky artifacts (Appendix A.1).
32 Chapter 3. Methods
value would then become relative to the cross-series timestep values instead of
the other values in the series.
3. Dividing the synthetic data and real data by their respective standard deviations
would destroy relative level change information between the datasets, when this
is an important factor to distinguishing otherwise similar signatures. This mat-
ters in particular when distinguishing between the kettle and the microwave,
whose signatures have very similar shapes but different power levels.
To address (1) and (2), we first demean the real and synthetic training data sep-
arately by subtracting off the sample means of all values of the respective matrices
as
Ẍ (r) = X (r) µ̂(r)
and
Ẍ (s) = X (s) µ̂(s)
where Ẍ is the demeaned dataset, X is the dataset with structure described in Section
3.6.2, µ̂ is the scalar matrix-wide mean,22 and (r) and (s) signify real and synthetic
respectively.
Now we do not have to worry about (3) when we vertically concatenate the real
and synthetic matrices
" #
Ẍ (r)
Ẍ =
Ẍ (s)
and shrink the scale
1
Ẍ
X̃ =
s̈ˆ
where X̃ is the standardized input dataset and s̈ˆ is the scalar sample standard deviation
computed for all values of Ẍ.23
It was tested whether first-differencing the input series improved learning.24 A
neural network is in principle capable of learning first-differencing, but it is best to
exploit this information beforehand if it is known to be a better representation of the
1
22 Wecompute the matrix-wide mean for the given matrix X as µ̂ = NK ÂNi=1 ÂD
d=1 Xi,d where N is the
number of observations and D is the number of features or timesteps.
23 In practice, the synthetic dataset has more observations than the real dataset. Since training is
performed with data that was half synthetic and half real, we calculate s̈ˆ to account for each dataset
equally instead of naturally giving more weight to the synthetic dataset.
24 To first-difference a series x = {x , x , ..., x } is to subtract off the previous value (the “lag”) for
1 2 D
every element of the series, creating the series {x1 x0 , x2 x1 , ..., xD xD 1 }. Note that the first-
differenced series has D 1 rather than D elements.
3.7. Hyperparameter and architecture selection 33
data. The motivation for this is that the signatures would no longer stack like they do
in the usual aggregate series, potentially relieving the network of learning power levels
and allowing it to focus on power differences. This also naturally standardizes the data
so it has a mean of zero. However, it was not found to be helpful in pilot tests, resulting
in higher error.
In this project we will standardize the outputs, which is uncommon in machine learn-
ing. We do so because we do not want to favor the error of one appliance over another
when training the multi-target models. Otherwise these multi-target models, which
compute loss as a mean across N observations and across K targets, would have a
strong bias for minimizing loss for high-energy appliances (like the washing machine)
at the expense of minimizing the loss of low-energy appliances (like the kettle) when
possible. While this may be desirable in some domain applications, it is most in-
teresting to treat all appliances equally so that we could compare the outputs of the
multi-target model with those of the single-target models.
Each target variable for each appliance is therefore standardized separately by sim-
ply dividing by its standard deviation. We do not subtract off the mean because we
will find it useful to have the minimum possible value equal to zero.25 The standard
deviations of the target variables are saved so that it is possible to recover the original
targets later.
basis for related learning problems, since building architectures from scratch is chal-
lenging [1]. However, the learning problem for this project is unexplored, so we will
have to work harder to build the architectures.
We can take some design inspiration from two areas:
• Image recognition. One major benefit of using the insights from image recog-
nition is that the field is well-explored. However, the downside is that it does
not apply directly to our problem, namely because images are 2D (our series
are 1D), the signatures are quite rich (ours are comparatively simple) and image
recognition tends to be about classification (ours is regression). Much of our
inspiration for image recognition will come from VGGNet [13].
Figure 3.12: NILM network architectures as depicted in [5]. PointNet is the first row
(A), following the arrows to the PointNet dense layer. The second row (B) describes a
competing network that that predicts an entire sequence of power levels of a appliance
instead of just the midpoint.
INPUT -> [[CONV w/ RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC
3.7. Hyperparameter and architecture selection 35
where * denotes repetition, INPUT is the input layer, CONV is a convolutional layer,
RELU is a ReLU activation layer26 (often simply defined in the previous layer as the
activation function), POOL is an optional max pooling layer, and FC is a fully connected
(“dense”) layer. Typically, 0 <= N <= 3 and K < 3 [1]. In this project we will use
N = 1 to include more pooling layers to handle the high dimensionality of our input
space. We will stick to the usual K < 3.
As demonstrated by state-of-the-art architectures—including VGGNet—it is effec-
tive to increase (usually double) the number of filters with each convolutional layer or
every few convolutional layers, and to decrease the size of consecutive dense layers.
We will therefore follow this pattern in our architectures.
Algorithm 1 gives a high-level overview of the process to dynamically create the
architectures used in this project. In particular, it describes the relationship between
the number of convolutional layers, the number of filters in the convolutional layers,
the number of dense layers, and the number of units in the dense layers. Other major
elements of the network that not discussed in Algorithm 1, such as the kernel size, are
described in Sections 3.7.3 and 3.7.4.
Data: (1) number of convolutional layers; (2) number of filters in the first
convolutional layer; (3) number of hidden dense layers; (4) size of last
dense layer
Result: model
1 Initialize empty model
2 for i in 0 to number of convolutional layers do
3 Add convolutional layer with (2i ⇥ num. first conv. layer filters) filters
4 Add pooling layer
5 end
6 Add dropout layer with dropout rate = 0.5
7 Add flattening layer
8 for i in 0 to number of dense layers do
9 Add dense layer with (2(num. dense layers) i 1) ⇥ last dense layer size) hidden
units
10 Add dropout layer with dropout rate = 0.25
11 end
12 Add dense layer with K output units
13 return model
Algorithm 1: Bare-bones description of how the number of filters and the size of
the dense layers are determined.
that is, all hidden layers—will be shared, which is the extreme case of “hard” param-
eter sharing. This is unlike the usual case of hard parameter sharing, where at least
one layer is reserved for each task (Figure 3.13) [4]. This is done for simplicity in the
architecture selection process.
Figure 3.13: Typical representation of hard parameter sharing with multi-task learning.
Image from [4].
The primary difficulty of designing neural network models for a new learning problem
is selecting the appropriate hyperparameters. (For the sake of brevity, the term “hyper-
paramters” in the rest of this paper will also include aspects of the architecture, such as
the number of convolutional layers.) Among the most used and most straightforward
strategies to find appropriate hyperparameters are grid search and manual search.27
Given that the learning problem in this project is unexplored, and that each model is
quite computationally expensive to train, we will instead use random search. Random
search is a simple process where each hyperparameter is chosen randomly from some
distribution, the model is trained, and the validation error calculated. The process is
repeated as many times as the practitioner would like, and typically the model with the
best validation error is selected. Empirically and theoretically, “randomly chosen trials
are more efficient for hyper-parameter optimization than trials on a grid,” often finding
27 Manual search is a time-intensive, unstructured process when the practitioner tests sets of hyperpa-
rameters one by one, relying on the performance results of each run (and a good deal of intuition) to get
increasingly close to a set of hyperparameters that perform well. Grid search is computationally inten-
sive, structured process that tries each possible combination of hyperparameter values. In both cases,
validation error is usually used to select among models.
3.7. Hyperparameter and architecture selection 37
Before defining how hyperparameter values are randomly selected in each iteration of
random search, it is useful to first define some (simplified) notation. Let Uniform(a, b)
be a uniform distribution between a and b, Geom(a, b) be geometric sampling between
a and b,29 Choice(S) be the uniform choice between discrete values in set S, bxc be
the greatest integer that is less than or equal to x (often referred to as the “floor” of
x), and x ⇠ y be short for “x is distributed as y.” Note that when we sample from, say,
bGeom(1, 10)c, the highest possible value that can be chosen is 9 since the probability
of selecting exactly 10 is zero according to the uniform distribution.
Given these definitions, the bullet points below describe the hyperparameters, their
variable names in the project (for reference later in this paper), their sampling distri-
butions, and the reasoning behind the distribution choices.
This is good for sampling for values for which the range of possible values is effectively all positive
numbers, or when we want to give a preference to numbers at the lower end of the distribution.
30 A check was put in place to select a different set of hyperparameters when this occurred, but choos-
ing a reasonable value for the number of convolutional layers was still worth doing.
38 Chapter 3. Methods
p
then it is reasonable to believe that 64 = 8 filters might be a good number of
filters for our first convolutional layer. However, given that our data is sparser
and contains signals that are less rich when compared to image data, then half of
this number may also be reasonable. Sampling geometrically between 4 and 8,
we have a bias for values closer to 4 for the sake of model simplicity. PointNet
has a starting filter size of 6, supporting our range of choices [5].
While some hyperparameters were chosen dynamically, some were set to a static value
either because (1) pilot tests showed a clear preference for one value over another;
and/or (2) to simplify the random search process. These static hyperparameters are:
• Use pooling (do_pool). This is a binary variable for whether max pooling layers
should be used. As mentioned in Section 3.7.1, we set this to True to handle the
⇢
33 ELU a(ex
1) for x < 0
is defined as f (a, x) = .
x for x 0
34 Dilated convolutions use large but sparse feature detectors that “increase the receptive field by
high dimensionality of the input space. There is precedence for this in VGGNet.
It also resulted in better performance in pilot tests.
• Activation of the final dense layer. This is set to be linear in most regression
problems, but we know beforehand that energy or activation predictions below
zero are incorrect. Therefore we use a ReLU activation function to turn negative
predictions into zeros. Given the number of target values that are exactly zero,
ReLU is preferred to smooth approximations of ReLU—such as softplus39 —
35 Dropout is a technique where some proportion of hidden units of a layer are set to zero during the
training process. This makes the network behave as if it were an average of multiple networks, usually
leading to better generalization [37].
36 Batch normalization is a technique that addresses the problem of “internal covariate shift,” which
is the shifting of the distribution of layer activations during training [39]. It does so through specialized
layers that normalize activations and then sometimes apply further adjustments to the normalized values.
Reported benefits include regularization and improved speed of convergence.
37 The optimizer is the algorithm that handles how the weights are updated every iteration of the
parameter optimization.
38 Adam is an adaptive learning rate schedule. With standard gradient descent, the gradient of the loss
function is used to update the weights. But with Adam, the gradient is “smoothed” using the momentum
of previous weight updates, usually leading to faster convergence [40, 2].
39 The softplus function is defined as f (x) = ln[1 + exp(x)].
3.7. Hyperparameter and architecture selection 41
• Loss function. This is chosen to be mean square error (MSE), since it is the
most straightforward loss function for regression problems, and there is not a
strong reason to use something else. We will also use this metric to evaluate our
models (Section 4).
• Weight and bias initialization. The weights are initialized using the Glorot
uniform initialization.40 The biases are initialized with zeros.
• Epochs, samples per epoch, batch size and early stopping. The models are
run for a maximum of 100 epochs, with 8192 examples per epoch.41 Half of
the samples are real and half are synthetic. Each batch has 32 examples.42 The
training of the model is terminated if the validation error does not improve in the
previous few epochs, and the weights at the epoch with the best validation error
are saved.43
To give an example of how the random hyperparameter choices affect the architec-
ture, we can select some hyperparameters that might be sampled in an iteration of the
random search process.
• num_conv_layers = 5
• num_dense_layers = 2
• start_filters = 4
• kernel_size = 3
• strides = 1
40 The
p
Glorot uniform initialization draws samples from Uniform( a, a) where a = 6/(xin + xout ),
xin is the size of the input layer and xout is the size of the output layer.
41 An epoch usually means a pass through the training data. However, this is not a strict rule, and the
number of examples per epoch can vary based on the data, the learning problem, the volatility of the
validation error, and other factors.
42 A batch is a group of training examples. The size of the batch refers to the number of observations
that are used to compute the gradient, which is then used to update the weights during an iteration of the
optimization algorithm.
43 This is commonly called “early stopping.” In the first 15 iterations of the random search process, the
“patience” was set to 5, where patience is the number of epochs the model would wait without seeing
an improvement in validation error before stopping the training process. However, after observing the
volatility of the validation error curves at around 15 iterations of the random search process, patience
was increased to 10 for the remaining 10 iterations. Longer patience did not seem to result in markedly
lower validation errors.
42 Chapter 3. Methods
• pool_size = 2
• last_dense_layer_size = 16
• learning_rate = 0.001
• l2_penalty = 1e-7
The resulting architecture for a single-target model using the hyperparameters above
would produce a network of 18 layers and 930,473 parameters.44 That architecture is
depicted below (in the style of [6]) with its elements colored by the hyperparameters
that determine their value. Note that the L2 regularization penalty is not included
(for concision), and neither is the learning rate. The “output shape” is in the format
“(timestep dimension, number of filters)” for layers with filters, and “timstep dimen-
sion” for all other layers. Also specified in italics are the number of parameters in the
associated layer.
1. Input (output shape = 14,400)
2. 1D conv (kernel size = 3, stride = 1, number of filters = 4, activation function =
ReLU, output shape = (14,400, 4)) (16 parameters)
3. 1D max pool (pool size = 2, stride = 1, output shape = (7,200, 4)) (0 parameters)
4. 1D conv (kernel size = 3, stride = 1, number of filters = 8, activation function =
ReLU, output shape = (7,200, 8)) (104 parameters)
5. 1D max pool (pool size = 2, stride = 1, output shape = (3,600, 8)) (0 parameters)
6. 1D conv (kernel size = 3, stride = 1, number of filters = 16, activation function
= ReLU, output shape = (3,600, 16)) (400 parameters)
7. 1D max pool (pool size = 2, stride = 1, output shape = (1,800, 16)) (0 parame-
ters)
8. 1D conv (kernel size = 3, stride = 1, number of filters = 32, activation function
= ReLU, output shape = (1,800, 32)) (1,568 parameters)
9. 1D max pool (pool size = 2, stride = 1, output shape = (900, 32)) (0 parameters)
10. 1D conv (kernel size = 3, stride = 1, number of filters = 64, activation function
= ReLU, output shape = (900, 64)) (6,208 parameters)
11. 1D max pool (pool size = 2, stride = 1, output shape = (450, 64)) (0 parameters)
12. Dropout (dropout rate = 0.5, output shape = (450, 64)) (0 parameters)
13. Flatten (output shape = 28,800) (0 parameters)
14. Fully connected (activation function = ReLU, output shape = 32)) (921,632
parameters)
15. Dropout (dropout rate = 0.25, output shape = 32) (0 parameters)
44 Parametersin the context of neural networks means trainable weights, not to be confused with
hyperparameters.
3.7. Hyperparameter and architecture selection 43
16. Fully connected (activation function = ReLU, output shape = 16) (528 parame-
ters)
17. Dropout (dropout rate = 0.25, output shape = 16) (0 parameters)
18. Fully connected (activation function = ReLU, output shape = 1) (17 parameters)
In total, the random search procedure trains 300 networks, or 25 random search itera-
tions for each of the 12 models. To compare models within each appliance and target
variable combination, we first use a moving average with a window size of 3 to smooth
the validation error over epochs for each model. The model with the lowest smoothed
validation error is considered the “best” model.45 When a new best model is found in
the random search process, it is represented as a step change in Figure 3.14. The best
models at the end of the 25 random search iterations are the models we choose to use
for the remainder of this paper.
Figure 3.14: Results of the random search process. The curves represent the cumula-
tive minimum of the minimum smoothed validation error of the models for the specified
appliance and target variable combination. The validation error is computed on the
preprocessed targets.
We see that validation error curves for the chosen models are quite erratic (Fig-
ure 3.15). The reason for this volatility is the choice of the validation set, which is
mostly comprised of unseen houses. Because there is limited variety in the appliance
45 The smoothing prevents us from choosing models with dramatic downward spikes.
44 Chapter 3. Methods
signatures, a change in the weights of the model sometimes leads to very different
predictions.46 The multi-target models have the smoothest validation curve, which is
likely because of two related reasons: (1) their errors are an average across target vari-
ables, which makes them more stable; and (2) their error signals are richer in that they
come from five target variables.
Figure 3.15: Training and validation error for the chosen models, where the error is
computed on the preprocessed targets. The depicted validation error curves have not
been smoothed, although the choice of the models was based on smoothed validation
error. The number of epochs differ due to early stopping.
The architectures and hyperparameters of the chosen models are described in Ta-
bles 3.3 and 3.4. It is difficult to say why there is such variety in the architectures across
networks. For example, we can say that the learning problem is different for each
model, and that this is reflected in the difference in architectures. Or we could say that
the models are insensitive to the hyperparameters within the ranges they were chosen
from—which would imply that the choices of the distributions were good. Therefore,
while we might be tempted to say that (for example) the washing machine models
ended up having more filters so that they can learn to identify more complex signa-
tures, we cannot rule out that this is the result of chance. However, two patterns that
do seem robust across models are that larger pooling sizes and smaller strides perform
best.
46 To ensure that this was the reason for the volatility, a test was performed that validated the models
on only the “seen” portion of the validation set, which resulted in much smoother validation curves
(results not shown here).
3.7. Hyperparameter and architecture selection 45
last_dense_layer_size
number of parameters
num_dense_layers
num_conv_layers
start_filters
learning_rate
kernel_size
l2_penalty
pool_size
strides
fridge 7 1 4 5 1 3 13 1.6e-3 1.0e-8 2.4e+05
kettle 6 2 4 6 1 4 9 2.4e-3 2.7e-7 7.5e+04
washing machine 7 1 7 4 1 3 23 1.2e-3 0.0e+1 6.1e+05
dishwasher 7 2 6 3 1 3 17 4.4e-4 6.2e-7 3.9e+05
microwave 5 2 6 3 1 4 26 2.8e-3 7.2e-8 9.5e+04
all target appliances 5 2 5 3 1 4 26 2.1e-3 3.7e-8 7.7e+04
number of parameters
num_dense_layers
num_conv_layers
start_filters
learning_rate
kernel_size
l2_penalty
pool_size
strides
Table 3.4: Final architectures for models predicting the number of activations.
Chapter 4
Evaluation
To evaluate our models, we define one or more loss functions L(i) = L(y(i) , f (xx(i) )),
where L(i) is the loss for test example i, y(i) is the target for test example i, x (i) is the
vector of input data for test example i, and f is the neural network model that takes
an input vector and outputs a prediction. The two loss functions used as evaluation
metrics in this paper are:
(i)
1. Mean square error (MSE): LMSE = (y(i) f (xx(i) ))2
(i)
2. Mean absolute error (MAE): LMAE = |y(i) f (xx(i) )|
MSE will be the primary evaluation metric reported.1 We use MSE simply because
it is a standard metric for regression problems. We also used it as the loss function in
the training of our networks. To make interpretation easier, we divide MSE by the
MSE of the “naive baseline,” which is the average of the target values from the real
part of the test set. So if a model’s MSE relative to the baseline is 0.7, then the model
has a MSE that is 30% lower than the baseline. This baseline was chosen because
average statistics for energy and number of activations are currently used as priors in
1 It may seem that a more intuitive performance metric would be mean absolute percentage error
(i) x(i)
(MAPE), LMAPE = 100 ⇥ y y(i) f (xx )
(i)
which is easily interpretable and naturally standardized for each
metric. For example, one could say that our predictions for some target appliance are, on average, within
X% of the actual daily value. Furthermore, it does seem as though a prediction error for a small target
value should incur a more severe penalty than a same-sized error for a large target value. However,
since our target data has many energy and activation values that are exactly zero (y(i) = 0), MAPE is
undefined for many values and is generally unstable when y(i) is close to zero. Therefore this metric is
not used for this project, and neither are other metrics that rely on ratios or percentages.
47
48 Chapter 4. Evaluation
the AFHMM with LBM model.2 MAE was chosen because it has the benefit of being
in the units of the target variable.
Total loss over the test set when using loss function L is computed as Ltest =
1
N ÂNi=1 L
(i) where N is the number of examples in the test set. Since L
q test is itself a
random variable, we can compute the standard error as sLtest = s2L /N where sL is
the standard deviation of L. Metrics are reported with a range of two standard errors,
which represents a confidence interval of roughly 95%.
The results for MSE, presented in Table 4.1 and visualized in Figure 4.1, show that
the models are successful at predicting both the daily energy and the daily number of
activations on unseen days for houses the model saw during training (“seen”). Most
target appliance models perform at around half error of the baseline or better.
The washing machine and the dishwasher perform particularly well on seen houses,
with errors roughly 60–80% lower than the baseline for the prediction of both energy
and number of activations. It is not surprising that these appliances perform well, since
the richness and distinctness of the signatures make the models better able to pick
them out from the noise of the aggregate signal and less likely to confuse them with
the signatures of other appliances. This contrasts with NILM, where neural network
models find the rectangular signature of the kettle easiest to predict [5]. (The fridge
also performs well on seen houses, but we will explore why this is the case later.)
Most models also generalize quite well to unseen houses. The models that perform
best on unseen houses are again the washing machine and dishwasher models. For
example, the model that predicts washing machine activations has an error that is 57–
71% lower than the baseline, while the dishwasher activations model has an error that is
54–66% lower. In the original units of the appliances, the washing machine activations
model is off by 0.34–0.40 activations on average, while the dishwasher is off by 0.31–
0.36 activations (Appendix C).
The model for fridge activations generalize very poorly. In fact, they perform much
worse than the baseline. This is likely because there was so much more variation
2 There
is also not a simple model that would perform well on this sort of complex problem outside
of other neural networks.
4.2. Performance results 49
Figure 4.1: Test error relative to the baseline for single-target models, with 95% confi-
dence intervals.
Table 4.1: Test error relative to the baseline for single-target models, with 95% confi-
dence intervals.
50 Chapter 4. Evaluation
between houses than within houses (reminder: Section 3.4.2), encouraging the model
to focus on house-specific signals instead the fridge signatures to make the predictions.
For both seen and unseen houses, it does not seem that predicting the number
of activations is markedly easier than predicting energy—or vice versa. This is not
surprising given the high correlation between energy and the number of activations.
The multi-target model does not generalize as well as the single-target models (Figure
4.2). In fact, all multi-target models have a generalization error that is either equal to
or greater than that of the single-target models. It is unclear whether this is because
the models need to be more flexible than the single-target models (e.g., by making the
neural networks larger or including task-specific hidden layers), or because multi-task
learning is simply not beneficial for this type of learning problem.
Figure 4.2: Test error relative to the baseline for unseen houses, with 95% confidence
intervals.
4.2. Performance results 51
If the statistics are to be reported directly to consumers, then it is helpful to explore the
accuracy of these statistics. Perhaps the most useful feedback would be for the washing
machine and the dishwasher, since they are high-energy appliances where feedback is
actionable.
We see that when there are zero washing machine activations a day, the model
gives the correct prediction 89% of the time (Figure 4.3). When there is one activation,
the model predicts this with 78% accuracy. And with two or more, 61% accuracy.3
The dishwasher has a tighter range for the number of activations per day, so the model
predictions are even more accurate (Figure 4.4).
Figure 4.3: Confusion matrix for washing machine activations on unseen houses. Model
predictions are rounded to the nearest integer.
It is a bit difficult to predict the exact number of kettle activations since the kettle
can have quite a few activations per day (Figure 4.5). To improve accuracy, it could
be helpful to provide feedback as ranges of activations, such as: “You used your kettle
3 to 5 times yesterday.” The kettle predictions tend to have a downward bias that gets
more pronounced as the number of target activations increases.
A description of the microwave activations can be found in Appendix C. The fridge
was not plotted since feedback would not be actionable, and there were too many
activations for a sensible plot.
3 To determine the cutoff for the number of activations in the confusion matrices, the 95th percentiles
of the targets and predictions are calculated. Then the maximum of these two values are used as the
upper limit for both axes in the confusion matrix.
52 Chapter 4. Evaluation
Figure 4.4: Confusion matrix for dishwasher activations on unseen houses. Model
predictions are rounded to the nearest integer.
Figure 4.5: Confusion matrix for kettle activations on unseen houses. Model predictions
are rounded to the nearest integer.
As we saw in Section 3.4.2, it helps to explore the data by house to uncover patterns
that may be obscured by simply looking at the headline statistics. Starting with the
washing machine activations model, we see that the predictions are correlated with
the target values, with the predictions for seen houses being a little bit tighter than for
unseen houses (Figure 4.6). The predictions for washing machine energy are similar
(Appendix C).
When visualizing the predictions of the fridge activations model by house, the
problem of high between-house variance and low within-house variance becomes ap-
4.2. Performance results 53
parent (Figure 4.7). The model struggles to predict the number of activations for indi-
vidual houses. When it is unsure of the house, it has a tendency to predict a constant
number of activations of around 20 or 25.
Figure 4.6: Predictions vs. targets for washing machine activations. The points are
jittered along the x-axis to reduce overlap.
In order to test the importance of the synthetic data, we retrain the networks with the
best-performing architectures on only real data and recompute the evaluation metrics.
The results show that the synthetic data is extremely important for generalization (Fig-
ure 4.8 and Table 4.2 for MSE; Appendix C for MAE).
Figure 4.8: Test error for best-performing single-target models when models were not
trained with synthetic data. The error bars represent 95% confidence intervals.
Outside of the fridge and the microwave, the generalization error is higher for all
appliances and for both target variables. The most dramatic difference is with the
kettle, where the error increases from 0.76 to 1.34 (around +76%) for energy and 0.69
to 1.28 (+86%) for the number of activations—meaning the kettle models lose their
ability to generalize to the houses held out for the test set in this project.4 This contrasts
with performance on seen houses, where error decreases from 0.72 to 0.33 ( 54%) for
energy and 0.50 to 0.27 ( 46%) for the number of activations. While less dramatic,
the dishwasher models have a similar pattern for seen and unseen houses.
This shows that the inclusion of the synthetic data helps performance on unseen
houses—but sometimes at the expense of performance on seen houses. However, this
4 Here we ignore errors bars and just use the point estimates for simplicity.
56 Chapter 4. Evaluation
Table 4.2: Test MSE relative to the baseline (with 95% confidence intervals) when mod-
els were not trained with synthetic data.
hit to performance on seen houses is not an issue since we care about generalization
performance for the application domain.
Although neural networks are complex and often considered to be black boxes, there
are some steps we can take to explore the networks’ internal representations. One
such step is visualizing the “activations”—or the values taken by the units—of the
convolutional layers.5
The activations can be visualized in two main ways. The first is to define a loss
function that maximizes the activations of the convolutional layer, and then use gra-
dient ascent to generate patterns that cause the activations to fire most strongly [41].
The second is to feed the convolutional neural network an input, and then visualize the
activations of the convolutional layer caused by that input.6 We do the second, since it
gives a better sense of how the networks perform on real input data.
Visualizing the activations of the first convolutional layers is likely not very infor-
5 This
is unfortunate terminology. The activations of hidden layers relates to the activation functions
described in Section 2.1 and not the activations of appliances.
6 This is equivalent to removing all layers after the convolutional layer of interest, and then simply
mative, since many models are sure to learn similar low-level features: an increase
in the time series, a decrease, a spike, and so on. We therefore visualize the highest-
level activations, which are part of the last convolutional layer. Since there are often
many filters in this final convolutional layer7 —with many of them sparse and some-
what redundant—we use principal component analysis (PCA)8 to reduce the number
of dimensions of the activations to ten.9
To start, we choose a house and day where all the target appliances are used (Figure
4.9). Outside of large spikes caused by the electric shower at around 7:00 or 8:00,10 it is
a fairly clean day of data with few anomalies. We then iterate through each appliance,
load the energy prediction model for that appliance, input the aggregate signal for the
day into the model, and visualize the activations of the final convolutional layer as the
model makes its prediction.11 Note again that the models do not see the signals of the
appliances when making predictions.
We find that the washing machine energy model seems to be “looking” in the right
places, correctly identifying the times of the day where the washing machine is used
(Figure 4.10). The dishwasher energy model also correctly identifies times of day
where its target appliance is used (Figure 4.11). The kettle model appears to fire for
spikes of any kind, including spikes created by the kettle (Figure 4.12). The larger the
spike, the larger the activations.
Meanwhile, it is unclear what information is being used by the microwave, fridge
and multi-target models to make predictions (Figures 4.13, 4.14 and 4.15). In par-
ticular, the activations of the microwave model do not seem fire when the microwave
is used. (Regardless, its prediction for the day is accurate.) Since the hidden layers
of the multi-target model are shared by all target appliances, it makes sense that the
activations do not have obvious associations with particular target appliances.
7 Due to the fact that the number of filters doubles each convolutional layer, there can be up to
512 = 8 ⇥ 27 1 filters in the final convolutional layer if there are 7 convolutional layers and 8 starting
filters.
8 Roughly speaking, principal component analysis is a linear dimensionality reduction technique
where the data is projected onto a lower-dimensional space in a way that retains as much of the original
variance as possible.
9 The top row in each visualization will represent the first principal component, meaning that it
accounts for the most variance in the activations. The second row is orthogonal to the first and accounts
for the second-largest amount of variance. And so on for the third through the tenth rows.
10 Electric showers are not captured using IAMs, so we do not know for sure that this is its signature.
However, it is a safe assumption given the time of day, the signature pattern and the usual power range
for electric showers (roughly 9.5–11.0 kW).
11 Energy is chosen over the number of activations as the target variable simply because the activa-
tions models have too few dimensions in the final convolutional layer, making the visualization of the
activations less informative.
58 Chapter 4. Evaluation
Figure 4.9: Example day of data where all target appliances are used.
It should be made clear that this type of analysis helps to understand where in the
day the models are focusing their attention, but it does not say how the neural net-
work processes these activations downstream in the dense layers. Furthermore, when
a model activates during parts of the day when the target appliance is not being used,
we cannot say whether the model is “distracted” or if the model is incorporating cross-
appliance information into its predictions. An example of this is when the dishwasher
model activates in the morning and at noon when the washing machine is being used.
4.4. Exploration of convolutional layer activations 59
Conclusion
In this project, neural networks are used to predict two target variables—energy and the
number of activations—for five appliances. The results are quite promising. We find
that neural networks are not only capable of summarizing energy usage in principle,
but can also generalize to houses that were not seen during training and validation.
This generalization is crucial for the application domain.
Nine of the twelve models have better generalization performance than the base-
line. This means that the output statistics from these models would be better priors than
national statistics for the AFHMM with LBM models. The neural networks perform
best on appliances with complex signatures like the washing machine and the dish-
washer, where the generalization error when predicting the number of activations is
less than half the baseline error. The fridge models, on the other hand, generalize very
poorly, having a generalization error that is worse than the baseline. The multi-target
models also do not generalize as well as the single-target models.
The data augmentation process proves important to the success of the models for
all appliances except for the fridge,1 improving generalization performance by quite a
bit. Without training on synthetic data, the kettle models do not generalize to unseen
houses at all.
The predictions of some models are arguably strong enough to provide directly to
consumers. For example, accuracies for the number of activations are 83% for the
washing machine and 88% for the dishwasher.2 However, there needs to be more
1 The predictions for the fridge were poorer than baseline performance when the models were trained
63
64 Chapter 5. Conclusion
• Small training set. Although REFIT is large when compared to similar datasets,
it is not as large as datasets typically used to train neural networks—especially
since we are aggregating at the daily level. This means we only have around
3,100 real training examples and 40,000 synthetic training examples. Addition-
ally, there are only 15–22 series for each type of target appliance.
• Limited validation process. Given the relative lack of data and the large number
of models that needed to be trained through the random search process, we do not
use cross-validation. Ideally, we would iteratively validate on each house while
training on the others. The small number of houses in the validation holdout set
leads to sensitivity in the validation error—and likely poorer results in general.
• Sensitive test results. Like with the validation set, we hold out only a few houses
to test generalization error. This leads to sensitivity in the reported results.
• Some models not using target appliance signature information. The fridge
model found it difficult to differentiate between between-house variance and
within-house variance. It is possible that this could be remedied with more data
and improving the methods that encourage generalization (e.g., further improv-
ing the data augmentation process so that the models do not learn to guess the
house).
• Address limitations discussed in Section 5.2. This can mean using cross-
validation, incorporating more data (perhaps from other studies), making the
multi-target models more flexible, or further improving the data augmentation
process.
• Test results of AFHMM with LBM using improved priors. We can run
AFHMM with LBM using the improved priors and see to what degree the re-
sults are improved. Given that simply using national averages as priors reduces
errors by 50% in some cases, it is possible that improved priors can lead to ad-
ditional significant gains.
Data issues
The data issues known before the start of this project are described below:
• Aggregate signal measurement error. The device used to measure the aggre-
gate power signal (a CurrentCost transmitter) has a relative error of around 6%.
• Meters not synchronized. The readings of IAMs are not synchronized in time
with the those of the aggregate signal. Specifically, the timestamp of any change
in value in an IAM is seconds before the associated change in aggregate signal.
This means that subtracting an appliance signal from the aggregate signal would
leave spiky artifacts.
• Disagreeing power step changes. Changes in value between IAMs and the
aggregate signal do not correspond.
• Disagreeing power levels. The sum of the IAM readings are sometimes higher
than the aggregate reading as indicated by the Issues column in the CSV files.
1 See Appendix A.2 on a discrepancy regarding solar panels on House 3.
67
68 Appendix A. Data issues
• Gaps in the data. There are large gaps in the data due to internet outages and
other technical problems. This is most notable in February 2014 (reminder:
Figure 3.1). The average uptime for houses is 88% on average [3].
Figure A.1: Houses with solar panels had a characteristic bell-shaped curve.
In addition to the known data issues, some were discovered through additional data
exploration:
2 Lighting itself is 16% of electricity use [42].
A.2. Discovered data issues 69
• Incorrect appliance labels. There are discrepancies between the appliance col-
umn names in the raw and cleaned README files. Visualizing the signals seems
to suggest that the labels from the raw README are correct.3 In total, there are
ten labeling discrepancies between the raw and cleaned README files that af-
fected our target variables.
• Strings of large, repeating power values (Figure A.4). This bizarre data error
is found in both aggregate series and appliance series.
• Low correlation between IAMs and aggregate (Figure A.2). This is caused
by appliances that are not connected to IAMs, IAM signals that do not seem
register in the aggregate, and other issues.
• Varying daily recordings. That is, the number of recordings vary quite a bit
across houses and days (Figure A.3). This is not necessarily a mistake, but it
needs to be taken into account when preparing the data for modeling.
Figure A.2: Distribution of daily correlation Figure A.3: Distribution of number of daily
between sum of IAMs and aggregate. recordings.
3 For example, Appliance 4 for House 12 is labeled as a computer site when the signature resembles
a kettle—which is the label in the raw README.
70 Appendix A. Data issues
Figure A.4: An example of a day with large, repeated power values for one of the power
series.
Appendix B
Additional algorithms
Data: (1) unaligned, the list of ordered unaligned values; (2) standard, the
list of ordered values to align to; (3) padder, the index value to use when
the starting value(s) of align are greater than the first value of standard
(default 0 in this project)
Result: idx, the list of indices of unaligned that align it with the values of
standard
1 u = 0
2 idx = []
3 for s in range(len(standard)) do
4 while u < len(unaligned) and unaligned[u] <=
standard[s] do
5 u += 1
6 end
7 idx.append(u-1)
8 end
9 for i in range(len(standard)) do
10 if idx[i]==-1 then
11 idx[i] = padder
12 else
13 break
14 end
15 end
16 return idx
Algorithm 2: Python-style pseudo-code that describes how timestamps are stan-
dardized. It returns the indices idx of unaligned such that unaligned[idx[i]]
is that maximum value of unaligned for which unaligned[idx[i]] <=
standard[i], for all i. idx can then be used to standardize power series that
have unaligned timestamps unaligned.
71
72 Appendix B. Additional algorithms
Data: (1) List of houses that have not been held out for validation and testing;
(2) list of target appliances; (3) list of dates that have not been held out
for validation and testing.
Result: (1) XN⇥D feature matrix where N is the number of training samples and
D is the standardized number of timesteps in a day; (2) YN⇥K feature
matrix where K is the number of target variables (one for each target
appliance).
1 Initialize X and Y as empty matrices
2 foreach house do
3 foreach date do
4 if date is not viable for house then
5 continue to next date
6 end
7 Initialize x , synthetic aggregate series, with zeros
8 Initialize y , vector of target variables, as empty list
9 foreach target appliance do
10 Initialize x a , power series for the target appliance, with zeros
11 Initialize ya , target variable used by target appliance, as zero
12 foreach power series in house with target appliance do
13 if swap signal (with p = 0.5) then
14 Load x s , a power series for the target appliance for a random
house/date
15 else
16 Load x s , the power series for the target appliance for the
current house/date
17 end
18 Calculate ys , energy used in the target appliance power series,
from x s
19 Standardize number of timesteps of x s
20 xa xa + xs
21 ya ya + ys
22 end
23 x x + xa
24 Append ya to y
25 end
26 foreach distractor appliance do
27 if include distractor appliance (with p = 0.5) then
28 Load x d , the power series for the distrctor appliance for the
current house/date
29 Standardize number of timesteps of x d
30 x x + xd
31 end
32 end
33 Append x to X as row
34 Append y to Y as row
35 end
36 end
37 return X, Y
Algorithm 4: Creation of synthetic training data. For simplicity, this algorithm
only shows the calculation of the energy target variables. In the actual code,
there are two Y matrices (and consequently two y , ya , ys , and yd )—one for energy
and one for number of activations. Also excluded are additional vectors that are
returned that specify the target house and date of each observation.
Appendix C
Loss (MAE)
Appliance House type Energy Num. activations
fridge seen 0.24 ± 0.020 6.7 ± 0.52
unseen 0.40 ± 0.019 14 ± 0.65
kettle seen 0.20 ± 0.011 2.0 ± 0.15
unseen 0.21 ± 9.4e-3 2.3 ± 0.15
w. machine seen 0.17 ± 0.018 0.29 ± 0.026
unseen 0.30 ± 0.025 0.37 ± 0.030
dishwasher seen 0.16 ± 0.021 0.16 ± 0.022
unseen 0.40 ± 0.028 0.33 ± 0.025
microwave seen 0.070 ± 5.7e-3 1.6 ± 0.12
unseen 0.068 ± 3.3e-3 1.7 ± 0.097
Table C.1: Test MAE loss with 95% confidence intervals. The units for energy are kWh,
and the units for number of activations are counts.
75
76 Appendix C. Additional evaluation data
Loss (MAE)
Appliance House type Energy Num. activations
fridge seen 0.20 ± 0.018 5.6 ± 0.47
unseen 0.38 ± 0.019 13 ± 0.63
kettle seen 0.11 ± 9.9e-3 1.3 ± 0.12
unseen 0.25 ± 0.014 3.3 ± 0.19
w. machine seen 0.19 ± 0.020 0.29 ± 0.032
unseen 0.37 ± 0.028 0.45 ± 0.037
dishwasher seen 0.16 ± 0.020 0.14 ± 0.020
unseen 0.56 ± 0.036 0.35 ± 0.026
microwave seen 0.072 ± 6.0e-3 1.6 ± 0.13
unseen 0.067 ± 3.0e-3 1.5 ± 0.094
Table C.2: Test MAE loss when trained only on synthetic data, with 95% confidence
intervals. The units for energy are kWh, and the units for number of activations are
counts.
Figure C.1: Confusion matrix for microwave activations on unseen houses. Model pre-
dictions are rounded to the nearest integer.
77
Figure C.2: Predictions vs. targets for washing machine energy. The points are jittered
along the x-axis to reduce overlap.
78 Appendix C. Additional evaluation data
Figure C.4: Predictions vs. targets for dishwasher activations. The points are jittered
along the x-axis to reduce overlap.
80 Appendix C. Additional evaluation data
Figure C.5: Predictions vs. targets for microwave activations. The points are jittered
along the x-axis to reduce overlap.
Bibliography
[4] S. Ruder, “An overview of multi-task learning in deep neural networks,” arXiv
preprint arXiv:1706.05098, 2017.
[6] J. Kelly and W. Knottenbelt, “Neural NILM: Deep neural networks applied to
energy disaggregation,” in Proceedings of the 2nd ACM International Conference
on Embedded Systems for Energy-Efficient Built Environments, pp. 55–64, ACM,
2015.
[11] M. Zhong, N. Goddard, and C. Sutton, “Latent bayesian melding for integrating
individual and population models,” in Advances in Neural Information Process-
ing Systems, pp. 3618–3626, 2015.
81
82 Bibliography
[13] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-
scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
[14] Y. Zheng, Q. Liu, E. Chen, Y. Ge, and J. L. Zhao, “Time series classification using
multi-channels deep convolutional neural networks,” in International Conference
on Web-Age Information Management, pp. 298–310, Springer, 2014.
[15] Z. Wang and T. Oates, “Encoding time series as images for visual inspection
and classification using tiled convolutional neural networks,” in Workshops at the
Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
[21] S. Seguı́, O. Pujol, and J. Vitria, “Learning to count with deep object features,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition Workshops, pp. 90–96, 2015.
[27] A. Filip, “Blued: A fully labeled public dataset for event-based non-intrusive load
monitoring research,” in 2nd Workshop on Data Mining Applications in Sustain-
ability (SustKDD), p. 2012, 2011.
[28] J. Z. Kolter and M. J. Johnson, “Redd: A public data set for energy disaggregation
research,” in Workshop on Data Mining Applications in Sustainability (SIGKDD),
San Diego, CA, vol. 25, pp. 59–62, 2011.
[29] “Refit: Smart homes and energy demand reduction.” Website available at
http://www.refitsmarthomes.org/. Accessed 2 Apr 2017.
[35] D. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network
learning by exponential linear units (elus),” CoRR, vol. abs/1511.07289, 2015.
[39] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network train-
ing by reducing internal covariate shift,” CoRR, vol. abs/1502.03167, 2015.
84 Bibliography
[40] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv
preprint arXiv:1412.6980, 2014.
[41] F. Chollet, “How convolutional neural networks see the world.” Keras blog.
Website available at https://blog.keras.io/how-convolutional-neural-networks-
see-the-world.html. Accessed 1 Jul 2017.
[42] L. Stankovic, V. Stankovic, J. Liao, and C. Wilson, “Measuring the energy in-
tensity of domestic activities from smart meter data,” Applied Energy, vol. 183,
pp. 1565–1580, 2016.