Convolutional Neural Networks For Human Activity Recognition Using Mobile Sensors
Convolutional Neural Networks For Human Activity Recognition Using Mobile Sensors
Convolutional Neural Networks For Human Activity Recognition Using Mobile Sensors
Ming Zeng, Le T. Nguyen, Bo Yu, Ole J. Mengshoel, Jiang Zhu, Pang Wu, Joy Zhang
1 Introduction
In the recent years, the rapid spread of mobile devices with sensing capabilities has
created a huge demand for human activity recognition (AR). Applications that can
benefit from AR include daily lifelogging, healthcare, senior care, personal fitness,
etc. [7, 32, 31, 9]. As a result, many approaches were proposed for the recognition
of a wide range of activities [8, 15, 10, 23].
Feature extraction is one of the key steps in AR, since it can capture relevant in-
formation to differentiate among various activities. The accuracy of AR approaches
greatly depends on features extracted from raw signals such as accelerometer read-
ings [34]. Many existing AR approaches often rely on statistical features such as mean,
variance, entropy or correlation coefficients [3]. Feature extraction is proposed from
2 Zeng, Nguyen, Yu, Mengshoel, Zhu, Wu, and Zhang
the frequency domain using FFT [17]. Prior works have shown that some of these
heuristically-defined features perform well in recognizing one activity, but badly for
others [15]. Therefore, given an application scenario and a set of target activities, one
can select a subset of features to optimize the activity recognition performance [34, 15].
Designing hand-crafted features in a specific application requires domain knowl-
edge [23]. This problem is not unique to activity recognition. It has been well-studied
in other research areas such as image recognition [22], where different types of features
need to be extracted when trying to recognize a handwriting as opposed to recognizing
faces. In recent years, due to advances of the processing capabilities, a large amount of
Deep Learning (DL) techniques have been developed and successful applied in recog-
nition tasks [2, 28]. These techniques allow an automatic extraction of features without
any domain knowledge.
In this work, we propose an approach based on Convolutional Neural Networks
(CNN) [2] to recognize activities in various application domains. There are two key
advantages when applying CNN to AR:
– Local Dependency: CNN captures local dependencies of an activity signals. In im-
age recognition tasks, the nearby pixels typically have strong relationship with each
other. Similarly, in AR given an activity the nearby acceleration readings are likely
to be correlated.
– Scale Invariance: CNN preserves feature scale invariant. In image recognition, the
training images might have different scales. In AR, a person may walk with different
paces (e.g., with different motion intensity).
We summarize the key contributions of this work as follows:
– We propose an approach based on CNN to extract human activity features without
any domain knowledge.
– The proposed approach can capture the local dependencies and scale-invariant fea-
tures of activity signals. Thus, variations of the same activity can be effectively cap-
tured through the extracted features.
– We present the experimental results on three public datasets collected in different
domains. The results shown that the proposed approach outperforms the state-of-the-
art methods.
The rest of this paper is organized as follows: Section 2 presents related work;
Section 3 describes our CNN-based method for activity recognition and improvement;
Section 4 presents our experimental results to demonstrate its applications. Finally, we
conclude the study in Section 5.
2 Related Work
2.1 Feature Extraction for Activity Recognition
AR can be consider as a classification problem, where the input are time series signals
and the output is an activity class label. Fig 1 shows the activity recognition process,
which is divided into training phase and classification phase. In the training phase, we
Convolutional Neural Networks for Human Activity Recognition using Mobile Sensors 3
extract features from the raw time series data. These features are then used to train a
classification model. In the classification phase, we first extract features from unseen
raw data and then use the trained prediction model to predict an activity label.
Feature extraction for AR is an important task, which has been studied for many
years. Statistical features such as mean, standard deviation, entropy and correlation
coefficients, etc. are the most widely used hand-crafted features in AR [8]. Fourier
transform and wavelet transform [27] are another two commonly used hand-crafted
features, while discrete cosine transform (DCT) have also been applied with promising
results [11], as well as auto-regressive model coefficients [12]. Recently, time-delay
embeddings [10] have been applied for activity recognition. It adopts nonlinear time
series analysis to extract features from time series and shows a significant improvement
on periodic activities recognition (cycling involves a periodic, roughly two-dimensional
leg movement). However, the features from time-delay embedding are less suitable for
non-periodic activities.
In the recent years, some approaches such as principal component analysis (PCA) [11]
or restricted Boltzmann machine (RBM) [23] were applied to adapt the feature extrac-
tion to the dataset, i.e. the mapping from raw signals to features is not predefined, but
rather automatically learned from the training dataset. PCA is a well-established tech-
nique to discover compact and meaningful representations of raw data without relying
on domain knowledge. The PCA feature extraction is conducted in discrete cosine trans-
form (DCT) domain [11]. After conducting PCA, the most invariant and discriminating
information for recognition is maintained. The PCA based on empirical cumulative dis-
tribution function (ECDF) is proposed [23] to preserve structural information of the
signal.
Although PCA can learn features in an unsupervised manner, its linear combination of
raw features does not have sufficient capability to model complex non-linear dependen-
4 Zeng, Nguyen, Yu, Mengshoel, Zhu, Wu, and Zhang
cies [4]. Therefore, deep neural networks (DNN)1 have been proposed to extract more
meaningful features [23]. The one key difference between traditional neural networks
and deep neural networks is that DNNs can have many layers in the networks while
traditional neural networks contain three layers at most. A key advantage of DNN is
its representation of input features. DNN can model diverse activities with much less
training data, it can share similar portions of the input space with some hidden units,
while keeping other units sensitive to a subset of the input features that are significant
to recognition.
DNN in recent made many breakthroughs in many research areas. The deep archi-
tectures can represent complex function compactly, which have been shown to outper-
form state-of-the-art machine learning algorithms in many applications (such as face
detection, speech recognition.) [4]. Fig 2 compares a DNN model with existing ap-
proaches.
A statistic feature model can be considered as a model of depth 1, where the out-
put nodes represent pre-defined function such as mean, variance, etc. PCA can be also
considered as a model with depth 1, where the output nodes represents the k principal
components outputted as a linear combination of the input data. DNN is a model with a
a depth of n layers, where the complex dependencies of the input data can be captured
through hidden layers with non-linear mapping in layers.
Fig. 2. (a): statistical feature computation, (b): PCA model, (c): DNN model
convolution and pooling layers2 . The small local parts of the input were captured by the
convolutional layer with a set of local filters. And the pooling layer can preserver the
invariant features. Top fully connected layer finally combine inputs from all features
to do the classification of the overall inputs. This hierarchical organization generates
good results in image processing [18, 16] and speech recognition [1] tasks. In the next
section, we will present details of CNN and describe our proposed CNN-based AR
approach.
Our L-layer CNN-based model has three kinds of layers: 1) An input layer (with
units h0i ) whose values are fixed by the input data; 2) hidden layers (with units hli )
whose values are derived from previous layers l − 1; 3) and output layer (with units hL i )
whose values are derived from the last hidden layer. The network learns by adjusting
l l
a set of weights wi,j , where wi,j is the weight from some input hli ’s output to some
l+1
other unit hj . We use xi to denote the total input to unit uli (ith unit in layer l), and yil
l
In the following we describe how CNN captures local dependencies and the scale in-
variant characteristics of the activity signals. In order to capture the local dependencies
of the data, one can enforce a local connectivity constraint between units of adjacent
layers. For example, in Fig 4 the units (neurons) in the middle layer are only connected
to a local subset of units in the input layer. From biology, we know that there are com-
plex arrangement of cells in visual cortex, which are sensitive to small regions of the
input, called a receptive field, and are tiled to generate the entire visual field. These fil-
ters are local in input space and are thus suited to exploit local correlation hidden in the
data, so we also call it local filter. In terms of local filter, the weight of edge connected
ith unit with jth, wi,j can be reduced by wa , and wi,j = wi,j+m = wa , where m is
the width of the local filter. In Fig 4, the 1D vector [w1 , w2 , w3 ] represents three local
filters denoted by different line style, where wi is weight of edge connecting in two
layers. The convolution operation is conducted over the local subset. This topological
constraint corresponds to learning a weight matrix with sparsity constraint, which is not
only good for extracting local dependencies, but also reduces the computational com-
plexity. The output of such a set of local filters constitute a feature map (Fig 5). At each
temporal position, different types of units in different feature maps compute different
types of features.
2
The terms will be defined and discussed in the Section 3
6 Zeng, Nguyen, Yu, Mengshoel, Zhu, Wu, and Zhang
Fig. 3. Structure of CNN for Human Activity Recognition. The dimension of input data is 64, the
dimension convolutional output is 12, the dimension max-pooling output is 4. The dimension of
two hidden layers is 1024 and 30, respectively. The top layer is a Softmax classifier.
… … w11
pooling pooling
w1 size size w12
… …
w2 w13
w21
w3
… …
w22
Fig. 4. Left) Traditional weight sharing CNN, Right) Partial weight sharing CNN. Weights de-
noted by the same line style are shared
where xl,j
i is the l convolution layer’s output of the jth feature map on the ith unit, and
σ is a non-linear mapping, it usually uses hyperbolic tangent function, tanh(·). Take
Fig 4 as an example, the first hidden unit of the first local filter is
x11,1 = tanh(w11,1 x0,1 1,1 0,1 1,1 0,1
1 + w2 x2 + w3 x3 + b1 )
Input Output
…
Local
Filter1
…
Local
Filter2
Local
Filter3
Fig. 5. Feature Map: three feature maps at input layer and two results at output layer
In traditional CNN model [18], each local filter is additionally replicated across the
entire input space. That means the weights of local filters are tied and shared by all
positions within the whole input space. For example, in Fig 4, weights denoted by the
same line style are shared (forced to be identical). The replicated weights allow the
features to be detected regardless of the position of the unit, which is also beneficial for
scale-invariant preservation.
In image processing task, full weight sharing is suitable because the same image pat-
tern can appear at any position in an image. However, in AR, because different patterns
appear in different frame, the signal appearing at different units may behave quite dif-
ferently. Therefore, it may be better to relax the weight sharing constraint, i.e., weights
of the same color and same type are constrained to be identical (means weight sharing,
in Fig 4)(Right). This weight sharing strategy is first described in [1], and we called it
partial weight sharing in our application. With the partial weight sharing technique, the
activation the function in convolution layer is changed as below:
m
!
X
xl,j
i,k = σ bj +
j
wa,k xl−1,j
i+(k−1)×s+a−1 (2)
a=1
where xl,j
i,k is one of the ith unit of j feature map of the kth section in the lth layer, and
s is the range of section. The difference between Equation 1 and Equation 2 is the range
of weight sharing, using the window (k − 1) × s + i + a instead of i + a to conduct the
convolution operation.
Due to the partial weight sharing structure, only local filters that are close to each
other share weights and will be aggregated together in the max-pooling layer. The pool-
ing function is slightly different from the traditional pooling function, since the max
operation is only carried out in the same shared weight section:
xl,j r
i = maxk=1 xk
l−1,j
(4)
Thus, filters that work on local time window will provide an efficient way to repre-
sent these local structures and their combinations along the whole time axis may be
eventually used to recognized different activities.
The CNN can contain one or more pairs of convolution and max-pooling layers, where
higher layers use broader filters to process more complex parts of the input. The top
layers in CNN are stacked by one or more fully connected normal neural networks.
These fully connected neural network are expected to combine different local structures
in the lower layers for the final classification purpose.
In this work, we only use one pair convolution and max-pooling layer, and two
normal fully connected neural networks. In the training stage, The CNN parameters are
estimated by standard forward and backward propagation algorithms to minimize the
objective function.
where c is a class label, x is a sample feature, y is label variable, and w is weight vector,
K is the number of class.
N −m−1
∂L X
l−1 ∂L 0 l
= y(i+a) σ (xi ) (8)
∂wa,b i=1
∂yil
In spite of many successes in CNN, there are still many weaknesses, such as local
optimum problem, overfitting, etc. In this section, we discuss the regularization terms
to train more robust CNN model. In the training stage, we used stochastic gradient
descent with batch size of 200 examples and the learning rate of 0.05. The weights of
weight decay, momentum and dropout should be set carefully.
Weight Decay The motivation of weight decay is to avoid over-fitting. The learning
rate is a parameter that determines how much an updating step influences the current
value of the weights, while weight decay is an additional term in the weight update rule
that causes the weights to exponentially decay to zero, if no other update is scheduled.
Assume that the cost function that we want to minimize is L(w). Gradient descent tells
us to modify weights w in the direction of steepest descent in L:
10 Zeng, Nguyen, Yu, Mengshoel, Zhu, Wu, and Zhang
end
– Use fully-connected layer to integrate the pooling results of tri-axises {((xi , yi , zi ), ai )} data
– Use soft-max to do classification and update the weight of each edge in the network
Backward Propagation:
until wi convergences;
∂L
wi = wi − η (9)
∂wi
where η is the learning rate, and if it’s too large it will have a correspondingly large
modification of the weights wi
In order to effectively limit the number of free parameters in the model and avoid
over-fitting, it is possible to regularize the cost function. An easy way is by introducing
a zero-mean Gaussian prior over the weights, which is equivalent to changing the cost
function to L̂(w) = L(w) + λ2 w2 . In practice, this penalizes large weights and effec-
tively limits the freedom in the model. The regularization parameter λ determines how
one trade off the original cost L with the large weights penalization.
Applying gradient descent to this new cost function we obtain:
∂L
wi = wi − η − ηλwi (10)
∂wi
The new term −ηλwi coming from the regularization causes the weight to decay in
proportion to its size. More intuitively, the “weight decay” terms models the data well
and has “smooth” solutions.
function and allow for larger step in favorable directions. Using appropriate parameters,
the rate of convergence is increased while local optima may be overstepped. The cost
function L(w) to be minimized, classical momentum is given by
where η > 0 is the learning rate, µ ∈ [0, 1] is the momentum coefficient, and ∇L(wt )
is the gradient at wt . The application of momentum for DNN is proposed in [26]. Note
that µ = 0 gives standard descent wk = −η∇L(wt ), while µ = 1, we obtain “infinite
momentum” wk = wk−1 .
4 Experimental Analysis
4.1 Dataset and Preprocessing
We selected three publically available datasets for our evaluation. All datasets related to
human activities in different contexts and have been recorded using tri-axial accelerom-
eters. Sensors were either worn or embedded into objects that subjects manipulated.
The sensor data was segmented using a sliding window with a size of 64 continuous
samples with 50% overlap. The acceleration values were normalized to have zero mean
and unit standard variance for CNN-based approach. All the deep learning based algo-
rithms (CNN-based, and RBM) are performed on a server, equipped with a Tesla K20c
GPU and 48G memory. The traditional learning algorithms (PCA, statistics) are run on
the same server with an Intel Xeon E5 CPU.
– Opportunity (Opp) [25, 6] The dataset contains activities performed in a home envi-
ronment (kitchen) using multiple worn sensors. The dataset records activities of mul-
tiple subjects on different days with 64Hz. The activities contain “open then close the
fridge”, “open then close the dishwasher”, “drink while standing”, “clean the table”,
etc. Our settings on this dataset is the same with [23]: only using one sensor on the
right arm, and we consider 11 activities categories, including 10 low-level activities
and 1 unknown activity. The dataset contains around 4,200 frames.
12 Zeng, Nguyen, Yu, Mengshoel, Zhu, Wu, and Zhang
100.00%
96.88%
90.88%
88.19%
87.85%
90.00%
86.19%
83.78%
85.21%
ACCURACY
80.00%
75.36%
75.63%
76.83%
73.58%
74.79%
75.54%
70.00%
66.78%
64.83%
60.00%
50.00%
Skoda
Opp
Ac+tracker
Sta+s+cal
RBM
PCA-‐ECDF
CNN
CNN-‐par+al
weight
sharing
Fig. 6. Accuracy of classification for experimental evaluation of learned features. The Statistical,
RBM and PCA-ECDF do not consider local dependency or scale invariant, but CNN-based model
take account of local dependency and scale invariant.
– Skoda [33] The Skoda Mini Checkpoint dataset describes the activities of assembly-
line workers in a car maintenance senario. The dataset records a worker wearing 20
accelerometers in both arms while performing 46 activities in the factory at one of the
quality control checkpoint. The activities include “open hood”, “close left hand door”
“check steering wheel”, etc. The frequency of sampling is 96Hz resulting around
15,000 frames. The settings of CNN on this data follows that of [23]: use only one
accelerometer on the right arm to identify all 10 activities related to right arm and
perform 4-fold cross validation.
– Actitracker [21] This dataset contains six daily activities collected in an controlled
laboratory environment. The activities includes “jogging”, “walking”, “ascending
stairs”, and “descending stairs”, etc., which are recorded from 36 users collected
using a cell phone in their pocket with 20Hz sampling rate resulting around 29,000
frames. 10-fold cross validation is conducted on this dataset.
In the first experiment, we evaluate the activity recognition results presented in Fig 6.
The CNN is composed of a convolution layer with the partial weight sharing, with the
filter size set to 20 and max-pooling size set to 3. The top two fully connected hidden
layer have 1024 nodes and 30 nodes respectively. One additional softmax top layer is
used to generate state posterior probabilities. All the other compared algorithms used
the same settings as [23]: calculating 23 dimension statistical value (mean, standard de-
viation energy, etc.) as statistical feature; PCA (ECDF prepossessed) with 30 principal
component (30 dimension); the structure of RBM is 192-1024-1024-30. KNN is used
as the label predictor. To show the general applicability of the methods, the learning
parameters and the network layout were tuned on the Skoda dataset via cross-validation
and then applied as is for the remaining datasets.
Convolutional Neural Networks for Human Activity Recognition using Mobile Sensors 13
From Fig 6 we can observe that CNN+partial weight sharing could improve the
classification accuracy (with 95% confidence) for all the three datasets. This CNN-
based model achieves classification accuracy of 88.19%, 76.83% 96.88% on Skoda,
Opp, Antitracker respectively, which is 4.41%, 1.2%, 9.02% higher than the best algo-
rithm (PCA-ECDF)[23].
To analyze the results in more detail, we show the confusion matrix for the Acti-
tracker dataset using PCA (Table 1) and CNN (Table 2). The two confusion matrices
indicate that many of the prediction error are due to confusion between these three
activities: ”walking”, ”walking down”, ”walking up”. This is because these three ac-
tivities are relatively similar [19]. However, from the results we can observe that the
CNN+partial weight sharing model outperforms the PCA-ECDF due to the two charac-
teristics of CNN+partial weight sharing. Note that in the PCA-ECDF confusion matrix,
the confusion in (up,walk) and (down,walk) is high. This is because the signal vibra-
tion of ”walking up” and ”walking down” activities are like ”walking”. But CNN-based
models performs well in these two cases, which indicates CNN could extract better rep-
resentative features for ”walking down” and ”walking up”.
Predict Class
Jog Walk Up Down Sit Stand
Jog 649 13 8 3 0 7
Actual Class
Walk 2 1146 7 1 2 5
Up 5 42 187 30 2 48
Down 0 44 65 101 3 42
Sit 0 0 0 0 166 0
Stand 0 0 0 0 0 133
Table 1. Confusion Matrix for PCA-ECDF on ActiTracker Dataset
Predict Class
Jog Walk Up Down Sit Stand
Jog 667 5 1 3 0 0
Actual Class
Walk 1 1145 8 5 0 0
Up 5 13 274 17 1 1
Down 2 9 13 231 0 0
Sit 0 0 0 0 166 0
Stand 0 0 0 0 0 133
Table 2. Confusion Matrix for CNN on AntiTracker Dataset
14 Zeng, Nguyen, Yu, Mengshoel, Zhu, Wu, and Zhang
We evaluate the sensitivity of varies pooling window size, the weight decay, momentum
and dropout. In the following, we vary the width of pooling window, weight decay,
momentum, and dropout respectively while keeping the other parameters as the best
settings.
Pooling Size In the following, we evaluate the effect of different pooling sizes of the
CNN configuration. Assume CNN is composed of a convolution layer with the partial
weight sharing, filter size of 20 units, a max-pooling layer with a sub-sampling factor of
3, and two top fully connected hidden layer having 1024 nodes and 30 node respectively.
And one additional the softmax top layer to generate state posterior probabilities. We
have tested CNN with pooling size from 1 to 5, where 1 corresponds to the case of no
max-pooling.
The recognition results are shown in Fig 7. The results with max-pooling are better than
that with no max-pooling because max-pooling is able to preserve scale invariant. The
best results are consistently achieved when setting the pooling size to 3. In this case,
the recognition accuracy is increased from 85.68% to 88.19% on the Skoda dataset, and
from 71.94% to 76.77% on Opp dataset.
100.00%
90.00%
Accuracy
80.00%
70.00%
60.00%
1
2
3
4
5
Pooling
Size
Weight Decay we evaluate the sensitivity of the weight decay for the weight val-
ues {0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.25, 0.5, 1}. The general trend show
that, the accuracy of CNN steadily improves from [0.0001, 0.25]. Then it decreases even
the weight continues increasing. It shows that this small amount of weight decay was
important for the model to learn (Fig 8).
Momentum We evaluate the sensitivity of the weights of momentum for the weight
values {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1}. The general trend shows that,
the accuracy of CNN steadily improves from [0.0, 0.5]. Then it drops quickly even the
weight continues increasing. With increasing weight momentum, the search direction
tends to use the initial direction of the last step (Fig 9).
Convolutional Neural Networks for Human Activity Recognition using Mobile Sensors 15
100.00%
Accuracy
90.00%
80.00%
70.00%
60.00%
50.00%
0. 1
0. 5
0. 0
0. 0
0. 0
0. 0
0. 0
0. 0
1. 0
00
0
0
00
00
00
00
01
05
10
25
50
00
0.
Weight
of
Weight
Decay
100.00%
80.00%
Accuracy
60.00%
40.00%
20.00%
0.00%
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Weight
of
Momentum
5 Conclusion
In this paper, we have proposed a CNN-based feature extraction approach, which ex-
tracts the local dependency and scale invariant characteristics of the acceleration time
series. The experimental results have shown that by extracting these characteristics, the
CNN-based approach outperforms the state-of-the-art approaches.
16 Zeng, Nguyen, Yu, Mengshoel, Zhu, Wu, and Zhang
100.00%
80.00%
Accuracy
60.00%
40.00%
20.00%
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
The
Prob
of
Retaining
Hidden
Unit
Experiments with larger datasets are needed to further study the robustness of the
proposed technique. Further improvements may be achieved by using unsupervised pre-
training and repeating pooling operations in multiple layers of the CNN model.
6 Acknowledgement
This work is supported in part by the National Science Foundation through the Smart
and Connected Health program under the award IIS1344768.
A
Backward Propagation for Convolutional Layer
Since we know the error from previous layer, we need to compute for the previous layer
∂L
is the partial of L with respect to each neuron output ∂y l . According the chain rule, the
i
gradient of w is computed by:
N −m−1 N −m−1
∂L X ∂L ∂xli,j X ∂L (l−1)
= = y (12)
∂wa,b i=1
l
∂xi,j ∂wa,b i=1
∂xli,j (i+a)
In order to compute the first term of rightmost of (12), which is straightforward to
compute using the chain rule:
l
∂L ∂L ∂yi,j ∂L ∂ ∂L 0 l
σ(xli,j ) =
l
= l l
= l l l
σ (xi,j ) (13)
∂xi,j ∂yi,j ∂xi,j ∂yi,j ∂xi,j ∂yi,j
∂L
As we can see, since we already know the error at the current layer ∂yil
, we can easily
∂L
compute ∂xl i
at the current layer by just using the derivative of the mapping function,
σ 0 (x).
In addition to compute the weights for this convolutional layer, we need to propagate
errors back to the previous layer by:
m−1
X ∂L ∂xl(i−a) m−1
∂L X ∂L
l−1
= = wa,b (14)
∂yi,j a=0
∂xl(i−a) ∂yi,j
l−1
a=0
∂xl(i−a)
Convolutional Neural Networks for Human Activity Recognition using Mobile Sensors 17
References
1. O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, and G. Penn. Applying convolutional neural
networks concepts to hybrid nn-hmm model for speech recognition. In Acoustics, Speech and
Signal Processing (ICASSP), 2012 IEEE International Conference on, pages 4277–4280.
IEEE, 2012.
2. U. Bagci and L. Bai. A comparison of daubechies and gabor wavelets for classification of mr
images. In Signal Processing and Communications, 2007. ICSPC 2007. IEEE International
Conference on, pages 676–679. IEEE, 2007.
3. L. Bao and S. S. Intille. Activity recognition from user-annotated acceleration data. In
Pervasive computing, pages 1–17. Springer, 2004.
4. Y. Bengio. Learning deep architectures for ai. Foundations and trends
R in Machine Learn-
ing, 2(1):1–127, 2009.
5. S. Bhattacharya, P. Nurmi, N. Hammerla, and T. Plötz. Using unlabeled data in a sparse-
coding framework for human activity recognition. arXiv preprint arXiv:1312.6995, 2013.
6. R. Chavarriaga, H. Sagha, A. Calatroni, S. T. Digumarti, G. Tröster, J. d. R. Millán, and
D. Roggen. The opportunity challenge: A benchmark database for on-body sensor-based
activity recognition. Pattern Recognition Letters, 34(15):2033–2042, 2013.
7. S. Chennuru, P.-W. Chen, J. Zhu, and J. Y. Zhang. Mobile lifelogger–recording, indexing,
and understanding a mobile users life. In Mobile Computing, Applications, and Services,
pages 263–281. Springer, 2012.
8. D. Figo, P. C. Diniz, D. R. Ferreira, and J. M. Cardoso. Preprocessing techniques for context
recognition from accelerometer data. Personal and Ubiquitous Computing, 14(7):645–662,
2010.
9. K. Forster, D. Roggen, and G. Troster. Unsupervised classifier self-calibration through re-
peated context occurences: is there robustness against sensor displacement to gain? In Wear-
able Computers, 2009. ISWC’09. International Symposium on, pages 77–84. IEEE, 2009.
10. J. Frank, S. Mannor, and D. Precup. Activity and gait recognition with time-delay embed-
dings. In AAAI, 2010.
11. Z. He and L. Jin. Activity recognition from acceleration data based on discrete consine trans-
form and svm. In Systems, Man and Cybernetics. SMC 2009. IEEE International Conference
on, pages 5041–5044. IEEE, 2009.
12. Z.-Y. He and L.-W. Jin. Activity recognition from acceleration data using ar model represen-
tation and svm. In Machine Learning and Cybernetics, 2008 International Conference on,
volume 4, pages 2245–2250. IEEE, 2008.
13. G. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural
computation, 18(7):1527–1554, 2006.
14. G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural
networks. Science, 313(5786):504–507, 2006.
15. T. Huynh and B. Schiele. Analyzing features for activity recognition. In Proceedings of the
2005 joint conference on Smart objects and ambient intelligence: innovative context-aware
services: usages and technologies, pages 159–163. ACM, 2005.
16. S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action
recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(1):221–
231, 2013.
17. A. Krause, J. Farringdon, D. P. Siewiorek, and A. Smailagic. Unsupervised, dynamic identi-
fication of physiological and activity context in wearable computing. In 2012 16th Interna-
tional Symposium on Wearable Computers, pages 88–88. IEEE Computer Society, 2003.
18. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolu-
tional neural networks. In NIPS, volume 1, page 4, 2012.
18 Zeng, Nguyen, Yu, Mengshoel, Zhu, Wu, and Zhang
19. J. R. Kwapisz, G. M. Weiss, and S. A. Moore. Activity recognition using cell phone ac-
celerometers. ACM SigKDD Explorations Newsletter, 12(2):74–82, 2011.
20. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
21. J. W. Lockhart, G. M. Weiss, J. C. Xue, S. T. Gallagher, A. B. Grosner, and T. T. Pulickal.
Design considerations for the wisdm smart phone-based sensor mining architecture. In Pro-
ceedings of the Fifth International Workshop on Knowledge Discovery from Sensor Data,
pages 25–33. ACM, 2011.
22. D. G. Lowe. Object recognition from local scale-invariant features. In Computer vision,
1999. The proceedings of the seventh IEEE international conference on, volume 2, pages
1150–1157. Ieee, 1999.
23. T. Plötz, N. Y. Hammerla, and P. Olivier. Feature learning for activity recognition in ubiqui-
tous computing. In Proceedings of the Twenty-Second IJCAI Volume Two, pages 1729–1734.
AAAI Press, 2011.
24. B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR
Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
25. D. Roggen, A. Calatroni, M. Rossi, T. Holleczek, K. Forster, and o. Troster. Collecting com-
plex activity datasets in highly rich networked sensor environments. In Networked Sensing
Systems (INSS), 2010 Seventh International Conference on, pages 233–240. IEEE, 2010.
26. I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and mo-
mentum in deep learning. In Proceedings of the 30th International Conference on Machine
Learning (ICML-13), pages 1139–1147, 2013.
27. T. Tamura, M. Sekine, M. Ogawa, T. Togawa, and Y. Fukui. Classification of acceleration
waveforms during walking by wavelet transform. Methods of information in medicine, 36(4-
5):356–359, 1997.
28. Y. Tang, R. Salakhutdinov, and G. Hinton. Robust boltzmann machines for recognition and
denoising. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on,
pages 2264–2271. IEEE, 2012.
29. C. Vollmer, H.-M. Gross, and J. P. Eggert. Learning features for activity recognition with
shift-invariant sparse coding. In Artificial Neural Networks and Machine Learning–ICANN
2013, pages 367–374. Springer, 2013.
30. S. Wang and C. Manning. Fast dropout training. In Proceedings of the 30th International
Conference on Machine Learning (ICML-13), pages 118–126, 2013.
31. P. Wu, H.-K. Peng, J. Zhu, and Y. Zhang. Senscare: Semi-automatic activity summariza-
tion system for elderly care. In Mobile Computing, Applications, and Services, pages 1–19.
Springer, 2012.
32. P. Wu, J. Zhu, and J. Y. Zhang. Mobisens: A versatile mobile sensing platform for real-world
applications. Mobile Networks and Applications, 18(1):60–80, 2013.
33. P. Zappi, C. Lombriser, T. Stiefmeier, E. Farella, D. Roggen, L. Benini, and G. Tröster. Activ-
ity recognition from on-body sensors: accuracy-power trade-off by dynamic sensor selection.
In Wireless Sensor Networks, pages 17–33. Springer, 2008.
34. M. Zhang and A. A. Sawchuk. A feature selection-based framework for human activity
recognition using wearable multimodal sensors. In Proceedings of the 6th International
Conference on Body Area Networks, pages 92–98. ICST, 2011.