NN MTH404
NN MTH404
NN MTH404
You Liang
2023-03-22
where 𝑔(𝑧) is a nonlinear activation function that is specified in advance. We can think of
each 𝐴𝑘 as a different transformation ℎ𝑘 (𝑋) of the original function features. These 𝐾
activations from the hidden layer then feed into the output layer, resulting in
𝐾
𝑓(𝑋) = 𝛽0 + ∑ 𝛽𝑘 𝐴𝑘 ,
𝑘=1
The commonly used activation functions are the sigmoid activation function
𝑒𝑧
𝑔(𝑧) =
1 + 𝑒𝑧
and the ReLU (rectified linear unit) activation function
𝑔(𝑧) = (𝑧)+
A ReLU activation can be computed and stored more efficiently than a sigmoid activation,
which is more preferred. The nonlinearity in the activation function 𝑔(·) is essential, since
without it the model f(X) would collapse into a simple linear model in 𝑋1 , … , 𝑋𝑝 . Moreover,
having a nonlinear activation function allows the model to capture complex nonlinearities
and interaction effects.
We start by fitting the models in Section 10.6. We set up the data, and separate out a
training and test set.
library(ISLR2)
Gitters <- na.omit(Hitters)
n <- nrow(Gitters)
set.seed (13)
ntest <- trunc(n / 3)
testid <- sample (1:n, ntest)
## [1] 254.6687
Next we fit the lasso using glmnet. Since this package does not use formulas, we create 𝑥
and 𝑦 first.
x <- scale(model.matrix(Salary ~ . - 1, data = Gitters))
y <- Gitters$Salary
library(glmnet)
## [1] 252.2994
To fit the neural network, we first set up a model structure that describes the network.
library(keras)
modnn <- keras_model_sequential () %>%
layer_dense(units = 50, activation = "relu",
input_shape = ncol(x)) %>%
layer_dropout(rate = 0.4) %>%
layer_dense(units = 1)
We have created a vanilla model object called modnn, and have added details about the
successive layers in a sequential manner, using the function keras model sequential().
We now make a matrix, and then we center each of the variables.
x <- model.matrix(Salary ~ . - 1, data = Gitters) %>% scale ()
We now return to our neural network. The object modnn has a single hidden layer with 50
hidden units, and a ReLU activation function. It then has a dropout layer, in which a random
40% of the 50 activations from the previous layer are set to zero during each iteration of
the stochastic gradient descent algorithm. Finally, the output layer has just one unit with no
activation function, indicating that the model provides a single quantitative output.
We minimize squared-error loss. The algorithm tracks the mean absolute error on the
training data, and on validation data if it is supplied.
modnn %>% compile(loss = "mse",
optimizer = optimizer_rmsprop (),
metrics = list("mean_absolute_error")
)
Now we fit the model. We supply the training data and two fitting parameters, epochs and
batch size. Using 32 for the latter means that at each step of SGD, the algorithm randomly
selects 32 training observations for the computation of the gradient. Since the training set
has n = 176, an epoch is 176/32 = 5.5 SGD steps. The fit() function has an argument
validation data; these data are not used in the fitting, but can be used to track the progress
of the model (in this case reporting the mean absolute error). Here we actually supply the
test data so we can see the mean absolute error of both the training data and test data as
the epochs proceed.
history <- modnn %>% fit(
x[-testid , ], y[-testid], epochs = 50, batch_size = 4,
validation_data = list(x[testid , ], y[testid ])
)
plot(history)
Finally, we predict from the final model, and evaluate its performance on the test data. Due
to the use of SGD, the results vary slightly with each fit. Unfortunately the set.seed()
function does not ensure identical results (since the fitting is done in python), so your
results will differ slightly.
npred <- predict(modnn , x[testid , ])
mean(abs(y[testid] - npred))
## [1] 330.0103
We will illustrate a large dense network on the famous and publicly available MNIST
handwritten digit dataset. The idea is to build a model to classify the images into their
correct digit class 0–9. Every image has p = 28 × 28 = 784 pixels, each of which is an eight-
bit grayscale value between 0 and 255 representing the relative amount of the written digit
in that tiny square. These pixels are stored in the input vector X (in, say, column order). The
output is the class label, represented by a vector 𝑌 = (𝑌0 , 𝑌1 , … , 𝑌9 ) of 10 dummy variables,
with a one in the position corresponding to the label, and zeros elsewhere. In the machine
learning community, this is known as one-hot encoding. There are 60,000 training images,
and 10,000 test images.
mnist <- dataset_mnist ()
x_train <- mnist$train$x
dim(x_train)
## [1] 60000 28 28
There are 60,000 images in the training data and 10,000 in the test data. The images are
28×28 = 784 pixels, and stored as a three-dimensional array, so we need to reshape them
into a matrix. Also, we need to ``one-hot” encode the class label. Luckily keras has a lot of
built-in functions that do this for us.
x_train <- array_reshape(x_train , c(nrow(x_train), 784))
dim(x_train)
Neural networks are somewhat sensitive to the scale of the inputs. For example, ridge and
lasso regularization are affected by scaling. Here the inputs are eight-bit26 grayscale values
between 0 and 255, so we rescale to the unit interval.
x_train <- x_train / 255
x_test <- x_test / 255
The first layer goes from 28 × 28 = 784 input units to a hidden layer of 256 units, which
uses the ReLU activation function. This is specified by a call to layer dense(), which takes as
input a modelnn object, and returns layer dense() a modified modelnn object. This is then
piped through layer dropout() to layer dropout() perform dropout regularization. The
second hidden layer comes next, with 128 hidden units, followed by a dropout layer. The
final layer is the output layer, with activation ``softmax” for the 10-class classification
problem, which defines the map from the second hidden layer to class probabilities.
summary(modelnn)
## Model: "sequential_1"
##
_____________________________________________________________________________
___
## Layer (type) Output Shape Param
#
##
=============================================================================
===
## dense_4 (Dense) (None, 256) 200960
## dropout_2 (Dropout) (None, 256) 0
## dense_3 (Dense) (None, 128) 32896
## dropout_1 (Dropout) (None, 128) 0
## dense_2 (Dense) (None, 10) 1290
##
=============================================================================
===
## Total params: 235,146
## Trainable params: 235,146
## Non-trainable params: 0
##
_____________________________________________________________________________
___
The parameters for each layer include a bias term, which results in a parameter count of
235,146. For example, the first hidden layer involves (784 + 1) × 256 = 200,960
parameters.
Notice that the layer names such as dropout 1 and dense 2 have subscripts. These may
appear somewhat random; in fact, if you fit the same model again, these will change. They
are of no consequence: they vary because the model specification code is run in python,
and these subscripts are incremented every time keras model sequential() is called.
Next, we add details to the model to specify the fitting algorithm. We fit the model by
minimizing the cross-entropy function.
modelnn %>% compile(loss = "categorical_crossentropy", optimizer =
optimizer_rmsprop (), metrics = c("accuracy")
)
Now we are ready to go. The final step is to supply training data, and fit the model.
system.time(
history <- modelnn %>%
fit(x_train , y_train , epochs = 30, batch_size = 128,
validation_split = 0.2)
)
We have suppressed the output here, which is a progress report on the fitting of the model,
grouped by epoch. This is very useful, since on large datasets fitting can take time.
accuracy <- function(pred , truth)
mean(drop(pred) == drop(truth))
pred <- modelnn %>% predict(x_test) %>% k_argmax()
accuracy(as.numeric(pred), g_test)
## [1] 0.9809
Although packages such as glmnet can handle multiclass logistic regression, they are quite
slow on this large dataset. It is much faster and quite easy to fit such a model using the
keras software. We just have an input layer and output layer, and omit the hidden layers!
modellr <- keras_model_sequential () %>%
layer_dense(input_shape = 784, units = 10,
activation = "softmax")
summary(modellr)
## Model: "sequential_2"
##
_____________________________________________________________________________
___
## Layer (type) Output Shape Param
#
##
=============================================================================
===
## dense_5 (Dense) (None, 10) 7850
##
=============================================================================
===
## Total params: 7,850
## Trainable params: 7,850
## Non-trainable params: 0
##
_____________________________________________________________________________
___
## [1] 0.9282