NN MTH404

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Chapter 10 – NN

You Liang

2023-03-22

Notes made based on Unit 10 of Introduction to Statistical Learning.

Single Layer Neural Networks


A neural network takes an input vector of 𝑝 variables 𝑋 = (𝑋1 , 𝑋2 , … , 𝑋𝑝 ) and builds a
nonlinear function 𝑓(𝑋) to predict the response 𝑌. Figure 10.1 shows a simple feed-
forward neural network for modeling a quantitative response feed-forward neural network
using 𝑝 = 4 predictors.

The neural network model has hidden units the form


𝐾 𝐾 𝑝

𝑓(𝑋) = 𝛽0 + ∑ 𝛽𝑘 ℎ𝑘 (𝑋) = 𝛽0 + ∑ 𝛽𝑘 𝑔 (𝑤𝑘0 + ∑ 𝑤𝑘𝑗 𝑋𝑗 )


𝑘=1 𝑘=1 𝑗=1

It is built up here in two steps. First the 𝐾 activations 𝐴𝑘 , 𝑘 = 1, … , 𝐾, in activations the


hidden layer are computed as functions of the input features 𝑋1 , 𝑋2 , … , 𝑋𝑝 ,
𝑝

𝐴𝑘 = ℎ𝑘 (𝑋) = 𝑔 (𝑤𝑘0 + ∑ 𝑤𝑘𝑗 𝑋𝑗 )


𝑗=1

where 𝑔(𝑧) is a nonlinear activation function that is specified in advance. We can think of
each 𝐴𝑘 as a different transformation ℎ𝑘 (𝑋) of the original function features. These 𝐾
activations from the hidden layer then feed into the output layer, resulting in
𝐾

𝑓(𝑋) = 𝛽0 + ∑ 𝛽𝑘 𝐴𝑘 ,
𝑘=1

a linear regression model in the K = 5 activations.


The name neural network originally derived from thinking of these hidden units as
analogous to neurons in the brain — values of the activations 𝐴𝑘 = ℎ𝑘 (𝑋) close to one are
firing, while those close to zero are silent (using the sigmoid activation function). All the
paramters 𝛽0 , … , 𝛽𝐾 and 𝑤10 , … , 𝑤𝐾𝑝 need to be estimated from data. For a quantitative
response, typically squared-error loss is used, so that the parameters are chosen to
minimize
𝑛
2
∑(𝑦𝑖 − 𝑓(𝑥𝑖 ))
𝑖=1

The commonly used activation functions are the sigmoid activation function
𝑒𝑧
𝑔(𝑧) =
1 + 𝑒𝑧
and the ReLU (rectified linear unit) activation function
𝑔(𝑧) = (𝑧)+
A ReLU activation can be computed and stored more efficiently than a sigmoid activation,
which is more preferred. The nonlinearity in the activation function 𝑔(·) is essential, since
without it the model f(X) would collapse into a simple linear model in 𝑋1 , … , 𝑋𝑝 . Moreover,
having a nonlinear activation function allows the model to capture complex nonlinearities
and interaction effects.
We start by fitting the models in Section 10.6. We set up the data, and separate out a
training and test set.
library(ISLR2)
Gitters <- na.omit(Hitters)
n <- nrow(Gitters)
set.seed (13)
ntest <- trunc(n / 3)
testid <- sample (1:n, ntest)

The linear model should be familiar, but we present it anyway.


lfit <- lm(Salary ~ ., data = Gitters[-testid , ])
lpred <- predict(lfit , Gitters[testid , ])
with(Gitters[testid , ], mean(abs(lpred - Salary)))

## [1] 254.6687

Next we fit the lasso using glmnet. Since this package does not use formulas, we create 𝑥
and 𝑦 first.
x <- scale(model.matrix(Salary ~ . - 1, data = Gitters))
y <- Gitters$Salary

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-8

cvfit <- cv.glmnet(x[-testid , ], y[-testid],


type.measure = "mae")
cpred <- predict(cvfit , x[testid , ], s = "lambda.min")
mean(abs(y[testid] - cpred))

## [1] 252.2994

To fit the neural network, we first set up a model structure that describes the network.
library(keras)
modnn <- keras_model_sequential () %>%
layer_dense(units = 50, activation = "relu",
input_shape = ncol(x)) %>%
layer_dropout(rate = 0.4) %>%
layer_dense(units = 1)

We have created a vanilla model object called modnn, and have added details about the
successive layers in a sequential manner, using the function keras model sequential().
We now make a matrix, and then we center each of the variables.
x <- model.matrix(Salary ~ . - 1, data = Gitters) %>% scale ()

We now return to our neural network. The object modnn has a single hidden layer with 50
hidden units, and a ReLU activation function. It then has a dropout layer, in which a random
40% of the 50 activations from the previous layer are set to zero during each iteration of
the stochastic gradient descent algorithm. Finally, the output layer has just one unit with no
activation function, indicating that the model provides a single quantitative output.
We minimize squared-error loss. The algorithm tracks the mean absolute error on the
training data, and on validation data if it is supplied.
modnn %>% compile(loss = "mse",
optimizer = optimizer_rmsprop (),
metrics = list("mean_absolute_error")
)

Now we fit the model. We supply the training data and two fitting parameters, epochs and
batch size. Using 32 for the latter means that at each step of SGD, the algorithm randomly
selects 32 training observations for the computation of the gradient. Since the training set
has n = 176, an epoch is 176/32 = 5.5 SGD steps. The fit() function has an argument
validation data; these data are not used in the fitting, but can be used to track the progress
of the model (in this case reporting the mean absolute error). Here we actually supply the
test data so we can see the mean absolute error of both the training data and test data as
the epochs proceed.
history <- modnn %>% fit(
x[-testid , ], y[-testid], epochs = 50, batch_size = 4,
validation_data = list(x[testid , ], y[testid ])
)

plot(history)

Finally, we predict from the final model, and evaluate its performance on the test data. Due
to the use of SGD, the results vary slightly with each fit. Unfortunately the set.seed()
function does not ensure identical results (since the fitting is done in python), so your
results will differ slightly.
npred <- predict(modnn , x[testid , ])
mean(abs(y[testid] - npred))

## [1] 330.0103

Multilayer Neural Networks


Modern neural networks typically have more than one hidden layer, and often many units
per layer. In theory a single hidden layer with a large number of units has the ability to
approximate most functions. However, the learning task of discovering a good solution is
made much easier with multiple layers each of modest size.

We will illustrate a large dense network on the famous and publicly available MNIST
handwritten digit dataset. The idea is to build a model to classify the images into their
correct digit class 0–9. Every image has p = 28 × 28 = 784 pixels, each of which is an eight-
bit grayscale value between 0 and 255 representing the relative amount of the written digit
in that tiny square. These pixels are stored in the input vector X (in, say, column order). The
output is the class label, represented by a vector 𝑌 = (𝑌0 , 𝑌1 , … , 𝑌9 ) of 10 dummy variables,
with a one in the position corresponding to the label, and zeros elsewhere. In the machine
learning community, this is known as one-hot encoding. There are 60,000 training images,
and 10,000 test images.
mnist <- dataset_mnist ()
x_train <- mnist$train$x
dim(x_train)

## [1] 60000 28 28

g_train <- mnist$train$y


x_test <- mnist$test$x
g_test <- mnist$test$y

There are 60,000 images in the training data and 10,000 in the test data. The images are
28×28 = 784 pixels, and stored as a three-dimensional array, so we need to reshape them
into a matrix. Also, we need to ``one-hot” encode the class label. Luckily keras has a lot of
built-in functions that do this for us.
x_train <- array_reshape(x_train , c(nrow(x_train), 784))
dim(x_train)

## [1] 60000 784

x_test <- array_reshape(x_test , c(nrow(x_test), 784))


y_train <- to_categorical(g_train , 10)
y_test <- to_categorical(g_test , 10)

Neural networks are somewhat sensitive to the scale of the inputs. For example, ridge and
lasso regularization are affected by scaling. Here the inputs are eight-bit26 grayscale values
between 0 and 255, so we rescale to the unit interval.
x_train <- x_train / 255
x_test <- x_test / 255

Now we are ready to fit our neural network.


modelnn <- keras_model_sequential ()
modelnn %>%
layer_dense(units = 256, activation = "relu",
input_shape = c(784)) %>%
layer_dropout(rate = 0.4) %>%
layer_dense(units = 128, activation = "relu") %>%
layer_dropout(rate = 0.3) %>%
layer_dense(units = 10, activation = "softmax")

The first layer goes from 28 × 28 = 784 input units to a hidden layer of 256 units, which
uses the ReLU activation function. This is specified by a call to layer dense(), which takes as
input a modelnn object, and returns layer dense() a modified modelnn object. This is then
piped through layer dropout() to layer dropout() perform dropout regularization. The
second hidden layer comes next, with 128 hidden units, followed by a dropout layer. The
final layer is the output layer, with activation ``softmax” for the 10-class classification
problem, which defines the map from the second hidden layer to class probabilities.
summary(modelnn)
## Model: "sequential_1"
##
_____________________________________________________________________________
___
## Layer (type) Output Shape Param
#
##
=============================================================================
===
## dense_4 (Dense) (None, 256) 200960
## dropout_2 (Dropout) (None, 256) 0
## dense_3 (Dense) (None, 128) 32896
## dropout_1 (Dropout) (None, 128) 0
## dense_2 (Dense) (None, 10) 1290
##
=============================================================================
===
## Total params: 235,146
## Trainable params: 235,146
## Non-trainable params: 0
##
_____________________________________________________________________________
___

The parameters for each layer include a bias term, which results in a parameter count of
235,146. For example, the first hidden layer involves (784 + 1) × 256 = 200,960
parameters.
Notice that the layer names such as dropout 1 and dense 2 have subscripts. These may
appear somewhat random; in fact, if you fit the same model again, these will change. They
are of no consequence: they vary because the model specification code is run in python,
and these subscripts are incremented every time keras model sequential() is called.
Next, we add details to the model to specify the fitting algorithm. We fit the model by
minimizing the cross-entropy function.
modelnn %>% compile(loss = "categorical_crossentropy", optimizer =
optimizer_rmsprop (), metrics = c("accuracy")
)

Now we are ready to go. The final step is to supply training data, and fit the model.
system.time(
history <- modelnn %>%
fit(x_train , y_train , epochs = 30, batch_size = 128,
validation_split = 0.2)
)

## user system elapsed


## 90.843 39.359 34.318
plot(history , smooth = FALSE)

We have suppressed the output here, which is a progress report on the fitting of the model,
grouped by epoch. This is very useful, since on large datasets fitting can take time.
accuracy <- function(pred , truth)
mean(drop(pred) == drop(truth))
pred <- modelnn %>% predict(x_test) %>% k_argmax()
accuracy(as.numeric(pred), g_test)

## [1] 0.9809

Although packages such as glmnet can handle multiclass logistic regression, they are quite
slow on this large dataset. It is much faster and quite easy to fit such a model using the
keras software. We just have an input layer and output layer, and omit the hidden layers!
modellr <- keras_model_sequential () %>%
layer_dense(input_shape = 784, units = 10,
activation = "softmax")
summary(modellr)

## Model: "sequential_2"
##
_____________________________________________________________________________
___
## Layer (type) Output Shape Param
#
##
=============================================================================
===
## dense_5 (Dense) (None, 10) 7850
##
=============================================================================
===
## Total params: 7,850
## Trainable params: 7,850
## Non-trainable params: 0
##
_____________________________________________________________________________
___

We fit the model just as before.


modellr %>% compile(loss = "categorical_crossentropy",
optimizer = optimizer_rmsprop (), metrics = c("accuracy"))
modellr %>% fit(x_train , y_train , epochs = 30,
batch_size = 128, validation_split = 0.2)

pred <- modellr %>% predict(x_test)%>% k_argmax()


accuracy(as.numeric(pred), g_test)

## [1] 0.9282

You might also like