D17-1255

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Repeat before Forgetting: Spaced Repetition for Efficient and Effective

Training of Neural Networks

Hadi Amiri, Timothy A. Miller, Guergana Savova


Boston Children’s Hospital Informatics Program, Harvard Medical School
{firstname.lastname}@childrens.harvard.edu

Abstract a “starting small” strategy they are called curricu-


lum or self-paced learning.
We present a novel approach for training In this paper, we present a novel training
artificial neural networks. Our approach is paradigm which is inspired by the broad evidence
inspired by broad evidence in psychology in psychology that shows human ability to retain
that shows human learners can learn effi- information improves with repeated exposure and
ciently and effectively by increasing inter- exponentially decays with delay since last expo-
vals of time between subsequent reviews of sure (Cepeda et al., 2006; Averell and Heathcote,
previously learned materials (spaced repeti- 2011). Spaced repetition was presented in psychol-
tion). We investigate the analogy between ogy (Dempster, 1989) and forms the building block
training neural models and findings in psy- of many educational devices, including flashcards,
chology about human memory model and in which small pieces of information are repeatedly
develop an efficient and effective algorithm presented to a learner on a schedule determined
to train neural models. The core part of our by a spaced repetition algorithm. Such algorithms
algorithm is a cognitively-motivated sched- show that human learners can learn efficiently and
uler according to which training instances effectively by increasing intervals of time between
and their “reviews” are spaced over time. subsequent reviews of previously learned materi-
Our algorithm uses only 34-50% of data als (Dempster, 1989; Novikoff et al., 2012).
per epoch, is 2.9-4.8 times faster than stan- We investigate the analogy between training neu-
dard training, and outperforms competing ral models and findings in psychology about human
state-of-the-art baselines.1 memory model and develop a spaced repetition al-
1 Introduction gorithm (named Repeat before Forgetting, RbF)
to efficiently and effectively train neural models.
Deep neural models are known to be computa- The core part of our algorithm is a scheduler that
tionally expensive to train even with fast hard- ensures a given neural network spends more time
ware (Sutskever et al., 2014; Wu et al., 2016). For working on difficult training instances and less time
example, it takes three weeks to train a deep neu- on easier ones. Our scheduler is inspired by fac-
ral machine translation system on 100 Graphics tors that affect human memory retention, namely,
Processing Units (GPUs) (Wu et al., 2016). Fur- difficulty of learning materials, delay since their
thermore, a large amount of data is usually required last review, and strength of memory. The scheduler
to train effective neural models (Goodfellow et al., uses these factors to lengthen or shorten review
2016; Hirschberg and Manning, 2015). intervals with respect to individual learners and
Bengio et al. (2009) and Kumar et al. (2010) de- training instances. We evaluate schedulers based
veloped training paradigms which are inspired by on their scheduling accuracy, i.e., accuracy in es-
the learning principle that humans can learn more timating network memory retention with respect
effectively when training starts with easier con- to previously-seen instances, as well as their effect
cepts and gradually proceeds with more difficult on the efficiency and effectiveness of downstream
concepts. Since these approaches are motivated by neural networks.2
1 2
Our code is available at scholar.harvard.edu/ In this paper, we use the terms memory retention, recall,
hadi/RbF/ and learning interchangeably.

2401
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2401–2410
Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics
{A ∪ B }

The contributions of this paper are: (1) we show {A ∪ C } {A ∪ C }

Sliding review
that memory retention in neural networks is af- window delay

fected by the same (known) factors that affect mem- epoch = 0


Epochs
First review Last review Recall
ory retention in humans, (2) we present a novel point (fRev) point (lRev) point (Rec)

training paradigm for neural networks based on


spaced repetition, and (3) our approach can be ap- Figure 1: Effect of recall indicators on network
plied without modification to any neural network. retention. Training data is uniformly at random
Our best RbF algorithm uses 34-50% of train- divided into three disjoint sets A, B, and C that
ing data per epoch while producing similar results respectively contain 80%, 10%, and 10% of the
to state-of-the-art systems on three tasks, namely data. Network retention is computed against set B
sentiment classification, image categorization, and instances at recall point.
arithmetic addition.3 It also runs 2.9-4.8 times
faster than standard training, and outperforms com- will allow us to intrinsically examine the effect of
peting state-of-the-art baselines. recall indicators on memory retention in isolation
from external effects such as size of training data,
2 Neural and Brain Memory Models number of training epochs, etc.
Research in psychology describes the following We first define the following concepts to ease
memory model for human learning: the probability understanding the experiments (see Figure 1):
that a human recalls a previously-seen item (e.g., • First and Last review points (fRev and
the Korean translation of a given English word) de- lRev) of a training instance are the first and
pends on the difficulty of the item, delay since last last epochs in which the instance is used to
review of the item, and the strength of the human train the network respectively,
memory. The relation between these indicators
and memory retention has the following functional • Recall point (Rec) is the epoch in which net-
form (Reddy et al., 2016; Ebbinghaus, 1913): work retention is computed against some train-
ing instances; network retention is the prob-
dif f iculty × delay
Pr(recall) = exp(− ). (1) ability that a neural network recalls (i.e. cor-
strength rectly classifies) a previously-seen training in-
An accurate memory model enables estimating stance, and
the time by which an item might be forgotten by
• Delay since last review of a training instance
a learner so that a review can be scheduled for the
is the difference between the recall point and
learner before that time.
the last review point of the training instance.
We investigate the analogy between the above
memory model and memory model of artificial Given training data and a neural network, we uni-
neural networks. Our intuition is that if the proba- formly at random divide the data into three disjoint
bility that a network recalls an item (e.g., correctly sets: a base set A, a review set B, and a replace-
predicts its category) depends on the same factors ment set C that respectively contain 80%, 10%, and
(difficulty of the item, delay since last review of 10% of the data. As depicted in Figure 1, instances
the item, or strength of the network), then we can of A are used for training at every epoch, while
develop spaced repetition algorithms to efficiently those in B and C are partially used for training.
and effectively train neural networks. The network initially starts to train with {A ∪ C}
instances. Then, starting from the first review point,
2.1 Recall Indicators
we inject the review set B and remove C, training
We design a set of preliminarily experiments to with {A ∪ B} instances at every epoch until the
directly evaluate the effect of the aforementioned last review point. The network will then continue
factors (recall indicators) on memory retention in training with {A ∪ C} instances until the recall
neural networks. For this purpose, we use a set point. At this point, network retention is computed
of training instances that are partially made avail- against set B instances, with delay defined as the
able to the network during training. This scheme number of epochs since last review point. The intu-
3
We obtained similar results on QA tasks (Weston et al., ition behind using review and replacement sets, B
2016) but they are excluded due to space limit. and C respectively, is to avoid external effects (e.g.

2402
0.965 0.770 1.0 0.9
network retention
0.9 0.8 network strength
0.960 0.765
average network retention

average network retention

average network retention


0.8 0.7

average accuracy
0.955 0.760 0.7 0.6
0.5
0.950 0.755 0.6
0.4
0.945 0.750 0.5 0.3
0.4 0.2
0.940 0.745 0.3 0.1
0.935 0.740 0.2 0.0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 0.00 0.01 0.02 0.03 0.04 0.05 0.06 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
delay since last review (Rec - lRev) delay since last review (Rec - lRev) item difficulty at last review point (normalized loss) recall points

(a) Delay full (b) Delay partial (c) Item difficulty (d) Network strength

Figure 2: (a) Delay since last review vs. average network retention (accuracy) on set B instances at recall
point. Recall point is fixed and set to the epoch in which networks obtain their best performance based on
rote training. (b) The same as (a) except that recall point is set to the epoch in which networks obtain half
of their best performance based on rote training. (c) Item Difficulty (normalized loss at last review point)
vs. average network retention at recall point on set B instances. (d) Network strength (network accuracy
on validation data at recall point) vs. average network retention at recall point on set B instances. Length
of sliding window is fixed throughout experiments and set to 5 epochs.

size of data or network generalization and learning parable across networks) and then average them
capability) for our intrinsic evaluation purpose. across networks for both fully and partially trained
To conduct these experiments, we identify dif- recall points. As the results show, network reten-
ferent neural models designed for different tasks.4 tion decreases as item difficulty increases.
For each network, we fix the recall point to either
the epoch in which the network is fully trained (i.e., 2.1.3 Network Strength
obtains its best performance based on standard or We define strength of a network by its performance
“rote” training in which all instances are used for on validation data. To understand the effect of
training at every iteration), or partially trained (i.e., network strength on its retention, we use the same
obtains half of its best performance based on rote experimental setup as before except that we keep
training). We report average results across these the delay (difference between recall point and last
networks for each experiment. review point) fixed while gradually increasing the
2.1.1 Delay since Last Review recall point; this will make the networks stronger
As aforementioned, delay since last review of a by training them for more epochs. Then, at every
training instance is the difference between the re- recall point, we record network retention on set B
call point (Rec) and the last review point (lRev) instances and network accuracy on validation data.
of the training instance. We evaluate the effect of Average results across networks for two sets of 10
delay on network retention (against set B instances) consecutive recall points (before fully and partially
by keeping the recall point fixed while moving the trained recall points) are shown in Figure 2(d). As
sliding window in Figure 1. Figures 2(a) and 2(b) the results show, network retention increases as
show average network retention across networks memory strength increases.
for the fully and partially trained recall points re- The above experiments show that memory re-
spectively. The results show an inverse relationship tention in neural networks is affected by the same
between network retention and delay since last re- factors that affect memory retention in humans: (a)
view in neural networks. neural networks forget training examples after a
certain period of intervening training data (b): the
2.1.2 Item Difficulty period of recall is shorter for more difficult exam-
We define difficulty of training instances by the ples, and (c): recall improves as networks achieve
loss values generated by a network for the instances. better overall performance. We conclude that de-
Figure 2(c) shows the difficulty of set B instances at lay since last review, item difficulty (loss values of
the last review point against average network reten- training instances), and memory strength (network
tion on these instances at recall point. We normal- performance on validation data) are key indicators
ize loss values to unit vectors (to make them com- that affect network retention and propose to design
4
See section 4, we use Addition and CIFAR10 datasets and spaced repetition algorithms that take such indica-
their corresponding neural networks for these experiments. tors into account in training neural networks.

2403
Algorithm 1. Leitner System the Leitner system is O(|current batch|) at every
Input: H : training data, V : validation data, k : num- epoch for moving instances between queues.
ber of iterations, n : number of queues
Output: trained model 3.2 RbF Model
0 Q = [q0 , q1 , . . . , qn−1 ] 3.2.1 RbF Memory Models
1 q0 = [H], qi = [] for i in [1, n − 1]
2 For epoch = 1 to k: The challenge in developing memory models is
3 current batch = [] to estimate the time by which a training instance
4 For i = 0 to n − 1: should be reviewed before it is forgotten by the
5 If epoch%2i == 0:
6 current batch = current batch + qi network. Accurate estimation of the review time
7 End For leads to efficient and effective training. However, a
8 pmos, dmos, model = train(current batch, V) heuristic scheduler such as Leitner system is sub-
9 update queue(Q, pmos, dmos)
10 End for optimal as its hard review schedules (i.e. only 2i -
11 return model iteration delays) may lead to early or late reviews.
q0 epochs = {1, 2, 3, 4, 5, . . . }
q1 epochs = {2, 4, 6, 8, 10, . . . }
We develop flexible schedulers that take recall in-
q2 epochs = {4, 8, 12, 16, 20, . . . } dicators into account in the scheduling process. Our
... schedulers lengthen or shorten inter-repetition in-
tervals with respect to individual training instances.
Figure 3: Leitner System. The train(.) function
In particular, we propose using density kernel func-
trains the network for one epoch using instances
tions to estimate the latest epoch in which a given
in the current batch, and the update queue(.)
training instance can be recalled. We aim to investi-
function promotes the recalled (correctly classified)
gate how much improvement (in terms of efficiency
instances, pmos, to the next queue and demotes the
and effectiveness) can be achieved using more flex-
forgotten ones, dmos, to q0 .
ible schedulers that utilize the recall indicators.
We propose considering density kernels as sched-
3 Spaced Repetition
ulers that favor (i.e., more confidently delay) less
We present two spaced repetition-based algorithms: difficult training instances in stronger networks. As
a modified version of the Leitner system developed a kernel we can use any non-increasing function of
in (Reddy et al., 2016) and our Repeat before For- the following quantity:
getting (RbF) model respectively. di × ti
xi = , (2)
3.1 Leitner System se
where di indicates the loss of network for a training
Suppose we have n queues {q0 , q1 , . . . , qn−1 }. The
instance hi ∈ H, ti indicates the number of epochs
Leitner system initially places all training instances
to next review of hi , and se indicates the perfor-
in the first queue, q0 . As Algorithm 1 shows,
mance of network— on validation data— at epoch
at each training iteration, the Leitner scheduler
e. We investigate the Gaussian, Laplace, Linear,
chooses some queues to train a downstream neu-
Cosine, Quadratic, and Secant kernels as described
ral network. Only instances in the selected queues
below respectively:
will be used for training the network. During train-
ing, if an instance from qi is recalled (e.g. correctly fgau (x, τ ) = exp(−τ x2 ), (3)
classified) by the network, the instance will be “pro- flap (x, τ ) = exp(−τ x), (4)
moted” to qi+1 , otherwise it will be “demoted” to (
the first queue, q0 .5 1 − τ x x < τ1
flin (x, τ ) = , (5)
The Leitner scheduler reviews instances of qi 0 otherwise
(
at every 2i iterations. Therefore, instance in 1
cos(τ πx) + 1 x < τ1
lower queues (difficult/forgotten instances) are re- fcos (x, τ ) = 2 ,
0 otherwise
viewed more frequently than those in higher queues
(easy/recalled ones). Figure 3 (bottom) provides (6)
(
examples of queues and their processing epochs. 1 − τ x2 x2 < τ1
Note that the overhead imposed on training by fqua (x, τ ) = , (7)
0 otherwise
5
Note that in (Reddy et al., 2016) demoted instances are 2
moved to qi−1 . We observed significant improvement in Leit- fsec (x, τ ) = , (8)
ner system by moving such instances to q0 instead of qi−1 . exp(−τ x ) + exp(τ x2 )
2

2404
1.0
Cos Algorithm 2. RbF Training Model
0.8 Gau Input: H : training data, V : validation data, k : num-
Lap ber of iterations, f : RbF kernel, η: recall confidence
Lin Output: trained model
0.6
Qua
Sec
0.4 0 ti = 1 for hi ∈ H
1 For epoch = 1 to k:
0.2 2 current batch = {hi : ti <= 1}
3 delayed batch = {hi : ti > 1}
0.0 4 sepoch , model = train(current batch, V)
0 1 2 3 4 5
2
5 τ̂ = arg minτ f (xi , τ ) − ai ∀hi ∈ V, ai ≥ η
2
Figure 4: RbF kernel functions with τ = 1. 6 tˆi = arg minti f (xi , τ̂ )−η ∀hi ∈ current batch
7 ti = ti − 1 ∀hi ∈ delayed bach
8 End for
where τ is a learning parameter. Figure 4 depicts 9 return model
these kernels with τ = 1. As we will discuss in
Figure 5: RbF training model. The train(.) func-
the next section, we use these kernels to optimize
tion at line 5 trains the network for one epoch using
delay with respect to item difficulty and network
instances in the current batch. Note that at each
strength for each training instance.
iteration epoch, xi is computed using Equation (2)
3.2.2 RbF Algorithm and strength of the current model, sepoch .
Our Repeat before Forgetting (RbF) model is a
spaced repetition algorithm that takes into account where τ̂ is the optimum value for the learning pa-
the previously validated recall indicators to train rameter obtained from validation data, see Equa-
neural networks, see Algorithm 2. RbF divides tion (10). In principle, reviewing instances could
training instances into current and delayed batches be delayed for any number of epochs; in practice
based on their delay values at each iteration. In- however, delay is bounded both below and above
stances in the current batch are those that RbF is (e.g., by queues in the Leitner system). Thus, we
less confident about their recall and therefore are assume that, at each epoch e, instances could be
reviewed (used to re-train the network) at current delayed for at least one iteration and at most k − e
iteration. On the other hand, instances in the de- iterations where k is the total number of training
layed batch are those that are likely to be recalled epochs. We also note that ti is a lower bound of the
by the network in the future and therefore are not re- maximum delay as se is expected to increase and
viewed at current epoch. At each iteration, the RbF di is expected to decrease as the network trains in
scheduler estimates the optimum delay (number of next iterations.
epochs to next review) for each training instance Algorithm 2 shows the outline of the proposed
in the current batch. RbF makes such item-specific RbF model. We estimate the optimum value of τ
estimations as follows: (line 5 of Algorithm 2) for RbF memory models us-
Given the difficulty of a training instance di , the ing validation data. In particular, RbF uses the loss
memory strength of the neural network at epoch e, values of validation instances and strength of the
se , and an RbF memory model f (see section 3.2.1), network obtained at the previous epoch to estimate
RbF scheduler estimates the maximum delay tˆi for network retention for validation instances at the
the instance such that it can be recalled with a con- current epoch (therefore ti = 1 for every validation
fidence greater than the given threshold η ∈ (0, 1) instance). The parameter τ for each memory model
at time e + tˆi . As described before, di and se can is computed as follows:
be represented by the current loss of the network 2
for the instance and the current performance of the τ̂ = arg min f (xj , τ ) − aj , ∀hj ∈ V, aj ≥ η,
τ
network on validation data respectively. Therefore, (10)
the maximum delay between the current (epoch e) where aj ∈ (0, 1) is the current accuracy of the
and next reviews of the instance can be estimated model for the validation instance hj . RbF then
as follows: predicts the delay for current batch instances and
2 reduces the delay for those in the delayed batch
tˆi = arg min f (xi , τ̂ ) − η , (9) by one epoch. The overhead of RbF is O(|H|) to
ti
compute delays and O(|V|) to compute τ̂ . Note
s.t 1 ≤ ti ≤ k − e that (9) and (10) have closed form solutions.

2405
0.80
Dataset train, dev, test Network Task
MLP/fastext 0.78 Cos
Qua
(Joulin et al., sentiment 0.76
IMDb 20K, 5K, 25K

scheduler accuracy
2017), best analysis Gau
0.74 Lit
epoch=8
CNN (Chan 0.72 Sec
image clas-
CIFAR10 45K, 5K, 10K et al., 2015) 0.70
sification Lap
best epoch=64
0.68
LSTM (Sutskever Lin
arithmetic 0.66
Addition 40K, 5K, 10K et al., 2014)
addition 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70
best epoch=32
(delayed instances per epoch)%
Table 1: Datasets, models, and tasks.
Figure 6: Accuracy of schedulers in predicting
network retention. For these experiments recall
4 Experiments
confidence is set to its default value, η = 0.5.
Table 1 describes the tasks, datasets, and models
that we consider in our experiments. It also reports each λe to the average loss of training instances
the training epochs for which the models produce that are correctly classified by the current partially-
their best performance on validation data (based on trained network. Furthermore, at each iteration e,
rote training). We note that the Addition dataset is we set αe = e/k to gradually introduce complex
randomly generated and contains numbers with at instances at every new iteration.7 Note that we treat
most 4 digits.6 all instances as easy at e = 0.
We consider three schedulers as baselines: a Performance values reported in experiments are
slightly modified version of the Leitner scheduler averaged over 10 runs of systems and the confi-
(Lit) developed in Reddy et al. (2016) for human dence parameter η is always set to 0.5 unless other-
learners (see Footnote 5), curriculum learning (CL) wise stated.
in which training instances are scheduled with re-
4.1 Evaluation of Memory Models
spect to their easiness (Jiang et al., 2015), and
the uniform scheduler of rote training (Rote) in In these experiments, we evaluate memory sched-
which all instances are used for training at every ulers with respect to their accuracy in predicting
epoch. For Lit, we experimented with different network retention for delayed instances. Since cur-
queue lengths, n = {3, 5, 7}, and set n = 5 in riculum learning does not estimate delay for train-
the experiments as this value led to the best perfor- ing instances, we only consider Leitner and RbF
mance of this scheduler across all datasets. schedulers in these experiments.
Curriculum learning starts training with easy For this evaluation, if a scheduler predicts a delay
instances and gradually introduces more com- t for a training instance h at epoch e, we evaluate
plex instances for training. Since easiness infor- network retention with respect to h at epoch e + t.
mation is not readily available in most datasets, If the network recalls (correctly classifies) the in-
previous approaches have used heuristic tech- stance at epoch e + t, the scheduler has correctly
niques (Spitkovsky et al., 2010; Basu and Chris- predicted network retention for h, and otherwise, it
tensen, 2013) or optimization algorithms (Jiang has made a wrong prediction. We use this binary
et al., 2015, 2014) to quantify easiness of training outcome to evaluate the accuracy of each sched-
instances. These approaches consider an instance uler. Note that the performance of schedulers on
as easy if its loss is smaller than a threshold (λ). instances that have not been delayed is not a ma-
We adopt this technique as follows: at each itera- jor concern. Although failing to delay an item
tion e, we divide the entire training data into easy inversely affects efficiency, it makes the network
and hard sets using iteration-specific λe and the stronger by providing more instances to train from.
loss values of instances, obtained from the current Therefore, we consider a good scheduler as the one
partially-trained network. All easy instances in con- that accurately delays more items.
junction with αe ∈ [0, 1] fraction of easiest hard Figure 6 depicts the average accuracy of sched-
instances (those with smallest loss values greater ulers in predicting networks’ retention versus the
than λe ) are used for training at iteration e. We set average fraction of training instances that they de-
layed per epoch. As the results show, all schedulers
6
https://github.com/fchollet/keras/
7
blob/master/examples/addition_rnn.py k is the total number of iterations.

2406
0.80 Model Accuracy TIPE X Faster Gain
CL 0.868 0.71 0.93 1.40
0.75 Lit 0.859 0.67 2.87 1.85
Gau 0.874 0.48 3.02 2.16
scheduler accuracy 0.70 Lap 0.857 0.34 4.66 3.15
Lin 0.864 0.36 4.78 3.07
0.65 Cos Cos 0.868 0.49 2.90 2.10
Gau
Lap Qua 0.871 0.50 2.95 2.08
0.60 Lin Sec 0.866 0.44 3.09 2.33
Qua
Sec
Lit
Cos η = 0.9 0.880 0.76 2.36 1.42
0.55 Rote 0.887 1.00 1.00 1.00
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
recall confidence (η)
Table 2: Comparison of schedulers in terms of aver-
Figure 7: Effect of recall confidence η on the accu- age network accuracy, average fraction of instances
racy of different schedulers in predicting network used for training per epoch (TIPE), and the extent
retention (best seen in color.) to which a model runs faster than Rote training (X
Times Faster). Gain column indicates the Accuracy
T IP E
delay substantial amount of instances per epoch. In improvement over rote training.
particular, Cos and Qua outperform Lit in both pre-
dicting network retention and delaying items, de- 4.2 Efficiency and Effectiveness
laying around 50% of training instances per epoch. We compare RbF against Leitner and curriculum
This is while Gau and Sec show comparable ac- learning in terms of efficiency of training and effec-
curacy to Lit but delay more instances. On the tiveness of trained models. We define effectiveness
other hand, Lap, which has been found effective in as the accuracy of a trained network on balanced
Psychology, and Lin are less accurate in predicting test data, and efficiency as (a): fraction of instances
network retention. This is because of the trade- used for training per epoch, and (b): required time
off between delaying more instances and creating for training the networks. For RbF schedulers, we
stronger networks. Since these schedulers are more set η to 0.5 and consider the best performing kernel
flexible in delaying greater amount of instances, Cosine with η = 0.9 based on results in Figure 7.
they might not provide networks with enough data The results in Table 2 show that all training
to fully train. paradigms have comparable effectiveness (Accu-
Figure 7 shows the performance of RbF sched- racy) to that of rote training (Rote). Our RbF sched-
ulers with respect to the recall confidence parame- ulers use less data per epoch (34-50% of data) and
ter η, see Equation (9). As the results show, sched- run considerably faster than Rote (2.90-4.78 times
ulers have poor performance with smaller values of faster for η = 0.5). The results also show that Lit is
η. This is because smaller values of η make sched- slightly less accurate but runs 2.87 time faster than
ulers very flexible in delaying instances. However, Rote; note that, as a scheduler, Lit is less accurate
the performance of schedulers are not dramatically than RbF models, see Figures 6 and 7.
low even with very small ηs. Our further analyses In addition, CL leads to comparable performance
on the delay patterns show that although a smaller to RbF but is considerably slower than other sched-
η leads to more delayed instances, the delays are ulers. This is because this scheduler has to identify
significantly shorter. Therefore, most delayed in- easier instances and sort the harder ones to sample
stances will be “reviewed” shortly in next epochs. training data at each iteration. Overall, the perfor-
These bulk reviews make the network stronger and mance of Lit, CL, Cos η = .5 and Cos η = .9
help it to recall most delayed instance in future are only 2.76, 1.90, 1.88, and 0.67 absolute values
iterations. lower than that of Rote respectively. Considering
On the other hand, greater ηs lead to more ac- the achieved efficiency, these differences are negli-
curate schedulers at the cost of using more train- gible (see the overall gain in Table 2).
ing data. In fact, we found that larger ηs do not Figure 8 reports detailed efficiency and effective-
delay most training instances in the first few itera- ness results across datasets and networks. For clear
tions. However, once the network obtains a reason- illustration, we report accuracy at iterations 2i ∀i in
ably high performance, schedulers start delaying which Lit is trained on the entire data, and consider
instances for longer durations. We will further Cos η = .5 as RbF scheduler. In terms of efficiency
study this effect in the next section. (first row of Figure 8), CL starts with (small set of)

2407
Lit CL RbF Rote Lit CL RbF Rote Lit CL RbF Rote
25 40 50
35
20
#training instances (×103 ) 40

#training instances (×103 )

#training instances (×103 )


30
15 25 30
20
10 15 20
10
5 10
5
0 0 0
2 4 6 8 4 8 12 16 20 24 28 32 8 16 24 32 40 48 56 64
epoch epoch epoch

(a) Efficiency IMDb (b) Efficiency Addition (c) Efficiency CIFAR10

Lit CL RbF Rote Lit CL RbF Rote Lit CL RbF Rote


0.90 1.0 0.80
0.9 0.75
0.88
0.8
network accuracy

network accuracy

network accuracy
0.7 0.70
0.86
0.6 0.65
0.84 0.5 0.60
0.4
0.82 0.55
0.3
0.80 0.2 0.50
1 2 4 8 1 2 4 8 16 32 1 2 4 8 16 32 64
epoch epoch epoch

(d) Effectiveness IMDb (e) Effectiveness Addition (f) Effectiveness CIFAR10

Figure 8: Efficiency and Effectiveness of schedulers across three datasets and networks. RbF uses Cos
η = .5 as kernel. CL starts with (small set of) easier instances and gradually incorporate slightly harder
instances at each iteration. Lit and RbF start big and gradually delay reviewing easy instances.

easier instances and gradually increases the amount schedulers to more confidently delay “reviewing”
of training data by adding slightly harder instances most instances and therefore train with a much
into its training set. On the other hand, Lit and smaller set of instances. For example, since the
RbF start big and gradually delay reviewing (easy) initial network accuracy in IMDb or CIFAR10 is
instances that the networks have learned. The dif- above 56%, Lit and RbF are considerably more
ference between these two training paradigms is efficient from the beginning of the training pro-
apparent in Figures 8(a)-8(c). cess. However, in case of low initial performance,
The results also show that the efficiency of a Lit and RbF tend to avoid delaying instances at
training paradigm depends on the initial effective- lower iterations which leads to poor efficiency at
ness of the downstream neural network. For CL the beginning. This is the case for the Addition
to be efficient, the neural network need to initially dataset in which instances are gradually delayed
have low performance (accuracy) so that the sched- by these two schedulers even at epoch 8 when the
uler works on smaller set of easy instances. For performance of the network reaches above 65%,
example, in case of Addition, Figures 8(b) and 8(e), see Figures 8(e) and 8(b). However, Lit gains its
the initial network accuracy is only 35%, therefore true efficiency after iteration 12, see Figure 8(b),
most instances are expected to be initially treated while RbF still gradually improves the efficiency.
as hard instances and don’t be used for training. This might be because of the lower bound delays
On the other hand, CL shows a considerably lower that RbF estimates, see Equation (9).
efficiency for networks with slightly high initial Furthermore, the effectiveness results in Figure 8
accuracy, e.g. in case of IMDb or CIFAR10 where (bottom) show that all schedulers produce compa-
the initial network accuracy is above 56%, see Fig- rable accuracy to the Rote scheduler throughout the
ures 8(a) and 8(d), and 8(c) and 8(f) respectively. training process, not just at specific iterations. This
In contrast to CL, Lit and RbF are more efficient indicates that these training paradigms can much
when the network has a relatively higher initial per- faster achieve the same generalizability as standard
formance. A higher initial performance helps the training, see Figures 8(b) and 8(e).

2408
0.90
Lit CL RbF Rote
Bengio et al. (2009) and Kumar et al. (2010)
also developed cognitively-motivated training
0.88
paradigms which are inspired by the principle that
network accuracy
0.86 learning can be more effective when training starts
0.84 with easier concepts and gradually proceeds with
more difficult ones. Our idea is motivated by the
0.82
spaced repetition principle which indicates learn-
0.80
1 2 4 8 16 32 ing improves with repeated exposure and decays
epoch
with delay since last exposure (Ebbinghaus, 1913;
Dempster, 1989). Based on this principle, we devel-
Figure 9: Robustness against overtraining. oped schedulers that space the reviews of training
instances over time for efficient and effective train-
4.3 Robustness against Overtraining ing of neural networks.
We investigate the effect of spaced repetition on
6 Conclusion and Future Work
overtraining. The optimal number of training
epochs required to train fastText on the IMDb We developed a cognitively-motivated training
dataset is 8 epochs (see Table 1). In this experiment, paradigm (scheduler) that space instances over time
we run fastText on IMDb for greater number for efficient and effective training of neural net-
of iterations to investigate the robustness of differ- works. Our scheduler only uses a small fraction
ent schedulers against overtraining. The results in of training data per epoch but still effectively train
Figure 9 show that Lit and RbF (Cos η = 0.5) are neural networks. It achieves this by estimating the
more robust against overtraining. In fact, the perfor- time (number of epochs) by which training could
mance of Lit and RbF further improve at epoch 16 be delayed for each instance. Our work was in-
while CL and Rote overfit at epoch 16 (note that CL spired by three recall indicators that affect memory
and Rote also require considerably more amount retention in humans, namely difficulty of learning
of time to reach to higher iterations). We attribute materials, delay since their last review, and mem-
the robustness of Lit and RbF to the scheduling ory strength of the learner, which we validated in
mechanism which helps the networks to avoid re- the context of neural networks.
training with easy instances. On the other hand, There are several avenues for future work in-
overtraining affects Lit and RbF at higher training cluding the extent to which our RbF model and its
iterations, compare performance of each scheduler kernels could be combined with curriculum learn-
at epochs 8 and 32. This might be because these ing or Leitner system to either predict easiness of
training paradigms overfit the network by paying novel training instances to inform curriculum learn-
too much training attention to very hard instances ing or incorporate Leitner’s queueing mechanism
which might introduce noise to the model. to the RbF model. Other directions include extend-
ing RbF to dynamically learn the recall confidence
5 Related Work parameter with respect to network behavior, or de-
veloping more flexible delay functions with theo-
Ebbinghaus (1913, 2013), and recently Murre and
retical analysis on their lower and upper bounds.
Dros (2015), studied the hypothesis of the expo-
nential nature of forgetting, i.e. how information
Acknowledgments
is lost over time when there is no attempt to re-
tain it. Previous research identified three critical We thank Mitra Mohtarami for her constructive
indicators that affect the probability of recall: re- feedback during the development of this paper
peated exposure to learning materials, elapsed time and anonymous reviewers for their thoughtful com-
since their last review (Ebbinghaus, 1913; Wixted, ments. This work was supported by National Insti-
1990; Dempster, 1989), and more recently item tutes of Health (NIH) grant R01GM114355 from
difficulty (Reddy et al., 2016). We based our inves- the National Institute of General Medical Sciences
tigation on these findings and validated that these (NIGMS). The content is solely the responsibil-
indicators indeed affect memory retention in neural ity of the authors and does not necessarily repre-
networks. We then developed training paradigms sent the official views of the National Institutes of
that utilize the above indicators to train networks. Health.

2409
References M Pawan Kumar, Benjamin Packer, and Daphne Koller.
2010. Self-paced learning for latent variable models.
Lee Averell and Andrew Heathcote. 2011. The form of In Advances in Neural Information Processing Sys-
the forgetting curve and the fate of memories. Jour- tems, pages 1189–1197.
nal of Mathematical Psychology, 55(1):25–35.
Jaap MJ Murre and Joeri Dros. 2015. Replication and
Sumit Basu and Janara Christensen. 2013. Teaching analysis of ebbinghaus’ forgetting curve. PloS one,
classification boundaries to humans. In Proceedings 10(7):e0120644.
of the Twenty-Seventh AAAI Conference on Artificial
Intelligence, pages 109–115. AAAI Press. Timothy P Novikoff, Jon M Kleinberg, and Steven H
Strogatz. 2012. Education of a model student.
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, Proceedings of the National Academy of Sciences,
and Jason Weston. 2009. Curriculum learning. In 109(6):1868–1873.
Proceedings of the 26th annual international confer-
ence on machine learning, pages 41–48. ACM. Siddharth Reddy, Igor Labutov, Siddhartha Banerjee,
and Thorsten Joachims. 2016. Unbounded human
Nicholas J Cepeda, Harold Pashler, Edward Vul, learning: Optimal scheduling for spaced repetition.
John T Wixted, and Doug Rohrer. 2006. Distributed In Proceedings of the 22nd ACM SIGKDD Inter-
practice in verbal recall tasks: A review and quanti- national Conference on Knowledge Discovery and
tative synthesis. Psychological bulletin, 132(3):354. Data Mining, pages 1815–1824. ACM.
Tsung-Han Chan, Kui Jia, Shenghua Gao, Jiwen Lu, Zi- Valentin I Spitkovsky, Hiyan Alshawi, and Daniel Ju-
nan Zeng, and Yi Ma. 2015. Pcanet: A simple deep rafsky. 2010. From baby steps to leapfrog: How less
learning baseline for image classification? IEEE is more in unsupervised dependency parsing. In Hu-
Transactions on Image Processing, 24(12):5017– man Language Technologies: The 2010 Annual Con-
5032. ference of the North American Chapter of the Associ-
ation for Computational Linguistics, pages 751–759.
Frank N Dempster. 1989. Spacing effects and their im- Association for Computational Linguistics.
plications for theory and practice. Educational Psy-
chology Review, 1(4):309–330. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Sequence to sequence learning with neural networks.
Hermann Ebbinghaus. 1913. Memory: A contribution In Advances in neural information processing sys-
to experimental psychology. 3. University Micro- tems, pages 3104–3112.
films.
Jason Weston, Antoine Bordes, Sumit Chopra, Alexan-
Hermann Ebbinghaus. 2013. Memory: A contribu- der M Rush, Bart van Merriënboer, Armand Joulin,
tion to experimental psychology. Annals of neuro- and Tomas Mikolov. 2016. Towards ai-complete
sciences, 20(4):155. question answering: A set of prerequisite toy tasks.
5th International Conference on Learning Represen-
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. tations.
2016. Deep Learning. MIT Press. http://www.
deeplearningbook.org. John T Wixted. 1990. Analyzing the empirical course
of forgetting. Journal of Experimental Psychology:
Julia Hirschberg and Christopher D Manning. 2015. Learning, Memory, and Cognition, 16(5):927.
Advances in natural language processing. Science,
349(6245):261–266. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
Le, Mohammad Norouzi, Wolfgang Macherey,
Lu Jiang, Deyu Meng, Shoou-I Yu, Zhenzhong Lan, Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Shiguang Shan, and Alexander Hauptmann. 2014. Macherey, et al. 2016. Google’s neural machine
Self-paced learning with diversity. In Z. Ghahra- translation system: Bridging the gap between hu-
mani, M. Welling, C. Cortes, N. D. Lawrence, and man and machine translation. arXiv preprint
K. Q. Weinberger, editors, Advances in Neural In- arXiv:1609.08144.
formation Processing Systems 27, pages 2078–2086.
Curran Associates, Inc.

Lu Jiang, Deyu Meng, Qian Zhao, Shiguang Shan, and


Alexander G. Hauptmann. 2015. Self-paced cur-
riculum learning. In Proceedings of the Twenty-
Ninth AAAI Conference on Artificial Intelligence,
AAAI’15, pages 2694–2700.

Armand Joulin, Edouard Grave, and Piotr Bo-


janowski Tomas Mikolov. 2017. Bag of tricks for
efficient text classification. European Chapter of the
Association for Computational Linguistics, EACL
2017, page 427.

2410

You might also like