Available online at www.sciencedirect.com
ScienceDirect
Cognitive Systems Research 65 (2021) 23–39
www.elsevier.com/locate/cogsys
Adaptive sampling for active learning with genetic programming
Sana Ben Hamida a,⇑, Hmida Hmida a,b, Amel Borgi c, Marta Rukoz a,d
a
Université Paris Dauphine, PSL Research University, CNRS, UMR[7243], LAMSADE, Paris 75016, France
b
Université de Tunis El Manar, Faculté des Sciences de Tunis, LR11ES14 LIPAH, Tunis 2092, Tunisia
c
Université de Tunis El Manar, Institut Supérieur d’Informatique et Faculté des Sciences de Tunis, LR11ES14 LIPAH, Tunis 2092, Tunisia
d
Université Paris Nanterre Nanterre Cedex 92001, France
Received 30 October 2019; received in revised form 3 July 2020; accepted 25 August 2020
Available online 19 September 2020
Abstract
Active learning is a machine learning paradigm allowing to decide which inputs to use for training. It is introduced to Genetic Programming (GP) essentially thanks to the dynamic data sampling, used to address some known issues such as the computational cost, the
over-fitting problem and the imbalanced databases. The traditional dynamic sampling for GP gives to the algorithm a new sample periodically, often each generation, without considering the state of the evolution. In so doing, individuals do not have enough time to
extract the hidden knowledge. An alternative approach is to use some information about the learning state to adapt the periodicity
of the training data change. In this work, we propose an adaptive sampling strategy for classification tasks based on the state of solved
fitness cases throughout learning. It is a flexible approach that could be applied with any dynamic sampling. We implemented some sampling algorithms extended with dynamic and adaptive controlling re-sampling frequency. We experimented them to solve the KDD intrusion detection and the Adult incomes prediction problems with GP. The experimental study demonstrates how the sampling frequency
control preserves the power of dynamic sampling with possible improvements in learning time and quality. We also demonstrate that
adaptive sampling can be an alternative to multi-level sampling. This work opens many new relevant extension paths.
Ó 2020 Elsevier B.V. All rights reserved.
Keywords: Genetic programming; Machine learning; Active learning; Training data sampling; Adaptive sampling; Sampling frequency control
1. Introduction
Evolutionary Algorithms (EA) (Pétrowski & Ben
Hamida, 2017; Simon, 2013; Yu & Gen, 2010) are metaheuristics that comply with a wide range of problems such
as complex optimization, identification, machine learning,
and adaptation problems. Applied to machine learning,
Evolutionary Algorithms, especially Genetic Programming
⇑ Corresponding author.
E-mail addresses:
[email protected] (S. Ben Hamida),
[email protected] (A. Borgi),
[email protected]
(M. Rukoz).
https://doi.org/10.1016/j.cogsys.2020.08.008
1389-0417/Ó 2020 Elsevier B.V. All rights reserved.
(GP) (Koza, 1992), have been seen very effective in a wide
range of problems in supervised and unsupervised learning.
However, their flexibility and expressiveness comes with
two major flaws: an excessive computational cost and a
problematic parameters setting.
In supervised learning field, the lack of data may lead to
unsatisfactory learners. This is no longer an issue with
numerous data sources and high data volume that we witness in the era of Big Data. Nonetheless, this toughens up
the computation problem of GP and precludes its application in data-intensive problems. There have been various
research efforts on improving GP when applied with large
24
S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39
datasets. These research efforts include hardware solutions,
such as parallelization, or algorithmic solutions. The most
affordable is software based solutions that do not require
any specific hardware configuration. Sampling is the mainstream approach in this category. It relies on reducing processing time by reducing data while keeping relevant
records.
A complete review of sampling methods used with GP is
published in Hmida, Ben Hamida, Borgi, and Rukoz
(2016b), extended with a discussion on their ability to deal
with large datasets. In fact, sampling methods can be classified with regard to three properties: re-sampling frequency, sampling scheme, or strategy and sampling
quantity. Sampling strategy defines how to select records
from the input database. Sampling quantity defines how
many samples are needed by the algorithm. Sampling frequency defines when the sampling technique is applied
throughout the training process. The latest property is
the focus of this study.
According to the re-sampling frequency, machine learning algorithms use a unique or a renewable sample. They
are called respectively static or dynamic sampling. On the
one hand, in static sampling for GP, like the Historical
Subset Selection (Gathercole & Ross, 1994) and
bagging/boosting (Iba, 1999; Paris, Robilliard, &
Fonlupt, 2003), a selection of representative training set
needs to be performed. With large datasets, this poses a
problem of combining downsizing and data coverage
objectives. On the other hand, dynamic sampling creates
samples per generation according to its selection strategy.
Consequently, GP individuals do not have enough time
to learn from sampled data. The population might waste
some good resources for solving some difficult cases in
the current training set. Otherwise, re-sampling at each
GP iteration might be computationally expensive, especially when using some sophisticated sampling strategies.
We propose, in this paper, an extension to dynamic sampling techniques in which sample renewal is controlled
through a parameter that adapts the sampling to the learning process. This extension aims to preserve original sampling strategy while making an enhancement in learning
robustness and/or learning time.
After studying the effect of the re-sampling frequency on
the training quality and learning time, we propose two
predicates to implement the adaptive sampling based on
the status of resolved fitness cases. These predicates are
tested and compared with two deterministic variation rules
defined by two functions with an increasing and a decreasing patterns. The objective of this study is to demonstrate
that controlling sampling frequency with deterministic or
dynamic functions does not degrade the results. On the
contrary, in some cases they allow an improvement in quality and learning time.
This paper is organized as follows. Next section gives an
overview of the adaptive sampling in active machine learning. In Section 3, we expose the background of this work in
GP and decision designs needed to add dynamic sampling
to the GP engine. Section 4 reviews some sampling methods for active learning with GP that are involved in the
experimental study. In Section 5 we study the effect of varying the sampling frequency on the Genetic Learners. Section 6 introduces the novel sampling approach and
explains how it can extend dynamic sampling methods.
Then, in Section 7, an experimental study gives the proof
of concept of adaptive sampling and traces its effect on
learning process through the discussion of registered results
in Section 8. The main results in this section are compared
to some results of three multi-level dynamic sampling
methods published in Hmida, Ben Hamida, Borgi, and
Rukoz (2016a) to demonstrate how adaptive sampling
could be an alternative to hierarchical sampling. Finally,
we give some conclusions and propose further
developments.
2. Related works: adaptive sampling
In this paper, we are mainly interested on sampling
methods aiming at reducing the original training data-set
size by substituting it with a representative subset much
smaller, thus reducing the evaluation cost of learning algorithm. Two major classes of sampling techniques can be
laid out: static sampling where the training set is selected
independently from the training process and remains
unmodified along evolution, and active sampling, also
known as active learning that could be defined as (Atlas,
Cohn, & Ladner, 1990; Cohn, Atlas, & Ladner, 1994):
‘any form of learning in which the learning program has
some control over the inputs on which it trains.’(Cohn
et al., 1994).
With active sampling, the training subsets are periodically (often at each iteration of the learning algorithm)
built and modified using a special technique associated to
the learning algorithm along evolution. In the machine
learning field ‘the key hypothesis is that if the learning algorithm is allowed to choose the data from which it learns–to
be curious, if you will–it will perform better with less training’(Settles, 2010). When the active sampling depends on
any component in the machine learning engine such as data
information or solutions quality, then it becomes adaptive.
In the past two decades, several approaches of adaptive
sampling have been proposed to deal with large data sets
on several domains. Xiao-Bai Li and Varghese S. Jacob
use adaptive sampling for data reduction based on chisquare statistic for measuring the goodness-of-fit between
the distributions of the reduced and full data-sets (Li,
2002; Li & Jacob, 2008). Lyengar et al. use an adaptive
resampling to the active learning task for classification
problems (Iyengar, Apté, & Zhang, 2000). Reviews of
active learning approaches mainly for classification problems are presented in Fu, Zhu, and Li (2013) and Settles
(2010). More recently, Luo et al. have proposed an adaptive bounding evolutionary algorithm based on adaptive
sampling for continuous optimization problems
S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39
(Luo, Hou, Zhong, Cai, & Ma, 2017). Their algorithm
starts to update the boundaries of the variables after n0
generations of the evolution process and then updates the
boundary every ng generations afterwards by using a
fitness-based bounding selection strategy over multiple previous generations. In Balkanski and Singer (2018a, 2018b)
an adaptive sampling technique for maximizing monotone
sub-modular functions under a cardinal constraint is presented. Various adaptive sampling criteria for the development of based on non-uniform rational B-splines (NURBs)
meta-models are presented in Pickett and Turner (2011).
Adaptive samplings have been also used in global
meta-modeling for the computer simulation models.
Indeed, simulation models can approximate detailed information of real-world physical problems, but require huge
computational resources. Thus adaptive sequential sampling strategy allows constructing accurate global metamodels with fewer points than others sampling strategies
such as space-filling sequential sampling (Haitao, YewSoon, & Jianfei, 2018). An adaptive sampling for parametric macro-modeling of a microwave antenna is proposed in
Deschrijver, Crombecq, Nguyen, and Dhaene (2011). A
balance strategy to perform adaptive sampling by circularly looping through a search pattern that contains several
weights from global to local is presented in Liu, Xu, Ma,
Chen, and Wang (2015). A survey of adaptive sampling
for global meta-modeling can be found in Haitao et al.
(2018), which considers four categories of adaptive sampling: variance based adaptive sampling, query-bycommitted based adaptive sampling, cross-validation based
adaptive sampling, and gradient based adaptive sampling.
Our work tackles adaptive sampling approach for active
learning, particularly for Genetic programming engines.
We present it in the next section.
25
3. Backgrounds: GP and active learning
3.1. Genetic programming engine
As any EA, GP evolves a population of individuals
throughout a number of generations. A generation is in
fact an iteration of the main loop as described in Fig. 1.
Each individual represents a complete mathematical
expression or a small computer program. The standard
GP uses a tree representation of individuals built from a
function set for nodes and a terminal set for leaves. When
GP is applied to a classification problem, each individual is
a candidate classifier. Thus, the objective is to find the best
classifier. The terminal set is composed of dataset features
and some randomly generated constants and the function
set contains mostly arithmetic and logic functions. As for
the fitness function, it is often based on some learning performance measures such as accuracy.
The main steps of GP with dynamic or active sampling
are:
1. Randomly create a population of individuals where tree
nodes are taken from a given function and terminal sets.
Then evaluate their fitness value by executing each program tree against the initial training subset.
2. According to a fixed probability, individuals are crossed
or mutated to create new offspring individuals.
3. Select a new training subset with a given sampling
algorithm.
4. Offspring solutions are evaluated against new sample
and a new population is made up by selecting best individuals from parents and offspring according to their fitness values.
5. Loop step 2 and 4 until a stop criteria is met.
Fig. 1. Genetic Learning Evolutionary loop. Steps 1, 2 and 4 concern the traditional GP loop, step 3 deals with the dynamic sampling.
26
S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39
The evaluation is the prevailing step with regard to the
overall computation cost and it depends simultaneously
on the sample size, the population size and each individual
complexity. With active sampling, the training subset is
changed regularly before the evaluation step so as only best
individuals fitting the different provided datasets survive
along evolution.
3.2. Active learning for GP
In a GP engine implementing active learning, the underlying sampling techniques are tightly related to the evolutionary mechanism. They allow a dynamic change of the
training dataset along the learning process.
Sampling the training database has been first used to
boost the learning process, to avoid over-fitting or to handle the imbalanced data classes for classification problems
(Iba, 1999; Liu & Khoshgoftaar, 2004; Paris et al., 2003).
Later, it was introduced for Genetic Learners as a strategy
for handling large input databases. For Genetic Learning,
several sampling techniques are proposed as solutions for
one or several of these problems. They can be classified into
two categories: static sampling methods and dynamic sampling methods. With static sampling, the sampling method
is performed only once for each Genetic Learning run. In
contrast, with dynamic sampling, the training subset is
changed periodically across the learning process. The data
selection strategy is based on some dynamic criteria, such
as random selection, weighted selection, incremental selection, etc. We distinguish one-level sampling methods using
a single selection strategy and multi-level sampling (hierarchical sampling) methods using multiple selection strategies
associated in a hierarchical way.
3.2.1. Designing a dynamic sampling technique for GP
A sampling technique can be formulated as follows. If
we consider B as the database storing all available records
for the training process, sampling consists on selecting a
subset S of records from B such as S B and
j S jj B j.
To introduce a sampling technique for GP, in addition
to the sampling strategy defining how to select fitness cases
from the database, two important parameters have to be
designed:
- sampling frequency: how often the training subsets are
changed across the learning process;
- sampling quantity: how many subsets are needed for the
evaluation step.
For evolutionary machine learning techniques, the sampling quantity could be individual-wise, population-wise or
sub-population-wise. For individual-wise case, a new data
sample Sj is extracted for each individual in the population and each solution is evaluated independently, which
might rise drastically the computational cost. The
sub-population-wise case could be used only if the Genetic
Learner evolves sub-populations with a co-evolutionary
mechanism. Theses two cases are not included in our study.
Only the population-wise case using one sample for all the
population is considered.
With Genetic Learning, it is possible to mine a single
subset S throughout an evolutionary run, that is used to
evaluate all individuals in the population. It is the runwise sampling frequency (Freitas, 2002). This sampling
approach is also known as the static sampling, where the
learner obtains all the input training data at once, and it
is kept unchanged across the learning process. All methods
in this category use the run-wise sampling frequency and
are population-wise, like the Historical Subset Selection
(Gathercole & Ross, 1994) or sub-population-wise like the
bagging/boosting methods (Iba, 1999; Paris et al., 2003).
When the sampling technique is called by the learner to
change the training sample along the evolution, then the
method uses the g-generation-wise sampling frequency. In
this case, each g generations, a new subset S is extracted
using the designed sampling strategy, and it is used for
the evaluation step. Methods in this category are known
as active sampling techniques. When g ¼ 1, the population
is evaluated on a different data subset each generation and
the sampling frequency is generation-wise. All the dynamic
sampling techniques introduced for GP use this frequency
where g ¼ 1. When using a complex sampling strategy
and relatively large sample size, the computational cost
of the learning process might be very high. Otherwise, with
the generation-wise frequency, the population do not have
enough time to adapt the population in order to extract
the hidden knowledge in the current training sample.
4. Active sampling with GP
To select a training subset S from the database B, many
approaches were proposed either for static or active sampling. For static sampling, the database is partitioned before
the learning process, based essentially on some criteria or
some features in the data. This sampling strategy is not discussed in this paper. For active sampling, we identify basically five main approaches used with GP: stochastic
sampling, weighted sampling, data-topology based sampling,
balanced sampling and incremental sampling. With stochastic and incremental sampling, fitness cases are selected randomly with respectively fixed and increasing size. However,
with weighted sampling, a weight is computed for each fitness
case based on some features and/or the difficulty to solve the
corresponding record. Regarding balanced sampling, it was
first introduced to deal with the problem for classes’ imbalance in the training databases. Like stochastic and weighted
sampling, it could be applied to deal with large datasets or
to decrease the training computational cost. The datatopology based sampling uses some information about the
features to measure similarity and connections between fitness cases. These measures help to create heterogeneous samples for a better training. Each approach is introduced to
provide some solutions to a specific machine leaning problem
S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39
like over-fitting or data imbalance, but all of them could be
applied to deal with large datasets and decrease the training
computational cost.
Another approach consists in combining some techniques from the five above approaches in a hierarchical
way. Methods in this category are proposed especially to
deal with very large databases. They combine two or three
sampling techniques in two or three levels. In the first level
(and second level for the 3-levels method), the corresponding sampling technique is applied less frequently than sampling technique in the last level.
To study the effect of the sampling frequency parameter
on the dynamic sampling efficiency, we selected four methods presented in the following subsections: Random Subset
Selection (RSS), Dynamic Subset Selection (DSS) and
Balanced Sampling in two variants (BRSS, BUSS). Additionally, four hierarchical sampling approaches are selected
for a comparison purpose: RSS-DSS, DSS-DSS, RSS-TBS
and BUSS-RSS-TBS. All selected techniques are presented
in the following sub-sections.
Note that it exists other sampling techniques that have
been experimented with GP to improve the GP performance or to deal with some learning difficulties. For example, the Interleaved sampling (Gonçalves & Silva, 2013) is
introduced to address over-fitting problems. It alters from
a generation to another between two training sets composed either by all instances in the training database or
only one selected fitness case.
Random Subset Selection (Gathercole & Ross, 1994) is
a simple algorithm that selects at every generation g a
record i among T records in the initial dataset B with a
uniform probability P i ðgÞ:
P i ðgÞ ¼
S
T
ð1Þ
where S is the target subset size. Its steps in the GP engine
are summarized in the Algorithm 1.
Algorithm 1. RSS
1: Select instances from B with a uniform probability
to create a subset Sð0Þ
2: g
0
3: for all generation g < gmax do
4: Evaluate programs using the subset SðgÞ
5: Evolve parents
6: Generate randomly new dataset Sðg þ 1Þ
7: end for
ing subset for each individual in the GP population and at
each generation. It is an individual-wise technique as
described in Section 3. A second method called Incremental
Random Selection (Zhang & Cho, 1999; Zhang & Joung,
1999) constructs subsets with a growing size by adding an
identical number of fitness cases at every generation until
using the whole training database. Variants of the RSS
technique are not subject to our experimental study.
4.2. Dynamic Subset Selection (DSS)
DSS algorithm (Gathercole & Ross, 1994, 1997;
Gathercole, 1998) is inspired by boosting techniques and
aims to bias selection to keep difficult cases (i.e. fitness
cases frequently unsolved by the best solutions) and fitness
cases that have not been selected for several generations.
DSS computes two measures: a difficulty degree Di ðgÞ
and an age Ai ðgÞ for a record i, starting with 0 at first generation and updated at every generation g. The difficulty is
incremented for each classification error and reset to 0 if
the fitness case is solved. The age is equal to the number
of generations since last selection, so it is incremented when
the fitness case has not been selected and reset to 0
otherwise.
Selection probabilityP i ðgÞ in eq. (3) depends on each fitness weight W i ðgÞ (eq. (2)).
8i : 1 6 i 6 T ;
W i ðgÞ ¼ Di ðgÞd þ Ai ðgÞa
ð2Þ
where d and a are given parameters denoting respectively
the difficulty exponent and the age exponent.
4.1. Random Subset Selection (RSS)
8i : 1 6 i 6 T ;
27
S
T
Note that other variants have been proposed based on
the same data selection strategy but differs from RSS on
some small details. For example, the Stochastic Sampling
introduced in Nordin and Banzhaf (1997) samples a train-
8i : 1 6 i 6 T ;
W i ðgÞ S
P i ðgÞ ¼ PT
j¼1 W j ðgÞ
ð3Þ
Algorithm 2 describes how the DSS technique is included
in the GP engine.
Algorithm 2. DSS
1: initialize for each record i in B the difficulty degree
Di ð0Þ and the age Ai ð0Þ
2: g
0
3: for all generation g < gmax do
4: initialize empty subset SðgÞ
5: for all record i in B do
6:
if g = 0 then
T
7:
P i ðgÞ ¼ jBj
8:
else
9:
compute P i ðgÞ using Eq. (3)
10:
end if
11:
add record i to SðgÞ with a probability P i ðgÞ
12:
if i is selected then
13:
Ai ðg þ 1Þ ¼ 0
14:
else
15:
Ai ðg þ 1Þ ¼ Ai ðgÞ þ 1
16:
end if
17: bf end for
28
S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39
18: Evaluate programs using the subset SðgÞ
19: Update for each record i in B the difficulty degree
Di ðg þ 1Þ
20: Apply genetic operators
21: end for
4.3. Balanced sampling
Balanced sampling (Hunt, Johnston, Browne, & Zhang,
2010) aims to improve classifier accuracy by correcting the
original data set imbalance within majority and minority
classes instances. Some methods are based on the minority
class size (N min ) and thus reduces the number of instances like
the methods studied in this paper. Several approaches are
proposed, we summarize hereafter three sampling techniques
used with GP: first, Static Balanced Sampling that selects
cases with uniform probability from each class without
replacement until obtaining a balanced subset of the desired
size, then, Basic Under-sampling (BUSS) (resp. Basic Oversampling) that selects all minority (resp. majority) class
instances and then an equal number from the majority (resp.
majority) class randomly. With BUSS, the sample size is
equal to 2N min , where N min is the minority class size.
4.4. Multi-level or hierarchical active sampling
Hierarchical sampling is based on multiple levels of sampling methods inspired by the concept of a memory hierarchy. It combines several sampling algorithms that are
applied in different levels. Its objective is to deal with large
datasets that do not fit in the memory, and simultaneously
provide the opportunity to find solutions with a greater
generalization ability than those given by the one-level
sampling techniques. The data subset selections at each
level are independent.
Fig. 2 shows the main steps performed to obtain the
final training subset. The usual schema is made up of three
levels. The first one consists in creating blocks with a given
size from the original data set which are recorded in the
Fig. 2. Main steps of the Hierarchical Sampling: case of three-level
sampling.
hard disk. The remaining two levels are a combination of
two active sampling methods.
Curry et al. conceived an extension to the DSS algorithm into a 3 level hierarchy (Curry & Heywood, 2004).
At level 0, the database is partitioned into blocks that are
sufficiently small to reside within RAM alone. Then, at
level 1, one block is chosen from these partitions based
on RSS or DSS sampling techniques. Finally, at level 2,
the selected block is considered as the full data set on which
DSS is applied for several generations. Depending on the
level 1 algorithm, two approaches are possible: RSS-DSS
hierarchy or DSS-DSS hierarchy.
Based on the same idea, Hmida et al. proposed two new
variants of hierarchical sampling: the RSS-TBS and the
BUSS-RSS-TBS (Hmida et al., 2016a). The RSS-TBS uses
the Topology Based Subset Selection (Lasarczyk, Dittrich,
& Banzhaf, 2004) at level 2 instead of RSS or DSS. TBS
uses for the sampling process an undirected weighted graph
representing the relationship between fitness cases in the
database. Vertices in the graph are fitness cases and each
edge defines a weight measuring a similarity or a distance
induced from individuals performance. Then, cases having
a tight relationship cannot be selected together in the same
subset assuming that they have an equivalent difficulty for
the population.
The second variant BUSS-RSS-TBS extends the first
variant with a Basic Under-Sampling at the level 0 block
creation. BUSS favors the minority class by calculating
the block size according to its cardinality. For majority
class, an equal number of instances are selected randomly.
5. Controlling sampling frequency with GP
5.1. The sampling frequency feature
Sampling frequency (f) is a main parameter for any active
sampling technique. It defines how often the training subset is
changed across the learning process. When f = 1, the training
sample is extracted at each generation and the sampling
approach is considered as a generation-wise sampling technique. Most of the sampling techniques applied with GP
belong to this category. This is the case of the techniques
described in Section 4. When f is set to 1, individuals in the
current population have only one generation to adapt their
genetic materials to the current environment characterized
by the training sample. For an evolutionary algorithm, it is
very difficult even impossible for any population to solve
all cases in a training set in one generation. A higher value
of f corresponds to a lower number of samples to be generated and might do not allow the population to see all the fitness cases available in the database. We think that the
sampling frequency must be updated according to the evolution state and the difficulty of the current training set.
Note that for hierarchical sampling, described in Section 4.4, a sampling frequency value is needed for each
level. For example, for RSS-DSS method, a sampling frequency f 1 is needed for the level 1 that defines when to
S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39
change the training block with the RSS technique. A second frequency value f 2 is needed for level 2 designing the
re-sampling frequency with the DSS technique.
5.2. State of the art
As detailed in Section 4, active sampling techniques proposed for GP studied essentially how to select examples for
training subset. The sampling frequency is a constant value
set as a user parameter. Thus, for each sampling technique
presented or cited in Section 4 (RSS, DSS, BUSS, etc), the
sampling frequency is set as a constant before the learning
process. For the one-level techniques, the sampling frequency is usually set as f ¼ 1. For the multi-level techniques,
the sampling frequency at the low level is set to 1 as for onelevel methods, and for the high levels, it varies on the complexity of the problem. Often it varies from 40 to 100.
The general used strategy to set the sampling frequency
parameter is reported in the following algorithm.
Fixed Sampling Frequency The sampling frequency f is
set before starting the GP run like any GP parameter. This
value remains unchanged till the last generation and it is
usually equal to 1. This can be represented by the following
algorithm:
Algorithm 3. Fixed Sampling Frequency
Require f {sampling frequency}
1: for all generation g < gmax do
2: if gmodf ¼ 0 then
3:
re-sample
4: end if
5: end for
6. The proposed sampling approach
Three main approaches are possible to control any EA
parameter: deterministic, adaptive and self-adaptive
(Eiben, Michalewicz, Schoenauer, & Smith, 2007). The
deterministic control uses a deterministic rule to alter the
EA parameter along the evolution. However, the adaptive
control uses feed-backs from the search state to define a
strategy for updating the parameter value. With the selfadaptive control, the parameter is encoded within the chromosome and it evolves with the population.
The sampling frequency could be considered as an EA
parameter and then could be controlled using the same
strategies. We propose in this section a deterministic and
an adaptive approaches to adjust this parameter along
the evolutionary learning process. For the deterministic
control, an increasing and decreasing scheme are experimented. For adaptive control, we propose an adaptive
scheme based on some feed-backs from the learning state
such as the proportion of solved fitness cases in the current
sample or the improvement rate of best/average fitness.
29
6.1. Deterministic sampling frequency
When the sampling frequency is updated with a deterministic control, f takes different values throughout the
GP run. These values are determined by a function that
gives the same series of values each run. Thus, the frequency may be increasing, decreasing or following a complex curve.
When f has an increasing scheme, the training process
starts with short lifetime (in number of generations) samples, giving to the population the opportunity to see the
maximum number of fitness cases in the first training iterations. By the end of the run, the samples are learned over
a large number of generations, that might help the population to tune the genetic materials of the current solutions.
To achieve this approach, we use a deterministic function based on the generation number: f ¼ ðC gÞa where
the coefficients C and a help to control the shape of the
curve of f. Their values are set with the GP parameters.
The following algorithm summarizes the corresponding
steps:
Algorithm 4. Deterministic Sampling Frequency
1: for all generation g < gmax do
2: f ¼ ðC gÞa {C; a 2 R}
3: if gmodf ¼ 0 then
4:
re-sample
5: end if
6: end for
The opposite process (i.e. decreasing frequency) uses the
same steps but updates frequency with a decreasing funca
tion such as: f ¼ ðC ðgmax gÞÞ . This scheme could be
useful when the data set contains fitness cases difficult to
solve. The GP engine, in first step, focus on the current
sample in order to help the population to reach the target
area in the search space. When the sampling becomes more
frequent, its focus becomes the tune of solutions.
6.2. Adaptive sampling frequency
The fundamental idea behind adaptive sampling by controlling the sampling frequency is to add an extra parameter to sampling algorithms acting as a moderator or resampling regulator. While dynamic methods use a fixed
renewal frequency equal to 1, adaptive sampling decides
to generate a new sample for the subsequent generations
according to a condition that must be satisfied by the learning state.
Fig. 3 depicts this approach. GP based learner interacts
with the sampling process by providing some adequate
information about the learning process needed to perform
the underlying selection strategy. For example, DSS needs
to know misclassified cases to update the difficulty value.
Then, the sampling algorithm delivers a new sample
30
S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39
Dataset
Dataset
Sampling
Sampling
Algorithm 5. Adaptive Sampling Frequency
Re-sampling
predicate
True
Sample
False
Sample
Learning
state
Learning
state
Learning process
(GP)
Learning process
(GP)
Dynamic
from the GP engine. Hereafter, two examples of adapting
techniques.
The first uses a threshold for the population mean fitness
to detect if the population is making improvements or not.
In the case of very small or no improvement, a new sample
is generated since the old one is fully exploited (Algorithm
5).
Adaptive
Require r {mean fitness variation rate: (mean fitness(g)mean fitness(g-1))/mean fitness(g-1)}
Require t {threshold}
1: for all generation g < gmax do
2: if r < t then
3:
re-sample
4: end if
5: end for
Fig. 3. Adaptive vs Dynamic sampling.
generated according to the updated parameters. With
adaptive sampling, a predicate controls the re-sampling
decision. We assume that any input required to evaluate
this predicate is available within the data dispensed by
the GP engine.
To design a predicate for the adaptive sampling, various
information about the current state of the evolution and
the training process can be retrieved from the GP engine,
such as the generation number, the population mean fitness, the mean fitness improvement rate, the best fitness
improvement rate, etc. This information can be used into
a Boolean condition or within a dynamic function.
The straightforward approach is to define a threshold
per measure and the predicate is then a comparison of
the current value to the corresponding threshold.
For example, if we define a threshold of 0:002 to best fitness improvement rate, then GP will continue to use the
same sample if the best fitness of the current generation
is better than the previous generation with 0:2% or more.
Otherwise a new sample must be created. In a more complex approach, threshold can be auto-adapted to learning
process. With the adaptive sampling, the sampling frequency f is adjusted according to the general training performance to accommodate the current state of the
learning process. Therefore, f can increase or decrease by
a varying amount.
Our approach is based on the current state of the population. It uses either the evolution of the mean fitness or the
number of resolved cases to decide whether to create a new
sample or to carry on learning with the previous one.
We assume, that less performing learners need more
time to improve their performance and symmetrically efficient learners on a particular sample need to see different
data from a new sample.
As we said, adaptive sampling can rely on various learning performance indicators. It retrieves these indicators
The second example is based on measuring the mean
number of individuals (learners) that have resolved each
record in the training sample. When this value reaches a
designated value then new records are selected in a new
sample.
In the following sections, we give details about the settings used for the conducted experiments and implementation of adaptive sampling over some dynamic sampling
algorithms discussed in Section 4. Then we expose the
experimental results and discuss them to analyze the effect
of sampling frequency and adaptive sampling on GP performance in resolving the considered problems.
7. Experimental settings
7.1. Cartesian Genetic Programming
Cartesian Genetic Programming (CGP) (Miller &
Thomson, 2000) is a GP variant where individuals represent graph-like programs. It is called ”Cartesian” because
it uses a two-dimensional grid of computational nodes
implementing directed acyclic graphs. Each graph node
encodes a function from the function set. The arguments
of the encoded function are provided by the inputs of the
node and its output designs the result.
CGP shows several advantages over other GP
approaches. Unlike trees, there are more than one path
between any pair of nodes. This enables the reuse of intermediate results. A genotype can also have multiple outputs
which make CGP able to solve many types of problems and
classification problems in particular (Harding & Banzhaf,
2011). Otherwise, CGP has the great advantage of counteracting the bloating effect (genotype growth), frequent phenomena with other GP representations. CGP is easy to
implement, and it is highly competitive compared to other
GP methods.
31
S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39
7.2. Data sets
Table 2
Adult dataset.
For the experimental study, we selected from the UCI
Machine Learning repository two databases: the KDD-99
database for the intrusion detection problem and the Adult
database for income prediction.
KDD-99 base is widely used to validate the performance
of various machine learning classifiers. The corresponding
problem consists on classifying connections into normal
or attack classes. It uses a large data set called 10%
KDD-99 data set (UCI, 1999). The data set is already
divided into training and test sets which are presented in
Table 1.
Each record is described by 41 features. The original
data is preprocessed with the following steps:
Class
– Transforming discrete nominal attributes to numeric
values,
– Scaling data using MinMax scaler:
X sc ¼
X
X max
X min
;
X min
– Binarization of attack classes: the problem is converted to a binary classification problem with a ‘Normal’class and ‘Attack’class. The original four attack
types (Dos, Probe, R2L and U2R) are fused in a single class.
The Adult data set is a UCI data set donated by Kohavi
(1996). It involves the prediction whether income exceeds
50,000 dollars a year based on census data. The original
data set consists out of 48,842 observations each described
by six numerical and eight categorical attributes. (see
Table 2).
The feature set contains 14 attributes that describe the
salary, age, gender, work class, education, sex, nativecountry, marital-status, race, occupation, relationship,
capital-gain, hours-per-week capital-loss. All the observations with missing values were removed from consideration. Otherwise, data is preprocessed according to the
same steps described above. Two-thirds of the base are
used for training and the other third as test set.
As for KDD99 data set, the problem is converted into a
binary classification problem where:
Normal
Dos
Probe
R2L
U2R
Total Attacks
Total examples
Positive cases
Negative cases
Total examples
Training Set
Test Set
7720
24,542
32,252
3896
12,374
16,280
– Probability for the label ’> 50K’: 23.93%
– Probability for the label ’<¼ 50K’: 76.07%
The imbalance between different classes is largely higher
for the KDD-99 database than for Adult database. The
main purpose of this choice is to study the utility to introduce the adaptive sampling in both cases where the database is imbalanced or not.
7.2.1. CGP settings
The design of CGP parameters used in this work is summarized in Table 3. In this work, the parameter tuning is
not fully explored.
7.2.2. Terminal and function sets
The terminal set includes 41 features of the benchmark
KDD-99 dataset and 14 features of Adult dataset. The
function set includes basic arithmetic, comparison and logical operators reaching 17 functions (Table 4).
7.3. Performance metrics
We recorded, for each run, its accuracy (Eq. (4)) and
False Positive Rate (FPR) (eq. (3)) to measure the learning
performance on both training and test sets. We also
recorded the learning time measuring the computational
cost.
True Positives þ True Negatives
Total patterns
False Positives
FPR ¼
:
False Positives þ TrueNegatives
Accuracy ¼
ð4Þ
ð5Þ
Table 3
CGP parameters.
Table 1
KDD-99 dataset.
Class
Number of instances
Number of instances
Training Set
Test Set
97,278
391,458
4107
1126
52
396,743
494,021
60,593
229,853
4166
16347
70
250,436
311,029
Parameter
Value
Population size
Sub-populations number
Generations number
CGP nodes
Inputs for a CGP node
Outputs for a CGP node
Tournament size
Crossover probability
Mutation probability
Fitness
256
1
200
300
49(KDD)/22(Adult)
1 (2 classes)
4
0.9
0.04
Minimize classification error
32
S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39
Table 4
Terminal and function sets for GP.
Function (node) set
Arithmetic operators:
Comparison operators:
Logic operators:
Other:
þ; ; ; %
<, >, <=, >=, =
AND, OR, NOT, NOR, NAND
NEGATE, IF (IF THEN ELSE),
IFLEZE(IF <¼ 0THEN ELSE)
Terminal set
KDD-99 Features
Adult Features
Random Constants
41
14
8 in ½ 2; 2½
variation of the sampling frequency on the learning time
and performance indicators. A set of fixed values for f is
chosen for this study given in Section 7.5. The second set
of experiments study the efficiency of the proposed sampling frequency controlling strategies. Results are discussed
and then compared with some hierarchical sampling results
published in previous works.
For each value of re-sampling frequency and controlling
technique, 21 runs of each sampling algorithm (RSS,DSS,
BUSS, and BRSS) are conducted on each data set. We
report the mean learning time of each configuration and
the accuracy and FPR values of the best individual.
7.4. Framework
8.1. Effect of sampling frequency
Software framework: Among several evolutionary computation frameworks, Sean Luke’s ECJ (Luke, 2017) was used
in this work to implement and test the CGP. It’s an open
source framework written in Java and benefits of many contribution packages as the one used here for implementing
Cartesian GP developed by David Oranchak (CGP, 2009).
This framework provides a very flexible API using parameter
files well documented in the ECJ owner’s manual.
Hardware framework: Experiments are performed on an
Intel i7-4810MQ (2.8GHZ) workstation with 8 GB RAM
running under Windows8:1 64-bit Operating System.
To study the effect of the sampling frequency, we analyse first the mean learning time and then the performance
metrics defined in Eqs. 4 and 5. Fig. 4 illustrates the learning time variation on the re-sampling frequency for the
four studied algorithms. Fig. 5 shows the mean learning
time lag between the one-generation-wise sampling
approach and the g-generation-wise sampling approach
for g ¼ f and f 2 ð10; 20; 30; 40; 50Þ.
The shape of the curves in Fig. 4 reveals two distinct
behaviors when the sampling frequency increases. The first
concerns the BUSS, BRSS and RSS algorithms that have
recorded an insignificant decreasing variation of the average learning time. The second behavior is that of DSS with
a more remarkable decrease for both data sets. Fig. 5 illustrates clearly the saving meantime for each sampling frequency f 2 ð10; 20; 30; 40; 50Þ according to the one
generation frequency (f ¼ 1) with both KDD99 and Adul
data sets, applied with the methods BRSS (Fig. 5 (a1) and
(a2)), BUSS (Fig. 5(b1) and (b2)), RSS (Fig. 5 (c1) and
(c2)) and DSS (Fig. 5(d1) and (d2)). The histograms of
the BUSS, BRSS and RSS methods have the same time
scale. However that of DSS has a different scale given the
importance of the time saved compared to the three other
methods.
Time saving depends on the time needed to perform
sample creation with respect to the time spent for a whole
generation. This is why the decrease in time is not very
important for the techniques BUSS, BRSS and RSS. In
the case of KDD99, only 26 s of time saving have been
recorded for BRSS with frequency f ¼ 50 (Fig. 5 (a1)),
and it is reduced to near-zero for BUSS for all the frequencies (Fig. 5 (b1)). Similarly, the maximum meantime lag for
the Adult data set is recorded with f ¼ 50 (Fig. 5 (a2)) and
it is negative in some cases with the BUSS sampling (Fig. 5
(b2)).
The same finding for RSS, the meantime saving is either
negative for several cases or reduced to near-zero (Fig. 5
(c1) and (c2)). Thus, when the fitness case selection, with
or without class balancing, is carried out randomly, the
computation cost is correlated to the population evaluation, since it is the predominant step in the learning time
for GP.
7.5. Sampling settings
In the first set of experiments, we tested six values for the
sampling frequency: 1, 10, 20, 30, 40 and 50 on four sampling methods BRSS, BUSS, DSS and RSS on both
KDD and Adult data sets. BRSS is Balanced RSS. It is
an RSS variant where the random sample is balanced
according to a given ratio between problem classes.
In the second part, we implemented four different techniques to control sampling frequency. Two deterministic
techniques and two adaptive ones as follows:
– Deterministic+: deterministic controlling with the
0:5
increasing function f ¼ ð2 gÞ ,
– Deterministic-: deterministic controlling with the
0:5
decreasing function f ¼ ð2 ðgmax gÞÞ ,
– Average Fitness: adaptive controlling based on population average fitness evolution with a threshold of 0:001,
– Min Resolved: adaptive controlling based on the average
proportion of the population representing the individuals that resolved all sample records. We use a minimum
threshold of 0:5.
The underlying active sampling algorithms have their
own parameters described in Table 5.
8. Results and discussion
The experimental study is organised in two parts. The
aim of the first experiments is to study the impact of the
S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39
33
Table 5
Common sampling parameters.
Method
Parameter
Value
All (except BUSS)
BRSS
BUSS
Target Size
Balancing method
Target size
5000
Full dataset distribution
416 for KDD/15682 for Adult
DSS
Difficulty exponent
Age exponent
1
3.5
Fig. 4. Variation of the mean learning time with the re-sampling frequency applied on KDD-99 dataset (a) and Adult dataset (b).
A general observation can be made with both KDD-99
and Adult data sets. When the sampling method is not
expensive according to the computational cost, the variation of the learning time according to the increase of the
sampling frequency is not significant. This is the case for
the BUSS, BRSS and RSS methods that recorded very
small positive or negative variations (Fig. 5 (a1), (a2),
(b1), (b2), (c1) and (c2)). However, as for the KDD99 database, the learning time is significantly lower with the DSS
method when it is applied less frequently (Fig. 5 (d1) and
(d2)). The decrease of the meantime lag is proportional
to the increase of the re-sampling frequency.
DSS algorithm differs from other algorithms by updating certain sampling parameters (age and difficulty). Thus,
with DSS, the selection of a fitness case requires the calculation of a probability based on the age and difficulty values of the whole dataset. Therefore, this method needs
much more time than the other techniques, which explains
the difference in learning time saving.
As for the performance metrics, the same analysis is carried out. Fig. 6 and Fig. 7 show the effect of sampling frequency on two learning quality measures: accuracy (Fig. 6
(a) and (b) for KDD99 data set and Fig. 6 (c) and (d) for
Adult data set) and FPR (Fig. 7 (a) and (b) for KDD99
data set and Fig. 7 (c) and (d) for Adult data set).
Fig. 8 illustrates the accuracy gap (computed on the test
data set) of the different re-sampling frequency according
to the one-generation-wise sampling.
Fig. 6 and Fig. 7 illustrate an irregular shape for Accuracy and FPR variation curves for the three methods using
random fitness case selection. Some improvements can be
seen with high frequency values (i.e. f ¼ 50) in the case
of KDD99. Nevertheless, this remains irregular and cannot
be generalized. However, in the case of the Adult data set, a
high sampling frequency such f ¼ 50 decreases the quality
of the results for the four sampling techniques.
The most noticeable shift is that of the FPR (Fig. 7).
However, no empirical correlation with the variation of
the re-sampling frequency for BRSS, DSS, and RSS can
be made. In the case of KDD-99, only BUSS realizes a
decrease in FPR value when re-sampling frequency
increases for training and test sets. The best values are
recorded with f ¼ 50 for the KDD-99 and with f between
30 and 40 for Adult data set. However, the BUSS sampling
is not suitable for the incomes prediction problem since
there is no classes imbalance in the corresponding data
set. It is clear that introducing a balanced sampling, such
as BUSS and BRSS, in the GP engine when it has to learn
from an imbalanced data helps to improve the quality of
the derived models. It is the case of the KDD99 database.
This performance is even higher when GP has more learning time from each data subset with a high sampling frequency. GP behavior is completely different with the
Adult database which has different characteristics than that
of KDD99 database. Hence the need to adapt the sampling
strategy and frequency according to the training data set.
Accuracy is a main metric to measure the quality of a
classification model. Thus, as the mean learning time, we
computed the gap between the accuracy values obtained
with f > 1 according the on-generation sampling (f ¼ 1)
for each sampling strategy and for both KDD99 and Adult
databases. Fig. 8 illustrates the obtained measures for f
varying between 10 and 50.
Although the time saving is low or not significant for
RSS, BUSS, and BRSS methods (Fig. 5), accuracy lag values illustrated in Fig. 8 show that it exists a sampling frequency able to improve the learning performance of each
of these sampling techniques. The appropriate sampling
34
S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39
Fig. 5. The mean learning time lag according to one-generation-wise sampling approach besides the variation of the re-sampling frequency with the
KDD99 dataset (left) and the Adult dataset (right) (X-axis of the DSS histogram has a larger scale).
frequency differs from sampling method to an other and
according to the database characteristics. In the case of
the KDD99 base, high frequencies have greatly improved
the results for DSS and BRSS. The nature of the data
implies that GP, with sampling techniques, needs to have
longer learning phases to properly adapt its models to the
training data. For example, BRSS and DSS performed
about 14% better with f ¼ 50 than with f ¼ 1 according
to accuracy, with a time saving of 26 s for BRSS and 161
s with DSS. This gap decreases to 0:9% and 0:63% as best
improvement accomplished respectively by BUSS and
RSS. As for the Adult data set, although small, all the
improvements (according to f ¼ 1) are observed essentially
with frequencies below than 50. With f ¼ 50, the quality of
the derived models is getting worse for all sampling techniques. This proves that the sampling frequency must be
adapted to the sampling method, the training data set
and the evolution of learning process.
This first study demonstrated the impact of sampling
frequency on the performance of four sampling methods
implemented with GP. The results illustrated in the different figures show that, giving a sampling strategy and a
training database, there is a frequency that allows the GP
to achieve a certain optimality. However, it is difficult to
hand tune its value. For these reasons, we propose some
solutions to control its value through the GP engine. The
S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39
35
Fig. 6. Variation of the best individual accuracy according to the re-sampling frequency with the KDD-99 train and test sets ((a) and (b)) and with the
Adult train and test sets ((c) and (d)).
following section presents the second experimental study
applied for the adaptive frequency control.
8.2. Adaptive sampling
To study the efficiency of the proposed sampling frequency controlling strategies introduced in Section 6, we
have extended the four dynamic sample algorithms (BUSS,
BRSS, RSS, and DSS) with the four sampling frequency
controlling techniques: deterministic based on an increasing or decreasing function (Deterministic +
and
Deterministic-) and adaptive controlling based on the population average fitness value (Average Fitness) or the average number of individuals that resolved the samples cases
(Min Resolved). Figs. 9 and 10 report the experimental
results of theses extensions. The obtained results with both
Fig. 7. Variation of the best individual FPR according to the re-sampling frequency on the KDD99 train and test sets ((a) and (b)) and on the Adult train
and test sets ((c) and (d)).
36
S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39
Fig. 8. Accuracy gap on test data set obtained with different re-sampling frequency according to the one-generation-wise sampling for KDD99 test data
set (left) and Adult test data set (right).
KDD99 and Adult data sets are compared to those
obtained with the original dynamic algorithms without frequency control (dynamic) where f ¼ 1.
With regard to the mean learning time, results from
Fig. 9 approve the general behaviour described in Section 8.1. It is essentially the DSS mean learning time that
is affected by the introduction of any controlling technique.
This impact is effective with both deterministic or adaptive
approaches. Indeed, the learning time varies with the number of samples generated through a GP run. There is no
significant time saving observed with the three other sampling methods.
However, this behavior changes regarding the accuracy
and FPR metrics on the test set (Fig. 10). Let us consider
first the results obtained for the KDD99 data set. For each
dynamic sampling method, at least one controlling frequency approach is able to improve its learning performance with a gap up to 10% for the accuracy metric. For
example, for the KDD data set, the controlling technique
’Average Fitness’ help the sampling algorithm BRSS to
achieve an improvement up to 12% (Fig. 10(a)), but it isn’t
the case with the FPR metric. Likewise, for the RSS, the
‘Deterministic+’ and ‘Min Resolved’ techniques have
allowed the accuracy to move from values around 80% to
values greater than 90%. Moreover, the two deterministic
methods (deterministic + and deterministic-) and the
adaptive method ‘Min Resolved’ have improved significantly the accuracy values for DSS sampling, whose values
increased by 10% to 12%. However, a significant decrease
of the FPR quality has been recorded. An exception is
observed for the BUSS algorithm where the accuracy and
FPR measures record a very small improvement with all
the controlled frequency techniques (Fig. 10).
For the Adult data set, the improvements are small or
missing. For example, for the DSS and RSS methods, the
accuracy value improves around 2% with the ’Average Fitness’ adaptive sampling. Similarly, no significant improvements obtained for the FPR measures.
In fact, according to the second experimental study, the
introduction of the sampling frequency control to the GP
engine, if it does not allow an improvement of the results,
does not generate a deterioration of the performance,
except for some cases for the FPR measure.
To summarize, with adaptive sampling, the computational cost can be improved according to the underlying
dynamic sampling algorithm only if the fitness cases selection process is time consuming as it is for DSS. Otherwise, the controlling predicates could well improve the
learning performance, especially the accuracy metric.
However, they do not have a proven positive and generalized effect on the learning quality. Thus, they need to
be refined.
Fig. 9. Variation of the mean learning time according to the re-sampling frequency controlling strategy obtained for the KDD999 data set (case (a)) and
the Adult data set (case (b)).
S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39
37
Fig. 10. Variation of the accuracy and FPR measures according to the re-sampling frequency controlling strategy for KDD99 data set (a and c) and Adult
data set (b and d).
Fig. 11. Comparison of the accuracy (a), FPR (b) and Meantime (c) measures between hierarchical (blue) and adaptive (red) sampling (Min resolved
approach). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
8.3. Adaptive vs hierarchical sampling
Previous works published in Hmida et al. (2016a, 2016b)
demonstrated that hierarchical (or multi-level) sampling
can help GP to accomplish lower run-time while it keeps
its performance as when using one level dynamic sampling
technique. Multi-level sampling could provide a trade-off
between speed and generalization ability, specially for complex dynamic sampling techniques such as DSS and TBS
(Section 4).
We demonstrate in this section that adaptive sampling
could provide the same trade-off and could be faster in
some cases. We provide below a comparative study
between the both strategies applied on the KDD-99 database used for the experiments published in Hmida et al.
(2016a, 2016b). Fig. 11 reports the performance of hierarchical sampling described in Section 4.4: RSS-DSS implemented with two variants, where the second RSS-DSS2
synchronizes the change of target size between the two
levels, RSS-TBS and BUSS-RSS-TBS. The corresponding
performance values are extracted from Hmida et al.
(2016a, 2016b). For a comparative purpose, each figure
represents also the performance of the adaptive method
’Min resolved’ approach applied to the dynamic sampling
BRSS, BUSS, DSS and RSS. Corresponding values are
designed respectively with ’A-BRSS’, ’A-BUSS’, ’A-DSS’
and ’A-RSS’.
TBS is a powerful sampling technique having high computational cost. When implemented in a multi-level sampling approach, the TBS cost disappears while its
performance is preserved (accuracy greater than 92%) especially with the application of the BUSS at level 0. Fig. 11
shows clearly that the same performance according to
Accuracy could be reached with adaptive sampling with
better learning meantime in some cases. Otherwise, a great
advantage appears with adaptive sampling (with MinResolved approach) according the hierarchical one.
Indeed, the comparative study published in Hmida et al.
38
S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39
(2016a, 2016b) reports how it is possible to reduce the computational cost of some complex sampling techniques with
the hierarchical implementation while keeping the same
performance or even obtaining better performance, according to accuracy metric. However, an important increase of
FPR measure is recorded for all the experiments. Fig. 11
(b) shows clearly that this problem could be handled with
the adaptive sampling where the FPR measures are largely
lower.
Hierarchical sampling has a g-generation-wise sampling
frequency, where g might be equal to 1 (g=1) in the last
level and g P 1 in level 1. To optimize the efficiency of
the GP learners when using this sampling strategy, g is
hand tuned according to the database size and the fitness
cases difficulty. The adaptive sampling can accomplish
the same purpose and do not need the step of hand tuning
of the parameter g. Indeed, the re-sampling frequency is
computed and adjusted with the ongoing evolutionary process. Moreover, the adaptive sampling has the advantage of
using a unique sampling method while the hierarchical
sampling needs to combine two or more methods in different levels. It is also possible to conceive a combination
between the two strategies. A research path to explore
would be to introduce the frequency control at each level
of the hierarchical sampling.
9. Conclusion
This work is a proposal for a new form of active learning with Genetic Programming based on adaptive sampling. Its main objective is to extend some known
dynamic sampling techniques with an adaptive frequency
control that takes into account the state of learning process. After a study of the impact of the sampling frequency
variation on the performance of the derived models learning meantime, we proposed an increasing and a decreasing
deterministic patterns, and two adaptive patterns for sampling frequency control. Adaptive patterns are based on
information about ongoing learning, such as the percentage of cases resolved or the average performance.
Experiments are led to test the adaptive sampling by
controlling the sampling frequency with simple predicates.
The results showed a slight effect on learning time without
impacting the learning accuracy. This effect is in the direction of a decrease but with different degrees depending on
the sampling method.
Many new research paths emerges from this study that
are worthy of further investigation. A first path is the
exploration of other predicates that take into account the
characteristics of the training dataset and the underlying
problem to find more relevant predicates for GP classifier
improvements. A second one is to extend the scope of
adaptive sampling to other sample properties. For
instance, an adaptive sampling can downsize or upsize
the sample instead of generating a new one. We may also
combine several sampling strategies and algorithms in a
single method. Then, according to the learning state, a
sample is generated using the suitable strategy in an interleaved way: we use a different algorithm at each time we
need to create a new sample.
Declaration of Competing Interest
The authors declare that they have no known competing
financial interests or personal relationships that could have
appeared to influence the work reported in this paper.
References
Atlas, L. E., Cohn, D., & Ladner, R. (1990) Training connectionist
networks with queries and selective sampling. In Advances in neural
information processing systems (Vol. 2, pp 566–573). MorganKaufmann.
Balkanski, E. & Singer, Y. (2018a). The adaptive complexity of maximizing a submodular function. In: I. Diakonikolas, D. Kempe, M.
Henzinger (eds.), Proceedings of the 50th annual ACM SIGACT
symposium on theory of computing, STOC 2018, Los Angeles, CA,
USA, June 25–29, 2018 (pp 1138–1151). ACM, doi:10.1145/
3188745.3188752.
Balkanski, E. & Singer, Y. (2018b). Approximation guarantees for
adaptive sampling. In: J.G. Dy, A. Krause (eds.) Proceedings of the
35th international conference on machine learning, ICML 2018,
Stockholmsmässan, Stockholm, Sweden, July 10–15, 2018, PMLR,
Proceedings of Machine Learning Research (vol 80, pp 393–402). http://
proceedings.mlr.press/v80/balkanski18a.html.
CGP. (2009). Cartesian gp website. http://www.cartesiangp.co.uk.
Cohn, D., Atlas, L., & Ladner, R. (1994). Improving generalization with
active learning. Machine Learning, 15, 201–221.
Curry, R. & Heywood, M. I. (2004). Towards efficient training on large
datasets for genetic programming. In Advances in artificial intelligence,
17th conference of the Canadian Society for Computational Studies of
Intelligence, Canadian AI 2004, Proc., Springer, lecture notes in
computer science. (vol 3060, pp 161–174), doi:10.1007/978-3-54024840-8_12.
Deschrijver, D., Crombecq, K., Nguyen, H. M., & Dhaene, T. (2011).
Adaptive sampling algorithm for macromodeling of parameterized s parameter responses. IEEE Transactions on Microwave Theory and
https://doi.org/10.1109/
Techniques,
59(1),
39–45.
TMTT.2010.2090407.
Eiben, A. E., Michalewicz, Z., Schoenauer, M., & Smith, J. E. (2007).
Parameter control in evolutionary algorithms. In Parameter setting in
evolutionary algorithms (pp. 19–46). Springer.
Freitas, A. A. (2002). Data mining and knowledge discovery with
evolutionary algorithms. Berlin, Heidelberg: Springer-Verlag.
Fu, Y., Zhu, X., & Li, B. (2013). A survey on instance selection for active
learning. Knowledge and Information Systems, 35(2), 249–283. https://
doi.org/10.1007/s10115-012-0507-8.
Gathercole, C. (1998). An investigation of supervised learning in genetic
programming. Thesis, University of Edinburgh
Gathercole, C., & Ross, P. (1994). Dynamic training subset selection for
supervised learning in genetic programming. In Y. Davidor, H. P.
Schwefel, & R. Manner (Eds.). Parallel problem solving from nature III
(Vol. 866, pp. 312–321). Jerusalem, LNCS: Springer-Verlag. https://
doi.org/10.1007/3-540-58484-6_275.
Ghatercole, C., & Ross, P. (1997). Small populations over many
generations can beat large populations over few generations in genetic
programming. In Genetic programming 1997: Proc. of the second
annual conf (pp. 111–118). San Francisco, CA: Morgan Kaufmann.
Gonçalves, I. & Silva, S. (2013). Balancing learning and overfitting in
genetic programming with interleaved sampling of training data. In K.
Krawiec, A. Moraglio, T. Hu, A.S. Etaner-Uyar, B. Hu (eds.), Genetic
programming – 16th European conference, EuroGP 2013, Vienna,
Austria, April 3–5, 2013. Proceedings, Springer, lecture notes in
S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39
computer science (Vol. 7831, pp 73–84), doi:10.1007/978-3-642-372070_7.
Haitao, L., Yew-Soon, O., & Jianfei, C. (2018). A survey of adaptive
sampling for global metamodeling in support of simulation-based
complex engineering design. Structural and Multidisciplinary Optimization, 57(1).
Harding, S., & Banzhaf, W. (2011). Implementing cartesian genetic
programming classifiers on graphics processing units using gpu.net. In
S. Harding, W. B. Langdon, M. L. Wong, G. Wilson, & T. Lewis
(Eds.), GECCO 2011 Computational intelligence on consumer games
and graphics hardware (CIGPU), ACM, Dublin, Ireland (pp. 463–470).
https://doi.org/10.1145/2001858.2002034.
Hmida, H., Ben Hamida, S., Borgi, A., & Rukoz, M. (2016a). Hierarchical
data topology based selection for large scale learning. In 2016 Intl
IEEE conferences on ubiquitous intelligence & computing, advanced and
trusted computing, scalable computing and communications, cloud and
big data computing, internet of people, and smart world congress,
Toulouse, France, July 18–21, 2016 (pp 1221–1226), IEEE, doi:10.1109/
UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.0186,
URL
http://doi.ieeecomputersociety.org/10.1109/UIC-ATC-ScalCom-CBD
Com-IoP-SmartWorld.2016.0186
Hmida, H., Ben Hamida, S., Borgi, A., & Rukoz, M. (2016b). Sampling
methods in genetic programming learners from large datasets: A
comparative study. In P. Angelov, Y. Manolopoulos, L.S. Iliadis, A.
Roy, M.M.B.R. Vellasco (eds.), Advances in big data – proceedings of
the 2nd INNS conference on big data, October 23–25, 2016, Thessaloniki, Greece, advances in intelligent systems and computing (vol 529,
pp 50–60), doi:10.1007/978-3-319-47898-2_6.
Hunt, R., Johnston, M., Browne, W. N., & Zhang, M. (2010). Sampling
methods in genetic programming for classification with unbalanced
data. In AI 2010: Advances in artificial intelligence – 23rd Australasian
joint conference, proc., Springer, lecture notes in computer science (Vol.
6464, pp 273–282), doi:10.1007/978-3-642-17432-2_28.
Iba, H. (1999). Bagging, boosting, and bloating in genetic programming.
In The 1st annual conference on genetic and evolutionary computation,
Proc., Morgan Kaufmann, San Francisco, CA, USA, GECCO’99, vol 2,
pp 1053–1060
Iyengar, V. S., Apté, C., & Zhang, T. (2000). Active learning using
adaptive resampling. In R. Ramakrishnan, S.J. Stolfo, R.J. Bayardo, I.
Parsa (eds.), Proceedings of the sixth ACM SIGKDD international
conference on knowledge discovery and data mining, Boston, MA, USA,
August 20–23, 2000 (pp 91–98). ACM, doi:10.1145/347090.347110.
Kohavi, R. (1996). Scaling up the accuracy of naive-bayes classifiers: A
decision-tree hybrid. In E. Simoudis, J. Han, & U. M. Fayyad (Eds.),
Proceedings of the second international conference on Knowledge
Discovery and Data Mining (KDD-96) (pp. 202–207). Portland,
Oregon, USA: AAAI Press, http://www.aaai.org/Library/KDD/
1996/kdd96-033.php.
Koza, J. R. (1992). Genetic programming – on the programming of
computers by means of natural selection. Complex adaptive systems.
MIT Press.
Lasarczyk, C., Dittrich, P., & Banzhaf, W. (2004). Dynamic subset
selection based on a fitness case topology. Evolutionary Computation,
12(2), 223–242. https://doi.org/10.1162/106365604773955157.
Li, X. B. (2002). Data reduction via adaptive sampling. Communications in
Information and Systems, 2(1), 53–68. https://doi.org/10.4310/
CIS.2002.v2.n1.a3,
https://www.intlpress.com/site/pub/pages/journals/items/cis/content/vols/0002/0001/a003/.
39
Li, X. B., & Jacob, V. S. (2008). Adaptive data reduction for large-scale
transaction data. European Journal of Operational Research, 188(3),
910–924. https://doi.org/10.1016/j.ejor.2007.08.008, http://www.sciencedirect.com/science/article/pii/S0377221707008867.
Liu, H., Xu, S., Ma, Y., Chen, X., & Wang, X. (2015) An adaptive
bayesian sequential sampling approach for global metamodeling.
Journal of Mechanical Design 138(1), doi:10.1115/1.4031905, 011404,
https://asmedigitalcollection.asme.org/mechanicaldesign/article-pdf/
138/1/011404/6227283/md_138_01_011404.pdf.
Liu, Y., & Khoshgoftaar, T. M. (2004). Reducing overfitting in genetic
programming models for software quality classification. In 8th IEEE
international symposium on high-assurance systems engineering
(pp. 56–65). Tampa, FL, USA: IEEE Computer Society. https://doi.
org/10.1109/HASE.2004.1281730.
Luke, S. (2017). Ecj homepage. http://cs.gmu.edu/~eclab/projects/ecj/.
Luo, L., Hou, X., Zhong, J., Cai, W., & Ma, J. (2017). Sampling-based
adaptive bounding evolutionary algorithm for continuous optimization problems. Information Sciences, 382–383, 216–233. https://doi.
org/10.1016/j.ins.2016.12.023.
Miller, J. F. & Thomson, P. (2000). Cartesian genetic programming. In
Genetic programming, European conference, proc., Springer, lecture
notes in computer science (Vol. 1802, pp 121–132).
Nordin, P., & Banzhaf, W. (1997). An on-line method to evolve behavior
and to control a miniature robot in real time with genetic programming. Adaptive Behaviour, 5(2), 107–140. https://doi.org/10.1177/
105971239700500201.
Paris, G., Robilliard, D., Fonlupt, C. (2003). Exploring overfitting in
genetic programming. In: P. Liardet, P. Collet, C. Fonlupt, E. Lutton,
M. Schoenauer (eds.), Artificial evolution, 6th international conference,
evolution Artificielle, EA 2003, Marseilles, France, October 27–30, 2003,
Springer, lecture notes in computer science (Vol. 2936, pp 267–277),
doi:10.1007/978-3-540-24621-3_22.
Pétrowski, A. & Ben Hamida, S. (2017). Evolutionary algorithms. John
Wiley & Sons, USA, doi:10.1002/9781119136378
Pickett, B. & Turner, C. J. (2011). A review and evaluation of existing
adaptive sampling criteria and methods for the creation of NURBsbased Metamodels. In 31st Computers and information in engineering
conference. (Vol. 2, Parts A and B, pp 609–618), doi:10.1115/
DETC2011-47288.
Settles, B. (2010). Active learning literature survey. Tech. Rep. 1648,
University of Wisconsin, Madison.
Simon, D. (2013). Evolutionary optimization algorithms. John Wiley &
Sons, USA, doi:10.1007/978-1-84996-129-5
UCI. (1999). Kdd cup. https://archive.ics.uci.edu/ml/datasets/KDD+Cup
+1999+Data.
Yu, X., & Gen, M. (2010). Introduction to evolutionary algorithms.
Decision Engineering, Springer, London, London,. https://doi.org/
10.1007/978-1-84996-129-5.
Zhang, B. T., & Cho, D. Y. (1999). Genetic programming with active data
selection. In B. McKay, X. Yao, C. S. Newton, J. H. Kim, & T.
Furuhashi (Eds.), Simulated evolution and learning (pp. 146–153).
Berlin, Heidelberg: Springer.
Zhang, B. T., & Joung, J. G. (1999). Genetic programming with
incremental data inheritance. The genetic and evolutionary computation
conference, proc (Vol. 2, pp. 1217–1224). Orlando, Florida, USA:
Morgan Kaufmann.