Adaptive sampling for active learning with genetic programming

Sana Ben Hamida

Adaptive sampling for active learning with genetic programming

Sana Ben Hamida

Cognitive Systems Research

visibility

…

description

17 pages

link

1 file

Active learning is a machine learning paradigm allowing to decide which inputs to use for training. It is introduced to Genetic Programming (GP) essentially thanks to the dynamic data sampling, used to address some known issues such as the computational cost, the over-fitting problem and the imbalanced databases. The traditional dynamic sampling for GP gives to the algorithm a new sample periodically, often each generation, without considering the state of the evolution. In so doing, individuals do not have enough time to extract the hidden knowledge. An alternative approach is to use some information about the learning state to adapt the periodicity of the training data change. In this work, we propose an adaptive sampling strategy for classification tasks based on the state of solved fitness cases throughout learning. It is a flexible approach that could be applied with any dynamic sampling. We implemented some sampling algorithms extended with dynamic and adaptive controlling re-sampling frequency. We experimented them to solve the KDD intrusion detection and the Adult incomes prediction problems with GP. The experimental study demonstrates how the sampling frequency control preserves the power of dynamic sampling with possible improvements in learning time and quality. We also demonstrate that adaptive sampling can be an alternative to multi-level sampling. This work opens many new relevant extension paths.

Available online at www.sciencedirect.com ScienceDirect Cognitive Systems Research 65 (2021) 23–39 www.elsevier.com/locate/cogsys Adaptive sampling for active learning with genetic programming Sana Ben Hamida a,⇑, Hmida Hmida a,b, Amel Borgi c, Marta Rukoz a,d a Université Paris Dauphine, PSL Research University, CNRS, UMR[7243], LAMSADE, Paris 75016, France b Université de Tunis El Manar, Faculté des Sciences de Tunis, LR11ES14 LIPAH, Tunis 2092, Tunisia c Université de Tunis El Manar, Institut Supérieur d’Informatique et Faculté des Sciences de Tunis, LR11ES14 LIPAH, Tunis 2092, Tunisia d Université Paris Nanterre Nanterre Cedex 92001, France Received 30 October 2019; received in revised form 3 July 2020; accepted 25 August 2020 Available online 19 September 2020 Abstract Active learning is a machine learning paradigm allowing to decide which inputs to use for training. It is introduced to Genetic Programming (GP) essentially thanks to the dynamic data sampling, used to address some known issues such as the computational cost, the over-ﬁtting problem and the imbalanced databases. The traditional dynamic sampling for GP gives to the algorithm a new sample periodically, often each generation, without considering the state of the evolution. In so doing, individuals do not have enough time to extract the hidden knowledge. An alternative approach is to use some information about the learning state to adapt the periodicity of the training data change. In this work, we propose an adaptive sampling strategy for classiﬁcation tasks based on the state of solved ﬁtness cases throughout learning. It is a ﬂexible approach that could be applied with any dynamic sampling. We implemented some sampling algorithms extended with dynamic and adaptive controlling re-sampling frequency. We experimented them to solve the KDD intrusion detection and the Adult incomes prediction problems with GP. The experimental study demonstrates how the sampling frequency control preserves the power of dynamic sampling with possible improvements in learning time and quality. We also demonstrate that adaptive sampling can be an alternative to multi-level sampling. This work opens many new relevant extension paths. Ó 2020 Elsevier B.V. All rights reserved. Keywords: Genetic programming; Machine learning; Active learning; Training data sampling; Adaptive sampling; Sampling frequency control 1. Introduction Evolutionary Algorithms (EA) (Pétrowski & Ben Hamida, 2017; Simon, 2013; Yu & Gen, 2010) are metaheuristics that comply with a wide range of problems such as complex optimization, identiﬁcation, machine learning, and adaptation problems. Applied to machine learning, Evolutionary Algorithms, especially Genetic Programming ⇑ Corresponding author. E-mail addresses: [email protected] (S. Ben Hamida), [email protected] (A. Borgi), [email protected] (M. Rukoz). https://doi.org/10.1016/j.cogsys.2020.08.008 1389-0417/Ó 2020 Elsevier B.V. All rights reserved. (GP) (Koza, 1992), have been seen very eﬀective in a wide range of problems in supervised and unsupervised learning. However, their ﬂexibility and expressiveness comes with two major ﬂaws: an excessive computational cost and a problematic parameters setting. In supervised learning ﬁeld, the lack of data may lead to unsatisfactory learners. This is no longer an issue with numerous data sources and high data volume that we witness in the era of Big Data. Nonetheless, this toughens up the computation problem of GP and precludes its application in data-intensive problems. There have been various research eﬀorts on improving GP when applied with large 24 S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39 datasets. These research eﬀorts include hardware solutions, such as parallelization, or algorithmic solutions. The most aﬀordable is software based solutions that do not require any speciﬁc hardware conﬁguration. Sampling is the mainstream approach in this category. It relies on reducing processing time by reducing data while keeping relevant records. A complete review of sampling methods used with GP is published in Hmida, Ben Hamida, Borgi, and Rukoz (2016b), extended with a discussion on their ability to deal with large datasets. In fact, sampling methods can be classiﬁed with regard to three properties: re-sampling frequency, sampling scheme, or strategy and sampling quantity. Sampling strategy deﬁnes how to select records from the input database. Sampling quantity deﬁnes how many samples are needed by the algorithm. Sampling frequency deﬁnes when the sampling technique is applied throughout the training process. The latest property is the focus of this study. According to the re-sampling frequency, machine learning algorithms use a unique or a renewable sample. They are called respectively static or dynamic sampling. On the one hand, in static sampling for GP, like the Historical Subset Selection (Gathercole & Ross, 1994) and bagging/boosting (Iba, 1999; Paris, Robilliard, & Fonlupt, 2003), a selection of representative training set needs to be performed. With large datasets, this poses a problem of combining downsizing and data coverage objectives. On the other hand, dynamic sampling creates samples per generation according to its selection strategy. Consequently, GP individuals do not have enough time to learn from sampled data. The population might waste some good resources for solving some difficult cases in the current training set. Otherwise, re-sampling at each GP iteration might be computationally expensive, especially when using some sophisticated sampling strategies. We propose, in this paper, an extension to dynamic sampling techniques in which sample renewal is controlled through a parameter that adapts the sampling to the learning process. This extension aims to preserve original sampling strategy while making an enhancement in learning robustness and/or learning time. After studying the eﬀect of the re-sampling frequency on the training quality and learning time, we propose two predicates to implement the adaptive sampling based on the status of resolved ﬁtness cases. These predicates are tested and compared with two deterministic variation rules deﬁned by two functions with an increasing and a decreasing patterns. The objective of this study is to demonstrate that controlling sampling frequency with deterministic or dynamic functions does not degrade the results. On the contrary, in some cases they allow an improvement in quality and learning time. This paper is organized as follows. Next section gives an overview of the adaptive sampling in active machine learning. In Section 3, we expose the background of this work in GP and decision designs needed to add dynamic sampling to the GP engine. Section 4 reviews some sampling methods for active learning with GP that are involved in the experimental study. In Section 5 we study the eﬀect of varying the sampling frequency on the Genetic Learners. Section 6 introduces the novel sampling approach and explains how it can extend dynamic sampling methods. Then, in Section 7, an experimental study gives the proof of concept of adaptive sampling and traces its eﬀect on learning process through the discussion of registered results in Section 8. The main results in this section are compared to some results of three multi-level dynamic sampling methods published in Hmida, Ben Hamida, Borgi, and Rukoz (2016a) to demonstrate how adaptive sampling could be an alternative to hierarchical sampling. Finally, we give some conclusions and propose further developments. 2. Related works: adaptive sampling In this paper, we are mainly interested on sampling methods aiming at reducing the original training data-set size by substituting it with a representative subset much smaller, thus reducing the evaluation cost of learning algorithm. Two major classes of sampling techniques can be laid out: static sampling where the training set is selected independently from the training process and remains unmodiﬁed along evolution, and active sampling, also known as active learning that could be deﬁned as (Atlas, Cohn, & Ladner, 1990; Cohn, Atlas, & Ladner, 1994): ‘any form of learning in which the learning program has some control over the inputs on which it trains.’(Cohn et al., 1994). With active sampling, the training subsets are periodically (often at each iteration of the learning algorithm) built and modiﬁed using a special technique associated to the learning algorithm along evolution. In the machine learning ﬁeld ‘the key hypothesis is that if the learning algorithm is allowed to choose the data from which it learns–to be curious, if you will–it will perform better with less training’(Settles, 2010). When the active sampling depends on any component in the machine learning engine such as data information or solutions quality, then it becomes adaptive. In the past two decades, several approaches of adaptive sampling have been proposed to deal with large data sets on several domains. Xiao-Bai Li and Varghese S. Jacob use adaptive sampling for data reduction based on chisquare statistic for measuring the goodness-of-ﬁt between the distributions of the reduced and full data-sets (Li, 2002; Li & Jacob, 2008). Lyengar et al. use an adaptive resampling to the active learning task for classiﬁcation problems (Iyengar, Apté, & Zhang, 2000). Reviews of active learning approaches mainly for classiﬁcation problems are presented in Fu, Zhu, and Li (2013) and Settles (2010). More recently, Luo et al. have proposed an adaptive bounding evolutionary algorithm based on adaptive sampling for continuous optimization problems S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39 (Luo, Hou, Zhong, Cai, & Ma, 2017). Their algorithm starts to update the boundaries of the variables after n0 generations of the evolution process and then updates the boundary every ng generations afterwards by using a ﬁtness-based bounding selection strategy over multiple previous generations. In Balkanski and Singer (2018a, 2018b) an adaptive sampling technique for maximizing monotone sub-modular functions under a cardinal constraint is presented. Various adaptive sampling criteria for the development of based on non-uniform rational B-splines (NURBs) meta-models are presented in Pickett and Turner (2011). Adaptive samplings have been also used in global meta-modeling for the computer simulation models. Indeed, simulation models can approximate detailed information of real-world physical problems, but require huge computational resources. Thus adaptive sequential sampling strategy allows constructing accurate global metamodels with fewer points than others sampling strategies such as space-ﬁlling sequential sampling (Haitao, YewSoon, & Jianfei, 2018). An adaptive sampling for parametric macro-modeling of a microwave antenna is proposed in Deschrijver, Crombecq, Nguyen, and Dhaene (2011). A balance strategy to perform adaptive sampling by circularly looping through a search pattern that contains several weights from global to local is presented in Liu, Xu, Ma, Chen, and Wang (2015). A survey of adaptive sampling for global meta-modeling can be found in Haitao et al. (2018), which considers four categories of adaptive sampling: variance based adaptive sampling, query-bycommitted based adaptive sampling, cross-validation based adaptive sampling, and gradient based adaptive sampling. Our work tackles adaptive sampling approach for active learning, particularly for Genetic programming engines. We present it in the next section. 25 3. Backgrounds: GP and active learning 3.1. Genetic programming engine As any EA, GP evolves a population of individuals throughout a number of generations. A generation is in fact an iteration of the main loop as described in Fig. 1. Each individual represents a complete mathematical expression or a small computer program. The standard GP uses a tree representation of individuals built from a function set for nodes and a terminal set for leaves. When GP is applied to a classiﬁcation problem, each individual is a candidate classiﬁer. Thus, the objective is to ﬁnd the best classiﬁer. The terminal set is composed of dataset features and some randomly generated constants and the function set contains mostly arithmetic and logic functions. As for the ﬁtness function, it is often based on some learning performance measures such as accuracy. The main steps of GP with dynamic or active sampling are: 1. Randomly create a population of individuals where tree nodes are taken from a given function and terminal sets. Then evaluate their ﬁtness value by executing each program tree against the initial training subset. 2. According to a ﬁxed probability, individuals are crossed or mutated to create new oﬀspring individuals. 3. Select a new training subset with a given sampling algorithm. 4. Oﬀspring solutions are evaluated against new sample and a new population is made up by selecting best individuals from parents and oﬀspring according to their ﬁtness values. 5. Loop step 2 and 4 until a stop criteria is met. Fig. 1. Genetic Learning Evolutionary loop. Steps 1, 2 and 4 concern the traditional GP loop, step 3 deals with the dynamic sampling. 26 S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39 The evaluation is the prevailing step with regard to the overall computation cost and it depends simultaneously on the sample size, the population size and each individual complexity. With active sampling, the training subset is changed regularly before the evaluation step so as only best individuals ﬁtting the diﬀerent provided datasets survive along evolution. 3.2. Active learning for GP In a GP engine implementing active learning, the underlying sampling techniques are tightly related to the evolutionary mechanism. They allow a dynamic change of the training dataset along the learning process. Sampling the training database has been ﬁrst used to boost the learning process, to avoid over-ﬁtting or to handle the imbalanced data classes for classiﬁcation problems (Iba, 1999; Liu & Khoshgoftaar, 2004; Paris et al., 2003). Later, it was introduced for Genetic Learners as a strategy for handling large input databases. For Genetic Learning, several sampling techniques are proposed as solutions for one or several of these problems. They can be classiﬁed into two categories: static sampling methods and dynamic sampling methods. With static sampling, the sampling method is performed only once for each Genetic Learning run. In contrast, with dynamic sampling, the training subset is changed periodically across the learning process. The data selection strategy is based on some dynamic criteria, such as random selection, weighted selection, incremental selection, etc. We distinguish one-level sampling methods using a single selection strategy and multi-level sampling (hierarchical sampling) methods using multiple selection strategies associated in a hierarchical way. 3.2.1. Designing a dynamic sampling technique for GP A sampling technique can be formulated as follows. If we consider B as the database storing all available records for the training process, sampling consists on selecting a subset S of records from B such as S B and j S jj B j. To introduce a sampling technique for GP, in addition to the sampling strategy deﬁning how to select ﬁtness cases from the database, two important parameters have to be designed: - sampling frequency: how often the training subsets are changed across the learning process; - sampling quantity: how many subsets are needed for the evaluation step. For evolutionary machine learning techniques, the sampling quantity could be individual-wise, population-wise or sub-population-wise. For individual-wise case, a new data sample Sj is extracted for each individual in the population and each solution is evaluated independently, which might rise drastically the computational cost. The sub-population-wise case could be used only if the Genetic Learner evolves sub-populations with a co-evolutionary mechanism. Theses two cases are not included in our study. Only the population-wise case using one sample for all the population is considered. With Genetic Learning, it is possible to mine a single subset S throughout an evolutionary run, that is used to evaluate all individuals in the population. It is the runwise sampling frequency (Freitas, 2002). This sampling approach is also known as the static sampling, where the learner obtains all the input training data at once, and it is kept unchanged across the learning process. All methods in this category use the run-wise sampling frequency and are population-wise, like the Historical Subset Selection (Gathercole & Ross, 1994) or sub-population-wise like the bagging/boosting methods (Iba, 1999; Paris et al., 2003). When the sampling technique is called by the learner to change the training sample along the evolution, then the method uses the g-generation-wise sampling frequency. In this case, each g generations, a new subset S is extracted using the designed sampling strategy, and it is used for the evaluation step. Methods in this category are known as active sampling techniques. When g ¼ 1, the population is evaluated on a diﬀerent data subset each generation and the sampling frequency is generation-wise. All the dynamic sampling techniques introduced for GP use this frequency where g ¼ 1. When using a complex sampling strategy and relatively large sample size, the computational cost of the learning process might be very high. Otherwise, with the generation-wise frequency, the population do not have enough time to adapt the population in order to extract the hidden knowledge in the current training sample. 4. Active sampling with GP To select a training subset S from the database B, many approaches were proposed either for static or active sampling. For static sampling, the database is partitioned before the learning process, based essentially on some criteria or some features in the data. This sampling strategy is not discussed in this paper. For active sampling, we identify basically ﬁve main approaches used with GP: stochastic sampling, weighted sampling, data-topology based sampling, balanced sampling and incremental sampling. With stochastic and incremental sampling, ﬁtness cases are selected randomly with respectively ﬁxed and increasing size. However, with weighted sampling, a weight is computed for each ﬁtness case based on some features and/or the difficulty to solve the corresponding record. Regarding balanced sampling, it was ﬁrst introduced to deal with the problem for classes’ imbalance in the training databases. Like stochastic and weighted sampling, it could be applied to deal with large datasets or to decrease the training computational cost. The datatopology based sampling uses some information about the features to measure similarity and connections between ﬁtness cases. These measures help to create heterogeneous samples for a better training. Each approach is introduced to provide some solutions to a speciﬁc machine leaning problem S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39 like over-ﬁtting or data imbalance, but all of them could be applied to deal with large datasets and decrease the training computational cost. Another approach consists in combining some techniques from the ﬁve above approaches in a hierarchical way. Methods in this category are proposed especially to deal with very large databases. They combine two or three sampling techniques in two or three levels. In the ﬁrst level (and second level for the 3-levels method), the corresponding sampling technique is applied less frequently than sampling technique in the last level. To study the eﬀect of the sampling frequency parameter on the dynamic sampling efficiency, we selected four methods presented in the following subsections: Random Subset Selection (RSS), Dynamic Subset Selection (DSS) and Balanced Sampling in two variants (BRSS, BUSS). Additionally, four hierarchical sampling approaches are selected for a comparison purpose: RSS-DSS, DSS-DSS, RSS-TBS and BUSS-RSS-TBS. All selected techniques are presented in the following sub-sections. Note that it exists other sampling techniques that have been experimented with GP to improve the GP performance or to deal with some learning difficulties. For example, the Interleaved sampling (Gonçalves & Silva, 2013) is introduced to address over-ﬁtting problems. It alters from a generation to another between two training sets composed either by all instances in the training database or only one selected ﬁtness case. Random Subset Selection (Gathercole & Ross, 1994) is a simple algorithm that selects at every generation g a record i among T records in the initial dataset B with a uniform probability P i ðgÞ: P i ðgÞ ¼ S T ð1Þ where S is the target subset size. Its steps in the GP engine are summarized in the Algorithm 1. Algorithm 1. RSS 1: Select instances from B with a uniform probability to create a subset Sð0Þ 2: g 0 3: for all generation g < gmax do 4: Evaluate programs using the subset SðgÞ 5: Evolve parents 6: Generate randomly new dataset Sðg þ 1Þ 7: end for ing subset for each individual in the GP population and at each generation. It is an individual-wise technique as described in Section 3. A second method called Incremental Random Selection (Zhang & Cho, 1999; Zhang & Joung, 1999) constructs subsets with a growing size by adding an identical number of ﬁtness cases at every generation until using the whole training database. Variants of the RSS technique are not subject to our experimental study. 4.2. Dynamic Subset Selection (DSS) DSS algorithm (Gathercole & Ross, 1994, 1997; Gathercole, 1998) is inspired by boosting techniques and aims to bias selection to keep difficult cases (i.e. ﬁtness cases frequently unsolved by the best solutions) and ﬁtness cases that have not been selected for several generations. DSS computes two measures: a difficulty degree Di ðgÞ and an age Ai ðgÞ for a record i, starting with 0 at ﬁrst generation and updated at every generation g. The difficulty is incremented for each classiﬁcation error and reset to 0 if the ﬁtness case is solved. The age is equal to the number of generations since last selection, so it is incremented when the ﬁtness case has not been selected and reset to 0 otherwise. Selection probabilityP i ðgÞ in eq. (3) depends on each ﬁtness weight W i ðgÞ (eq. (2)). 8i : 1 6 i 6 T ; W i ðgÞ ¼ Di ðgÞd þ Ai ðgÞa ð2Þ where d and a are given parameters denoting respectively the difficulty exponent and the age exponent. 4.1. Random Subset Selection (RSS) 8i : 1 6 i 6 T ; 27 S T Note that other variants have been proposed based on the same data selection strategy but diﬀers from RSS on some small details. For example, the Stochastic Sampling introduced in Nordin and Banzhaf (1997) samples a train- 8i : 1 6 i 6 T ; W i ðgÞ S P i ðgÞ ¼ PT j¼1 W j ðgÞ ð3Þ Algorithm 2 describes how the DSS technique is included in the GP engine. Algorithm 2. DSS 1: initialize for each record i in B the difﬁculty degree Di ð0Þ and the age Ai ð0Þ 2: g 0 3: for all generation g < gmax do 4: initialize empty subset SðgÞ 5: for all record i in B do 6: if g = 0 then T 7: P i ðgÞ ¼ jBj 8: else 9: compute P i ðgÞ using Eq. (3) 10: end if 11: add record i to SðgÞ with a probability P i ðgÞ 12: if i is selected then 13: Ai ðg þ 1Þ ¼ 0 14: else 15: Ai ðg þ 1Þ ¼ Ai ðgÞ þ 1 16: end if 17: bf end for 28 S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39 18: Evaluate programs using the subset SðgÞ 19: Update for each record i in B the difﬁculty degree Di ðg þ 1Þ 20: Apply genetic operators 21: end for 4.3. Balanced sampling Balanced sampling (Hunt, Johnston, Browne, & Zhang, 2010) aims to improve classiﬁer accuracy by correcting the original data set imbalance within majority and minority classes instances. Some methods are based on the minority class size (N min ) and thus reduces the number of instances like the methods studied in this paper. Several approaches are proposed, we summarize hereafter three sampling techniques used with GP: ﬁrst, Static Balanced Sampling that selects cases with uniform probability from each class without replacement until obtaining a balanced subset of the desired size, then, Basic Under-sampling (BUSS) (resp. Basic Oversampling) that selects all minority (resp. majority) class instances and then an equal number from the majority (resp. majority) class randomly. With BUSS, the sample size is equal to 2N min , where N min is the minority class size. 4.4. Multi-level or hierarchical active sampling Hierarchical sampling is based on multiple levels of sampling methods inspired by the concept of a memory hierarchy. It combines several sampling algorithms that are applied in diﬀerent levels. Its objective is to deal with large datasets that do not ﬁt in the memory, and simultaneously provide the opportunity to ﬁnd solutions with a greater generalization ability than those given by the one-level sampling techniques. The data subset selections at each level are independent. Fig. 2 shows the main steps performed to obtain the ﬁnal training subset. The usual schema is made up of three levels. The ﬁrst one consists in creating blocks with a given size from the original data set which are recorded in the Fig. 2. Main steps of the Hierarchical Sampling: case of three-level sampling. hard disk. The remaining two levels are a combination of two active sampling methods. Curry et al. conceived an extension to the DSS algorithm into a 3 level hierarchy (Curry & Heywood, 2004). At level 0, the database is partitioned into blocks that are sufficiently small to reside within RAM alone. Then, at level 1, one block is chosen from these partitions based on RSS or DSS sampling techniques. Finally, at level 2, the selected block is considered as the full data set on which DSS is applied for several generations. Depending on the level 1 algorithm, two approaches are possible: RSS-DSS hierarchy or DSS-DSS hierarchy. Based on the same idea, Hmida et al. proposed two new variants of hierarchical sampling: the RSS-TBS and the BUSS-RSS-TBS (Hmida et al., 2016a). The RSS-TBS uses the Topology Based Subset Selection (Lasarczyk, Dittrich, & Banzhaf, 2004) at level 2 instead of RSS or DSS. TBS uses for the sampling process an undirected weighted graph representing the relationship between ﬁtness cases in the database. Vertices in the graph are ﬁtness cases and each edge deﬁnes a weight measuring a similarity or a distance induced from individuals performance. Then, cases having a tight relationship cannot be selected together in the same subset assuming that they have an equivalent difficulty for the population. The second variant BUSS-RSS-TBS extends the ﬁrst variant with a Basic Under-Sampling at the level 0 block creation. BUSS favors the minority class by calculating the block size according to its cardinality. For majority class, an equal number of instances are selected randomly. 5. Controlling sampling frequency with GP 5.1. The sampling frequency feature Sampling frequency (f) is a main parameter for any active sampling technique. It deﬁnes how often the training subset is changed across the learning process. When f = 1, the training sample is extracted at each generation and the sampling approach is considered as a generation-wise sampling technique. Most of the sampling techniques applied with GP belong to this category. This is the case of the techniques described in Section 4. When f is set to 1, individuals in the current population have only one generation to adapt their genetic materials to the current environment characterized by the training sample. For an evolutionary algorithm, it is very difficult even impossible for any population to solve all cases in a training set in one generation. A higher value of f corresponds to a lower number of samples to be generated and might do not allow the population to see all the ﬁtness cases available in the database. We think that the sampling frequency must be updated according to the evolution state and the difficulty of the current training set. Note that for hierarchical sampling, described in Section 4.4, a sampling frequency value is needed for each level. For example, for RSS-DSS method, a sampling frequency f 1 is needed for the level 1 that deﬁnes when to S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39 change the training block with the RSS technique. A second frequency value f 2 is needed for level 2 designing the re-sampling frequency with the DSS technique. 5.2. State of the art As detailed in Section 4, active sampling techniques proposed for GP studied essentially how to select examples for training subset. The sampling frequency is a constant value set as a user parameter. Thus, for each sampling technique presented or cited in Section 4 (RSS, DSS, BUSS, etc), the sampling frequency is set as a constant before the learning process. For the one-level techniques, the sampling frequency is usually set as f ¼ 1. For the multi-level techniques, the sampling frequency at the low level is set to 1 as for onelevel methods, and for the high levels, it varies on the complexity of the problem. Often it varies from 40 to 100. The general used strategy to set the sampling frequency parameter is reported in the following algorithm. Fixed Sampling Frequency The sampling frequency f is set before starting the GP run like any GP parameter. This value remains unchanged till the last generation and it is usually equal to 1. This can be represented by the following algorithm: Algorithm 3. Fixed Sampling Frequency Require f {sampling frequency} 1: for all generation g < gmax do 2: if gmodf ¼ 0 then 3: re-sample 4: end if 5: end for 6. The proposed sampling approach Three main approaches are possible to control any EA parameter: deterministic, adaptive and self-adaptive (Eiben, Michalewicz, Schoenauer, & Smith, 2007). The deterministic control uses a deterministic rule to alter the EA parameter along the evolution. However, the adaptive control uses feed-backs from the search state to deﬁne a strategy for updating the parameter value. With the selfadaptive control, the parameter is encoded within the chromosome and it evolves with the population. The sampling frequency could be considered as an EA parameter and then could be controlled using the same strategies. We propose in this section a deterministic and an adaptive approaches to adjust this parameter along the evolutionary learning process. For the deterministic control, an increasing and decreasing scheme are experimented. For adaptive control, we propose an adaptive scheme based on some feed-backs from the learning state such as the proportion of solved ﬁtness cases in the current sample or the improvement rate of best/average ﬁtness. 29 6.1. Deterministic sampling frequency When the sampling frequency is updated with a deterministic control, f takes diﬀerent values throughout the GP run. These values are determined by a function that gives the same series of values each run. Thus, the frequency may be increasing, decreasing or following a complex curve. When f has an increasing scheme, the training process starts with short lifetime (in number of generations) samples, giving to the population the opportunity to see the maximum number of ﬁtness cases in the ﬁrst training iterations. By the end of the run, the samples are learned over a large number of generations, that might help the population to tune the genetic materials of the current solutions. To achieve this approach, we use a deterministic function based on the generation number: f ¼ ðC gÞa where the coefficients C and a help to control the shape of the curve of f. Their values are set with the GP parameters. The following algorithm summarizes the corresponding steps: Algorithm 4. Deterministic Sampling Frequency 1: for all generation g < gmax do 2: f ¼ ðC gÞa {C; a 2 R} 3: if gmodf ¼ 0 then 4: re-sample 5: end if 6: end for The opposite process (i.e. decreasing frequency) uses the same steps but updates frequency with a decreasing funca tion such as: f ¼ ðC ðgmax gÞÞ . This scheme could be useful when the data set contains ﬁtness cases difficult to solve. The GP engine, in ﬁrst step, focus on the current sample in order to help the population to reach the target area in the search space. When the sampling becomes more frequent, its focus becomes the tune of solutions. 6.2. Adaptive sampling frequency The fundamental idea behind adaptive sampling by controlling the sampling frequency is to add an extra parameter to sampling algorithms acting as a moderator or resampling regulator. While dynamic methods use a ﬁxed renewal frequency equal to 1, adaptive sampling decides to generate a new sample for the subsequent generations according to a condition that must be satisﬁed by the learning state. Fig. 3 depicts this approach. GP based learner interacts with the sampling process by providing some adequate information about the learning process needed to perform the underlying selection strategy. For example, DSS needs to know misclassiﬁed cases to update the difficulty value. Then, the sampling algorithm delivers a new sample 30 S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39 Dataset Dataset Sampling Sampling Algorithm 5. Adaptive Sampling Frequency Re-sampling predicate True Sample False Sample Learning state Learning state Learning process (GP) Learning process (GP) Dynamic from the GP engine. Hereafter, two examples of adapting techniques. The ﬁrst uses a threshold for the population mean ﬁtness to detect if the population is making improvements or not. In the case of very small or no improvement, a new sample is generated since the old one is fully exploited (Algorithm 5). Adaptive Require r {mean ﬁtness variation rate: (mean ﬁtness(g)mean ﬁtness(g-1))/mean ﬁtness(g-1)} Require t {threshold} 1: for all generation g < gmax do 2: if r < t then 3: re-sample 4: end if 5: end for Fig. 3. Adaptive vs Dynamic sampling. generated according to the updated parameters. With adaptive sampling, a predicate controls the re-sampling decision. We assume that any input required to evaluate this predicate is available within the data dispensed by the GP engine. To design a predicate for the adaptive sampling, various information about the current state of the evolution and the training process can be retrieved from the GP engine, such as the generation number, the population mean ﬁtness, the mean ﬁtness improvement rate, the best ﬁtness improvement rate, etc. This information can be used into a Boolean condition or within a dynamic function. The straightforward approach is to deﬁne a threshold per measure and the predicate is then a comparison of the current value to the corresponding threshold. For example, if we deﬁne a threshold of 0:002 to best ﬁtness improvement rate, then GP will continue to use the same sample if the best ﬁtness of the current generation is better than the previous generation with 0:2% or more. Otherwise a new sample must be created. In a more complex approach, threshold can be auto-adapted to learning process. With the adaptive sampling, the sampling frequency f is adjusted according to the general training performance to accommodate the current state of the learning process. Therefore, f can increase or decrease by a varying amount. Our approach is based on the current state of the population. It uses either the evolution of the mean ﬁtness or the number of resolved cases to decide whether to create a new sample or to carry on learning with the previous one. We assume, that less performing learners need more time to improve their performance and symmetrically efficient learners on a particular sample need to see diﬀerent data from a new sample. As we said, adaptive sampling can rely on various learning performance indicators. It retrieves these indicators The second example is based on measuring the mean number of individuals (learners) that have resolved each record in the training sample. When this value reaches a designated value then new records are selected in a new sample. In the following sections, we give details about the settings used for the conducted experiments and implementation of adaptive sampling over some dynamic sampling algorithms discussed in Section 4. Then we expose the experimental results and discuss them to analyze the eﬀect of sampling frequency and adaptive sampling on GP performance in resolving the considered problems. 7. Experimental settings 7.1. Cartesian Genetic Programming Cartesian Genetic Programming (CGP) (Miller & Thomson, 2000) is a GP variant where individuals represent graph-like programs. It is called ”Cartesian” because it uses a two-dimensional grid of computational nodes implementing directed acyclic graphs. Each graph node encodes a function from the function set. The arguments of the encoded function are provided by the inputs of the node and its output designs the result. CGP shows several advantages over other GP approaches. Unlike trees, there are more than one path between any pair of nodes. This enables the reuse of intermediate results. A genotype can also have multiple outputs which make CGP able to solve many types of problems and classiﬁcation problems in particular (Harding & Banzhaf, 2011). Otherwise, CGP has the great advantage of counteracting the bloating eﬀect (genotype growth), frequent phenomena with other GP representations. CGP is easy to implement, and it is highly competitive compared to other GP methods. 31 S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39 7.2. Data sets Table 2 Adult dataset. For the experimental study, we selected from the UCI Machine Learning repository two databases: the KDD-99 database for the intrusion detection problem and the Adult database for income prediction. KDD-99 base is widely used to validate the performance of various machine learning classiﬁers. The corresponding problem consists on classifying connections into normal or attack classes. It uses a large data set called 10% KDD-99 data set (UCI, 1999). The data set is already divided into training and test sets which are presented in Table 1. Each record is described by 41 features. The original data is preprocessed with the following steps: Class – Transforming discrete nominal attributes to numeric values, – Scaling data using MinMax scaler: X sc ¼ X X max X min ; X min – Binarization of attack classes: the problem is converted to a binary classiﬁcation problem with a ‘Normal’class and ‘Attack’class. The original four attack types (Dos, Probe, R2L and U2R) are fused in a single class. The Adult data set is a UCI data set donated by Kohavi (1996). It involves the prediction whether income exceeds 50,000 dollars a year based on census data. The original data set consists out of 48,842 observations each described by six numerical and eight categorical attributes. (see Table 2). The feature set contains 14 attributes that describe the salary, age, gender, work class, education, sex, nativecountry, marital-status, race, occupation, relationship, capital-gain, hours-per-week capital-loss. All the observations with missing values were removed from consideration. Otherwise, data is preprocessed according to the same steps described above. Two-thirds of the base are used for training and the other third as test set. As for KDD99 data set, the problem is converted into a binary classiﬁcation problem where: Normal Dos Probe R2L U2R Total Attacks Total examples Positive cases Negative cases Total examples Training Set Test Set 7720 24,542 32,252 3896 12,374 16,280 – Probability for the label ’> 50K’: 23.93% – Probability for the label ’<¼ 50K’: 76.07% The imbalance between diﬀerent classes is largely higher for the KDD-99 database than for Adult database. The main purpose of this choice is to study the utility to introduce the adaptive sampling in both cases where the database is imbalanced or not. 7.2.1. CGP settings The design of CGP parameters used in this work is summarized in Table 3. In this work, the parameter tuning is not fully explored. 7.2.2. Terminal and function sets The terminal set includes 41 features of the benchmark KDD-99 dataset and 14 features of Adult dataset. The function set includes basic arithmetic, comparison and logical operators reaching 17 functions (Table 4). 7.3. Performance metrics We recorded, for each run, its accuracy (Eq. (4)) and False Positive Rate (FPR) (eq. (3)) to measure the learning performance on both training and test sets. We also recorded the learning time measuring the computational cost. True Positives þ True Negatives Total patterns False Positives FPR ¼ : False Positives þ TrueNegatives Accuracy ¼ ð4Þ ð5Þ Table 3 CGP parameters. Table 1 KDD-99 dataset. Class Number of instances Number of instances Training Set Test Set 97,278 391,458 4107 1126 52 396,743 494,021 60,593 229,853 4166 16347 70 250,436 311,029 Parameter Value Population size Sub-populations number Generations number CGP nodes Inputs for a CGP node Outputs for a CGP node Tournament size Crossover probability Mutation probability Fitness 256 1 200 300 49(KDD)/22(Adult) 1 (2 classes) 4 0.9 0.04 Minimize classiﬁcation error 32 S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39 Table 4 Terminal and function sets for GP. Function (node) set Arithmetic operators: Comparison operators: Logic operators: Other: þ; ; ; % <, >, <=, >=, = AND, OR, NOT, NOR, NAND NEGATE, IF (IF THEN ELSE), IFLEZE(IF <¼ 0THEN ELSE) Terminal set KDD-99 Features Adult Features Random Constants 41 14 8 in ½ 2; 2½ variation of the sampling frequency on the learning time and performance indicators. A set of ﬁxed values for f is chosen for this study given in Section 7.5. The second set of experiments study the efficiency of the proposed sampling frequency controlling strategies. Results are discussed and then compared with some hierarchical sampling results published in previous works. For each value of re-sampling frequency and controlling technique, 21 runs of each sampling algorithm (RSS,DSS, BUSS, and BRSS) are conducted on each data set. We report the mean learning time of each conﬁguration and the accuracy and FPR values of the best individual. 7.4. Framework 8.1. Effect of sampling frequency Software framework: Among several evolutionary computation frameworks, Sean Luke’s ECJ (Luke, 2017) was used in this work to implement and test the CGP. It’s an open source framework written in Java and beneﬁts of many contribution packages as the one used here for implementing Cartesian GP developed by David Oranchak (CGP, 2009). This framework provides a very ﬂexible API using parameter ﬁles well documented in the ECJ owner’s manual. Hardware framework: Experiments are performed on an Intel i7-4810MQ (2.8GHZ) workstation with 8 GB RAM running under Windows8:1 64-bit Operating System. To study the eﬀect of the sampling frequency, we analyse ﬁrst the mean learning time and then the performance metrics deﬁned in Eqs. 4 and 5. Fig. 4 illustrates the learning time variation on the re-sampling frequency for the four studied algorithms. Fig. 5 shows the mean learning time lag between the one-generation-wise sampling approach and the g-generation-wise sampling approach for g ¼ f and f 2 ð10; 20; 30; 40; 50Þ. The shape of the curves in Fig. 4 reveals two distinct behaviors when the sampling frequency increases. The ﬁrst concerns the BUSS, BRSS and RSS algorithms that have recorded an insigniﬁcant decreasing variation of the average learning time. The second behavior is that of DSS with a more remarkable decrease for both data sets. Fig. 5 illustrates clearly the saving meantime for each sampling frequency f 2 ð10; 20; 30; 40; 50Þ according to the one generation frequency (f ¼ 1) with both KDD99 and Adul data sets, applied with the methods BRSS (Fig. 5 (a1) and (a2)), BUSS (Fig. 5(b1) and (b2)), RSS (Fig. 5 (c1) and (c2)) and DSS (Fig. 5(d1) and (d2)). The histograms of the BUSS, BRSS and RSS methods have the same time scale. However that of DSS has a diﬀerent scale given the importance of the time saved compared to the three other methods. Time saving depends on the time needed to perform sample creation with respect to the time spent for a whole generation. This is why the decrease in time is not very important for the techniques BUSS, BRSS and RSS. In the case of KDD99, only 26 s of time saving have been recorded for BRSS with frequency f ¼ 50 (Fig. 5 (a1)), and it is reduced to near-zero for BUSS for all the frequencies (Fig. 5 (b1)). Similarly, the maximum meantime lag for the Adult data set is recorded with f ¼ 50 (Fig. 5 (a2)) and it is negative in some cases with the BUSS sampling (Fig. 5 (b2)). The same ﬁnding for RSS, the meantime saving is either negative for several cases or reduced to near-zero (Fig. 5 (c1) and (c2)). Thus, when the ﬁtness case selection, with or without class balancing, is carried out randomly, the computation cost is correlated to the population evaluation, since it is the predominant step in the learning time for GP. 7.5. Sampling settings In the ﬁrst set of experiments, we tested six values for the sampling frequency: 1, 10, 20, 30, 40 and 50 on four sampling methods BRSS, BUSS, DSS and RSS on both KDD and Adult data sets. BRSS is Balanced RSS. It is an RSS variant where the random sample is balanced according to a given ratio between problem classes. In the second part, we implemented four diﬀerent techniques to control sampling frequency. Two deterministic techniques and two adaptive ones as follows: – Deterministic+: deterministic controlling with the 0:5 increasing function f ¼ ð2 gÞ , – Deterministic-: deterministic controlling with the 0:5 decreasing function f ¼ ð2 ðgmax gÞÞ , – Average Fitness: adaptive controlling based on population average ﬁtness evolution with a threshold of 0:001, – Min Resolved: adaptive controlling based on the average proportion of the population representing the individuals that resolved all sample records. We use a minimum threshold of 0:5. The underlying active sampling algorithms have their own parameters described in Table 5. 8. Results and discussion The experimental study is organised in two parts. The aim of the ﬁrst experiments is to study the impact of the S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39 33 Table 5 Common sampling parameters. Method Parameter Value All (except BUSS) BRSS BUSS Target Size Balancing method Target size 5000 Full dataset distribution 416 for KDD/15682 for Adult DSS Difficulty exponent Age exponent 1 3.5 Fig. 4. Variation of the mean learning time with the re-sampling frequency applied on KDD-99 dataset (a) and Adult dataset (b). A general observation can be made with both KDD-99 and Adult data sets. When the sampling method is not expensive according to the computational cost, the variation of the learning time according to the increase of the sampling frequency is not signiﬁcant. This is the case for the BUSS, BRSS and RSS methods that recorded very small positive or negative variations (Fig. 5 (a1), (a2), (b1), (b2), (c1) and (c2)). However, as for the KDD99 database, the learning time is signiﬁcantly lower with the DSS method when it is applied less frequently (Fig. 5 (d1) and (d2)). The decrease of the meantime lag is proportional to the increase of the re-sampling frequency. DSS algorithm diﬀers from other algorithms by updating certain sampling parameters (age and difficulty). Thus, with DSS, the selection of a ﬁtness case requires the calculation of a probability based on the age and difficulty values of the whole dataset. Therefore, this method needs much more time than the other techniques, which explains the diﬀerence in learning time saving. As for the performance metrics, the same analysis is carried out. Fig. 6 and Fig. 7 show the eﬀect of sampling frequency on two learning quality measures: accuracy (Fig. 6 (a) and (b) for KDD99 data set and Fig. 6 (c) and (d) for Adult data set) and FPR (Fig. 7 (a) and (b) for KDD99 data set and Fig. 7 (c) and (d) for Adult data set). Fig. 8 illustrates the accuracy gap (computed on the test data set) of the diﬀerent re-sampling frequency according to the one-generation-wise sampling. Fig. 6 and Fig. 7 illustrate an irregular shape for Accuracy and FPR variation curves for the three methods using random ﬁtness case selection. Some improvements can be seen with high frequency values (i.e. f ¼ 50) in the case of KDD99. Nevertheless, this remains irregular and cannot be generalized. However, in the case of the Adult data set, a high sampling frequency such f ¼ 50 decreases the quality of the results for the four sampling techniques. The most noticeable shift is that of the FPR (Fig. 7). However, no empirical correlation with the variation of the re-sampling frequency for BRSS, DSS, and RSS can be made. In the case of KDD-99, only BUSS realizes a decrease in FPR value when re-sampling frequency increases for training and test sets. The best values are recorded with f ¼ 50 for the KDD-99 and with f between 30 and 40 for Adult data set. However, the BUSS sampling is not suitable for the incomes prediction problem since there is no classes imbalance in the corresponding data set. It is clear that introducing a balanced sampling, such as BUSS and BRSS, in the GP engine when it has to learn from an imbalanced data helps to improve the quality of the derived models. It is the case of the KDD99 database. This performance is even higher when GP has more learning time from each data subset with a high sampling frequency. GP behavior is completely diﬀerent with the Adult database which has diﬀerent characteristics than that of KDD99 database. Hence the need to adapt the sampling strategy and frequency according to the training data set. Accuracy is a main metric to measure the quality of a classiﬁcation model. Thus, as the mean learning time, we computed the gap between the accuracy values obtained with f > 1 according the on-generation sampling (f ¼ 1) for each sampling strategy and for both KDD99 and Adult databases. Fig. 8 illustrates the obtained measures for f varying between 10 and 50. Although the time saving is low or not signiﬁcant for RSS, BUSS, and BRSS methods (Fig. 5), accuracy lag values illustrated in Fig. 8 show that it exists a sampling frequency able to improve the learning performance of each of these sampling techniques. The appropriate sampling 34 S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39 Fig. 5. The mean learning time lag according to one-generation-wise sampling approach besides the variation of the re-sampling frequency with the KDD99 dataset (left) and the Adult dataset (right) (X-axis of the DSS histogram has a larger scale). frequency diﬀers from sampling method to an other and according to the database characteristics. In the case of the KDD99 base, high frequencies have greatly improved the results for DSS and BRSS. The nature of the data implies that GP, with sampling techniques, needs to have longer learning phases to properly adapt its models to the training data. For example, BRSS and DSS performed about 14% better with f ¼ 50 than with f ¼ 1 according to accuracy, with a time saving of 26 s for BRSS and 161 s with DSS. This gap decreases to 0:9% and 0:63% as best improvement accomplished respectively by BUSS and RSS. As for the Adult data set, although small, all the improvements (according to f ¼ 1) are observed essentially with frequencies below than 50. With f ¼ 50, the quality of the derived models is getting worse for all sampling techniques. This proves that the sampling frequency must be adapted to the sampling method, the training data set and the evolution of learning process. This ﬁrst study demonstrated the impact of sampling frequency on the performance of four sampling methods implemented with GP. The results illustrated in the diﬀerent ﬁgures show that, giving a sampling strategy and a training database, there is a frequency that allows the GP to achieve a certain optimality. However, it is difficult to hand tune its value. For these reasons, we propose some solutions to control its value through the GP engine. The S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39 35 Fig. 6. Variation of the best individual accuracy according to the re-sampling frequency with the KDD-99 train and test sets ((a) and (b)) and with the Adult train and test sets ((c) and (d)). following section presents the second experimental study applied for the adaptive frequency control. 8.2. Adaptive sampling To study the efficiency of the proposed sampling frequency controlling strategies introduced in Section 6, we have extended the four dynamic sample algorithms (BUSS, BRSS, RSS, and DSS) with the four sampling frequency controlling techniques: deterministic based on an increasing or decreasing function (Deterministic + and Deterministic-) and adaptive controlling based on the population average ﬁtness value (Average Fitness) or the average number of individuals that resolved the samples cases (Min Resolved). Figs. 9 and 10 report the experimental results of theses extensions. The obtained results with both Fig. 7. Variation of the best individual FPR according to the re-sampling frequency on the KDD99 train and test sets ((a) and (b)) and on the Adult train and test sets ((c) and (d)). 36 S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39 Fig. 8. Accuracy gap on test data set obtained with diﬀerent re-sampling frequency according to the one-generation-wise sampling for KDD99 test data set (left) and Adult test data set (right). KDD99 and Adult data sets are compared to those obtained with the original dynamic algorithms without frequency control (dynamic) where f ¼ 1. With regard to the mean learning time, results from Fig. 9 approve the general behaviour described in Section 8.1. It is essentially the DSS mean learning time that is aﬀected by the introduction of any controlling technique. This impact is eﬀective with both deterministic or adaptive approaches. Indeed, the learning time varies with the number of samples generated through a GP run. There is no signiﬁcant time saving observed with the three other sampling methods. However, this behavior changes regarding the accuracy and FPR metrics on the test set (Fig. 10). Let us consider ﬁrst the results obtained for the KDD99 data set. For each dynamic sampling method, at least one controlling frequency approach is able to improve its learning performance with a gap up to 10% for the accuracy metric. For example, for the KDD data set, the controlling technique ’Average Fitness’ help the sampling algorithm BRSS to achieve an improvement up to 12% (Fig. 10(a)), but it isn’t the case with the FPR metric. Likewise, for the RSS, the ‘Deterministic+’ and ‘Min Resolved’ techniques have allowed the accuracy to move from values around 80% to values greater than 90%. Moreover, the two deterministic methods (deterministic + and deterministic-) and the adaptive method ‘Min Resolved’ have improved signiﬁcantly the accuracy values for DSS sampling, whose values increased by 10% to 12%. However, a signiﬁcant decrease of the FPR quality has been recorded. An exception is observed for the BUSS algorithm where the accuracy and FPR measures record a very small improvement with all the controlled frequency techniques (Fig. 10). For the Adult data set, the improvements are small or missing. For example, for the DSS and RSS methods, the accuracy value improves around 2% with the ’Average Fitness’ adaptive sampling. Similarly, no signiﬁcant improvements obtained for the FPR measures. In fact, according to the second experimental study, the introduction of the sampling frequency control to the GP engine, if it does not allow an improvement of the results, does not generate a deterioration of the performance, except for some cases for the FPR measure. To summarize, with adaptive sampling, the computational cost can be improved according to the underlying dynamic sampling algorithm only if the ﬁtness cases selection process is time consuming as it is for DSS. Otherwise, the controlling predicates could well improve the learning performance, especially the accuracy metric. However, they do not have a proven positive and generalized eﬀect on the learning quality. Thus, they need to be reﬁned. Fig. 9. Variation of the mean learning time according to the re-sampling frequency controlling strategy obtained for the KDD999 data set (case (a)) and the Adult data set (case (b)). S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39 37 Fig. 10. Variation of the accuracy and FPR measures according to the re-sampling frequency controlling strategy for KDD99 data set (a and c) and Adult data set (b and d). Fig. 11. Comparison of the accuracy (a), FPR (b) and Meantime (c) measures between hierarchical (blue) and adaptive (red) sampling (Min resolved approach). (For interpretation of the references to colour in this ﬁgure legend, the reader is referred to the web version of this article.) 8.3. Adaptive vs hierarchical sampling Previous works published in Hmida et al. (2016a, 2016b) demonstrated that hierarchical (or multi-level) sampling can help GP to accomplish lower run-time while it keeps its performance as when using one level dynamic sampling technique. Multi-level sampling could provide a trade-oﬀ between speed and generalization ability, specially for complex dynamic sampling techniques such as DSS and TBS (Section 4). We demonstrate in this section that adaptive sampling could provide the same trade-oﬀ and could be faster in some cases. We provide below a comparative study between the both strategies applied on the KDD-99 database used for the experiments published in Hmida et al. (2016a, 2016b). Fig. 11 reports the performance of hierarchical sampling described in Section 4.4: RSS-DSS implemented with two variants, where the second RSS-DSS2 synchronizes the change of target size between the two levels, RSS-TBS and BUSS-RSS-TBS. The corresponding performance values are extracted from Hmida et al. (2016a, 2016b). For a comparative purpose, each ﬁgure represents also the performance of the adaptive method ’Min resolved’ approach applied to the dynamic sampling BRSS, BUSS, DSS and RSS. Corresponding values are designed respectively with ’A-BRSS’, ’A-BUSS’, ’A-DSS’ and ’A-RSS’. TBS is a powerful sampling technique having high computational cost. When implemented in a multi-level sampling approach, the TBS cost disappears while its performance is preserved (accuracy greater than 92%) especially with the application of the BUSS at level 0. Fig. 11 shows clearly that the same performance according to Accuracy could be reached with adaptive sampling with better learning meantime in some cases. Otherwise, a great advantage appears with adaptive sampling (with MinResolved approach) according the hierarchical one. Indeed, the comparative study published in Hmida et al. 38 S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39 (2016a, 2016b) reports how it is possible to reduce the computational cost of some complex sampling techniques with the hierarchical implementation while keeping the same performance or even obtaining better performance, according to accuracy metric. However, an important increase of FPR measure is recorded for all the experiments. Fig. 11 (b) shows clearly that this problem could be handled with the adaptive sampling where the FPR measures are largely lower. Hierarchical sampling has a g-generation-wise sampling frequency, where g might be equal to 1 (g=1) in the last level and g P 1 in level 1. To optimize the efficiency of the GP learners when using this sampling strategy, g is hand tuned according to the database size and the ﬁtness cases difficulty. The adaptive sampling can accomplish the same purpose and do not need the step of hand tuning of the parameter g. Indeed, the re-sampling frequency is computed and adjusted with the ongoing evolutionary process. Moreover, the adaptive sampling has the advantage of using a unique sampling method while the hierarchical sampling needs to combine two or more methods in diﬀerent levels. It is also possible to conceive a combination between the two strategies. A research path to explore would be to introduce the frequency control at each level of the hierarchical sampling. 9. Conclusion This work is a proposal for a new form of active learning with Genetic Programming based on adaptive sampling. Its main objective is to extend some known dynamic sampling techniques with an adaptive frequency control that takes into account the state of learning process. After a study of the impact of the sampling frequency variation on the performance of the derived models learning meantime, we proposed an increasing and a decreasing deterministic patterns, and two adaptive patterns for sampling frequency control. Adaptive patterns are based on information about ongoing learning, such as the percentage of cases resolved or the average performance. Experiments are led to test the adaptive sampling by controlling the sampling frequency with simple predicates. The results showed a slight eﬀect on learning time without impacting the learning accuracy. This eﬀect is in the direction of a decrease but with diﬀerent degrees depending on the sampling method. Many new research paths emerges from this study that are worthy of further investigation. A ﬁrst path is the exploration of other predicates that take into account the characteristics of the training dataset and the underlying problem to ﬁnd more relevant predicates for GP classiﬁer improvements. A second one is to extend the scope of adaptive sampling to other sample properties. For instance, an adaptive sampling can downsize or upsize the sample instead of generating a new one. We may also combine several sampling strategies and algorithms in a single method. Then, according to the learning state, a sample is generated using the suitable strategy in an interleaved way: we use a diﬀerent algorithm at each time we need to create a new sample. Declaration of Competing Interest The authors declare that they have no known competing ﬁnancial interests or personal relationships that could have appeared to inﬂuence the work reported in this paper. References Atlas, L. E., Cohn, D., & Ladner, R. (1990) Training connectionist networks with queries and selective sampling. In Advances in neural information processing systems (Vol. 2, pp 566–573). MorganKaufmann. Balkanski, E. & Singer, Y. (2018a). The adaptive complexity of maximizing a submodular function. In: I. Diakonikolas, D. Kempe, M. Henzinger (eds.), Proceedings of the 50th annual ACM SIGACT symposium on theory of computing, STOC 2018, Los Angeles, CA, USA, June 25–29, 2018 (pp 1138–1151). ACM, doi:10.1145/ 3188745.3188752. Balkanski, E. & Singer, Y. (2018b). Approximation guarantees for adaptive sampling. In: J.G. Dy, A. Krause (eds.) Proceedings of the 35th international conference on machine learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10–15, 2018, PMLR, Proceedings of Machine Learning Research (vol 80, pp 393–402). http:// proceedings.mlr.press/v80/balkanski18a.html. CGP. (2009). Cartesian gp website. http://www.cartesiangp.co.uk. Cohn, D., Atlas, L., & Ladner, R. (1994). Improving generalization with active learning. Machine Learning, 15, 201–221. Curry, R. & Heywood, M. I. (2004). Towards efficient training on large datasets for genetic programming. In Advances in artiﬁcial intelligence, 17th conference of the Canadian Society for Computational Studies of Intelligence, Canadian AI 2004, Proc., Springer, lecture notes in computer science. (vol 3060, pp 161–174), doi:10.1007/978-3-54024840-8_12. Deschrijver, D., Crombecq, K., Nguyen, H. M., & Dhaene, T. (2011). Adaptive sampling algorithm for macromodeling of parameterized s parameter responses. IEEE Transactions on Microwave Theory and https://doi.org/10.1109/ Techniques, 59(1), 39–45. TMTT.2010.2090407. Eiben, A. E., Michalewicz, Z., Schoenauer, M., & Smith, J. E. (2007). Parameter control in evolutionary algorithms. In Parameter setting in evolutionary algorithms (pp. 19–46). Springer. Freitas, A. A. (2002). Data mining and knowledge discovery with evolutionary algorithms. Berlin, Heidelberg: Springer-Verlag. Fu, Y., Zhu, X., & Li, B. (2013). A survey on instance selection for active learning. Knowledge and Information Systems, 35(2), 249–283. https:// doi.org/10.1007/s10115-012-0507-8. Gathercole, C. (1998). An investigation of supervised learning in genetic programming. Thesis, University of Edinburgh Gathercole, C., & Ross, P. (1994). Dynamic training subset selection for supervised learning in genetic programming. In Y. Davidor, H. P. Schwefel, & R. Manner (Eds.). Parallel problem solving from nature III (Vol. 866, pp. 312–321). Jerusalem, LNCS: Springer-Verlag. https:// doi.org/10.1007/3-540-58484-6_275. Ghatercole, C., & Ross, P. (1997). Small populations over many generations can beat large populations over few generations in genetic programming. In Genetic programming 1997: Proc. of the second annual conf (pp. 111–118). San Francisco, CA: Morgan Kaufmann. Gonçalves, I. & Silva, S. (2013). Balancing learning and overﬁtting in genetic programming with interleaved sampling of training data. In K. Krawiec, A. Moraglio, T. Hu, A.S. Etaner-Uyar, B. Hu (eds.), Genetic programming – 16th European conference, EuroGP 2013, Vienna, Austria, April 3–5, 2013. Proceedings, Springer, lecture notes in S. Ben Hamida et al. / Cognitive Systems Research 65 (2021) 23–39 computer science (Vol. 7831, pp 73–84), doi:10.1007/978-3-642-372070_7. Haitao, L., Yew-Soon, O., & Jianfei, C. (2018). A survey of adaptive sampling for global metamodeling in support of simulation-based complex engineering design. Structural and Multidisciplinary Optimization, 57(1). Harding, S., & Banzhaf, W. (2011). Implementing cartesian genetic programming classiﬁers on graphics processing units using gpu.net. In S. Harding, W. B. Langdon, M. L. Wong, G. Wilson, & T. Lewis (Eds.), GECCO 2011 Computational intelligence on consumer games and graphics hardware (CIGPU), ACM, Dublin, Ireland (pp. 463–470). https://doi.org/10.1145/2001858.2002034. Hmida, H., Ben Hamida, S., Borgi, A., & Rukoz, M. (2016a). Hierarchical data topology based selection for large scale learning. In 2016 Intl IEEE conferences on ubiquitous intelligence & computing, advanced and trusted computing, scalable computing and communications, cloud and big data computing, internet of people, and smart world congress, Toulouse, France, July 18–21, 2016 (pp 1221–1226), IEEE, doi:10.1109/ UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.0186, URL http://doi.ieeecomputersociety.org/10.1109/UIC-ATC-ScalCom-CBD Com-IoP-SmartWorld.2016.0186 Hmida, H., Ben Hamida, S., Borgi, A., & Rukoz, M. (2016b). Sampling methods in genetic programming learners from large datasets: A comparative study. In P. Angelov, Y. Manolopoulos, L.S. Iliadis, A. Roy, M.M.B.R. Vellasco (eds.), Advances in big data – proceedings of the 2nd INNS conference on big data, October 23–25, 2016, Thessaloniki, Greece, advances in intelligent systems and computing (vol 529, pp 50–60), doi:10.1007/978-3-319-47898-2_6. Hunt, R., Johnston, M., Browne, W. N., & Zhang, M. (2010). Sampling methods in genetic programming for classiﬁcation with unbalanced data. In AI 2010: Advances in artiﬁcial intelligence – 23rd Australasian joint conference, proc., Springer, lecture notes in computer science (Vol. 6464, pp 273–282), doi:10.1007/978-3-642-17432-2_28. Iba, H. (1999). Bagging, boosting, and bloating in genetic programming. In The 1st annual conference on genetic and evolutionary computation, Proc., Morgan Kaufmann, San Francisco, CA, USA, GECCO’99, vol 2, pp 1053–1060 Iyengar, V. S., Apté, C., & Zhang, T. (2000). Active learning using adaptive resampling. In R. Ramakrishnan, S.J. Stolfo, R.J. Bayardo, I. Parsa (eds.), Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, Boston, MA, USA, August 20–23, 2000 (pp 91–98). ACM, doi:10.1145/347090.347110. Kohavi, R. (1996). Scaling up the accuracy of naive-bayes classiﬁers: A decision-tree hybrid. In E. Simoudis, J. Han, & U. M. Fayyad (Eds.), Proceedings of the second international conference on Knowledge Discovery and Data Mining (KDD-96) (pp. 202–207). Portland, Oregon, USA: AAAI Press, http://www.aaai.org/Library/KDD/ 1996/kdd96-033.php. Koza, J. R. (1992). Genetic programming – on the programming of computers by means of natural selection. Complex adaptive systems. MIT Press. Lasarczyk, C., Dittrich, P., & Banzhaf, W. (2004). Dynamic subset selection based on a ﬁtness case topology. Evolutionary Computation, 12(2), 223–242. https://doi.org/10.1162/106365604773955157. Li, X. B. (2002). Data reduction via adaptive sampling. Communications in Information and Systems, 2(1), 53–68. https://doi.org/10.4310/ CIS.2002.v2.n1.a3, https://www.intlpress.com/site/pub/pages/journals/items/cis/content/vols/0002/0001/a003/. 39 Li, X. B., & Jacob, V. S. (2008). Adaptive data reduction for large-scale transaction data. European Journal of Operational Research, 188(3), 910–924. https://doi.org/10.1016/j.ejor.2007.08.008, http://www.sciencedirect.com/science/article/pii/S0377221707008867. Liu, H., Xu, S., Ma, Y., Chen, X., & Wang, X. (2015) An adaptive bayesian sequential sampling approach for global metamodeling. Journal of Mechanical Design 138(1), doi:10.1115/1.4031905, 011404, https://asmedigitalcollection.asme.org/mechanicaldesign/article-pdf/ 138/1/011404/6227283/md_138_01_011404.pdf. Liu, Y., & Khoshgoftaar, T. M. (2004). Reducing overﬁtting in genetic programming models for software quality classiﬁcation. In 8th IEEE international symposium on high-assurance systems engineering (pp. 56–65). Tampa, FL, USA: IEEE Computer Society. https://doi. org/10.1109/HASE.2004.1281730. Luke, S. (2017). Ecj homepage. http://cs.gmu.edu/~eclab/projects/ecj/. Luo, L., Hou, X., Zhong, J., Cai, W., & Ma, J. (2017). Sampling-based adaptive bounding evolutionary algorithm for continuous optimization problems. Information Sciences, 382–383, 216–233. https://doi. org/10.1016/j.ins.2016.12.023. Miller, J. F. & Thomson, P. (2000). Cartesian genetic programming. In Genetic programming, European conference, proc., Springer, lecture notes in computer science (Vol. 1802, pp 121–132). Nordin, P., & Banzhaf, W. (1997). An on-line method to evolve behavior and to control a miniature robot in real time with genetic programming. Adaptive Behaviour, 5(2), 107–140. https://doi.org/10.1177/ 105971239700500201. Paris, G., Robilliard, D., Fonlupt, C. (2003). Exploring overﬁtting in genetic programming. In: P. Liardet, P. Collet, C. Fonlupt, E. Lutton, M. Schoenauer (eds.), Artiﬁcial evolution, 6th international conference, evolution Artiﬁcielle, EA 2003, Marseilles, France, October 27–30, 2003, Springer, lecture notes in computer science (Vol. 2936, pp 267–277), doi:10.1007/978-3-540-24621-3_22. Pétrowski, A. & Ben Hamida, S. (2017). Evolutionary algorithms. John Wiley & Sons, USA, doi:10.1002/9781119136378 Pickett, B. & Turner, C. J. (2011). A review and evaluation of existing adaptive sampling criteria and methods for the creation of NURBsbased Metamodels. In 31st Computers and information in engineering conference. (Vol. 2, Parts A and B, pp 609–618), doi:10.1115/ DETC2011-47288. Settles, B. (2010). Active learning literature survey. Tech. Rep. 1648, University of Wisconsin, Madison. Simon, D. (2013). Evolutionary optimization algorithms. John Wiley & Sons, USA, doi:10.1007/978-1-84996-129-5 UCI. (1999). Kdd cup. https://archive.ics.uci.edu/ml/datasets/KDD+Cup +1999+Data. Yu, X., & Gen, M. (2010). Introduction to evolutionary algorithms. Decision Engineering, Springer, London, London,. https://doi.org/ 10.1007/978-1-84996-129-5. Zhang, B. T., & Cho, D. Y. (1999). Genetic programming with active data selection. In B. McKay, X. Yao, C. S. Newton, J. H. Kim, & T. Furuhashi (Eds.), Simulated evolution and learning (pp. 146–153). Berlin, Heidelberg: Springer. Zhang, B. T., & Joung, J. G. (1999). Genetic programming with incremental data inheritance. The genetic and evolutionary computation conference, proc (Vol. 2, pp. 1217–1224). Orlando, Florida, USA: Morgan Kaufmann.

Log In

Adaptive sampling for active learning with genetic programming

Related papers

Related papers

Related topics