AI - Software - Testing - Syllabus (1.0) PDF
AI - Software - Testing - Syllabus (1.0) PDF
AI - Software - Testing - Syllabus (1.0) PDF
Released
Version 1.0
Version 1.0
© Copyright 2019
© A4Q Copyright 2019 - Copyright notice
All contents of this work, in particular texts and graphics, are protected by copyright. The use and
exploitation of the work is exclusively the responsibility of the A4Q. In particular, the copying or
duplication of the work but also of parts of this work is prohibited. The A4Q reserves civil and penal
consequences in case of infringement.
Revision History
Version 1.0
© Copyright 2019
2
Table of Contents
0 Introduction
6
0.1 Purpose of this Syllabus 6
0.2 Examinable Learning Objectives and Cognitive Levels of Knowledge 6
0.3 The AI and Software Testing Foundation Exam 6
0.4 Accreditation 6
0.5 Level of Detail 7
0.6 How this Syllabus is Organized 7
0.7 Business Outcomes 7
0.8 Acronyms 7
1.0 Key Aspects of Artificial Intelligence 8
Keywords 8
Learning Objectives for Key Aspects of Artificial Intelligence 8
1.1 What are Human Intelligence and Artificial Intelligence? 8
1.2 History of AI 8
1.3 Symbolic AI 8
1.4 Sub-symbolic AI 8
1.5 Some ML Algorithms in More Detail 8
1.6 Applications and Limits of AI 9
1.1 What are Human Intelligence and Artificial Intelligence? 10
Types of Intelligence 10
Turing Test 11
1.2 History of AI 11
Main Periods of AI History 12
Difference between Symbolic and Sub-symbolic AI 13
1.3. Symbolic AI 13
Mathematical Logic and Inference 14
Knowledge-based Systems 14
Constraint-based Solving Systems for Problem Solving 15
1.4 Sub-symbolic AI 15
Types of Learning 16
Version 1.0
© Copyright 2019
3
Automation Bias 30
Adversarial Actors 31
2.2 Machine Learning Model Training and Testing 31
2.3 AI Test Environments 32
2.4 Strategies to Test AI-based Systems 34
Acceptance Criteria 34
Functional Testing 35
External and Internal Validity 35
Metamorphic Testing 35
A/B Testing 36
Evaluation of Real-world Outcomes 36
Expert Panels 37
Levels of Testing 37
Component Testing 38
System Integration Testing 38
System Testing 39
User Acceptance Testing 39
2.5 Metrics for Testing AI-based systems 40
Confusion Matrix 40
Statistical Significance 41
3.0 Using AI to Support Testing 42
Keywords 42
Learning Objectives for Using AI to Support Testing 42
3.1 AI in Testing 42
3.2 Applying AI to Testing Tasks and Quality Management 42
3.3 AI in Component Level Test Automation 42
3.4 AI in Integration Level or System Level Test Automation 42
3.5 AI-based Tool Support for Testing 43
3.1 AI in Testing 44
The Oracle Problem 44
Test Oracles 44
Testing versus Test Automation 45
3.2. Applying AI to Testing Tasks and Quality Management 45
Version 1.0
© Copyright 2019
5
Version 1.0
© Copyright 2019
6
0 Introduction
0.4 Accreditation
The A4Q training titled AI and Software Testing Foundation is the only accredited training course for
the content presented in this syllabus.
Version 1.0
© Copyright 2019
7
0.8 Acronyms
AI Artificial intelligence
ML Machine learning
Version 1.0
© Copyright 2019
8
Keywords
Artificial Intelligence, clustering, correlation, decision trees, Deep Learning, expert knowledge-based
system, data feature, hyperparameter, machine learning model, regression model, machine
learning, neural network, reinforcement learning, sub-symbolic AI, supervised learning, symbolic AI,
unsupervised learning
1.2 History of AI
AI-1.2.1 (K1) Recall the main periods of AI history.
AI-1.2.2 (K2) Explain the difference between symbolic and sub-symbolic AI.
1.3 Symbolic AI
AI-1.3.1 (K2) Discuss the difference between propositional logic, predicate logic and
many-valued logic.
AI-1.3.2 (K2) Describe how a knowledge-based system works.
AI-1.3.3 (K2) Explain what a constraint satisfaction problem is.
1.4 Sub-symbolic AI
AI-1.4.1 (K2) Distinguish several types of machine learning.
AI-1.4.2 (K2) Give examples of applications for a given machine learning type.
AI-1.4.3 (K2) Give examples of machine learning algorithms used for a given machine
learning type.
AI-1.4.4 (K1) Recall machine learning metrics.
Version 1.0
© Copyright 2019
10
Types of Intelligence
The theory of multiple intelligences highlights the existence of several types of human intelligence.
This theory was first proposed by Howard Gardner in 1983 and has since been expanded.
Gardner defines intelligence as “an ability or a set of abilities that permits an individual to solve
problems or fashion products that are of consequence in a particular cultural setting” [Gardner 1983].
The theory of multiple intelligence today defines nine types of intelligence modalities [Davis 2011]:
Musical intelligence refers to the capacity to discern pitch, rhythm, timbre, and tone;
Bodily/kinesthetic intelligence refers to the movement and ability to manipulate objects and use
a variety of physical skills;
Logical/mathematical intelligence refers to the ability with numbers and figures to carry out
complete mathematical operations;
Visual/spatial intelligence refers to the ability to think in three dimensions, making charts and
diagrams, and estimating distances;
Linguistic intelligence refers to the ability of using words, including a convincing way of speaking
or note taking;
Interpersonal intelligence refers to one’s ability with social skills, including having successful
long-lasting friendships and understanding other people’s moods and needs and behaviors;
Version 1.0
© Copyright 2019
11
Intrapersonal intelligence refers to one’s ability to understand one’s self, including being
productive when working and making appropriate decisions for oneself;
Naturalistic intelligence refers to the ability of relating to nature and the natural environment in
general, animals, flowers, or plants;
Existential intelligence is defined as the ability to be sensitive to, or have the capacity for,
conceptualizing or tackling deeper and fundamental questions about human existence.
Other modalities of intelligence have been discussed, such as the teaching-pedagogical intelligence,
or the sense of humor, but these skills do not fit those proposed in Gardner's theory.
The theory of multiple intelligences highlights the very partial nature of intelligence quotient (IQ)
tests, which mainly measures linguistic and logical-mathematical abilities.
Turing Test
One of the fundamental problems with intelligence is that no consensual definition has yet been
agreed upon; the one given above is just one of many existing definitions. So defining an artificial
intelligence without a definition of intelligence is challenging. Alan Turing in 1950 circumvented this
issue when he devised his Turing Test. Although nobody can agree what intelligence is, we all agree
that humans are intelligent. So if a machine can’t be distinguished from a human, that machine also
needs to be intelligent (whatever that means). Because of the technical abilities at the time, the test
was devised as a chat.
The Turing Test consists of putting a human being in blind verbal conversation with a computer and
another human. If the person who initiates the conversation is not able to say which of his
interlocutors is a computer, the computer software can be considered to have successfully passed
the test. This implies that the computer and the human will try to have a human talk as humans
would. To maintain the simplicity and universality of the test, the conversation is limited to text
messages between the participants.
Today, the Turing Test has a historical value because the scenario implemented remains quite
simplistic. Instead, AI researchers have been more interested in board games that have a high
cognitive level, such as chess and go games. On these games, artificial intelligence is now better than
humans, since 1997 with IBM’s Deep Blue program for Chess and since 2017 for the game of Go with
AlphaGo from Google DeepMind.
1.2 History of AI
Artificial intelligence has an ancient history because humanity was concerned very early on with
designing machines that simulate human reasoning and behavior. The myths and legends during
prehistoric and antiquity times are full of artificial creatures with consciousness and intelligence
[McCorduck 2004]. Artificial intelligence as we understand it today was initiated by classical
philosophers, including Gottfried Wilhelm Leibniz with his calculus ratiocinator, who tried to
describe the process of human thought as the mechanical manipulation of symbols.
The Dartmouth Conference of 1956 launched artificial intelligence as a specific and independent
field of research. The research community has been structured from this period onwards, benefiting
from the constant progress in computer processing capabilities, but also from the development of
specific theories mimicking the human brain. Animal behavior and other bio-inspired mechanisms
such as evolution, are also a source of inspiration for AI researchers.
Version 1.0
© Copyright 2019
12
Version 1.0
© Copyright 2019
13
1.3. Symbolic AI
The techniques of symbolic AI are characterized by an explicit formalization of knowledge and
general algorithms for reasoning on this formalized knowledge. Mathematical logic and associated
inference systems have played a major role in the development of symbolic AI, but search and
optimization algorithms are also used to reason about the formalized knowledge.
Version 1.0
© Copyright 2019
14
Knowledge-based Systems
A knowledge-based system is generally structured in two parts: the knowledge base and the
inference engine. This separation brings several advantages in the treatment of knowledge:
Knowledge can be formalized directly by the domain experts. This knowledge representation is
explicit and not hidden in the software code.
The results of the reasoning can be explained: which part of the knowledge base was used, and
which inference technique was used. If the results are incorrect, it is possible to analyze the
cause (e.g., by detecting inconsistencies in the knowledge base).
The inference engine is independent of the knowledge base and can be used in different
knowledge base systems.
Formal logic-based representations and automated deduction techniques are generally used for
knowledge-based systems. They are successfully used in various fields such as expert systems for
diagnosis, intelligent tutoring systems and computer system fault diagnosis.
Version 1.0
© Copyright 2019
15
The limitations and difficulties of implementing knowledge-based systems stem precisely from the
explicit nature of knowledge and its formalization. It can be complex even for a domain expert to
formalize the knowledge used in his or her activity. When the knowledge base becomes large,
maintenance and inconsistency detection problems become equally difficult to solve.
1.4 Sub-symbolic AI
Sub-symbolic AI does not seek to formalize knowledge and reasoning. It identifies what can be found
in or learned from existing data to make decisions about unknown situations.
Machine learning (ML) is the main domain of sub-symbolic AI, and on which the rest of this syllabus
is focused. It relies heavily on statistical techniques to represent and reason with data using different
types of learning.
In the field of ML, techniques, algorithms and models are three different but related terms.
Techniques refer to mathematical representations and methods adapted to ML. Algorithms define
how these representations are treated. The notion of a model refers to a representation of an
algorithm trained with training data.
Data science has been developing rapidly over the past decade due to the availability of large
amounts of data, the tremendous increase in storage capacity and data transfer speed and the
industry's need to optimize decision-making processes based on data. AI and data science have
common intersections, especially since machine learning is one of the methods used in data science
to perform data processing, predicting and automating decision-making activities.
Version 1.0
© Copyright 2019
16
Types of Learning
Different approaches can be used in the way a machine learning system is trained. Here are three
types commonly used:
Supervised learning is based on labelled training data; i.e., providing the correspondence
between input and output. Once trained on labelled data, the machine learning model is used to
predict the results using unknown data.
Unsupervised learning uses ML algorithms to identify commonalities in the data, without
providing labels or classification. The objective is to highlight the structures of a dataset provided
to the model, and then automatically classify new data.
Reinforcement learning uses reward and punishment mechanisms to improve the performance
of the learning model. In this case, learning is iterative and interactive with the environment. It
consists of an exploration phase to learn, then an exploitation phase to use what has been
learned.
Semi-supervised learning combines both supervised and unsupervised learning by using a set of
labelled and unlabeled data. This technique reduces the amount of labelled data required for
training.
For some applications, such as the autonomous car, combinations of learning types can be used and
integrated into a complex system.
Version 1.0
© Copyright 2019
17
Supervised Bayes belief Statistical model that represents variables and conditional
learning networks and naïve dependencies via a directed acyclic graph.
Bayes classifiers
Unsupervised K-means clustering Given a dataset, it identifies which data points belong to
learning each one of the k clusters.
Reinforcement Q-Learning Identify the optimal action an agent can take depending of
learning the circumstances.
As the ML domain is very active, new algorithms are frequently proposed, and combinations of
algorithms can also be used. The above table gives some common examples, but other combinations
can be used in some contexts or research work.
Deep learning is based on artificial neural networks using several layers of neurons to refine the
representation and processing of data. Because of the successes achieved with deep learning in
Version 1.0
© Copyright 2019
18
different fields (e.g., the AlphaGo program mentioned in section 1.1), deep learning has become a
branch of ML as such.
True Negatives: The percentage of actual negatives which are correctly identified.
False Positives: The percentage of actual positives which are not correctly identified. This is also
known as a Type I error.
False Negatives: The percentage of actual negatives which are not correctly identified. This is
also known as a Type II error.
These metrics can be used to measure the performance of a ML model and also to drive its
improvement.
They can deal with large amounts of data and high-dimensional data, using compact and
efficient representations.
Version 1.0
© Copyright 2019
19
K-means Algorithm
K-means is a classification algorithm used in unsupervised ML. It consists of partitioning the data into
k clusters, by minimizing the distance of each data to the mean of its cluster. It is therefore an
optimization problem for which heuristics have been developed using an iterative refinement
technique. This does not guarantee optimality but allows efficient implementations.
Version 1.0
© Copyright 2019
20
K-means requires that the number of targeted clusters be defined initially, which sometimes
requires several successive attempts to find the number of clusters best suited to the application.
The results of data classification by the k-means algorithm will depend on the number of clusters
defined (the K value), and also on the initialization (i.e., the choice of the initial data to initialize the
clusters) and the distance function chosen to calculate the distance between the data.
K-means is a clustering algorithm commonly used in a wide variety of applications such as customer
behavioral segmentation (e.g., by purchase history, by activities on a platform), document
classification, fraud detection, or image compression.
Version 1.0
© Copyright 2019
21
Typically, machine learning development work is conducted as an agile activity, moving through
iterations of business understanding, data understanding, data preparation, modelling, and
deployment. This is reflected in the Cross-Industry Process Model for Data Mining (CRISP-DM)
model:
1. Business understanding is about gaining an understanding of the business requirement and the
problem to be solved in data science terms.
2. Data understanding is performed with actual data; it helps to identify data quality issues and
requirements for preparation. Some initial insights and hypothesis may be produced at this
stage, and data items with the highest correlations to desired predictions will be identified.
3. Data preparation is usually a significant piece of work. Most learning models require that data is
converted into floating point (decimal) numbers before the model can understand it. Data
quality issues need to be resolved or incomplete records removed.
4. Modelling involves applying different learning models with different parameters and
mathematically comparing them.
5. Evaluation is a thorough evaluation of the learning model and the process used to create it.
Typically, this involves validation of the results in the business context.
6. Deployment generally involves integrating the model with other software, to provide an end to
end system which can be integration, system, and acceptance tested.
Sample bias can occur when the training data is not fully representative of the data space to
which ML is applied.
Inappropriate bias, like racial or gender bias, may be actual bias that is present in the dataset
that was truthfully picked up by the ML approach. In some cases, it can also be reinforced by the
algorithm used, for example by overvaluing a more widely present data class.
The ML system must therefore be evaluated against the different biases and then act on the data
and algorithm to correct the problem. Later in this syllabus we explore bias in more detail and its
effect on the quality of the system. Note that it is not sufficient to exclude inappropriate features
Version 1.0
© Copyright 2019
22
(e.g., race or gender) from the sample data to take care of inappropriate bias. These features may be
represented as dependent features.
Version 1.0
© Copyright 2019
23
Keywords
A/B testing, actuators, adversarial actor, agency, automation bias, bias, confusion matrix,
deterministic system, discreteness, drift, episodicness, external validity, human in the loop, internal
validity metamorphic relations, metamorphic testing, model training, observability, overfitting,
probabilistic system, sensors, staticness, statistical significance, training data, underfitting, variance
Version 1.0
© Copyright 2019
24
AI-2.2.2 (K1) Recall the difference between model training and traditional unit
testing.
AI-2.2.3 (K1) Recall types of defects that can arise in machine learning model
implementation.
Version 1.0
© Copyright 2019
25
Version 1.0
© Copyright 2019
26
Real-world Inputs
In the modern systems landscape, and particularly in the field of AI, there are many examples where
systems have real-world sensors, which are receiving unpredictable inputs. This is a type of physical
non-deterministic system. However, it may not be a non-testable system, if the sensors can be
manipulated in order to achieve stable outputs.
Self-optimization
Systems which learn based on data they experience, or self-optimize based on inputs and outputs,
are non-deterministic. While it may be possible to create a model and observe the same inputs and
outputs, it is unlikely to be practical to do so. These are non-testable systems.
Expert Systems
Detailed requirements and designs are often absent or imprecise for expert knowledge-based
systems often associated with symbolic AI, and conversational interfaces. These are often built
through an iterative refinement processes and user engagement; it is often the case that any
documented, detailed requirements and other test bases are not kept up to date.
These systems are not necessarily non-deterministic or non-testable, but instead can be considered
complex and may simply have a poorly defined test basis or lack a test oracle.
Perception of Intelligence
In many cases, the correctness of the behavior of the software is perceived differently by different
individual users. An example of this is in conversational interfaces, such as voice recognition and
response in smartphones or other devices, where different users have different expectations about
the variations of vocabulary that should be supported, and experience different results based on
their own choice of words and clarity of speech (in comparison to the dialects and speech the system
has been trained with).
These systems are technically both deterministic and testable, but the variety of possible inputs and
expectations of intelligence from end users can make the testing quite complicated. For example, it
is challenging to list up front all of requirements up-front for replicating human intelligence. This
often leads to a shifting or weak test basis, as intelligence judged differently, by different users.
Version 1.0
© Copyright 2019
27
Model Optimization
Systems which use probability, such as machine learning systems and their learning models, are
often carefully optimized. As part of model optimization, the algorithms will be adapted to get the
most accurate results. There are trade-offs in this process, and it may be that optimization improves
overall results, but causes incorrect results in a small number of cases. These systems are
deterministic and probabilistic systems, and non-testable using traditional techniques.
Trustworthiness: The degree to which the system is trusted by stakeholders, for example a
health diagnostic.
Because machine learning works by identifying patterns in data and applying them to make
prediction, it can be compared to a human being’s ability to make decisions based on heuristics or
“rules-of-thumb.” The concept of equivalence partitioning, which is the assumption that subsets of
variables within a similar range will be treated in a similar way, does not always apply. Machine
learning algorithms typically consume multiple input data items with complex relationships, and
without building an exhaustive understanding of these mathematical relationships, it is necessary to
evaluate models using large datasets, to ensure defects are exposed.
Referring again to the Titanic, another passenger ship, the Lusitania, sank three years later. In this
example the correlation between survival and gender did not exist, and survival rates were much
more closely correlated with age. Therefore, when applying the model we built for the Titanic to
data from the Lusitania, we would see a very low accuracy rate.
Bias occurs when a learning model has limited opportunity to make correct predictions because the
training data was not representative, and variance refers to the model’s sensitivity to specific sets of
training data.
Bias is a key statistical concept that is important to understand, that represents the difference
between model predictions and reality. Bias occurs whenever the dataset that a model has been
trained on is not representative of real data. Bias can be injected without the awareness of the
developers of the model; it can even be based on “hidden” variables which are not present in the
input domain but are inferred by the algorithm. Bias results from incorrect assumptions in the
model, and a high level of bias in the data will lead to “underfitting,” where the predictions are not
sufficiently accurate.
By contrast, models with a low level of bias suffer from a greater variance based on small
fluctuations in the training data, which can lead to “overfitting.” For supervised learning, it is an
established principle that a model which is trained to give perfect results for a specific dataset will
achieve lower quality predictions with previously unseen inputs. Selecting the ideal balance
between the two is called the bias/variance trade-off, and this is a property of all supervised
machine learning models.
High bias, low variance models will be consistent, but inaccurate on average. High variance, low bias
systems are accurate on average, but inconsistent. Examples of low-bias machine learning models
include clustering. Examples of high-bias machine learning models include linear regression.
There is a direct correlation between bias and variance, which is why it is referred to as a trade-off.
Specifically, increasing the bias will decrease the variance and increasing the variance will decrease
the bias.
Another similar area of complexity in machine learning is the concept of the “No Free Lunch
Theorem.” This is a set of mathematical proofs that show any general-purpose algorithm is of
equivalent complexity when its efficiency and accuracy are assessed across all possible problems.
Another way of looking at this is to state that all algorithms are accurate to the same degree when
averaged across all possible problems.
This is due to the huge potential problem space that a general-purpose algorithm can be applied to.
If an algorithm is optimized for a specific sub-domain, it is known that it will be sub-optimal for
others. This essentially also means that, when no assumptions at all can be made for a problem (e.g.,
heuristics to be applied), then a purely random approach will on average perform as well as the most
sophisticated ML approach. The reason why the human brain works well in real-life is that our world
allows many such assumptions; e.g., that history is a good predictor of the future.
Version 1.0
© Copyright 2019
29
This can be illustrated through the example of a search algorithm of the sort used to identify the
route out of a maze. Search algorithms typically enumerate possibilities until they determine a
correct answer, and the mechanism of enumeration is critical. For example, a common way to find
your way out of a maze is to keep turning in the same direction until you reach an exit, a key variable
for an algorithm might be whether to turn left each time or turn right. If the algorithm was
configured to always turn left, then given a maze with an exit immediately upon the left, it would
perform exceptionally well. If the exit was actually immediately to the right, it would perform very
poorly. While this is obviously theoretical, it highlights how a model’s effectiveness can vary greatly
as the problem space changes.
The problems associated with model optimization are crucial to understand when testing machine
learning models. Examples of defects which can occur are listed below:
Models which are trained with a high variance on a data feature which does not vary much in
the training dataset, but varies much more in real life, will be inconsistent in making predictions.
Models which are trained with a high bias (e.g., models trained on data from a single country but
intended to be used globally) may overfit to data items which are not highly correlated with the
desired predictions.
Supervised learning datasets which are incorrectly labelled (e.g., by humans) will lead to models
which are incorrectly trained and will make incorrect predictions.
Integration issues which cause data features to be incorrect or missing can cause incorrect
predictions even when the model has been exposed to correct data.
Drift
Drift is the concept that the correlation between the inputs and the outputs of an AI system may
change over time. It is not possible to test for drift generally, but it is possible to test specific risks.
When systems are re-trained with the same training data but a new random initialization of a
learning model, or self-learn based on new data, then there is a risk that the correlations in the data
change and consequently the model’s behavior changes. In other words, drift means that the
average error rate is unchanged, but that some instances are labeled differently than they were
labeled before. These could be hidden correlations that are not necessarily even represented in the
input data domain for the AI system. These could also be seasonal or recurring changes, or abrupt
changes. They could also be cultural, moral or societal changes external to the system.
It is important to understand that many machine learning models are effectively built from scratch
whenever they are re-trained. Traditional software engineering approaches typically change a
limited amount of code in each release, whereas in many mathematical based approaches the model
is completely rebuilt. This increases the risk of regression significantly.
These correlations can also become less valid over time in the real-world. For example, if the
popularity of specific products change, or different groups of users start to use a website more, the
predictions and decisions made by machine learning will become less accurate.
Another aspect of the drift in AI systems can be attributed directly to testing. For example, the
pesticide paradox referred to in the ISTQB Foundation syllabus states that if the same tests are
repeated over and over again, they find less defects. This is also true where machine learning
models are used, and automated tests are repeated in a way that changes the model. The model
Version 1.0
© Copyright 2019
30
can in fact become overfitted to the automated tests, reducing the quality in problem spaces that fit
outside of the domain of the automated tests.
When testing for drift, it is necessary to either repeat tests over time, similar to regression testing, or
by including external data (external validity testing), such as information from the environment that
is not available to the algorithm.
Ethical Considerations
An early step in planning a testing activity should include gathering applicable regulatory standards
to determine if they impact the testing. Even if stakeholders do not have an explicit regulatory,
ethical, or procurement requirement, it is obvious that an error rate in an algorithm which affects a
sub-group of people (in an unintended way) can have an adverse effect on quality. This is an
important aspect when considering the trustworthiness, to stakeholders, of an AI system.
As demonstrated in the previous section, a model can be built that is highly dependent on personal
characteristics (e.g., gender and age), which are not accurate correlations in the real world. If our
Titanic model was used currently in the real world, in order to optimize a rescue operation for a
sinking passenger ship, consider the implications with regard to ethics.
In some jurisdictions, such as in the EU under the GDPR regulations, there are specific legal
requirements that apply where data about actual people is used with complex systems, particularly
those which make decisions and can affect people in a significant way (e.g., access to services).
These requirements may need:
Explicit informed consent to automated decision making using the user’s data.
Automation Bias
Automation bias, or complacency bias, is the tendency for humans to be more trusting of decisions
when they are recommended by AI systems. This can take the form of a human failing to take into
account information it receives other than from the system. Alternatively, it can take the form of a
failure by the system which is not detected by a human as they are not properly monitoring a
system.
An example of this is semi-autonomous vehicles. Cars are becoming increasingly self-driving, but still
require a human in the loop to be ready to take over if the car is going to have an accident. If the
human gradually becomes trusting of the car’s abilities to reason and direct the vehicle, they are
likely to pay less attention, and monitor the car less, leading to an inability to react to an unexpected
situation.
In predictive systems, it is sometimes common to have a human validate a pre-populated machine
recommendation. For example, a procedure where a human keys data into a form might be
improved to use machine learning to pre-populate the form, and the human just validates the data.
Version 1.0
© Copyright 2019
31
In these scenarios, it is necessary to understand that the accuracy of the human decision may be
compromised, and test for that where possible.
Automation bias is most likely to be detected during acceptance testing.
Adversarial Actors
Hackers and other adversarial actors can attempt to disrupt or manipulate an AI system just like any
other modern system. One new attack vector, which is present for systems which learn and self-
optimize, is the risk that by manipulating the data inputs to the model, it can be changed. This is
particularly relevant for any system which interacts with the real world through natural language, or
physical interactions.
Examples can include:
Manipulating intelligent bots on social media by feeding them inputs which promote a specific
viewpoint; e.g., political.
Moving objects around to make a robot perceive an obstacle where there is not one.
Making changes to objects in the real-world, which limits or changes the result of sensors.
has been gathered as training data, has this been done at a certain time of day? Is this
representative of average use? Is the dataset too broad or too narrow? If the sampling is done
incorrectly, or if the training data is not representative, the real-world outcomes will be wrong.
Typically, a data scientist would use a framework for automatically splitting the data available into a
mutually exclusive training dataset and a testing dataset. One common framework used to do this is
SciKit-Learn, which allows developers to split off the required size of dataset through random
selection. When evaluating different models, or retraining models, it is important to override the
random seed used to split the data. If this is not done, the results will not be consistent,
comparable, or reproducible.
Normally 70%-80% of the data is used for training the model, with the remainder reserved for
evaluating it. (There are advanced methods available for ensuring that the train/test split has been
done in a representative way; however, they are outside the scope of this syllabus.) This is in fact,
exactly, what we did with the Titanic example. We used a single line of code to create a random
split of data.
Traditional unit testing usually focuses on covering statements and conditions with code. However,
a whole machine learning model can often be represented in a very small number of lines of code
and it is component tested in isolation. When considering the coverage of model testing, it should be
measured in terms of data, rather than line of code.
An important point for consideration is the design of regression testing activities. Whereas in
traditional software development the risk of functional regression is usually low unless very large
changes are made, almost any change to the algorithm, parameters or training data typically
requires rebuilding of the model from scratch, and the risk of regression for previously tested
functionality is quite high. This is because rather than changing a small percentage of the model
based on the required changes, 100% of the model can potentially change.
In summary:
The way that the initial data has been gathered is important to understand whether it is
representative.
It is important that the training data is not used to test the model otherwise testing cannot fail.
The data split for training and testing may need to be made reproducible.
Version 1.0
© Copyright 2019
33
scale training data. Manual annotation of data, for both training and testing sets, should be
validated through experimentation and manual review of subject matter experts.
The environment may be continuously self-optimizing or may be restarted to a known model
baseline prior to testing. This is an important consideration.
It is important, when considering the machine learning test environment, to consider whether the
population of data in use reflects the expected final population. For instance, cultural and societal
aspects may vary significantly between different sets of target users. Using data from one part of
the world to test a global system can lead to unpredictable results.
It is clear that AI systems to some degree perceive their environment, and to some degree take
action to influence it. We can refer to the system’s ability to perceive its environment through
sensors, and its ability to influence the environment through its actuators.
Where the AI system is characterized by an impact on the physical, real world, for instance an AI
system inside a robot, the test environment may need to be very carefully planned. It may need to
consider physical operating parameters such as exposure to weather, in order to fully evaluate it.
One model for describing AI environments is D-SOAKED:
Deterministicness. We have already discussed determinism in the context of an AI system, but is
the environment itself deterministic, probabilistic, or non-deterministic? Few AI systems which
interact with the real world can claim to have a deterministic environment. For example, if a
system is able to make a recommendation to a user searching for something, is the user
guaranteed to follow the recommendation? If not, then the environment is either probabilistic,
or non-deterministic.
Staticness. Static environments do not change until the AI system takes an action, whereas
dynamic environments continue to change.
Observability. A fully observable environment is one where the system has access to all
information in the environment relevant to its task.
Agency. If there is at least one other AI system in the environment, it can be described as a
multi-agent environment.
Knowledge. The degree to which the system knows how its environment will behave and
respond.
Episodicness. Episodic environments comprise series of independent events, whereas sequential
environments involve a sequence of events, and behavior will change based on prior events.
Discreteness. A discrete environment has fixed locations, zones or time intervals. A continuous
environment can be measured to any degree of precision.
Table 3 uses these characteristics to compare the intended environments for a machine learning
model that recommends purchases, and an assistance robot providing support (e.g., to senior
citizens or those less able).
Version 1.0
© Copyright 2019
34
Staticness Not static; user may do something else Not static; the world continues...
Note that while these are all valid for the final intended usage, it may be that within a testing
context it is practical to make the environment more deterministic, static, observable and discrete.
Acceptance Criteria
System requirements and design processes are equally important with AI systems, but given the data
or rule driven nature of them, often the basis for the detail for how the system should behave is
composed differently than a procedural system.
It is very important that the standard of accuracy is agreed up-front between stakeholders. For
example, even if only 70% of the system’s responses are correct according to an expert, that may be
acceptable if the average human falls well short of that.
Ideally requirements for complex AI components should specify both desired and minimum quality
requirements and consider false positive and false negative errors. These are discussed further in
the metrics section.
Here are some examples of good acceptance criteria:
A classification algorithm for prediction is desired to achieve no more than 10% false positive
errors and no false negative errors (see Metrics), but up to 15%/5% will be acceptable.
An expert system using a conversational interface is desired to answer all human questions
relating to a domain accurately, but 80% is acceptable as long as a human evaluator cannot tell
the difference between that system’s responses and a human.
An algorithm that is predicting the right special offer to present to a website visitor should
convert to a sale 5% of the time, but as long as it is more accurate than the current algorithm it
is acceptable.
Version 1.0
© Copyright 2019
35
One model for describing the desired behavior of AI systems, particularly those that are considered
autonomous and able to make decisions based on prior experience, is the PEAS model. This stands
for Performance, Environments, Actuators and Sensors:
Performance in this context is an objective measure of the success of the system’s behavior. It
can also be phrased as defining how the agent itself knows whether it has been successful.
Examples of performance can include speed of execution of an action, or profitability, or safety
outcomes. Notes that the definition that related to speed and time is commonly used in other
testing contexts.
Environment is defined in Section 2.3 using the D-SOAKED model.
Actuators were referred to previously and include anything that allows the system to take action
which influences its performance.
Sensors were also referred to previously and include anything that allows the system to perceive
events relevant to its goals.
Functional Testing
This section details five functional testing techniques that can be useful in testing AI systems.
Metamorphic Testing
Metamorphic testing is a testing technique which requires multiple executions, and uses a test
design approach that creates a pseudo-oracle for the system, which is made up of heuristics called
metamorphic relations.
These relations are probabilistic in nature and define the likely correlation between multiple inputs
and outputs of the system. For example, in a system using machine learning to predict the likelihood
a person will have a disease that predominantly affects older people, there is likely to be a strong
correlation to age. There is therefore a metamorphic relationship between the input age and the
output prediction. A test design could therefore state that in general, a test with a higher input age
should generate a higher likelihood output, and vice-versa.
Once metamorphic relations have been established, they can be used in place of the test oracle to
generate test cases which verify individual sets of inputs and outputs. In addition, the successful
execution of metamorphic tests can lead to the identification of further tests, unlike conventional
testing.
In summary, metamorphic testing allows test analysts to predict the results of future tests, based on
past test executions. It is usually a kind of internal validity test.
Version 1.0
© Copyright 2019
36
A/B Testing
A/B testing is another kind of testing designed to solve test oracle problems. Conceptually it is a test
that is run with two equivalent samples, changing one variable. This kind of testing is heavily used in
consumer-facing marketing and e-commerce, where it is not clearly known which kinds of system
behavior will lead to the best outcomes (in terms of customer activity).
A simple example of this type of test is where two promotional offers are emailed to a marketing list
divided into two sets. Half of the list gets offer A, half gets offer B, and the success of each offer
gives insights into the best offer. Many e-commerce and web-based companies use A/B testing in
production, diverting buckets of consumers to different functionality, to see which consumers
prefer, or achieves the best business outcomes.
This concept can be applied to other situations where the correct result is unknowable, such as
recommender and classifier systems. With this concept, an external validity check can be performed
on how the system is received in a production environment.
Another useful application of A/B testing is when it is difficult to test the external validity of the
system before reaching production. That is, testing is conducted which shows that the system gives
the correct results with sample inputs. External validity is often only proven in the wild. By using
A/B testing, changes can be released to only a small population of users or use cases, which enables
external validity testing to be conducted in a similar way to beta testing.
Internal information about whether the action taken by the system has resulted in the desired
goal in the environment
Structured or unstructured feedback systematically gathered from users
External data sources that correlate with internally held system data
Additional sensors fitted to monitor the system or the environment
Version 1.0
© Copyright 2019
37
Expert Panels
When testing rule and knowledge based expert systems, which are aimed at replacing experts
themselves, it is clear that human opinion needs to be considered. This is a kind of external validity
test, and can be achieved using the following process:
1. Establish a set of test inputs based on analysis of the input domain
2. Perform a Turing Test, using experts in the field of the system under test, to determine the
equivalent outputs (likely recommendations) of experts, and therefore the expected results
3. Evaluate the differences between results through a blind rating (again by experts) of the
outputs. The purpose of this step is to determine the natural variation in human expert opinion.
In this context, outputs are the results of the system, which could include recommendations or
classifications.
There are several considerations which are important while conducting such tests:
Human experts vary in competence, so the experts involved need to be representative, and the
number of experts needs to be statistically significant.
Experts may not agree with each other, even when presented with the same information.
Human experts may be biased either for or against automation, and as such the ratings given to
outputs should be double-blind (i.e., neither the experts nor the evaluators of the outputs
should know which ratings were automated).
Humans are more likely to caveat responses with phrases like “I’m not sure, but…” If this kind of
caveat is not available to the automated solution, this should be considered when comparing the
responses.
Levels of Testing
The purpose of different levels of testing for AI systems is to systematically eliminate variables that
may lead to quality failures in the system. Table 4 shows examples of testing objectives that could
be applied at different test levels, to different types of AI system.
System
Component Integration Acceptance
Example Testing Testing System Testing Testing
Expert system: Expert verification Verifying that the Verifying that the Testing with non-
helping diagnose of sensors interact expert system expert users in
problems with recommendations correctly with the behaves as real cars.
your self-driving using virtualized expert system. expected when
car sensors and fully integrated
metamorphic
testing.
Version 1.0
© Copyright 2019
38
System
Component Integration Acceptance
Example Testing Testing System Testing Testing
Computer vision: Model training/ Testing that the Verifying that the Testing with
facial recognition testing based on image from the recognition groups of people
and greeting of pictures from sensors is of system sensors, who look
customers files. sufficient quality models and different, to a
to achieve greetings work as statistically
acceptable expected when significant
results. fully integrated degree.
Recommender: Model training/ Ensure the model Verifying that the Conduct A/B
recommending a testing using prior is able to retrieve recommendation testing with some
purchase based order history. all the inputs to system including end customers to
on order history the model models and see if results are
correctly. retrieval of order improved.
history works as
expected when
integrated, and
make effective
recommendations
Table 4: Examples of test objectives for each test level with example AI systems
Component Testing
Complex algorithms typically require significant component testing as they contribute significantly to
the overall system functionality.
In the context of machine learning models, this is likely to be highly data-driven as described in the
Model Training and Testing section.
With expert systems it is common to fully verify the requirements for the knowledge that the system
needs to have. As these are usually specified in a machine-readable form, it is often possible to fully
automate this testing by using the knowledge inputs as a test oracle.
Systems which take real-world inputs through sensors should have those sensors virtualized, so that
they can be controlled and observed.
Version 1.0
© Copyright 2019
39
System Testing
The goal of many AI systems is to reproduce the intelligence of human experts. If there is a heavily
documented set of rules or a pseudo-oracle that the system can be tested against, it may be possible
to define expected results which achieve a high level of coverage. However, in many cases there
may not be such a mechanism and it may be more appropriate to evaluate system performance
against human experts.
It is usually prudent to verify the accuracy of outputs, including the competence of expert
recommendations, prior to validating their acceptability to users. Otherwise, it will not be clear
whether any failures are caused by inadequate outputs, or human factors.
Version 1.0
© Copyright 2019
40
Confusion Matrix
While it is possible to describe the accuracy as the percentage of classifications that were correct, it
does not tell the full picture, as the impact of incorrectly classifying the patient as requiring a
procedure (a false positive) could be very different to the impact incorrectly classifying them as not
requiring the procedure, when they actually did (a false negative). While this is an extreme example
because of the very physical impact on the patient, in practice the impact of a false positive error, as
compared to a false negative error, is usually different.
A confusion matrix in a chart which describes the accuracy of a system in a classification context. A
simple confusion matrix looks like Figure 2 below.
This matrix is in fact generated from the application of a decision tree algorithm to the Titanic data.
On one axis is the predicted labels (true and false, in the simple example above). On the other axis is
the true label. The numbers in each box indicate the ratio of results with that true label, which fell
into the appropriate predicted label.
Using the matrix above we can see that 88% of people who did not survive were correctly predicted.
However, 12% of the people who did not survive were actually predicted to survive, i.e., received
false positive errors. Similarly, 65% of people predicted to survive indeed did (true positive) and 35%
of people who survived, were not predicted to do so (false negative)
A confusion matrix can be presented differently, using more than one label, using absolute values,
and using different colors to highlight different aspects.
Version 1.0
© Copyright 2019
41
Statistical Significance
Due to the nature of probabilistic and non-deterministic systems, the concept of statistical
significance can become relevant to software testing, especially in understanding test coverage, for
example:
● Expected results can’t be determined exactly for a specific test and require multiple executions
to establish an average.
● A test analyst is trying to cover different types of users based on their personal characteristics.
● A test analyst needs to ensure all the data features used in a learning model are exercised in
testing.
● An A/B test in production is being scoped, and a test analyst needs to determine the number of
users to be targeted with new functionality in order to produce the most meaningful results.
Statistical significance is the likelihood that observed test results can be generalized across a wider
population of data. There are three factors that affect the statistical confidence that can obtained
from a specific number of tests:
● The sample size, or the number of tests. The higher this number, the greater the accuracy of the
results. However, this is not linear, as doubling the number of tests does not double the
accuracy.
● The variation in responses. If the system responds a certain way 99% of the time, it will require
less tests to evaluate, if the system responds a certain way 49% of the time, it will require more
tests to determine a statistically significant result.
● The total population size. This might be the total number of permutations of data items in the
tests, or perhaps the total number of users. This is usually not relevant unless the test
population is greater than a few percentage points of the total population.
A relevant concept is Simpson’s paradox. This is a phenomenon in statistics, in which a trend
appears in several different groups of data but disappears or reverses when these groups are
combined.
The largest number of tests is required when data is stratified or grouped into buckets to ensure it is
covered. Any data buckets for stratified sampling typically vary based on data in the input domain
that may affect the results, for example:
● A test where the expected results are unknowable should have a statistically significant volume
of tests for each possible input value with an important feature.
● A test analyst who is testing facial recognition may make a list of different ethnic backgrounds
and aim to cover a statistically significant sample of each.
● A test analyst assessing the quality of a machine learning model may wish to assess whether
there is a statistically significant set of training and testing values being used in component
testing.
● An A/B test may wish to consider different internet browser or geographies, to ensure they get
the most representative results.
It should be noted that the number of such buckets in use increases the number of tests required
more than any other factor.
Version 1.0
© Copyright 2019
42
Keywords
Test automation, test maintenance, oracle problem, regression testing, difference testing, Golden
Master testing, test data generation, test data synthetization, bug triaging, risk estimation, test
generation, test selection, test prioritization, object recognition, identifier selection, cross-browser
testing, cross-device testing, visual regression testing
3.1 AI in Testing
AI-3.1.1 (K2) Summarize why AI is not directly applicable to testing.
AI-3.1.2 (K1) Recall examples of test oracles.
Version 1.0
© Copyright 2019
43
Version 1.0
© Copyright 2019
44
3.1 AI in Testing
AI can be used on both sides of testing. It should be tested itself, but it can also be applied to testing
and surrounding activities.
Test Oracles
Specified test oracles are oracles that are explicitly specified. For manual tests, this is usually the
specification or documentation. In typical test automation, this is the widely used assertion. But
these assertions usually only apply to the specific test for which they are defined. However, to be
usable for test automation or AI (i.e., for new tests), what is usually needed is some sort of formal
specification, contract or software model. The problem is the correct level of abstraction of the
specification. It must be abstract enough to be practical, but concrete enough to be applicable. After
all, the code itself is another form of formal specification. So if the system can be modelled, it often
is more efficient to derive an implementation in code than to use the model as test oracle.
Derived test oracles are oracles derived from artefacts that were originally not intended to serve as
test oracles, such as alternative implementations or previous versions of the system under test.
Especially the latter can serve as test oracles for test automation to detect unintended side-effects
Version 1.0
© Copyright 2019
45
of changes. It is also sometimes referred to as consistency oracle or the testing approach as Golden
Master Testing, difference testing, snapshot-based testing, characterization testing or approval
testing, that aims to keep the SUT consistent to earlier versions of itself.
Implicit oracles are well-known characteristics that do not need to be modelled explicitly (e.g., the
fact that usually crashes are unwanted or software should respond to input in a timely manner).
Defect analysis
Test optimization and unification
Version 1.0
© Copyright 2019
46
GAN (generative adversarial network), which actually consists of two neural networks. One
generates an image; the other one rates how realistic the image is and thus challenges the first
neural network to improve on its output.
Version 1.0
© Copyright 2019
48
● Parameters may be of complex types. For example, a code function may expect to be passed a
complex datatype of sorts, such as a user object or a list of arrays of sets of valid email
addresses.
● The oracle problem. Since most of the inner state of the system is available or reachable, the
consistency or history oracle will likely show too many non-meaningful and volatile differences
(false positives), which reduces the value of the results. Even the implicit test oracle is often
defunct, because exceptions and crashes may be non-problematic. This is the case, for example,
when robustness is not an issue and the code is being passed invalid data by the test (as no
other part of the system does).
● Unclear testing goal. When generating tests for a specific function in contrast to the whole
system or a large part of the subsystem, it is unclear when a specific function is tested enough in
comparison to the rest of the system. So, test prioritization needs to be incorporated.
● Unclear output optimization. As the test targets code, the expected output of test generation is
usually also code. Therefore, many additional hard questions without agreed upon answers
arise, such as:
○ What is an optimal test length?
○ How should variables be named?
○ How should code be structured into methods and sub-methods and how should they be
named?
○ When is code readable versus spaghetti-code?
● Unclear next action. With a wide variety of methods and functions available and usually most of
the system reachable (by breaking the Law of Demeter, see Lieberherr and Holland), it is hard to
decide what method or function to call next. Also, there are often implicit protocols in place (i.e.,
call open before close), for which breaking them makes no sense.
Despite all those challenges, there is some ongoing research in this area, resulting in free-to-use
open source tools that generate unit tests for existing code. One such tool is EvoSuite, that can be
used on the command line and even comes with plugins for Eclipse, IntelliJ and Maven.
Version 1.0
© Copyright 2019
49
● Implicit test oracles can be used; e.g., typically the system should respond within a certain
amount of time and should not crash, no matter what the input is.
● It is easier to define testing goals in relation to the overall system, as the system is in
consideration anyways.
● The format of the output can be much less formal; e.g., as an Excel document, listing the
individual test steps. Even if outputted as code, the requirements regarding code quality on this
level are much lower, and spaghetti code is usually acceptable.
● The available and sensible set of next actions is often obvious. If not, this is possibly also an issue
of the user experience design.
Version 1.0
© Copyright 2019
50
that the knowledge incorporated in such a neural network is generally not bound to a specific
software, as it represents user experience design rules, such as Fitts’s law.
One could also describe using a GUI as being like playing a game: In any given GUI state, the AI has to
decide what move it wants to do next (i.e., what user action to execute). However, in contrast to a
game where the clear goal is to win, in testing, the goal is not so clear. So special consideration has
to be given to the definition of the goal. Additional to manually defined soft testing strategies, the
goal for the AI has to be measurable in a fully automated way. Therefore, typically goals such as a
high overall code coverage or similar are often used.
These approaches can even be combined. An evolutionary algorithm starts off with a seed
population. This seed population can either be humanly created (e.g., recordings of user stories), or
it can be randomly generated. The better the seed population, the better and faster the results. So,
if generating the seed population (partly) randomly, a neural network can be used to generate more
human-like tests.
Version 1.0
© Copyright 2019
51
used). This typically results in false positives. Depending on when that problem arises, it can even be
hard to detect (e.g., when a text is entered in a wrong field early during test execution).
AI can be applied to either help identify the correct object using a multitude of identification criteria
(e.g., XPath, label, id, class, X/Y coordinates), using the “looks” of the object by applying some form
of image recognition, or by choosing the historically most stable single identification criterion. All of
those approaches have been used in the past. However, none of these does address the principal
underlying problem: the continual change of the software. So, if the correct object is identified by a
multitude of criteria, the approach is robust against changes to one or even multiple of those
criteria. However, if such changes occur, the underlying baseline then needs to be updated to reflect
those changes. Otherwise, the confidence in the identification will decrease and after multiple such
changes occur to different criteria throughout the lifetime of the software, at some point the
remaining unchanged criteria will not yield a high enough confidence—thus only postponing the
problem, not solving it. The same is true for image recognition: if the past and present image
increase in difference, at some point the remaining confidence will not be high enough. And also, for
choosing the historically most stable single identification criterion—even if it was the most stable
one, at some point it might still change.
the rate of type-II errors (missed bugs) often cannot or not directly be determined in a life system
(e.g., it is hard to know how many bugs were not detected).
Version 1.0
© Copyright 2019
53
Version 1.0
© Copyright 2019
54
Maintainability
After generating all those test cases, how do you maintain them with reasonable effort in relation to
their value? As the pesticide paradox tells, the ability of test cases to find defects decreases over
time, while the maintenance effort often increases. Also, generated test cases are usually hard for
humans to understand (e.g., determining the goal of any given test case), and therefore hard to
maintain. Sometimes a small detail of a test makes it valuable (e.g., in regard to coverage by
checking a specific option). When maintaining the test, if the maintainer is oblivious to what makes it
valuable, a valuable test can easily be turned into a useless deadweight.
Summary
In summary, it can be said that due to the oracle problem, manual testing cannot be replaced by AI.
But AI can improve efficiency and effectiveness of manual testing and many other QA related tasks.
However, tool vendor claims should generally (as for any other tool) be critically scrutinized, and an
extensive tool evaluation is recommended. When doing this evaluation, make sure to not perform it
on a (vendor delivered) toy example, but on a real-world project. As a tester, you should have the
expertise to critically test not only your software, but also the vendor tool.
Version 1.0
© Copyright 2019
55
4.0 Appendix
4.1 Glossary
artificial intelligence: Manually designed intelligence (i.e., in contrast to biologically evolved).
A/B testing: A type of testing that is run with two equivalent samples, changing one variable.
actuators: Components of a system which enable it to influence the environment.
adversarial actor: A person with adversarial objectives relating to a system; e.g., a hacker.
agency: A property of a test environment, whether it contains one or more agents.
approval testing: Synonymous to golden master testing, in that changes to the golden master need
to be approved.
automation bias: When a human decision maker favors recommendations made by an automated
decision-making system over information made without automation, even when the automated
decision-making system makes errors.
bias: 1) When a statistical model does not accurately reflect reality; or, 2) When being in favor of or
against one thing, person, or group compared with another, usually in a way considered to be unfair.
Both can occur independent of one another; e.g., sociological bias that accurately reflects the
modelled reality as reflected in the data.
blind test: A testing procedure designed to avoid biased results by ensuring that at the time of the
test the subjects do not know if they are subject to a test or are part of a control group. Also
double-blind, where those evaluating the results are also not aware of whether they are evaluating
test results or a control group.
bug triaging: The process of grouping and maintaining (e.g., removing duplicates) defect tickets.
characterization testing: Synonymous to golden master testing, in that the SUT is tested for
“characterizing properties,” namely the result.
clustering: Identifying individuals to belong to the same group or share some characteristic trait.
confusion matrix: A table that summarizes how successful a classification model's predictions were;
that is, the correlation between the label and the model's classification. One axis of a confusion
matrix is the label that the model predicted, and the other axis is the actual label.
correlation: A correlation between things is a connection or link between them.
cross-browser testing: Executing the same test with multiple browsers to identify differences
cross-device testing: Executing the same test with multiple devices to identify differences
data feature: A measurable property of the phenomenon being observed, usually an input to a
machine learning model.
decision trees: An explicitly modeled way to arrive at a specific resolution, given certain decision
parameters.
deep learning: A specific arrangements of a neural network to contain many neural layers.
Version 1.0
© Copyright 2019
56
deterministic system: A deterministic system is a system which, given the same inputs and initial
state, will always produce the same output.
difference testing: Testing approach that explicitly checks for changes in relation to a reference
instead of, for example, checking for “correctness”.
discreteness: A characteristic of an environment which reflects whether it can be divided into zones
or measures, as opposed to an environment which can be continuously measured.
drift: Drift is the concept that the correlation between the inputs and the outputs of an AI system
may change over time, including situations where training data labels change over time.
episodicness: A characteristic of an environment which reflects whether it has dependencies on
previous states.
expert knowledge-based system: System, where knowledge (usually from subject matter experts) is
explicitly represented in the system (e.g., via decision trees or rules).
external validity: Using external data or benchmarks in order to validate the quality of a system.
fuzz testing: A random-based testing approach, where inputs are generated or changed (fuzzed).
golden master testing: A difference testing approach, where the reference is a previous version of
the system. The term "Golden Master" refers to the original audio recording used to copy vinyl discs
from.
human in the loop: An AI system monitored by a human, or where a human approves each decision.
hyperparameter: A parameter whose value is set to configure the learning algorithm before the
learning phase.
identifier selection: The selection of the unique identifying attribute from a machine readable
representation (e.g., HTML, CSS) used during user interface automation.
internal validity: Using the input data to a system in order to validate it’s quality.
regression model: A type of model that outputs continuous (typically, floating-point) values.
machine learning: A program or system that builds (trains) a predictive model from input data.
machine learning also refers to the field of study concerned with these programs or systems.
machine learning model: A trained model to make predictions from new (never-before-seen) data
drawn from the same distribution as the one used to train the model.
metamorphic relations: Probabilistic relationships between the inputs and outputs of a system.
metamorphic testing: A testing technique which uses metamorphic relations at a pseudo oracle.
model training: The process of determining the ideal parameters comprising a model.
neural network: A model that, taking inspiration from the brain, is composed of layers (at least one
of which is hidden) consisting of simple connected units or neurons followed by nonlinearities.
object recognition: In user interface automation, the ability of the automation tool to recognize
user interface objects.
observability: The degree to which the AI system can monitor and observe its environment.
Version 1.0
© Copyright 2019
57
oracle problem: The inability to generally identify the correct output of a system.
overfitting: Creating a model that matches the training data so closely that the model fails to make
correct predictions on new data.
probabilistic system: A system where the occurrence of events cannot be perfectly predicted.
reinforcement learning: A machine learning approach to maximize an ultimate reward through
feedback (rewards and punishments) after a sequence of actions.
risk estimation: The process of identifying system quality risks relating to a testing activity.
sensors: A component used by the system to perceive the environment.
snapshot-based testing: Synonymous to golden master testing, in that the golden master is the
snapshot.
staticness: A characteristic of an environment which reflects the degree to which it changes without
the AI system taking an action.
statistical significance: The likelihood that observed test results can be extrapolated across a wider
population of data.
supervised learning: Training a model from input data and its corresponding labels.
symbolic AI: Formalizing reasoning through mathematical logic and automatic deduction
techniques.
test automation: The use of software to perform or support test activities, e.g., test management,
test design, test execution and results checking.
test data generation: The automated generation of data for testing purposes.
test generation: The automation production of test procedures
test maintenance: The process of updating an automated test following desirable changes to the
system under test which cause the test to fail.
test prioritization: The process of deciding which tests should be run first.
test selection: The process of deciding which tests should be run
training data: The subset of the dataset used to train a model.
type I error: A false positive
type II error: A false negative
underfitting: Producing a model with poor predictive ability because the model hasn’t captured the
complexity of the training data.
unsupervised learning: Training a model to find patterns in a dataset, typically an unlabeled
dataset.
variance: A machine learning model’s sensitivity to specific sets of training data
visual regression testing: Test automation which aims solely at identifying visual regressions or
visual differences on different platforms
Version 1.0
© Copyright 2019
58
4.2 References
Websites
EvoSuite: http://www.evosuite.org [accessed: June 15, 2019]
This Person Does Not Exist: https://thispersondoesnotexist.com/ [accessed: June 15, 2019]
Version 1.0
© Copyright 2019
59
Havrikov, N.; Höschele, M.; Galeotti, J. P.; Zeller, A. (2014). “XMLMate: Evolutionary XML Test
Generation,” in the Proceedings of 22nd ACM SIGSOFT International Symposium on the Foundations
of Software Engineering.
Hoffman, D. (1998). “A Taxonomy for Test Oracles,”
http://www.softwarequalitymethods.com/Papers/OracleTax.pdf [accessed: July 21, 2019]
Holler, C.; Herzig, K.; Zeller A. (2012). “Fuzzing with Code Fragments,” in the Proceedings of the 21st
USENIX Conference on Security Symposium, USENIX Association,
https://dl.acm.org/citation.cfm?id=2362793.2362831 [accessed: July 6, 2019]
Jürgens, E.; Hummel, B.; Deissenboeck, F.; Feilkas, M.; Schlögel, C.; Wübbeke, A. (2011). “Regression
Test Selection of Manual System Tests in Practice,” in the Proceedings of the 15th European
Conference on Software Maintenance and Reengineering (CSMR2011),
http://dx.doi.org/10.1109/CSMR.2011.44 [accessed: July 21, 2019]
Kim, D.; Wang, X.; Kim, S.; Zeller, A.; Cheung, S.C.; Park, S. (2011). “Which Crashes Should I Fix First?
Predicting Top Crashes at an Early Stage to Prioritize Debugging Efforts,” in the IEEE Transactions on
Software Engineering, volume 37, https://ieeexplore.ieee.org/document/5711013 [accessed: July 1,
2019].
Knauf, R.; A. J. Gonzalez; K. P. Jantke (1999). “Validating rule-based systems: a complete
methodology,” in the IEEE SMC’99 Conference Proceedings, 1999 IEEE International Conference on
Systems, Man, and Cybernetics, volume 5.
Lieberherr, K. J.; Holland I. (1989). “Assuring Good Style for Object-Oriented Programs”, in IEEE
Software, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.50.3681&rep=rep1&type=pdf
[accessed: July 6, 2019]
McCorduck, P. (2004). Machines Who Think, second edition, CRC Press.
Minsky, M., and S. Papert (1969). Perceptrons: An Introduction to Computational Geometry, MIT
Press.
MMC Ventures, Barclays UK Ventures (2019). “The State of AI: Divergence 2019,”
https://www.mmcventures.com/wp-content/uploads/2019/02/The-State-of-AI-2019-Divergence.pdf
[accessed: July 6, 2019]
Nachmanson, L.; Veanes, M.; Schulte, W.; Tillmann, N.; and Grieskamp, W. (2004). “Optimal
Strategies for Testing Nondeterministic Systems,” https://www.microsoft.com/en-us/research/wp-
content/uploads/2004/01/OptimalStrategiesForTestingNondeterminsticSystemsISSTA2004.pdf
[accessed: July 21, 2019].
Russell, Stuart J.; Norvig, P. (2009). Artificial Intelligence: A Modern Approach, third edition, Prentice
Hall.
Patel, K.; Hierons, R. M. (2018). “A mapping study on testing non-testable systems,” Software
Quality Journal, volume 26, number 4.
Vanderah, T. (2018). Nolte's Essentials of the Human Brain, second edition, Elsevier.
Wolpert, D. H.; Macready, W. G. (1997). “No free lunch theorems for optimization,” in the IEEE
Transactions on Evolutionary Computation, volume 1, number 1.
Version 1.0
© Copyright 2019
60
Zhang, H. (2004). “The Optimality of Naïve Bayes,” in the Proceedings of the Seventeenth
International Florida Artificial Intelligence Research Society Conference (FLAIRS 2004), AAAI Press.
Zimmermann, T.; Nagappan, N.; Zeller, A. (2008). “Predicting Bugs from History,” in Software
Evolution (69-88), Springer.
Version 1.0
© Copyright 2019