A Survey On Load Testing of Large-Scale

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation

information: DOI 10.1109/TSE.2015.2445340, IEEE Transactions on Software Engineering
1
A Survey on Load Testing of Large-Scale

Software Systems
Zhen Ming Jiang, Member, IEEE, Ahmed E. Hassan, Member, IEEE
Abstract—Many large-scale software systems must service thousands or millions of concurrent requests. These systems must
be load tested to ensure that they can function correctly under load (i.e., the rate of the incoming requests). In this paper, we
survey the state of load testing research and practice. We compare and contrast current techniques that are used in the three
phases of a load test: (1) designing a proper load, (2) executing a load test, and (3) analyzing the results of a load test. This survey
will be useful for load testing practitioners and software engineering researchers with interest in the load testing of large-scale
software systems.
Index Terms—Software Testing, Load Testing, Software Quality, Large-Scale Software Systems, Survey
1 I NTRODUCTION Load Test Objectives
Many large-scale systems ranging from e-commerce

websites to telecommunication infrastructures must
Design a Load Test
support concurrent access from thousands or millions Designing Realistic Loads Designing Fault-Inducing Loads
of users. Studies show that failures in these systems Load Design Optimization and Reduction
tend to be caused by their inability to scale to meet

user demands, as opposed to feature bugs [1], [2]. The Testing Load
failure to scale often leads to catastrophic failures and
unfavorable media coverage (e.g., the meltdown of
the Firefox website [3], the botched launch of Apple’s Execute the Load Test
MobileMe [4] and US Government’s Health Care Web- Live-User Based Driver Based Emulation Based
Test Execution Test Execution Test Execution
site [5]). To ensure the quality of these systems, load Setup
testing is a required testing procedure in addition to Load Generation and Termination
conventional functional testing procedures, like unit Test Monitoring and Data Collection
testing and integration testing.
This paper surveys the state of research and prac- Recorded System
tices in the load testing of large-scale software sys- Behavior Data
tems. This paper will be useful for load testing prac-
titioners and software engineering researchers with Analyze the Load Test
interests in testing and analyzing large-scale soft- Verifying Against Detecting Known Detecting Anomalous
Threshold Values Problems Behavior
ware systems. Unlike functional testing, where we
have a clear objective (pass/fail criteria), load testing
can have one or more functional and non-functional
Test Results
objectives as well as different pass/fail criteria. As
illustrated in Figure 1, we propose the following three
research questions on load testing based on the three Fig. 1: Load Testing Process
phases of traditional software testing (test design, test
execution and test analysis [6]):
1) How is a proper load designed? The Load Design phase defines the load that will
be placed on the system during testing based on
the test objectives (e.g., detecting functional and
• Zhen Ming Jiang is with the Software Construction, AnaLytics and
Evaluation (SCALE) lab, Department of Electrical Engineering and
performance problems under load). There are
Computer Science, York University, Toronto, Ontario, Canada. two main schools of load designs: (1) designing
E-mail: [email protected] realistic loads, which simulate workload that
• Ahmed E. Hassan is with the Software Analysis and Intelligence
(SAIL) Lab, School of Computing, Queen’s University, Kingston,
may occur in the field; or (2) designing fault-
Ontario, Canada. inducing loads, which are likely to expose load
E-mail: [email protected] related problems. Once the load is designed,
some optimization and reduction techniques
0098-5589 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
2
could be applied to further improve various testing by contrasting among various interpretations
aspects of the load (e.g., reducing the duration of load, performance and stress testing. Then we
of a load test). In this research question, we briefly explain our selection process of the surveyed
will discuss various load design techniques and papers.
explore a number of load design optimization
and reduction techniques. 2.1 Definitions of Load Testing, Performance
2) How is a load test executed? Testing and Stress Testing
In this research question, we explore the tech-
niques and practices that are used in the Load We find that these three types of testing share some
Test Execution phase. There are three different common aspects, yet each has its own focus. In the
test execution approaches: (1) using live-users to rest of this section, we first summarize the various
manually generate load, (2) using load drivers to definitions of the testing types. Then we illustrate
automatically generate load, and (3) deploying their relationship with respect to each other. Finally,
and executing the load test on special platforms we present our definition of load testing. Our load
(e.g., a platform which enables deterministic testing definition unifies the existing load testing in-
test executions). These three load test execu- terpretations as well as performance and stress testing
tion approaches share some commonalities and interpretations, which are also about load testing.
differences in the following three aspects: (1) There could be other aspects/objectives (e.g., addi-
setup, which includes deploying the system and tional non-functional requirements) of load testing
configuring the test infrastructure and the test that we may have missed due to our understanding
environment, (2) load generation and termina- of the objectives of software testing and our survey
tion, and (3) test monitoring and data collection. process.
3) How is the result of a load test analyzed? Table 1 outlines the interpretations of load testing,
In this research question, we survey the tech- performance testing and stress testing in the existing
niques used in the Load Test Analysis phase. literature. The table breaks down various interpreta-
The system behavior data (e.g., execution logs tions of load, performance and stress testing along the
and performance counters) recorded during the following dimensions:
test execution phase needs to be analyzed to • Objectives refer to the goals that a test is trying
determine if there are any functional or non- to achieve (e.g., detecting performance problems
functional load-related problems. There are three under load);
general load test analysis approaches: (1) veri- • Stages refer to the applicable software devel-
fying against known thresholds (e.g., detecting opment stages (e.g., design, implementation, or
system reliability problems), (2) checking for testing), during which a test occurs;
known problems (e.g., memory leak detection), • Terms refer to the terminology used in the rele-
and (3) inferring anomalous system behavior. vant literature (e.g., load testing and performance
The structure of this paper is organized as follows: testing);
• Is It Load Testing? indicates whether we consider
Section 2 provides some background about this sur-
vey. Then, based on the flow of a load test, we discuss such cases (performance or stress testing) to be
the techniques that are used in designing a load test load testing based on our working definition
(Section 3), in executing a load test (Section 4), and of load testing. The criteria for deciding load,
in analyzing the results of a load test (Section 5). performance and stress testing is presented later
Section 6 concludes our survey. (Section 2.2).
2.1.1 Load Testing

2 BACKGROUND Load testing is the process of assessing the behavior
Contrary to functional testing, which has clear testing of a system under load in order to detect load-related
objectives (pass/fail criteria), load testing objectives problems. The rate at which different service requests
(e.g., performance requirements) are not clear in the are submitted to the system under test (SUT) is called
early development stages [7], [8] and are often defined the load [73]. The load-related problems can be either
later on a case-by-case basis (e.g., during the initial functional problems that appear only under load (e.g,
observation period in a load test [9]). There are many such as deadlocks, racing, buffer overflows and mem-
different interpretations of load testing, both in the ory leaks [23], [24], [25]) or non-functional problems
context of academic research and industrial practices which are violations in non-functional quality-related
(e.g., [10], [11], [12], [13]). In addition, the term load requirements under load (e.g., reliability [23], [37],
testing is often used interchangeably with two other stability [10], and robustness [30]).
terms: performance testing (e.g., [14], [15], [16]) and Load testing is conducted on a system (either a
stress testing (e.g., [11], [17], [18]). In this section, we prototype or a fully functional system) rather than
first provide our “own” working definition of load on a design or an architectural model. In the case
3
TABLE 1: Interpretations of Load Testing, Performance Testing and Stress Testing
Is It Load
Testing?
Objectives Stages Terms
Detecting functional problems under load Testing (After Conven- Load Testing [14], [18], [19], [20], [21], [22], [23], Yes
tional Functional Testing) [24], [25], [26], [27], [28], [29],
Stress Testing [18], [26], [30], [31], [32], [33], [34]
Detecting violations in performance re- Testing (After Conven- Load Testing [10], Yes
quirements under load tional Functional Testing) Performance Testing [2], [10], [35],
Stress Testing [36], [37], [38]
Detecting violations in reliability require- Testing (After Conven- Load Testing [10], [21], [22], [23], [37], Yes
ments under load tional Functional Testing) Reliability Testing [10]
Detecting violations in stability require- Testing (After Conven- Load Testing [10], Yes
ments under load tional Functional Testing) Stability Testing [10]
Detecting violations in robustness require- Testing (After Conven- Load Testing [10], Yes
ments under load tional Functional Testing) Stress Testing [30], [10]
Measuring and/or evaluating system per- Implementation Performance Testing [39], [40], [41], [42] Depends
formance under load
Testing (After Conven- Performance Testing [12], [13], [15], [16], [43], Depends
tional Functional Testing) [44], [45], [46], [47], [48], [49],
Load Testing [15], [50],
Stress Testing [37], [38], [51], [52], [53], [54]
Maintenance (Regression Performance Testing [55], Regression Bench- Depends
Testing) marking [56], [57]
Measuring and/or evaluating system per- Testing (After Conven- Performance Testing [58], [59], [60] No
formance without load tional Functional Testing)
Measuring and/or evaluating compo- Implementation Performance Testing [61] No
nent/unit performance
Measuring and/or evaluating various de- Design Performance Testing [62], [63], [64], [65], Stress No
sign alternatives Testing [66], [67], [68], [69]
Testing (After Conven- Performance Testing [70] No
tional Functional Testing)
Measuring and/or evaluating system per- Testing (After Conven- Performance Testing [48], [71], [72] No
formance under different configurations tional Functional Testing)
of missing non-functional requirements, the pass/fail with load (e.g., e-commerce systems [15], [16], [47],
criteria of a load test are usually derived based on middle-ware systems [2], [35] or service-oriented dis-
the “no-worse-than-before” principle. The “no-worse- tributed systems [39]), or single user request (e.g.,
than-before” principle states that the non-functional mobile applications [60] or desktop applications [58],
requirements of the current version should be at least [59]).
as good as the prior version [26]. Depending on the Contrary to load testing, the objectives of perfor-
objectives, the load can vary from a normal load mance testing are broader. Performance testing (1) can
(the load expected in the field when the system is verify performance requirements [48] or in case of ab-
operational [23], [37]) or a stress load (higher than sent performance requirements, the pass/fail criteria
the expected normal load) to uncover functional or are derived based on the “no-worse-than-previous”
non-functional problems [29]. principle [26] (similar to load testing); or (2) can be
exploratory (no clear pass/fail criteria). For example,
2.1.2 Performance Testing one type of performance testing aims to answer the
Performance testing is the process of measuring what-if questions like “what is system performance if
and/or evaluating performance related aspects of a we change this software configuration option or if we
software system. Examples of performance related as- increase the number of users?” [47], [48], [76], [77].
pects include response time, throughput and resource
utilizations [24], [25], [74]. 2.1.3 Stress Testing
Performance testing can focus on parts of the sys- Stress testing is the process of putting a system under
tem (e.g., unit performance testing [61] or GUI perfor- extreme conditions to verify the robustness of the
mance testing [75]), or on the overall system [24], [25], system and/or to detect various load-related prob-
[48]. Performance testing can also study the efficiency lems (e.g., memory leaks and deadlocks). Examples
of various design/architectural decisions [63], [64], of such conditions can either be load-related (putting
[65], different algorithms [58], [59] and various system system under normal [36], [37], [38] or extreme heavy
configurations [48], [71], [72]. Depending on the types load [14], [26], [37], [27]), limited computing resources
of systems, performance testing can be conducted (e.g., high CPU [78]), or failures (e.g., database fail-
4
needed to validate the system’s performance

under the expected field workload. Such type
of testing is not considered to be stress testing
as the testing load does not exceed the expected
field workload.
2) Scenarios Considered as Both Load Testing and
Stress Testing
The e-commerce system must be be robust under
extreme conditions. For example, this system
is required to stay up even under bursts of
heavy load (e.g., flash crowd [15]). In addition,
the system should be free of resource allocation
Fig. 2: Relationships Among Load, Performance and bugs, like deadlocks or memory leaks [34].
Stress Testing This type of testing, which imposes a heavy load
on the system to verify the system’s robustness
and to detect resource allocation bugs, is consid-
ure [20]). In other cases, stress testing is used to ered as both stress testing and load testing. Such
evaluate the efficiency of software designs [66], [67], testing is not performance testing, as software
[68], [69]. performance is not one of the testing objectives.
3) Scenarios Only Considered as Load Testing
Although this system is already tested manually
2.2 Relationships Between Load Testing, Perfor- using a small number of users to verify the
mance Testing and Stress Testing functional correctness of a service request (e.g.,
As Dijkstra pointed out in [79], software testing can the total cost of a shopping cart is calculated
only show the presence of bugs but not their absence. correctly when a customer checks out), the cor-
Bugs are the behavior of systems which deviate from rectness of the same types of requests should be
the specified requirements. Hence, we define our uni- verified under hundreds or millions of concur-
fied definition of load testing that is used in this paper rent users.
is as follows: The test, which aims to verify the functional
correctness of a system under load is consid-
ered only as a load test. This scenario is not
Load testing is the process of assessing system behavior performance testing, as the objective is not per-
under load in order to detect problems due to one or both formance related; nor is this scenario considered
of the following reasons: (1) functional-related problems as stress testing, as the testing conditions are not
(i.e., functional bugs that appear only under load), and (2) extreme.
non-functional problems (i.e., violations in non-functional 4) Scenarios Considered as Load, Performance
quality-related requirements under load). and Stress Testing
This e-commerce website can also be accessed
Comparatively, performance testing is used to mea- using smartphones. One of the requirements
sure and/or evaluate performance related aspects is that the end-to-end service request response
(e.g., response time, throughput and resource utiliza- time should be reasonable even under poor cel-
tions) of algorithms, designs/architectures, modules, lular network conditions (e.g., packet drops and
configurations, or the overall systems. Stress testing packet delays).
puts a system under extreme conditions (e.g., higher The type of test used to validate the performance
than expected load or limited computing resources) requirements of the SUT with limited computing
to verify the robustness of the system and/or detect resources (e.g., network conditions), can be con-
various functional bugs (e.g., memory leaks and dead- sidered as all three types of testing.
locks). 5) Scenarios Considered as Performance Testing
There are commonalities and differences among and Stress Testing
the three types of testing, as illustrated in the Venn Rather than testing the system performance after
Diagram shown in Figure 2. We use an e-commerce the implementation is completed. The system
system as a working example to demonstrate the architect may want to validate whether a com-
relation across these three types of testing techniques. pression algorithm can efficiently handle large
1) Scenarios Considered as Both Load Testing and image files (processing time and resulting com-
Performance Testing pressed file size). Such testing is not considered
The e-commerce system is required to provide to be load testing, as there is no load (concurrent
fast response under load (e.g., millions of con- access) applied to the SUT.
current client requests). Therefore, testing is 6) Scenarios Considered as Performance Testing
5
Only be related to load testing (e.g., [11], [17], [15], [21],

In addition to writing unit tests to check the [23], [87], [88]).
functional correctness of their code, the devel-
opers are also required to unit test the code per- 3 R ESEARCH Q UESTION 1: H OW IS A
formance. The test to verify the performance of
PROPER LOAD DESIGNED ?
one unit/component of the system is considered
only as performance testing. The goal of the load design phase is to devise a load,
In addition, the operators of this e-commerce which can uncover load-related problems under load.
system need to know the system deployment Based on the load test objectives, there are two general
configurations to achieve the maximal per- schools of thought for designing a proper load to
formance throughput using minimal hardware achieve such objectives:
costs. Therefore, performance testing should 1) Designing Realistic Loads
be carried out to measure the system per- As the main goal of load testing is to ensure
formance under various database or web- that the SUT can function correctly once it is
server configurations. The type of test to eval- deployed in the field, one school of thought is to
uate the performance of different architec- design loads, which resemble the expected usage
tures/algorithms/configurations is only consid- once the system is operational in the field. If the
ered as performance testing. SUT can handle such loads without functional
7) Scenarios Considered as Stress Testing Only and non-functional issues, the SUT would have
Developers have implemented a smartphone ap- passed the load test.
plication for this e-commerce system to enable Once the load is defined, the SUT executes the
users to access and buy items from their smart- load and the system behavior is recorded. Load
phones. This smartphone application is required testing practitioners then analyze the recorded
to work under sporadic network conditions. system behavior data to detect load-related
This type of test is considered as stress testing, problems. Test durations in such cases are usu-
since the application is tested under extreme ally not clearly defined and can vary from sev-
network condition. This testing is not considered eral hours to a few days depending on the
to be performance testing, since the objective testing objectives (e.g., to obtain steady state
is not performance related; nor is this scenario estimates of the system performance under load
considered as load testing, as the test does not or to verify that the SUT is deadlock-free) and
involve load. testing budget (e.g., limited testing time). There
are two approaches proposed in the literature
to design realistic testing loads as categorized
2.3 Our Paper Selection Process in [53]:
We first search the following three keywords on a) The Aggregate-Workload Based Load De-
the General scholarly article search engines (DBLP sign Approach
searches [80] and Google Scholar [81]): “load test”, The aggregate-workload based load design
“performance test” and “stress test”. Second, we filter approach aims to generate the individual
irrelevant papers based on the paper titles, publication target request rates. For example, an e-
venues and abstracts. For example, results like “Test commerce system is expected to handle
front loading in early stages of automotive software three types of requests with different trans-
development based on AUTOSAR” are filtered out. action rates: ten thousand purchasing re-
We also remove performance and stress testing papers quests per second, three million brows-
that are not related to load testing (e.g., “Backdrive ing requests per second, and five hundred
Stress-Testing of CMOS Gate Array Circuits”). Third, registration requests per second. The re-
we add additional papers and tools based on the sulting load, using the aggregate-workload
related work sections from relevant load testing pa- based load design approach, should resem-
pers, which do not contain the above three keywords. ble these transaction rates.
Finally, we include relevant papers that cite these pa- b) The Use-Case Based Load Design Ap-
pers, based on the “Cited by” feature from Microsoft proach
Academic Search [82], Google Scholar [81], ACM Por- The use-case (also called user equivalent
tal [83] and IEEE Explore [84]. For example, papers in [53]) based approach is more focused
like [85], [86] are included, because they cite [23] and on generating requests that are derived
[87], respectively. from realistic use cases. For example, in the
In the end, we have surveyed a total of 147 papers aforementioned e-commerce system, an in-
and tools between the year 1993 − 2013. To verify dividual user would alternate between sub-
the completeness of the surveyed papers, the final mitting page requests (browsing, searching
results include all the papers we knew beforehand to and purchasing) and being idle (reading the
6
web page or thinking). In addition, a user ample, an operational profile for an e-commerce
cannot purchase an item before he/she logs system would describe the number of concur-
into the system. rent requests (e.g., browsing and purchasing) that
2) Designing Fault-Inducing Loads the system would experience during a day. The
Another school of thought aims to design loads, process of extracting and representing the ex-
which are likely to cause functional or non- pected workload (operational profile) in the field
functional problems under load. Compared to is called Workload Characterization [89]. The goal
realistic loads, the test duration using faulting- of workload characterization is to extract the ex-
inducing loads are usually deterministic and pected usage from hundreds or millions hours of
the test results are easier to analyze. The test past field data. Various workload characterization
durations in these cases are the time taken for techniques have been surveyed in [89], [90], [91].
the SUT to enumerate through the loads or the • Output refers to the types of output from each
time until the SUT encounters a functional or load design technique. Examples can be work-
non-functional problem. load configurations or usage models. Workload
There are two approaches proposed in the liter- configuration refers to one set of workload mix
ature for designing fault-inducing loads: and workload intensity (covered in Section 3.1.1).
Models refer to various abstracted system usage
a) Deriving Fault-Inducing Loads by Analyz-
models (e.g., the Markov chain).
ing the Source Code
• References refer to the list of literatures, which
This approach uncovers various functional
propose each technique.
and non-functional problems under load by
systematically analyzing the source code of Both load design schools of thought (realistic v.s.
the SUT. For example, by analyzing the fault-inducing load designs) have their advantages
source code of our motivating e-commerce and disadvantages: In general, loads resulting from
system example, load testing practitioners realistic-load based design techniques can be used to
can derive loads that exercise these po- detect both functional and non-functional problems.
tential functional (e.g., memory leaks) and However, the test durations are usually longer and
non-functional (e.g., performance issues) the test analysis is more difficult, as the load testing
weak spots under load. practitioners have to search through large amounts
b) Deriving Fault-Inducing Loads by Build- of data to detect load-related problems. Conversely,
ing and Analyzing System Models although loads resulting from fault-inducing load de-
Various system models abstract different assign techniques take less time to uncover potential
pects of system behavior (e.g., performance functional and non-functional problems, the resulting
models for the performance aspects). By loads usually only cover a small portion of the testing
systematically analyzing these models, po- objectives (e.g., only detecting the violations in the
tential weak spots can be revealed. For ex- performance requirements). Thus, there are load opti-
ample, load testing practitioners can build mization and reduction techniques proposed to miti-
performance models in the aforementioned gate the deficiencies of each load design technique.
This section is organized as follows: Section 3.1
e-commerce system, and discover loads
covers the realistic load design techniques. Section 3.2
that could potentially lead to performance
covers the fault-inducing load design techniques. Sec-
problems (higher than expected response
tion 3.3 discusses the test optimization and reduction
time).
techniques used in the load design phase. Section 3.4
We introduce the following dimensions to compare summarizes the load design techniques and proposes
among various load design techniques as shown in a few open problems.
Table 2:
• Techniques refer to the names of the load design 3.1 Designing Realistic Loads
techniques (e.g., step-wise load); In this subsection, we discuss the techniques used to
• Objectives refer to the goals of the load (e.g., design loads, which resemble the realistic usage once
detecting performance problems); the system is operational in the field. Section 3.1.1
• Data Sources refer to the artifacts used in each and 3.1.2 cover the techniques from the Aggregate-
load design technique. Examples of artifacts can Workload and the Use-Case based load design ap-
be past field data or operational profiles. Past proaches, respectively.
field data could include web access logs, which
record the identities the visitors and their visited 3.1.1 Aggregate-Workload Based Load Design Tech-
sites, and database auditing logs, which show niques
the various database interactions. An Operational Aggregate-workload based load design techniques
Profile describes the expected field usage once the characterize loads along two dimensions: (1) Work-
system is operational in the field [23]. For ex- load Intensity, and (2) Workload Mix:
7
TABLE 2: Load Design Techniques
Techniques Objectives Data Sources Output References
Realistic Load Design - (1) Aggregate-Workload Based Load Design Approach (Section 3.1.1)
Steady Load Operational Profiles, One Configuration of Work- [45], [92]
Detecting Functional Past Usage Data load Mix and Workload In-
and Non-Functional tensity
Step-wise Load Problems Under Operational Profiles, Multiple Configurations of [30], [93],
Load Past Usage Data Workload Mixed and Work- [36], [94],
load Intensities [95], [96],
[97], [98]
Extrapolated Load Beta-user Usage Data, One or More Configura- [44], [99]
Interviewing Domain tions of Workload Mix and
Experts and Competi- Workload Intensities
tions Data
Realistic Load Design - (2) Use-Case Based Load Design Approach (Section 3.1.2)
Testing Loads Derived using UML UML Use Case Di- UML Diagrams Tagged [100], [101],
Models Detecting Functional agrams, UML Activ- with Request Rates [102], [103]
and Non-Functional ity Diagrams, Opera-
Problems tional Profile
Testing Loads Derived using Past Usage Data Markov Chain Models [104], [105],
Markov Models [16]
Testing Loads Derived using Operational Stochastic Form-oriented [106], [107]
Stochastic Form-oriented Models Profile, Business Models
Requirements, User
configurations
Testing Loads Derived using Prob- User Configurations Probabilistic Timed [108], [109],
abilistic Timed Automata Automata [110]
Fault-Inducing Load Design - (1) Deriving Load from Analyzing the Source Code (Section 3.2.1)
Testing Loads Derived using Data Detecting Functional Source Code Testing Loads Lead to Code [18]
Flow Analysis Problems (memory Paths with Memory Leaks
leaks)
Testing Loads Derived using Sym- Detecting Functional Source Code, Testing Loads Lead to Prob- [111], [29]
bolic Execution Problems (high Symbolic Execution lematic Code Paths with
memory usage) Analysis Tools Performance Problems
and Performance
Problems (high
response time)
Fault-Inducing Load Design - (2) Deriving Load from Building and Analyzing System Models (Section 3.2.2)
Testing Loads Derived using Linear Detecting Resource Usage Per Testing Loads Lead to [38], [54]
Programs Performance Request Performance Problems
Problems (audio (high response time)
and video not in
sync)
Testing Loads Derived using Ge- Detecting Resource Usage and [112], [113]
netic Algorithms Performance Response Time Per
Problems (high Task
response time)
• The Workload Intensity refers to the rate of the in- an existing operational profile. This steady load
coming requests (e.g., browsing, purchasing and could be the normal expected usage or the peak
searching), or the number of concurrent users; time usage depending on the testing objectives.
• The Workload Mix refers to the ratios among Running the SUT using a steady load can be
different types of requests (e.g., 30% browsing, used to verify the system resource requirements
10% purchasing and 60% searching). (e.g., memory, CPU and response time) [92] and
to identify resource usage problems (e.g., mem-
Three load design techniques have been proposed
ory leaks) [45].
to characterize loads with various workload intensity
2) Step-wise Load
and workload mix:
A system in the field normally undergoes vary-
1) Steady Load ing load characteristics throughout a normal
The most straightforward aggregate-workload day. There are periods of light usage (e.g., early
based load design technique is to devise a steady in the morning or late at night), normal usage
load, which contains only one configuration (e.g., during the working hours), and peak us-
of the workload intensity and workload mix ages (e.g., during lunch time). It might not be
throughout the entire load test [45]. A steady possible to load test a system using a single type
load can be inferred from past data or based on of steady load. Step-wise load design techniques
8
would devise loads consisting of multiple types these requests would fail due to invalid user states
of load, to model the light/normal/peak usage (e.g., some users do not have items added to their
expected in the field. shopping carts yet).
Step-wise load testing keeps the workload mix However, designing loads reflecting realistic use-
the same throughout the test, while increasing cases could be challenging, as there may be too many
the workload intensity periodically [30], [93], use cases available for the SUT. Continuing with our
[36], [94], [95], [96], [97], [98]. Step-wise load motivating example of the e-commerce system, dif-
testing, in essence, consists of multiple levels of ferent users can follow different navigation patterns:
steady load. Similar to the steady load approach, some users may directly locate items and purchase
the workload mix can be derived using the them. Some users may prefer to browse through a
past field data or an operational profile. The few items before buying the items. Some other users
workload intensity varies from system to sys- may just browse the catalogs without buying. It would
tem. For example, the workload intensity can be not be possible to cover all the combinations of these
the number of users, the normal and peak load sequences. Therefore, various usage models are pro-
usages, or even the amount of results returned posed to abstract the use cases from thousands and
from web search engines. millions of user-system interactions. In the rest of this
3) Load Extrapolation Based on Partial or Incom- subsection, we discuss four load design techniques
plete Data based on usage models.
The steady load and the step-wise load design
techniques require an existing operational pro- 1) Testing Loads Derived using UML Models
file or past field data. However, such data might UML diagrams, like Activity Diagrams and Use
not be available in some cases: For example, Case Diagrams, illustrate detailed user interac-
newly developed systems or systems with new tions in the system. One of the most straight
features have no existing operational profile or forward use-case based load design techniques
past usage data. Also, some past usage data may is to tag load information on the UML Activity
not be available due to privacy concerns. To cope Diagram [100], [101], [102] or Use Case Dia-
with these limitations, loads are extrapolated gram [100], [103] with probabilities. For exam-
from the following sources: ple, the probability beside each use case is the
• Beta-Usage Data likelihood that a user triggers that action [103].
Savoia [99] proposes to analyze log files For example, a user is more likely to navigate
from a limited beta usage and to extrapolate around (40% probability) than to delete a file
the load based on the number of expected (10% probability).
users in the actual field deployment. 2) Testing Loads Derived using Markov-Chain
• Interviews With Domain Experts Models
Domain experts like system administrators, The problem with the UML-based testing load
who monitor and manage deployed systems is that the UML Diagrams may not be available
in the field, generally have a sense of sys- or such information may be too detailed (e.g.,
tem usage patterns. Barber [44] suggests to hundreds of use cases). Therefore, techniques are
obtain a rough estimate of the expected field needed to abstract load information from other
usage by interviewing such domain experts. sources. A Markov Chain, which is also called
• Extrapolation from Using Competitors’ the User Behavior Graph [16], consists of a finite
Data number of states and a set of state transition
Barber [44] argues that in many cases, new probabilities between these states. Each state has
systems likely do not have a beta program a steady state probability associated with it. If
due to limited time and budgets and inter- two states are connected, there is a transition
viewing domain experts might be challeng- probability between these two states.
ing. Therefore, he proposes an even less for- Markov Chains are widely used to generate load
mal approach to characterize the load based for web-based e-commerce applications [104],
on checking out published competitors’ us- [105], [16], since Markov chains can be easily
age data, if such data exists. derived from the past field data (web access
logs [16]). Each entry of the log is a URL, which
3.1.2 Use-Case Based Load Design Techniques consists of the requested web pages and “param-
The main problem associated with the aggregate- eter name = parameter value” pairs. Therefore,
workload based load design approach is that the loads sequences of user sessions can be recovered by
might not be realistic/feasible in practice, because grouping sequences of request types belonging
the resulting requests might not reflect individual use to the same session. Each URL requested be-
cases. For example, although the load can generate comes one state in the generated Markov chain.
one million purchasing requests per second, some of Transition probabilities between states represent
9
real user navigation patterns, which are derived 3.2 Designing Fault-Inducing Loads
using the probabilities of a user clicking page B In this subsection, we cover the load design technique
when he/she is on page A. from the school of fault-inducing load design. There
During the course of a load test, user action se- are two approaches proposed to devise potential fault-
quences are generated based on the probabilities inducing testing loads: (1) by analyzing the source
modeled in the Markov chain. The think time code (Section 3.2.1), and (2) by building and analyzing
between successive actions is usually generated various system models (Section 3.2.2).
randomly based on a probabilistic distribution
(e.g., a normal distribution or an exponential 3.2.1 Deriving Fault-Inducing Loads via Source Code
distribution) [16], [105]. As the probability in the Analysis
Markov chain only reflects the average behavior
of a certain period of time, Barros et al. [104] rec- There are two techniques proposed to automatically
ommend the periodical updating of the Markov analyze the source code for specific problems. The first
chain based on the field data in order to ensure technique is trying to locate specific code patterns,
that load testing reflects the actual field behavior. which lead to known load-related problems (e.g.,
3) Testing Loads Derived using Stochastic Form- memory allocation patterns for memory allocation
oriented Models problems). The second technique uses model checkers
Stochastic Form-oriented Model is another tech- to systematically look for memory and performance
nique used to model a sequence of actions per- problems. In this section, we only look at the tech-
formed by users. Compared to the testing loads niques which analyze the source code statically. There
represented by the Markov Chain models, a are also load generation techniques which leverage
Stochastic Form-oriented model is richer in mod- dynamic analysis techniques. However, these tech-
eling user interactions in web-based applica- niques are tightly coupled with the load execution:
tions [106]. For example, a user login action can the system behavior is monitored and analyzed, while
either be a successful login and a redirect to the new loads are generated. We have categorized such
overview page, or a failure login and a redirect techniques as “dynamic-feedback-based load genera-
back to the login page. Such user behavior is tion and termination techniques” in Section 4.2.3.
difficult to model in a Markov chain [106], [107]. 1) Testing Loads Derived using Data Flow Anal-
Cai et al. [114], [115] propose a toolset that ysis
automatically generates a load for a web appli- Load sensitive regions are code segments, whose
cation using a three-step process: First, the web correctness depends on the amount of input data
site is crawled by a third party web crawler and the duration of testing [18]. Examples of
and the website’s structural data is recovered. load sensitive regions can be code dealing with
Then, their proposed toolset lays out the crawled various types of resource accesses (e.g., memory,
web structure using a Stochastic Form-Oriented thread pools and database accesses). Yang et
Model and prompts the performance engineer al. [18] use data flow analysis of the system’s
to manually specify the probabilities between source code to generate loads, which exercise the
the pages and actions based on an operational load sensitive regions. Their technique detects
profile. memory related faults (e.g., memory allocation,
4) Testing Loads Derived using Probabilistic memory deallocation and pointers referencing).
Timed Automata 2) Testing Loads Derived using Symbolic Execu-
Compared to the Markov Chain and the Stochas- tions
tic Form-oriented Models, Probability Timed Rather than matching the code for specific pat-
Automata is an abstraction which provides sup- terns (e.g., the resource accesses patterns in [18]),
port for user action modeling as well as timing Zhang et al. [29], [111] use symbolic test exe-
delays [108], [109], [110]. Similar to the Markov cution techniques to generate loads, which can
chain model, a Probabilistic Timed Automata cause memory or performance problems. Sym-
contains a set of states and transition probabil- bolic execution is a program analysis technique,
ities between states. In addition, for each tran- which can automatically generates input values
sition, a Probabilistic Timed Automata contains corresponding to different code paths.
the time delays before firing the transition. The Zhang et al. use the symbolic execution to derive
timing delays are useful for modeling realis- two types of loads:
tic user behaviors. For example, a user could a) Testing Loads Causing Large Response
pause for a few seconds (e.g., reading the page Time
contents) before triggering the next action (e.g., Zhang et al. assign a time value for each
purchasing the items). step along the code path (e.g., 10 for an
invoking routing and 1 for other routines).
Therefore, by summing up the costs for
10
each code path, they can identify the paths paths, on the Petri Net. For example, a new
that lead to the longest response time. The video action C cannot be fired until the
values that satisfy the path constraints form previous video action A and audio action
the loads. B are both completed.
b) Testing Loads Causing Large Memory b) Formulate System Behavior into a Linear
Consumptions Program in order to Identify Performance
Rather than tracking the time, Zhang et Problems
al. track the memory usage at each step Linear programming is used to identify
along the code path. The memory footprint the sequences of user actions, which can
information is available through a Symbolic trigger performance problems. Linear pro-
Execution tool, (e.g., the Java Path Finder gramming systematically searches for opti-
(JPF)). Zhang et al. use the JPF’s built-in mal solutions based on certain constraints.
object life cycle listener mechanism to track A linear program contains the following
the heap size of each path. Paths leading to two types of artifacts: an objective func-
large memory consumption are identified tion (the optimal criteria) and a set of con-
and values satisfying such code paths form straints. The objective function is to max-
the loads. imize or minimize a linear equation. The
constraints are a set of linear equations or
3.2.2 Deriving Fault-Inducing Loads by Building and inequalities. The sequence of arrivals of the
Analyzing System Models user action sequences is formulated using
linear constraints. There are two types of
In the previous subsection, we have presented load
constraint functions: One constraint func-
design techniques which analyze the source code
tion ensures the total testing time is within
of a system to explore potential problematic re-
a pre-specified value (the test will not run
gions/paths. However, some of those techniques only
for too long). The rest of the constraint func-
work for a particular type of systems. For example,
tions formulate the temporal requirements
[29], [111] only work for Java-based systems. In addi-
derived using the possible user action se-
tion, in some cases, source code might not be directly
quences, as the resource requirements (e.g.,
accessible. Hence, there are also techniques that have
CPU, memory, network bandwidth) asso-
been proposed to automatically search for potential
ciated with each multimedia object (video
problematic loads using various system models.
or audio) are assumed to be known. The
1) Testing Loads Derived using Petri Nets and objective function is set to evaluate whether
Linear Programs the arrival time sequence would cause the
Online multimedia systems have various tempo- saturations of one or more system resources
ral requirements: (1) Timing requirements: audio (CPU and network).
and video data streams should be delivered in
sequence and following strict timing deadlines; 2) Testing Loads Derived using Genetic Algo-
(2) Synchronization requirements: video and audio rithms
data should be in synch with each other; (3) An SLA, Service Level Agreement, is a contract
Functional requirements: some videos can only be with potential users on the non-functional prop-
displayed after collecting fee. erties like response time and reliability as well
Zhang et al. [38], [54] propose a two-step tech- as other requirements like costs. Penta et al. [113]
nique that automatically generates loads, which and Gu et al. [112] uses Genetic Algorithms to
can cause a system to violate the synchronization derive loads causing SLA or QoS (Quality of
and responsive requirements while satisfying Service) requirement violations (e.g., response
the business requirements. Their idea is based time) in service-oriented systems. Like linear
on the belief that timing and synchronization programming, Genetic Algorithms, is a search
requirements usually fail when the SUT’s re- algorithm, which mimics the process of natural
sources are saturated. For example, if the mem- evolution for locating optimal solutions towards
ory is used up, the SUT would slow down due a specific goal.
to paging. The genetic algorithms are applied twice to de-
rive potential performance sensitive loads:
a) Identify Data Flows using a Petri Net
The online multimedia system is modeled a) Penta et al. [113] use the genetic algorithm
using a Petri Net, which is a technique technique that is proposed by Canfora et
that models the temporal constraints of a al. [116], in order to identify risky work-
system. All possible user action sequences flows within a service. The response time
can be generated by conducting reachabil- for the risky workflows should be as close
ity analysis, which explores all the possible to the SLA (high response time) as possible.
11
b) Penta et al. [113] apply the genetic algo- • Step 1 - Extracting Realistic Individual User
rithm to generate loads that cover the iden- Behavior From Past Data
tified risky workflow and violate the SLA. Most of the e-commerce systems record past us-
age data in the form of web access logs. Each
time a user hits a web page, an entry is recorded
3.3 Load Design Optimization and Reduction
in the web access logs. Each log entry is usually
Techniques
a URL (e.g., the browse page or the login page),
In this subsection, we discuss two classes of load de- combined with some user identification data (e.g.,
sign optimization and reduction techniques aimed at session IDs). Therefore, individual user action
improving various aspects of load design techniques. sequences, which describe the step-by-step user
Both classes of techniques are aimed at improving the actions, can be recovered by grouping the log
realistic load design techniques. entries with user identification data.
• Hybrid Load Optimization Techniques • Step 2 - Deriving Targeted Aggregate Load By
The aggregate-workload based techniques focus Carefully Arranging the User Action Sequence
on generating the desired workload, but fail to Data
mimic realistic user behavior. The user-equivalent The aggregate load is achieved by carefully ar-
based techniques focus on mimicking the individ- ranging and stacking up the user action se-
ual user behaviour, but fail to match the expected quences (e.g., two concurrent requests are gener-
overall workload. The hybrid load optimization ated from two individual user action sequences).
techniques (Section 3.3.1) aim to combine the There are two proposed techniques to calculate
strength of the aggregate-workload and use-case user action sequences:
based load design approaches. For example, for
our example e-commerce system, the resulting 1) Matching the Peak Load By Compressing
load should resemble the targeted transaction Multiple Hours Worth of Load
rates and mimic real user behavior. Burstiness refers to short uneven spikes of
• Optimizing and Reducing the Duration of a requests. One type of burstiness is caused
Load Test by the flash crowd. The phenomenon where a
One major problem with loads derived from real- website suddenly experiences a heavier than
istic load testing is that the test durations in these expected request rate. An example of flash
testing loads are usually not clearly defined (i.e., crowd includes when many users flocked
no clear stopping rules). The same scenarios are to the news sites like CNN.com during the
repeatedly-executed over several hours or days. 9/11 incident, or during the World Cup
Table 3 compares the various load design optimiza- period, the FIFA website was often more
tion and reduction techniques along the following loaded when a goal was scored. During
dimensions: the flash crowd incident, the load could be
several times higher than the expected nor-
• Techniques refer to the used load design opti-
mal load. Incorporating realistic burstiness
mization and reduction techniques (e.g., the hy- into load testing is important to verify the
brid load optimization techniques). capacity of a system [120].
• Target Load Design Techniques refer to the load
Maccabee and Ma [117] squeeze multiple
design techniques that the reduction or optimiza- one-hour user action sequences together into
tion techniques are intended to improve. For one-hour testing load to generate a realistic
example, the hybrid load optimization techniques peak load, which is several times higher than
combine the strength of aggregate-workload and the normal load.
use-case based load design techniques. 2) Matching the Specific Request Rates By
• Optimization and Reducing Aspects refer to
Linear Programs
the aspects of the current load design that the Maccabee and Ma’s [117] technique is simple
optimization and reduction techniques attempt and can generate higher than normal load to
to improve. One example is to reduce the test verify the system capacity and guard against
duration. problems like a flash crowd. However, their
technique has problems like coarse-grained
propose each technique. aggregate load, which cannot reflect the nor-
mal expected field usage. For example, the
3.3.1 Hybrid Load Optimization Techniques individual requests rates (e.g., browsing or
Hybrid load optimization techniques aim to better purchasing rates) might not match with the
model the realistic load for web-based e-commerce targeting request rates. Krishnamurthy et
systems [53], [117]. These techniques consist of the al. [52] use linear programming to systemat-
following three steps: ically arrange user action sequences, which
12
TABLE 3: Test Reduction and Optimization Techniques That Are Used in the Load Design Phase
Techniques Target Load Design Optimizing and Reducing Data Sources References
Techniques Aspects
Hybrid Load Opti- All Realistic Load Design Combining the strength of Past usage data [51], [53], [117]
mization Techniques aggregate-workload and
use-case based load design
techniques
Extrapolation Step-wise Load Design Reducing the number of work- Step-wise testing loads, [15], [16], [118],
load intensity levels past usage data [119]
Deterministic State All Realistic Load Design Reducing repeated execution of Realistic testing loads [21], [22], [23]
Techniques the same scenarios
match with the desired workload. system behavior under all load levels, Menasce
• Step 3 - Specifying the Inter-arrival Time Be- et al. [15], [16] propose to only test a few load
tween User Actions levels and extrapolate the system performance
There is a delay between each user action, when at other load levels. Weyuker et al. [119] propose
the user is either reading the page or think- a metric, called the Performance Nonscalability
ing about what to do next. This delay is called Likelihood (PNL). The PNL metric is derived
the “think time” or the “inter-arrival time” be- from the past usage data and can be used to
tween actions. The think time distribution among predict the workload, which will likely cause
the user action sequences is specified manually performance problems.
in [52], [53]. Casale et al. [51] extend the technique Furthermore, Leganza [118] proposes to extrap-
in [52], [53] to create realistic burstiness. They olate the load testing data from the results con-
use a burstiness level metrics, called the Index of ducted on a lower number of users onto the
Dispersion [120], which can be calculated based on actual production workload (300 users in test-
the inter-arrival time between requests. They use ing versus 1, 500 users in production) to verify
the same constraint functions as [52], [53], but a whether the current SUT and hardware infras-
different non-linear objective function. The goal tructure can handle the desired workload.
of the objective function is to find the optimal 2) Load Test Optimization and Reduction By De-
session mix, whose index of dispersion is as close terministic States
to the real-time value as possible. Rather than repeatedly executing a set of scenar-
The hybrid technique outputs the exact individual ios over and over, like many of the aggregate-
user actions during the course of the actions. The workload based load design techniques (e.g.,
advantage of such output is to avoid some of the steady load and Markov-chain), Avritzer et
unexpected system failures from other techniques like al. [10], [21], [22], [23] propose a load opti-
the Markov chains [53]. However, special load gen- mization technique, called the Deterministic State
erators are required to take such input and generate Testing, which ensures each type of load is only
the testing loads. In addition, the scalability of the executed once.
approach would be limited by the machine’s memory, Avritzer et al. characterize the testing load using
as the load generators need to read in all the input states. Each state measures the number of differ-
data (testing user actions at each time instance) at ent active processing jobs at the moment. Each
once. number in the state represents the number of ac-
tive requests of a particular request. Suppose our
3.3.2 Optimizing and Reducing the Duration of a e-commerce system consists of four scenarios:
Load Test registration, browsing, purchasing and search-
Two techniques have been proposed to systematically ing. The state (1, 0, 0, 1) would indicate that cur-
optimize and reduce the load test duration for the rently only there is one registration request and
realistic-load-based techniques. One technique aims at one search request active and the state (0, 0, 0, 0)
reducing a particular load design technique (step-wise would indicate that the system is idle. The
load testing). The other technique aims at optimizing probability of these states, called “Probability
and reducing the realistic load design techniques by Mass Coverage”, measures the likelihood that
adding determinism. the testing states is going to be covered in the
1) Load Test Reduction By Extrapolation field. These probabilities are calculated based on
Load testing needs to be conducted at various the production data. The higher the probability
load levels (e.g., number of user levels) for step- of one particular state, the more likely it is going
wise load testing. Rather than examining the to happen in the field.
13
Load test optimization can also be achieved by al. [10], [21], [22], [23]. This metric only captures
making use of the probability associated with the workload coverage in terms of aggregate
each state to prioritize tests. If time is limited, workload. We need more metrics to capture other
only a small set of states with a high probability aspects of the system behavior. For example, we
of occurrence in the field can be selected. need new metrics to capture the coverage of
In addition to reducing the test durations, de- different types of users (a.k.a., use-case based
terministic state testing is very good at detect- loads).
ing and reproducing resource allocation failures • Testing Loads Evolution and Maintenance
(e.g., memory leaks and deadlocks). There is no existing work aimed at maintaining
and evolving the resulting loads. Below we pro-
3.4 Summary and Open Problems vide two examples where the evolution of the
There are two schools of thought for load design: (1) load is likely to play an important role:
Designing loads, which mimic realistic usage; and (2) 1) Realistic Loads: As users get more familiar
Designing loads, which are likely to trigger functional with the system, usage patterns are likely to
and non-functional failures. Realistic Load Design change. How much change would merit an
techniques are more general, but the resulting loads update to a realistic-based testing loads?
can take a long time to execute. Results of a load 2) Fault-Inducing Loads: As the system evolve
test are harder to analyze (due to the large volume over time, can we improve the model build-
of data). On the contrary, Fault-Inducing Load Design ing of fault-inducing loads by incrementally
techniques are more narrowly focused on a few objec- analyzing the system internals (e.g., changed
tives (i.e., you will not detect unexpected problems), source code or changed features)?
but the test duration is usually deterministic and
shorter. The test results are usually easier to analyze.
4 R ESEARCH Q UESTION 2: H OW IS A LOAD
However, a few issues are still not explored thor-
oughly: TEST EXECUTED ?
• Optimal Test Duration for the Realistic Load Once a proper load is designed, a load test is executed.
Design The load test execution phase consists of the following
One unanswered question among all the real- three main aspects: (1) Setup, which includes system
istic load design techniques is how to identify deployment and test execution setup; (2) Load Genera-
the optimal test duration, which is the shortest tion and Termination, which consists of generating the
test duration while still covering all the test ob- load according to the configurations and terminating
jectives. This problem is very similar as deter- the load when the load test is completed; and (3) Test
mining the optimal simulation duration for the Monitoring and Data Collection, which includes record-
discrete event simulation experiments. Recently, ing the system behavior (e.g., execution logs and
there have been works [121], [122] proposed to performance metrics) during execution. The recorded
leverage statistical techniques to limit the dura- data is then used in the Test Analysis phase.
tion of the simulation runs by determining the As shown in Table 4, there are three general ap-
number of sample observations required to reach proaches of load test executions:
certain accuracy in the output metrics. Similar 1) Live-User Based Executions
techniques may be used to help determine the A load test examines a SUT’s behavior when
optimal test duration of the realistic load design. the SUT is simultaneously used by many users.
• Benchmarking & Empirical Studies of the Ef- Therefore, one of the most intuitive load test
fectiveness of Various Techniques execution approach is to execute a load test
Among the load design techniques, the effective- by employing a group of human testers [19],
ness of these techniques, in terms of scale and [124], [118]. Individual users (testers) are se-
coverage, is not clear. In large-scale industrial lected based on the testing requirements (e.g.,
systems, which are not web-based systems, can locations and browsers).
we still apply techniques like Stochastic Form- The live-user based execution approach reflects
oriented Models? A benchmark suite (like the the most realistic user behaviors. In addition,
Siemens benchmark suite for functional bug de- this approach can obtain real user feedbacks
tection [123]) is needed to systematical evaluate on aspects like acceptable request performance
the scale and coverage of these techniques. (e.g., whether certain requests are taking too
• Test Coverage Metrics long) and functional correctness (e.g., a movie
Unlike functional testing suites, which have var- or a figure is not displaying properly). However,
ious metrics (e.g., code coverage) to measure the the live-user based execution approach cannot
test coverage. There are few load testing coverage scale well, as the approach is limited by the
metrics other than the “Probability Mass Cov- number of recruited testers and the test du-
erage” metric, which is proposed by Avritzer et ration [118]. Furthermore, the approach cannot
14
explore various timing issues due to complexity In comparison to benchmark suites, the
of manual coordination of many testers. Finally, following two categories of load drivers
the load tests that are executed by the live users (centralized and peer-to-peer load drivers)
cannot be reproduced or repeated exactly as they are more generic (applicable for many sys-
occurred. tems).
2) Driver Based Executions b) Centralized Load Drivers refer to a single
To overcome the scalability issue of the live- load driver, which generates the load [128],
user based approach, the driver based execu- [129].
tion approach is introduced to automatically c) Peer-to-peer Load Drivers refer to a set
generate thousands or millions of concurrent of load drivers, which collectively generate
requests for a long period of time. Compared to the target testing load. Peer-to-peer load
the live-user based executions, where individual drivers usually have a controller compo-
testers are selected and trained, driver based nent, which coordinates the load generation
executions require setup and configuration of among the peer load drivers [135], [136],
the load drivers. Therefore, a new challenge in [137].
driver based execution is the configuration of Centralized load drivers are better at gener-
load drivers to properly produce the load. In ating targeted load, as there is only one sin-
addition, some system behavior (e.g., the movie gle load driver to control the traffic. Peer-
or image display) cannot be easily tracked, as it to-peer load drivers can generate larger
is hard for the load driver to judge the audio or scale load (more scalable), as centralized
video quality. load drivers are limited by processing and
Different from existing driver based sur- storage capabilities of a single machine.
veys [125], [126], [127], which focus on com-
3) Emulation Based Executions
paring the capabilities of various load drivers,
The previous two load test execution (live-user
our survey of driver based execution focuses on
based and driver based execution) approaches
the techniques used by the load drivers. Com-
require a fully functional system. Moreover, they
paring the load driver techniques, as opposed
conduct load testing in the field or in a field-
to capabilities, has the following two advan-
like environment. The techniques that use the
tages in terms of knowledge contributions: (1)
emulation based load test execution approach
Avoid Repetitions: Tools from different vendors
conduct the load testing on special platforms.
can adopt similar techniques. For example, We-
In this survey, we focus on two types of special
bLoad [128] and HP LoadRunner [129] both sup-
platforms:
port the store-and-replay test configuration tech-
nique. (2) Tool Evolution: The evolution of such a) Special Platforms Enabling Early and
load drivers is not tracked in the driver based Continuous Examination of System Be-
surveys. Some tools get decommissioned over havior Under Load
time. For example, tools like Microsoft’s Web In the development of large distributed
App Stress Tool surveyed in [127], no longer software systems (e.g., service-oriented
exist. New features (e.g., supported protocols) systems), many components like
are constantly added into the load testing tools the application-level entities and the
over time. For example, Apache JMeter [130] has infrastructure-level entities are developed
recently added support for model-based testing and validated during different phases of
(e.g., Markov-chain models). the software lifecycle. This development
There are three categories of load drivers: process creates serialized-phasing problem,
as the end-to-end functional and quality-of-
a) Benchmark Suite is a specialized load service (QoS) aspects cannot be evaluated
driver, designed for one type of system. For until late in the software life cycle (e.g.,
example, LoadGen [131] is a load driver at the system integration time) [39], [40],
used to specifically load test the Microsoft [41], [42], [138]. Emulation based execution
Exchange MailServer. Benchmark suites are can emulate parts of the system that
also used to measure and compare the per- are not readily available. Such execution
formance of different versions of software techniques can be used to examine the
and/or hardware setup (called Benchmark- system’s functional and non-functional
ing). Practitioners specify the rate of re- behavior under load throughout the
quests as well as test duration. Such load software development lifecycle, even
drivers are usually customized and can before the system is completely developed.
only be used to load test one type of sys- b) Special Platforms Enabling Deterministic
tem [131], [132], [133], [134]. Execution
15
Reporting and reproducing problems like 4.1.1 System Deployment

deadlocks or high response time are much
The system deployment process is the same for the
easier on these special platforms, as these
live-user based and driver based executions, but dif-
platforms can provide fine-grained controls
ferent from the emulation based executions.
on method and thread inter-leavings. When
problems occur, such platforms can provide • System Installation and Configuration for the
more insights on the exact system state [34]. Live-user based and Driver based Executions
For live-user based and driver based executions,
Live-user based and driver based executions require
it is recommended to perform the load testing on
deploying the SUT and running the test in the field or
the actual field environment, although the testing
field-like environment. Both approaches need to face
time can be limited and there could be high cost
the challenge of setting up realistic test environment
associated [99]. However, in many cases, load
(e.g., with proper network latency mimicking dis-
tests are conducted in a lab environment due
tributed locations). Running the SUT on special plat-
to accessibility and cost concerns, as it is often
forms avoids such complications. However, emulation
difficult and costly to access the actual produc-
based executions usually focus on a few test objectives
tion environment) [44], [99], [118], [139]. This
(e.g., functional problems under load), which are not
lab environment can be built from the dedicated
general purposes like the live-user based and driver
computing hardware purchased in-house [118],
based executions. In addition, like driver based exe-
[139] or by renting the readily available cloud
cutions, emulation based executions use load drivers
infrastructures (Testing-as-a-Service) [140], [141].
to automatically generate the testing load.
The system deployed in the lab could behave dif-
Among the three main aspects of the load test
ferently compared to the field environment, due
execution phase, Table 4 outlines the similarities and
to issues like unexpected resource fluctuations in
differences among the aforementioned three load test
the cloud environment [142]. Hence, extra efforts
execution approaches. For example, there are two
are required to configure the lab environment to
distinct setup activities in the Setup aspect: System
reflect the most relevant field characteristics.
Deployment and Test Execution Setup. Some setup
The SUT and its associated components (e.g.,
activities would contain different aspects for the three
database and mail servers) are deployed in a
test execution approaches (e.g., during the test execu-
field-like setting. One of the important aspects
tion setup activity). Some other activities would be
mentioned in the load testing literature is creating
similar (e.g., the system deployment activity is the
realistic databases, which have a size and struc-
same for live-user and driver based executions).
ture similar to the field setting. It would be ideal
In the next three subsections, we compare and con-
to have a copy of the field database. However,
trast the different techniques applied in the three as-
sometimes no such data is available or the field
pects of the load execution phase: Section 4.1 explains
database cannot be directly used due to security
the setup techniques, Section 4.2 discusses the load
or privacy concerns. There are two proposed
generation and termination techniques. Section 4.3
techniques to create field-like test databases: (1)
describes the test monitoring and data collection tech-
importing raw data, which shares the same char-
niques. Section 4.4 summaries the load test execution
acteristics (e.g., size and structure) as the field
techniques and lists some open problems.
data [31]; (2) sanitizating the field database so
that certain sensitive information (e.g., customer
4.1 Setup information) is removed or anonymized [104],
[143], [144].
As shown in Table 4, there are two setup activities in • System Deployment for the Emulation based
the Setup aspect: Executions
• System Deployment refers to deploying the SUT For the emulation based executions, the SUT
in the proper test environment and making the needs to be deployed on the special platforms,
SUT operational. Examples can include installing in which the load test is to be executed. The de-
the SUT and configuring the associated third ployment techniques for the two types of special
party components (e.g., the mail server and the platforms mentioned above are different:
database server). – Automated Code Generation for the Incom-
• Test Execution Setup refers to setting up and con- plete System Components
figuring the load testing tools (for driver based The automated code generation for the in-
and emulation based executions), or recruiting complete system components is achieved
and training testers (for live-user based execu- using model-driven engineering platforms.
tions) and configuring the test environment to Rather than implementing the actual sys-
reflect the field environment (e.g., increasing net- tem components via programming, develop-
work latency for long distance communication). ers can work at a higher level of abstraction
16
TABLE 4: Load Execution Techniques
Live-user based Driver based Emulation based

Execution Execution Execution
Load Test Execution
Approaches
Aspect 1. Setup
System installation and
System installation and
configuration in the
configurations in the System deployment on
System Deployment
field/field-like/lab
field/field-like/lab the special platforms
Setup Activities environment environment
Load Driver
Tester Recruitment and
Installation and Load Driver
Training, Test
Test Execution Setup Configurations, Test Installation and
Environment
Environment Configurations
Configurations
Configurations
Aspect 2. Load Generation and Terminations
Options for Load Static Configurations X X X
Generation and Dynamic x X x
Termination Deterministic x x X
Aspect 3. Test Monitoring and Analysis
Functional Problems X X X
Types of System Execution Logs X X X
Behavior Data Performance Metrics X X x
System Snapshots x X X
in a model-driven engineering setup (e.g., 4.1.2 Test Execution Setup

using domain-specific modeling languages or
The test execution setup activity includes two parts:
visual representations). Concrete code arti-
(1) setting up and configuring the test components:
facts and system configurations are generated
testers (for live-user based executions) or load drivers
based on the model interpreter [39], [40], [41],
(for driver based and emulation based executions);
[42], [138] or the code factory [145]. The over-
and (2) configuring the test environment.
all system is implemented using a model-
based engineering framework in Domain- Setting Up and Configuring the Test Components
specific modeling languages. For the compo-
nents, which are not available yet, the frame- Depending on the execution approaches, the test
work interpreter will automatically gener- components for setup and configuration are different:
ate mock objects (method stubs) based on • Tester Recruitment, Setup and Training (Live-
the model specifications. These mock objects, user based executions)
which conform to the interface of the ac- For live-user based executions, the three main
tual components, emulate the actual compo- steps involved in the test execution setup and
nent functionality. In order to support a new configuration aspects [19], [118] are:
environment (e.g., middleware or operating
1) Tester Recruitment
system), the model interpreter needs to be
Testers are hired to perform load tests. There
adapted for various middleware or operating
are specific criteria to select live users de-
systems, but no change to the upper level
pending on the testing objectives and type of
model specifications is required.
system. For example, for web-applications,
– Special Profiling and Scheduling Platform
individual users are picked based on factors
In order to provide more detailed informa-
like geographical locations, languages, oper-
tion on the SUT’s state when a problem oc-
ating systems and browsers;
curs (e.g., deadlocks or racing), special plat-
2) Tester Setup
forms (e.g., the CHESS platform [34]), which
Necessary procedures are carried out to en-
control the inter-leaving of threads are used.
able testers to access the SUTs (e.g., network
The SUT needs to be run under a develop-
permission, account permission, monitoring
ment IDE (Microsoft Visual Studio) with a
and data recording software installation);
specific scheduling in CHESS. In this way,
3) Tester Training
the CHESS scheduler, rather than the oper-
The selected testers are trained to be familiar
ating system, can control the inter-leaving of
with the SUT and their testing scenarios.
threads.
• Load Driver Deployment (Driver based and
emulation based executions)
17
Deploying the load drivers involves the installa- nication applications [150], [151]. This
tion and configuration of load drivers: configuration technique consists of the
following three steps:
1) Installation of Load Drivers
The load drivers are usually installed on dif- i) The Storing Phase:
ferent machines from the SUT to avoid con- During the storing phase, load test
founding of measurements and resource us- practitioners perform a sequence of
age. The machines which have load drivers actions for each scenario. For ex-
installed should have enough computing re- ample, in a web-based system, a
sources such that they do not saturate dur- user would first login to the sys-
ing a load test. tem, browse a few catalogs then lo-
The installation of load drivers is usually gout. A probe, which is included
straight-forward [128], [129], except for peer- in the load drivers, is used to cap-
to-peer load drivers. Dumitrescu et al. [135] ture all incoming and outgoing data.
implement a framework to automatically For example, all HTTP requests can
push the peer load drivers to different ma- be captured by either implement-
chines for load testing Grid systems. The ing a probe at the client browser
framework picks one machine in the Grid to side (e.g., browser proxy in We-
act as a controller. The controller pushes the bLoad [128], [129]) or at the network
peer load driver to other machines, which packet level using a packet analyzer
are responsible for requesting web services like Wireshark [152]. The recorded
under test. scenarios are encoded in load-driver
2) Configuration of Load Drivers specific programming languages (e.g.,
The configuration of load drivers is the pro- C++ [129] and Javascript [128]).
cess of encoding the load as inputs, which Rich Internet Applications (RIA) dy-
the load drivers can understand. There are namically update parts of the web
currently four general load driver configu- page based on the user actions. There-
ration techniques: fore, the user action sequences cannot
be easily used in record-and-replay
a) Simple GUI Configuration via URL editing. Instead, The store-
Some load drivers (especially the bench- and-replay is achieved via using GUI
mark suites like [131]) provide a simple automation tools like Selenium [153]
graphical user interface for load test prac- to record user actions instead.
titioners to specify the rate of the requests ii) The Editing Phase:
as well as test durations. The recorded data needs to be edited
b) Programable Configuration and customized by load test practi-
Many of the general purpose load tioners in order to be properly exe-
drivers let load test practitioners en- cuted by the load driver. The stored
code the testing load using program- data is usually edited to remove
ming languages. The choice of program- runtime-specific values (e.g., session
ming languages varies between load IDs and user IDs).
drivers. For example, the language could iii) The Replaying Phase: Once load test
be generic programming languages like practitioners finish editing, they need
C++ [129], Javascript [128], Java [146] to identify the replay rates of these
or XML [147]; or domain specific lan- scenarios, the delay between individ-
guages, which enable easy specifications ual requests and the test duration.
of test environment components like the
setup/configration of database, network d) Model Configuration
and storage [148] and or for specialized Section 3.1.2 explains realistic load design
systems (e.g., TTCN-3 for telecommuni- techniques via usage models. There are
cation systems [28]). two approaches to translate the usage
c) Store-and-replay Configuration models into load driver inputs: on one
Rather than directly encoding the load hand, many load drivers can directly take
via coding, many load drivers sup- usage models as their inputs. On the
port store-and-replay to reduce the pro- other hand, a number of researchers have
gramming efforts. Store-and-replay load proposed to automatically generate load
driver configuration techniques are used driver configuration code based on usage
in web-based applications [102], [128], models.
[129], [149] and distributed telecommu- i) Readily Supported Models:
18
Test cases formulated in Markov chain practice of generating different IP addresses for
can be directly used in load test execu- workload requests coming from different sim-
tion tools like LoadRunner [129] and ulated users. IP Spoofing is needed to prop-
Apache JMeter [130] (through plugin) erly load test some web-based systems using
or research tools like [154]. the driver based executions, as these systems
ii) Automated Generation of Load usually deny large volume of requests from the
Driver Configuration Code same IP addresses to protect against the DoS
Many techniques have been proposed attacks. IP spoofing is usually configured in
to automatically generate load driver supported load drivers (e.g., [129]).
configuration code from usage
models. LoadRunner scripts can be 4.2 Load Generation and Termination
automatically generated from UML
diagrams (e.g., activity diagrams This subsection covers three categories of load gener-
and sequence diagrams) [100], [101]. ation and termination techniques: manual load gener-
The Stochastic Form Charts can be ation and termination techniques (Section 4.2.1), load
automatically encoded into JMeter generation and termination based on static configura-
scripts [114], [115]. tions (Section 4.2.2), and load generation and termi-
nation techniques based on dynamic system feedback
Configuring the Test Environment (Section 4.2.3).
As mentioned above, live-user based and driver
based executions usually take place in a lab envi- 4.2.1 Manual Load Generation and (Timer-based)
ronment. Extra care is needed to configure the test Termination Techniques
environment to be as realistic as possible. Each user repeatedly conducts a sequence of actions
First, it is important to understand the implica- over a fixed period of time. Sometimes, actions among
tion of the hardware platforms. Netto et al. [155] different live users need to be coordinated in order to
and White et al. [156] evaluate the stability of the reach the desired load.
generated load under virtualized environments (e.g.,
virtual machines). They find that the system through- 4.2.2 Static-Configuration-Based Load Generation
put sometimes might not produce stable load on and Termination Techniques
virtual machines. Second, additional operating system Each load driver has a controller component to gen-
configurations might need to be tuned. For exam- erate the specified load based on the configura-
ple, Kim [157] reports that extra settings need to be tions [104], [128], [129]. If the load drivers are installed
specified in Windows platforms in order to generate on multiple machines, the controller needs to send
hundreds or millions of concurrent connections. Last, messages among distributed components to coordi-
it is crucial to make network behavior as realistic as nate among the load drivers to generate the desired
possible. The realism of the network is covered in two load [135].
aspects: Each specific request is either generated based on
1) Network Latency a random number during runtime (e.g., 10% of the
Many load-driver based test execution tech- time user A is browsing) [104], [129] or based on a
niques are conducted within a local area net- specific pre-defined schedule (e.g., during the first five
work, where packets are delivered swiftly and minutes, user B is browsing) [52], [53].
reliably. The case of no/little packet latency is There are four types of load termination tech-
usually not applicable in the field, as packets niques based on pre-defined static configurations.
may be delayed, dropped or corrupted. IP Net- The first three techniques (continuous, timer-driven
work Emulator Tools like Shunra [158], [159] are and counter-driven) exist in many existing load
used in load testing to create a realistic load drivers [151]. The fourth technique (statistic-driven)
testing network environment [27]. was recently introduced [92], [159] to ensure the va-
2) Network Spoofing lidity or accuracy of the data collected.
Routers sometimes attempt to optimize the over- 1) Continuous: A load test runs continuously until
all network throughput by caching the source the load test practitioners manually stop it;
and destination. If the requests come from the 2) Timer-Driven: A load test runs for a pre-
same IP address, the network latency measure specified test duration then stops;
won’t be as realistic. In addition, some systems 3) Counter-Driven: A load test runs continuously
perform traffic controls based on requests from until a pre-specified number of requests have
different network addresses (IP addresses) for been processed or sent; and
purposes like guarding against Denial of Service 4) Statistic-Driven: A load test is terminated once
(DoS) attacks or providing different Quality of the performance metrics of interest (e.g., re-
Services. IP Spoofing in a load test refers to the sponse time, CPU and memory) are statistically
19
stable. This means the metrics of interest yield with respect to a set of randomly selected inputs.
high confidence interval to estimate such value Then they apply machine learning techniques
or have small standard deviations among the to derive performance rules, which describe the
collected data points [92], [159]. characteristics of user inputs causing bad per-
formance (e.g., long response time). The load
4.2.3 Dynamic-Feedback-Based Load Generation testing is conducted adaptively, so that only
and Termination Techniques new inputs are passed into the SUT. During the
Rather than generating and terminating a load test adaptive load testing process, execution traces
based on static configurations, techniques have been (method entry/exit), software performance met-
proposed to dynamically steer the load based on the rics (e.g., response time) are recorded and the
system feedback [11], [17], [24], [25], [160]. performance rules are re-learned. The adaptive
Depending on the load testing objectives, the defi- load testing is stopped when there are no new
nition of important inputs can vary. For example, one performance rules discovered.
goal is to detect memory leaks [17]. Thus, input pa- Once these important inputs are identified, the
rameters that significantly impact the system memory load driver automatically generates the target load
usage, are considered as important parameters. Other to detect memory leaks [17], to verify system per-
goals can be to find/verify the maximum number of formance requirements [11], or to identify software
users that the SUT can support before the response bottlenecks [24], [25], [160].
time degrades [17] or to locate software performance
bottleneck [24], [25], [160]. Thus, important inputs are 4.2.4 Deterministic Load Generation and Termination
the ones that significantly impact the testing objectives Techniques
(e.g., performance objectives like the response time or
Even though all of these load test execution tech-
throughput). There are three proposed techniques to
niques manage to inject many concurrent requests
locate the important inputs.
into the SUT, none of those techniques can guarantee
1) System Identification Technique to explore all the possible inter-leavings of threads
Bayan and Cangussu calculate the important and timing of asynchronous events. Such system state
inputs using the System Identification Tech- information is important, as some thread inter-leaving
nique [161], [162]. The general idea is as fol- events could lead to hard to catch and reproduce
lows: the metric mentioned in the objectives is problems like deadlocks or racing conditions.
considered as the output variable (e.g., memory As we mentioned in the beginning of this sec-
usage or response time). Different combinations tion, the CHESS platform [34] can be used to deter-
of input parameters lead to different values in ministically execute a test based on all the possible
the output variable. A series of random testing event inter-leavings. The deterministic inter-leaving
runs, which measure the system performance execution is achieved by the scheduling component,
using randomly generated inputs, would create as the actual scheduling during the test execution is
a set of linear equations with the output vari- controlled by the tool scheduler rather than the OS
able on one side and various combinations of scheduler. The CHESS scheduler understands the se-
input variables on the other side. Thus, locating mantics of all non-deterministic APIs and provides an
the resource impacting inputs is equivalent to alternative implementation of these APIs. By picking
solving these linear equations and identifying different threads to block at different execution points,
the inputs, which are large (i.e., sensitive to the the CHESS scheduler is able to deterministically ex-
resources of interest). plore all the possible inter-leavings of task executions.
2) Analytical Queuing Modeling The test stops when the scheduler explores all the
Compared with the System Identification Tech- task inter-leavings. During this process, the CHESS
nique, which calculates the important inputs platform automatically reports when there is a dead-
before load test execution starts, Branal et al. lock or race conditions, along with the exact execution
dynamically model the SUT using a two-layer context (e.g., thread interleaving and events).
queuing model and use analytical techniques
to find the workload mixes that change the
bottlenecks in the SUT. Branal et al. iteratively 4.3 Test Monitoring and Data Collection
tune the analytical queuing model based on The system behavior under load is monitored and
the system performance metrics (e.g., CPU, disk recorded during the course of the load test execution.
and memory). Through iteratively driving load, There is a tradeoff between the level of monitoring
their model gradually narrows down the bottle- details and monitoring overhead. Detailed monitoring
neck/important inputs. has a huge performance overhead, which may slow
3) Machine Learning Technique down the system execution and may even alter the
Similar to [161], [162], Grechanik et al. first apply system behavior [163]. Therefore, probing techniques
random testing to monitor the system behavior for load testing are usually light weight and are
20
intended to impose minimal overhead to the overall the logs could slow down the application under
system. test [169]. There are three types of instrumentation
In general, there are four categories of collected data mechanisms: (1) ad-hoc debug statements, like printf
in the research literature: Metrics, Execution Logs, or System.out, (2) general instrumentation frame-
Functional Failures, and System Snapshots. works, like Log4j [170], and (3) through specialized
instrumentation frameworks like ARM (Application
4.3.1 Monitoring and Collecting Metrics Response Measurement) [171]:
Metrics are tracked by recruited testers in the live- 1) Ad-hoc Logging: The ad-hoc logging mecha-
user based executions [19], [164], by load drivers in nism is the most commonly used, as develop-
the driver based and emulation based executions [13], ers insert output statements (e.g., printf or Sys-
[15], [49], [50], [129], [128], [165], or by light weight tem.out) into the source code for debugging pur-
system monitoring tools like PerfMon [166], pid- poses [39]. However, extra care is required to (1)
stats [164], Munin [167], and SNMP MIBs [168]. minimize the amount of information generated,
In general, there are two types of metrics that and to (2) to make sure that the statements are
are monitored and collected during the course of not garbled as multiple logging threads attempt
the load test execution phase: Throughput Metrics to write to the same file concurrently.
(“Number of Pass/Fail Requests”) and Performance 2) General Instrumentation Framework: General
Metrics (“End-to-End Response Time” and “Resource instrumentation frameworks, like Log4j [170],
Usage Metrics”). address the two limitations in the ad-hoc mecha-
1) Number of Passed and Failed Requests nism. The instrumentation framework provides
Once the load is terminated, the number of a platform to support thread-safe logging and
passed and failed requests are collected from live fine-grained control of information. Thread-safe
users. This metric can either be recorded peri- logging makes sure that each logging thread
odically (the number of pass and fail requests serially accesses the single log file for multi-
at this interval) or recorded once at the end of threaded systems. Fine-grained logging control en-
the load test (the total number of pass and failed ables developers to specify logging at various
requests). levels. For example, there can be many levels of
2) End-to-End Response Time logging suited for various purposes, like infor-
The end-to-end response time (or just response mation level logs for monitoring and legal com-
time) is the time that it takes to complete one pliances [172], and debug level logs for debug-
individual request. ging purposes. During load tests and actual field
3) Resource Usage Metrics deployments, only higher level logging (e.g., at
System resource usage metrics like CPU, mem- the information level) is generated to minimize
ory, disk and network usage, are collected for the overhead.
system under load. These resource usage metrics 3) Specialized Instrumentation Framework: Spe-
are usually collected and recorded at a fixed time cialized instrumentation frameworks like ARM
interval. Similar to the end-to-end metrics, de- (Application Response Measurement) [171] can
pending on the specifications, the recorded data facilitate the process of gathering performance
can either be aggregated values or a sampled information from running programs.
value at that particular time instance. System
resource usage metrics can either be collected 4.3.3 Monitoring and Collecting Functional Failures
through system monitoring tools like PerfMon Live-user based and emulation based executions
in Windows or pidstats in Unix/Linux. Such record functional problems, whenever the failure oc-
resource usage metrics are usually collected both curs. For each request that a live user executes, he/she
for the SUT and its associated components (e.g., records whether the request has completed success-
databases and mail servers). fully. If not, he/she will note the problem areas (e.g.,
Emulation-based test executions typically do not flash content is not displayed properly [19]).
track these metrics, as the systems are deployed on
specialized platforms which are not reflective of the 4.3.4 Monitoring System Behavior and Collecting
actual field behavior. System Snapshots
Rather than capturing information throughout the
4.3.2 Instrumenting and Collecting Execution Logs course of the load test, Bertolino et al. [165] propose
Execution logs are generated by the instrumentation a technique that captures a snapshot of the entire
of code that developers insert into the source code. test environment as well as the system state when
Execution logs record the runtime behavior of the a problem arises. Whenever the SUT’s overall QoS is
system under test. However, excessive instrumenta- below some threshold, all network requests as well as
tion is not recommended, as contention for outputting snapshot of the system state are saved. This snapshot
21
can be replayed later for debugging purposes. For data must be analyzed to decide whether the SUT has
the deterministic emulation based execution (e.g., in met the test objectives. Different types of data and
the case of the CHESS platform), the detailed system analysis techniques are needed to validate different
state is recorded when the deadlock or race conditions test objectives.
occur. As discussed in Section 4.3, there are four categories
of system behavior data: metrics, execution logs, func-
4.4 Summary and Open Problems tional failures and system snapshots. All of the re-
There are three general load test execution ap- search literature focuses on the analysis and reporting
proaches: (1) the live-user based executions, where techniques that are used for working with metrics
recruited testers manually generate the testing load; and execution logs. (It is relatively straight-forward to
(2) the driver based executions, where the testing handle the functional failure data by reporting them
load is automatically generated; and (3) the emula- to the development team, and there is no further
tion based executions, where the SUT is executed on discussion on how to analyze system snapshots [165]).
There are three categories of load testing analysis
top of special platforms. Live-user based executions
approaches:
provide the most realistic feedback on the system
behavior, but suffer from scalability issues. Driver 1) Verifying Against Threshold Values
based executions can scale to large testing load and Some system requirements under load (espe-
test durations, but require substantial efforts to deploy cially non-functional requirements) are defined
and configure the load drives for the targeted testing using threshold values. One example is the sys-
load. Emulation based executions provide special ca- tem resource requirements. The CPU and mem-
pacities over the other two execution approaches: (1) ory usage cannot be too high during the course
early examination of system behavior before system is of a load test, otherwise the request processing
fully implemented, (2) easy detection and reporting of can hang and system performance can be un-
load problems. However, emulation based execution stable [173]. Another example is the reliability
techniques can only focus on a small subset of the requirement for safety critical and telecommuni-
load testing objectives. cation systems [10], [85]. The reliability require-
Here, we list two open problems, which are still not ments are usually specified as “three-nines” or
explored thoroughly: “five-nines”, which means the system reliability
cannot be lower than 99.9% (for “three-nines”)
• Encoding Testing Loads into Testing Tools
and 99.999% (for “five-nines”). The most intu-
It is not straight-forward to translate the designed
itive load test analysis technique is to summarize
load into inputs used by load drivers. For ex-
the system behavior into one number and ver-
ample, the load resulted from hybrid load opti-
ify this number against a threshold. The usual
mization techniques [53] is in the form of traces.
output for such analysis is simply pass/fail.
Therefore, load drivers need to be modified to
2) Detecting Known Types of Problems
take these traces as inputs and replay the exact
Another general category of load test analysis
order of these sequences. However, if the size
is examining the system behavior to locate pat-
of traces becomes large, the load driver might
terns of known problems; as some problems are
not be able to handle traces. Similarity, testing
buried in the data and cannot be found based on
load derived from deterministic state testing [10],
threshold values, but can be spotted by known
[85] is not easily realized in existing load drivers,
patterns. One example of such analysis approach
either.
is to check the memory growth trend over time
• System Monitoring Details and Load Testing
for memory leaks. The usual output for such
Analysis
analysis is a list of detected problems.
On one hand, it is important to minimize the sys-
3) Detecting Anomalous Behavior
tem monitoring overhead during the execution
Unfortunately, not all problems can be specified
of a load test. On the other hand, the recorded
using patterns and certainly not all problems
data might not be sufficient (or straight-forward)
have been detected previously. In addition, the
for load testing analysis. For example, recorded
volume of recorded system behavior is too large
data (e.g., metrics and logs) can be too large to
for manual examination. Therefore, automated
be examined manually for problems. Additional
techniques have been proposed to systemati-
work is needed to find proper system monitoring
cally analyze the system behavior to uncover
data suited for load testing.
anomalous behavior. These techniques automat-
ically derive “normal/expected behavior” and
5 R ESEARCH Q UESTION 3: H OW IS THE flag “anomalous behavior” from the data. How-
RESULT OF A LOAD TEST ANALYZED ? ever, the accuracy of such techniques might not
During the load test execution phase, the system be as high as the above two approaches, as
behavior (e.g., logs and metrics) is recorded. Such the “anomalous behavior” are merely hints of
22
potential problems under load. The output for 5.1.2 Comparing Against Processed Data
such analysis is usually the anomalous behavior If the system resources, like CPU and memory utiliza-
and some reasoning/diagnosis on the potential tion are too high, the system performance may not be
problematic behavior. stable [173] and user experience could degrade (e.g.,
All three aforementioned techniques can analyze slow response time) [27], [43], [93], [95], [191].
different categories of data to verify a range of There can be many formats of system behavior. One
objectives (detecting functional problems and non- example is resource usage data, which is sampled
functional problems). These load test analysis tech- at a fixed interval. Another example is the end-to-
niques can be used individually or together based end response time, which is recorded as response
on the types of data available and the available time. time for each individual request. These types of data
For example, if time permits, load testing practition- need to be processed before comparing against thresh-
ers can verify against known requirements based on old values. On one hand, as Bondi points out [45],
thresholds, locate problems based on specific patterns system resources may fluctuate during the startup
and run the automated anomaly detection techniques time for warmup and cooldown periods. Hence, it
just to check if there are any more problems. We is important to only focus on the system behavior
categorize the various load test analysis techniques once the system reaches a stabilized state. On the
into the following six dimensions as shown in Table 5. other hand, a proper data summarization technique is
• Approaches refer to one of the above three load needed to describe these many data instances into one
test analysis approaches. number. There are three types of data summarization
• Techniques refer to the load test analysis tech- techniques proposed in the literature. We use response
nique like memory leak detection. time analysis as an example to describe the proposed
• Data refers to the types of system behavior data data summarization techniques:
that the test analysis technique can analyze. Ex- 1) Maximum Values
amples are execution logs and performance met- For online distributed multi-media systems, if
rics like response time. any video and audio packets are out of sync
• Test Objectives refer to the goal or goals of load
or not delivered in time, it is considered a fail-
test objectives (e.g., detecting performance prob- ure [38]. Therefore, the inability of the end-to-
lems), which the test analysis technique achieves. end response time to meet a specific threshold
• Reported Results refer to the types of reported
(e.g., video buffering period) is considered as a
outcomes, which can simply be pass/fail or de- failure.
tailed problem diagnoses. 2) Average or median Values
The average or median response time summa-
propose each technique. rizes the majority of the response times during
This section is organized as follows: The next three the load test and is used to evaluate the overall
subsections describe the three categories of load test- system performance under load [15], [39], [40],
ing analysis techniques respectively: Section 5.1 ex- [41], [42], [135], [174].
plains the techniques of verifying load test results 3) 90-percentile Values
against threshold values, Section 5.2 describes the Some researchers advocate that the 90-percentile
techniques of detecting known types of problems, response time is a better measurement than
and Section 5.3 explains the techniques of automated the average/median response time [173], [177],
anomaly detection and diagnosis. Section 5.4 summa- [178], as 90-percentile response time accounts for
rizes the load test analysis techniques and highlights most of the peaks, while eliminating the outliers.
some open problems.
5.1.3 Comparing Against Derived Data
5.1 Verifying Against Threshold Values
In some cases, either the data (e.g., the reliability)
The threshold-based test analysis approach can be
to compare or the threshold value is not directly
further broken down into three techniques based on
available. Extra steps need to be taken to derive this
the availability of the data and threshold values.
data before analysis.
5.1.1 Straight-forward Comparison • Deriving Thresholds
When the data is available and the threshold re- Some other threshold values for non-functional
quirement is clearly defined, load testing practitioners requirements are informally defined. One exam-
can perform a straight-forward comparison between ple is the “no-worse-than-before” principle when
the data and the threshold values. One example is verifying the overall system performance. The
throughput analysis. Throughput, which is the rate of “no-worse-than-before” principle states that the
successful requests completed, can be used to com- average response time (system performance re-
pare against the load to validate whether the SUT’s quirements) for the current version should be at
functionality scales under load [174], [175], [176]. least as good as prior versions [26].
23
TABLE 5: Load Test Analysis Techniques
Approaches Techniques Data Test Objectives Reported Results References
Straight-forward Com- Performance metrics Detecting violations in [174], [175], [176]

Verifying parison performance and scala-
Against bility requirements
Threshold Pass/Fail
Comparing Against Periodic sampling Detecting violations [15], [38], [39],
Values Processed Data (Max, metrics in performance [40], [41], [42],
median or 90-percentile requirements [135], [174], [173],
values) [177], [178]
Comparing Against Number of pass/fail Detecting violations in [15], [16], [26],
Derived (Threshold requests, past perfor- performance and relia- [179]
and/or target) Data mance metrics bility requirements
Detecting Memory Memory usage met- Detecting load-related Pass/Fail [9], [17], [45]
Detecting Leaks rics functional problems
Known Locating Error Execution logs Detecting functional Error log lines and [180]
Types of Keywords problems error types
Problems Detecting Deadlocks CPU Detecting load-related Pass/Fail [9]
functional problems
and violations in
scalability requirements
Detecting Unhealthy CPU, Response Time Detecting load-related Pass/Fail [9]
System States and Workload functional problems
and violations
in performance
requirements
Detecting Throughput Throughput, Detecting load-related Pass/Fail [159]
Problems response time metrics functional problems
and violations in
scalability requirements
Detecting Anomalous Performance metrics Detecting performance Anomalous perfor- [86], [181], [182],
Detecting Behavior using problems mance metrics [183], [184], [185],
Anomalous Performance Metrics [186], [187], [188],
Behavior [189]
Detecting Anomalous Execution logs Detecting functional Log sequences [87], [88]
Behavior using and performance with anomalous
Execution Logs problems functional or
performance
behaviors
Detecting Anomalous Execution logs and Detecting memory- Potential problem- [190]
Behavior using performance metrics related problems atic log lines caus-
Execution Logs and ing memory-related
Performance Metrics problems
• Deriving Target Data tain conditions. Mission critical systems usu-

There are two methods for deriving the target ally have very strict reliability requirements.
data to be analyzed: Avritzer et al. [23], [85] use the Bayesian
Network to estimate the system reliability
– Through Extrapolation: As mentioned in
from the load test data. Avritzer et al. use
Section 3.3.2, due to time or cost limitations,
the failure probability of each type of load
sometimes it is not possible to run the tar-
(workload mix and workload intensity) and
geted load, but we might run tests with lower
the likelihood of these types of load occurring
load levels (same workload mix but different
in the field. Load test practitioners can then
intensity). Based on the performance of these
use such reliability estimates to track the
lower workload intensity level tests, load test
quality of the SUT across various builds and
practitioners can extrapolate the performance
decide whether the SUT is ready for release.
metrics at the targeted load [179], [15], [16]. If
certain resource metrics are higher than the
hardware limits (e.g., requires more memory
5.2 Detecting Known Types of Problems
than provided or CPU utilization is greater
than 100%) based on the extrapolation, scal- There are five known load related problems, which
ability problems are noted. can be analyzed using patterns: detection of memory
– Through Bayesian Network: Software relia- leaks (Section 5.2.1), locating error keywords (Sec-
bility is defined as the probability of failure- tion 5.2.2), detecting deadlocks (Section 5.2.3), de-
free operation for a period of time, under cer- tecting unhealthy system states (Section 5.2.4), and
24
detecting throughput problems using queuing theory 5.2.5 Detecting Throughput Problems
(Section 5.2.5). Mansharamani et al. [159] use Little’s Law from Queu-
ing Theory to validate the load test results:
5.2.1 Detecting Memory Leaks
N umber of users
Memory leaks can cause long running systems to T hroughput = Response T ime + Average T hink T ime
crash. Memory leak problems can be detected if there If there is a big difference between the calculated
is an upward trend of the memory footprint through- and measured throughput, there could be failure in
out the course of load testing [9], [17], [45]. the transactions or load variations (e.g., during warm
up or cool down) or load generation errors (e.g.,
5.2.2 Locating Error Keywords load generation machines cannot keep up with the
Execution logs, generated by code instrumentations, specified loads).
provide textual descriptions of the system behavior
during runtime. Compared to system resource usage
5.3 Detecting Anomalous Behavior
data, which are structural and easy to analyze, exe-
cution logs are more difficult to analyze, but provide Depending on the types of data available, the au-
more in-depth knowledge. tomated anomaly detection approach can be further
One of the challenges of analyzing execution logs is broken down into two groups of techniques: (1) tech-
the size of the data. At the end of a load test, the size niques based on metrics; and (2) techniques based on
of execution logs can be several hundred megabytes execution logs.
or gigabytes. Therefore, automatic log analysis tech-
niques are needed to scan through logs to detect 5.3.1 Anomaly Detection Using Performance Metrics
problems. The proposed techniques based on metric-based
Load testing practitioners can search for specific anomaly detection are focused on analyzing the re-
keywords like “errors”, “failures”, “crash” or “restart” source usage data. There are six techniques proposed
in the execution logs [88]. Once these log lines are to derive the “expected/normal” behavior and flag
found, load test practitioners need to analyze the “anomalous” behavior based on resource usage data:
context of the matched log lines to determine whether 1) Deriving and Comparing Clusters
they indicate problems or not. One of the challenges As noted by Georges et al. [163], [192], it is
of performing a simple keyword search is that the important to execute the same tests multiple
data is not categorized. There can be hundreds of times to gain a better view of the system per-
“error” log lines belonging to several different types formance due to issues like system warmup and
of errors. Jiang et al. [180] extend this approach to memory layouts. Bulej et al. [56] propose the use
further categorize these log lines into various types of statistical techniques to detect performance re-
of errors or failures. They accomplish this by first gressions (performance degradations in the con-
abstracting each execution log line into an execution text of regression testing). Bulej et al. repeatedly
event where the runtime data is parameterized. Then, execute the same tests multiple times. Then, they
they group these execution events by their associated group the response time for each request into
keywords like “failures” or “errors”. A log summary clusters and compare the response time distri-
report is then produced with a clear breakdown of the butions cluster-by-cluster. They have used vari-
types of “failures” and “errors”, their frequencies and ous statistical tests (Student-t test, Kolmogorov-
examples of their occurrence in the logs. Smirnov Test, Wilcoxon test, Kruskal-Wallis test)
to compare the response time distributions be-
5.2.3 Detecting Deadlocks tween the current release and prior releases.
Deadlocks can cause CPU resource to deviate from The results of their case studies show that these
normal levels [9]. A typical pattern would be CPU re- statistical tests yield similar results.
source repeatedly drops below normal levels (indicat- 2) Deriving Clusters and Finding Outliers
ing deadlock) and returns to normal levels (indicating Rather than comparing the resulting clusters as
lock releases). in [56], Syer et al. [86], [189] use a hierarchical
clustering technique to identify outliers, which
5.2.4 Detecting Unhealthy System States represent threads with deviating behavior in a
Avritzer et al. [9] observe that under normal condi- thread pool. A thread pool, which is a popular
tions the CPU resource has a linear relation with the design pattern for large-scale software systems,
workload and that the response time should be stable contains a collection of threads available to per-
over time. However, when such observations (a.k.a., form the same type of computational tasks. Each
the linear relationship among the CPU resource, the thread in the thread pool performs similar tasks
workload and the stable response time) no longer and should exhibit similar behavior with respect
hold, then the SUT might have performance problems to resource usage metric, such as CPU and mem-
(e.g., software bottlenecks or concurrency problems). ory usage. Threads with performance deviations
25
likely indicate problems, such as deadlock or CPU). Metrics from the current test are matched
memory leaks. against these rules. Metrics (e.g., low database
3) Deriving Performance Ranges CPU), which violates these rules, are flagged as
A control chart consists of three parts: a control “anomalous behavior”.
line (center line), a lower control limit (LCL) and 5) Deriving Performance Signatures
an upper control limit. If a point lies outside Rather than deriving performance rules [181],
the controlled regions (between the upper and Malik et al. [186], [184], [185], [187] select the
lower limits), the point is counted as a violation. most important metrics among hundreds or
Control charts are used widely in the manufac- thousands of metrics and group these metrics
turing process to detect anomalies. Nguyen et into relevant groups, called “Performance Sig-
al. [188] use control charts to flag anomalous natures”. Malik et al. propose two main types
resource usage metrics. There are serval assump- of performance signature generation techniques:
tions associated with applying control chart: (1) an unsupervised learning approach and a super-
the collected data is normally distributed; (2) vised learning approach.
the testing loads should be constant or linearly If the past performance tests are not clearly la-
correlated with the system performance metrics; beled with pass/fail information, Malik et al. use
and (3) the performance metrics should be inde- an unsupervised learning technique, called Prin-
pendent of each other. cipal Component Analysis (PCA) [185], [186],
For each recorded resource usage metrics, [184], [187]. First, Malik et al. normalize all met-
Nguyen et al. derive the “expected behavior” rics into values between 0 and 1. Then PCA
in the form of control chart limits based on is applied to show the relationship between
prior good tests. For tests whose loads are not metrics. PCA groups metrics into groups, called
constant, Nguyen et al. use linear extrapolation Principle Components (PC). Each group has a
to transform the performance metrics data. Then value called variance, which explains the im-
current test data is overlayed on the control portance/relevance of the group to explain the
chart. If the examined performance metric (e.g., overall data. The higher the variance values of
subsystem CPU) has a high number of viola- the groups, the more relevant these groups are.
tions, this metric is flagged as an anomaly and Furthermore, each metric is a member of all the
is reported to the development team for further PCs, but the importance of the metrics within
analysis. one group varies. The higher the eigenvalue of
4) Deriving Performance Rules a metric within one group, the more important
Nguyen et al. [188] treat each metric separately the metric is to the group. Malik et al. select
and derive range boundary values for each of first N Principle Components with then largest
these metrics. However, in many cases the as- variance. Then within each Principle Compo-
sumptions of control chart may not hold by the nent, Malik et al. select important counters by
performance metrics data. For example, when calculating pair-wise correlations between coun-
the SUT is processing a large number of re- ters. These important counters forms the “Per-
quests, the CPU usage and memory usage could formance Signatures”. The performance signa-
be high. tures are calculated on the past good tests and
Foo et al. [181] build performance rules, and flag the current test, respectively. The discrepancies
metrics, which violates these rules. A perfor- between the performance signatures are flagged
mance rule groups a set of correlating metrics. as “Anomalous Behavior”.
For example, a large number of requests imply If the past performance tests are labeled with
high CPU and memory usage. For all the past as pass/fail, Malik et al. recommend to use a
tests, Foo et al. first categorize each metrics supervised learning approach to pin-point per-
into one of high/median/low categories, then formance problems over the aforementioned un-
derive performance rules by applying an arti- supervised learning approach, as the supervised
ficial intelligence technique, called Association learning approach yields better results. In [184],
Rule mining. The performance rules (associa- they first use a machine learning technique
tion rules) are derived by finding frequent co- called the Wrapped-based attribute selection
occurred metrics. For example, if high brows- technique to pick the top N performance coun-
ing requests, high Database CPU and high ters, which best characterizes the performance
web server memory footprint always appear behavior of the SUT under load. Then they build
together, Browsing/DB CPU/Web Server Memory a logistic regression model with these signature
form a set (called “frequent-item-set”). Based performance counters. The performance counter
on the frequent-item-set, association rules can data from the new test is passed into a logistic
be formed (e.g., high browsing requests and regression model to identify whether they are
high web server memory implies high database anomalous or not.
26
6) Deriving Transaction Profiles example, the database disconnects and recon-

The aforementioned five techniques uses data nects with the SUT intermittently throughout the
mining techniques to derive the expected behav- test. These types of anomalous behavior should
ior and flag the anomalously behavior, Ghaith be raised for further investigation.
et al. use queuing network model to derive the Similar as in Section 5.2.2, Jiang et al. first ab-
expected behavior, called the Transaction Pro- stract each log line into an execution event,
files (TP) [183], [182]. A TP represents the service then group these log lines into pairs (based on
demands on all resources when processing a runtime information like session IDs or thread
particular transaction. For example, in a web- IDs). Then, Jiang et al. group these event pairs
based application, the TP for a single “Brows- and flag small deviations. For example, if 99%
ing” transaction would be 0.1 seconds of server of the time a lock-open event is followed by a
CPU, 0.2 seconds of server disk and 0.05 seconds lock-close event and 1% of the time lock open
of client CPU. Ideally, the performance of each is followed by something else; such deviated
transaction would be identical regardless of the behavior should be flagged as an “anomalous
system load, if the SUT do not experience any behavior”.
performance bottleneck. Hence, the TP would 2) Detecting Anomalous Performance Behavior
be identical for a particular transaction type (Response Time)
regardless of the load. If the TP deviates from As a regular load test simulates periods of peak
one release to another, the SUT might have a usage and periods of off-hour usage, the same
performance regression problem. Ghaith et al. workload is usually applied across load tests,
derive TPs from the performance data on pre- so that the results of prior load tests are used
vious releases and compares again the current as an informal baseline and compared against
release. If the TPs differ, the new release might the current run. If the current run has scenarios,
have performance regression problems. which follow a different response time distribu-
tion than the baseline, this run is probably trou-
5.3.2 Anomaly Detection Using Execution Logs blesome and worth investigating. Jiang et al. pro-
posed an approach, which analyzes the response
There are two proposed log-based anomaly detection time extracted from the execution logs [87]. Jiang
techniques. One technique is focused on detecting et al. recover the scenario sequences by linking
anomalous functional behavior, the other one is fo- the corresponding identifiers (e.g., session IDs).
cused on detecting anomalous performance (i.e., non- In this way, both the end-to-end and step-wise
functional) behavior. response times are extracted for each scenario.
1) Detecting Anomalous Functional Behavior By comparing the distribution of end-to-end and
There are limitations associated with keyword- step-wise response times, this approach reports
based analysis approaches described in Sec- scenarios with performance problems and pin-
tion 5.2.2: First, not all log lines with the key- points performance bottlenecks within these sce-
words correspond to failures. For example, the narios.
log line, “Failure to locate item in the cache”,
contains the “failure” keyword, it is not an 5.3.3 Anomaly Detection Using Execution Logs and
anomalous log line worthy of investigation. Sec- Performance Metrics
ond, not all log lines without such keywords All the aforementioned anomaly detection techniques
are failure free. For example, the log line, “In- only examine one type of system behavior data (exe-
ternal message queue is full”, does not contain cution logs or performance metrics), Syer et al. [190]
the word failure, though it is an indication of analyze both execution logs and performance metrics
anomalous situation that should be investigated. for memory-related problems. Ideally, same set of log
Jiang et al. [88] propose a technique that detects lines (a.k.a., same workload) would lead to similar
the anomalous execution sequences in the execu- system resource usage levels (e.g., similar CPU and
tion logs, instead of relying on the log keywords. memory usages). Otherwise, scenarios corresponding
The main intuition behind this work is that a to these log lines might lead to potential performance
load test repeatedly executes a set of scenarios problems. Syer et al. first divide the logs and memory
over a period of time. The applications should usage data into equal time intervals and combine
follow the same behavior (e.g., generating the these two types of system behavior data into profiles.
same logs) when the scenario is executed each Then these profiles are clustered based on the sim-
time. As load testing is conducted after the func- ilarity of logs. Finally, outliers within these clusters
tional tests are completed, the dominant behav- are identified by the deviation of their memory foot-
ior is usually the normal (i.e., correct) behavior prints. Scenarios corresponding to the outlier clusters
and the minority (i.e., deviated) behaviors are could lead to potential memory-related problems (e.g.,
likely troublesome and worth investigating. For memory leaks).
27
5.4 Summary and Open Problems [2] E. J. Weyuker and F. I. Vokolos, “Experience with perfor-
mance testing of software systems: Issues, an approach, and
Depending on the types of data and test objectives, case study,” IEEE Transactions on Software Engineering, vol. 26,
there are different load test analysis techniques that no. 12, 2000.
[3] “Firefox download stunt sets record for quickest
have been proposed. There are three general test meltdown,” http://blogs.siliconvalley.com/gmsv/2008/06/
analysis approaches: verifying the test data against firefox-download-stunt-sets-record-for-quickest-meltdown.
fixed threshold values, searching through the test data html, visited 2014-11-24.
[4] “Steve Jobs on MobileMe,” http://
for known problem patterns and automated detection arstechnica.com/journals/apple.ars/2008/08/05/
of anomalous behaviors. steve-jobs-on-mobileme-the-full-e-mail, visited 2014-11-
Below are a few open problems: 24.
[5] S. G. Stolberg and M. D. Shear, “Inside the Race
• Can we use system monitoring techniques to to Rescue a Health Care Site, and Obama,” 2013,
analyze load test data? http://www.nytimes.com/2013/12/01/us/politics/
inside-the-race-to-rescue-a-health-site-and-obama.html,
Many research ideas in production system moni- visited 2014-11-24.
toring may be applicable for load testing analysis. [6] M. J. Harrold, “Testing: A roadmap,” in Proceedings of the
For example, approaches (e.g., [20], [193], [194], Conference on The Future of Software Engineering, pp. 61–72.
[7] C.-W. Ho, L. Williams, and A. I. Anton, “Improving perfor-
[195]) have been proposed to build performance mance requirements specification from field failure reports,”
signatures based on the past failures, so that in Proceedings of the 2007 15th IEEE International Requirements
whenever such symptoms occur in the field, the Engineering Conference (RE), 2007, pp. 79–88.
[8] C.-W. Ho, L. Williams, and B. Robinson, “Examining the
problems can be detected and notified right away. relationships between performance requirements and ”not a
Analogously, we can formulate our performance problem defect reports,” in Proceedings of the 2008 16th
signature based on mining the past load testing IEEE International Requirements Engineering Conference (RE),
2008, pp. 135–144.
history and use these performance signatures to [9] A. Avritzer and A. B. Bondi, “Resilience assessment based on
detect recurrent problems in load tests. A promis- performance testing,” in Resilience Assessment and Evaluation
ing research area is to explore the applicability of Computing Systems, K. Wolter, A. Avritzer, M. Vieira, and
A. van Moorsel, Eds. Springer Berlin Heidelberg, 2012, pp.
and ease of adapting system monitoring tech- 305–322.
niques for the analysis of load tests. [10] A. Avritzer, J. P. Ros, and E. J. Weyuker, “Reliability testing of
• Scalable and Efficient Analysis of the Results rule-based systems,” IEEE Software, vol. 13, no. 5, pp. 76–82,
1996.
of Load Tests [11] M. S. Bayan and J. W. Cangussu, “Automatic feedback,
As load tests generate large volumes of data, the control-based, stress and load testing,” in Proceedings of the
load test analysis techniques need to be scalable 2008 ACM symposium on Applied computing (SAC), 2008, pp.
661–666.
and efficient. However, as data grows larger (e.g., [12] G. Gheorghiu, “Performance vs. load vs. stress
bigger than one machine’s hard-drive to store), testing,” 2005, http://agiletesting.blogspot.com/2005/
many of the test analysis techniques may not 02/performance-vs-load-vs-stress-testing.html, visited
2014-11-24.
scale well. It is very important to explore scalable [13] J. Meier, C. Farre, P. Bansode, S. Barber, and D. Rea, “Per-
test analysis techniques, which can automatically formance Testing Guidance for Web Applications - patterns
examine gigabytes or terabyte of system behavior & practices,” September 1997, http://msdn.microsoft.com/
en-us/library/bb924375.aspx, visited 2014-11-24.
data efficiently. [14] B. Dillenseger, “Clif, a framework based on fractal for flex-
ible, distributed load testing,” Annals of Telecommunications,
vol. 64, pp. 101–120, 2009.
6 S URVEY C ONCLUSION [15] D. A. Menasce, “Load testing, benchmarking, and application
performance management for the web,” in Proceedings of the
To ensure the quality of large scale systems, load 2002 Computer Management Group Conference (CMG), 2002, pp.
271–281.
testing is required in addition to conventional func- [16] ——, “Load testing of web sites,” IEEE Internet Computing,
tional testing procedures. Furthermore, load testing is vol. 6, no. 4, 2002.
becoming more important, as an increasing number [17] M. S. Bayan and J. W. Cangussu, “Automatic stress and
load testing for embedded systems,” in Proceedings of the
of services are being offered in the cloud to millions 30th Annual International Computer Software and Applications
of users. However, as observed by Visser [196], load Conference (COMPSAC), 2006, pp. 229–233.
testing is a difficult task requiring a great under- [18] C.-S. D. Yang and L. L. Pollock, “Towards a structural load
testing tool,” in Proceedings of the 1996 ACM SIGSOFT interna-
standing of the SUT. In this paper, we have surveyed tional symposium on Software testing and analysis (ISSTA), 1996,
techniques that are used in the three phases of load pp. 201–208.
testing: the load design phase, the load execution [19] “uTest - Load Testing Services,” http://www.utest.com/
load-testing, visited 2014-11-24.
phase, and the load test analysis phase. We compared [20] M. Acharya and V. Kommineni, “Mining health models for
and contrasted these techniques and provided a few performance monitoring of services,” in Proceedings of the
open research problems for each phase of the load 2009 IEEE/ACM International Conference on Automated Software
Engineering (ASE), 2009, pp. 409–420.
testing problem. [21] A. Avritzer and B. B. Larson, “Load testing software using
deterministic state testing,” in Proceedings of the 1993 ACM
SIGSOFT international symposium on Software testing and anal-
R EFERENCES ysis (ISSTA 1993), 1993, pp. 82–88.
[22] A. Avritzer and E. J. Weyuker, “Generating test suites for soft-
[1] “Applied Performance Management Survey,” Oct 2007. ware load testing,” in Proceedings of the 1994 ACM SIGSOFT
28
international symposium on Software testing and analysis (ISSTA based software systems,” in Proceedings of the 14th Annual
1994), 1994, pp. 44–57. IEEE International Conference and Workshops on the Engineering
[23] ——, “The automatic generation of load test suites and the of Computer-Based Systems (ECBS), 2007, pp. 307–316.
assessment of the resulting software,” IEEE Transactions on [43] A. Avritzer, J. Kondek, D. Liu, and E. J. Weyuker, “Software
Software Engineering, vol. 21, no. 9, pp. 705–716, 1995. performance testing based on workload characterization,” in
[24] C. Barna, M. Litoiu, and H. Ghanbari, “Autonomic load- Proceedings of the 3rd international workshop on Software and
testing framework,” in Proceedings of the 8th ACM International performance (WOSP), 2002, pp. 17–24.
Conference on Autonomic Computing (ICAC), 2011. [44] S. Barber, “Creating effective load models for performance
[25] ——, “Model-based performance testing (nier track),” in testing with incomplete empirical data,” in Proceedings of the
Proceedings of the 33rd International Conference on Software Sixth IEEE International Workshop on the Web Site Evolution
Engineering (ICSE), 2011. (WSE), 2004, pp. 51–59.
[26] F. Huebner, K. S. Meier-Hellstern, and P. Reeser, “Perfor- [45] A. B. Bondi, “Automating the analysis of load test results
mance testing for ip services and systems,” in Performance to assess the scalability and stability of a component,” in
Engineering, State of the Art and Current Trends, 2001, pp. 283– Proceedings of the 2007 Computer Measurement Group Conference
299. (CMG), 2007, pp. 133–146.
[27] B. Lim, J. Kim, and K. Shim, “Hierarchical load testing [46] G. Gheorghiu, “More on performance vs. load
architecture using large scale virtual clients,” pp. 581–584, testing,” 2005, http://agiletesting.blogspot.com/2005/
2006. 04/more-on-performance-vs-load-testing.html, visited
[28] I. Schieferdecker, G. Din, and D. Apostolidis, “Distributed 2014-11-24.
functional and load tests for web services,” International [47] D. A. Menasce and V. A. F. Almeida, Scaling for E Business:
Journal on Software Tools for Technology Transfer (STTT), vol. 7, Technologies, Models, Performance, and Capacity Planning. Up-
pp. 351–360, 2005. per Saddle River, NJ, USA: Prentice Hall PTR, 2000.
[29] P. Zhang, S. G. Elbaum, and M. B. Dwyer, “Automatic gener- [48] B. A. Pozin and I. V. Galakhov, “Models in performance
ation of load tests,” in 26th IEEE/ACM International Conference testing,” Programming and Computer Software, vol. 37, no. 1,
on Automated Software Engineering (ASE 2011), November January 2011.
2011. [49] M. Woodside, G. Franks, and D. C. Petriu, “The future of soft-
[30] S. Abu-Nimeh, S. Nair, and M. Marchetti, “Avoiding denial ware performance engineering,” in Proceedings of the Future of
of service via stress testing,” in Proceedings of the IEEE In- Software Engineering (FOSE) track, International Conference on
ternational Conference on Computer Systems and Applications Software Engineering (ICSE), 2007, pp. 171–187.
(AICCSA), 2006, pp. 300–307. [50] D. A. Menasce, V. A. F. Almeida, R. Fonseca, and M. A.
[31] D. Bainbridge, I. H. Witten, S. Boddie, and J. Thompson, Re- Mendes, “A methodology for workload characterization of
search and Advanced Technology for Digital Libraries. Springer, e-commerce sites,” in Proceedings of the 1st ACM conference on
2009, ch. Stress-Testing General Purpose Digital Library Soft- Electronic commerce (EC), 1999, pp. 119–128.
ware, pp. 203–214.
[51] G. Casale, A. Kalbasi, D. Krishnamurthy, and J. Rolia,
[32] L. C. Briand, Y. Labiche, and M. Shousha, “Stress testing
“Automatic stress testing of multi-tier systems by dy-
real-time systems with genetic algorithms,” in Proceedings
namic bottleneck switch generation,” in Proceedings of the
of the 2005 conference on Genetic and evolutionary computation
10th ACM/IFIP/USENIX International Conference on Middleware
(GECCO), 2005, pp. 1021–1028.
(Middleware). New York, NY, USA: Springer-Verlag New
[33] C. D. Grosso, G. Antoniol, M. D. Penta, P. Galinier, and
York, Inc., 2009, pp. 1–20.
E. Merlo, “Improving network applications security: a new
heuristic to generate stress testing data,” in Proceedings of [52] D. Krishnamurthy, J. Rolia, and S. Majumdar, “Swat: A
tool for stress testing session-based web applications,” in
the 2005 conference on Genetic and evolutionary computation
(GECCO), 2005, pp. 1037–1043. Computer Measurement Group Conference, 2003.
[34] M. Musuvathi, S. Qadeer, T. Ball, G. Basler, P. A. Nainar, and [53] D. Krishnamurthy, J. A. Rolia, and S. Majumdar, “A synthetic
I. Neamtiu, “Finding and reproducing heisenbugs in concur- workload generation technique for stress testing session-
rent programs,” in Proceedings of the 8th USENIX Symposium based systems,” IEEE Transactions on Software Engineering,
on Operating Systems Design and Implementation (OSDI), 2008, vol. 32, no. 11, pp. 868–882, 2006.
pp. 267–280. [54] J. Zhang, S.-C. Cheung, and S. T. Chanson, “Stress testing
[35] F. I. Vokolos and E. J. Weyuker, “Performance testing of soft- of distributed multimedia software systems,” in Proceedings
ware systems,” in Proceedings of the 1st international workshop of the IFIP TC6 WG6.1 Joint International Conference on Formal
on Software and performance (WOSP), 1998, pp. 80–87. Description Techniques for Distributed Systems and Communica-
[36] A. Chakravarty, “Stress testing an ai based web service: A tion Protocols (FORTE XII) and Protocol Specification, Testing and
case study,” in Seventh International Conference on Information Verification (PSTV XIX), 1999.
Technology: New Generations (ITNG 2010), April 2010. [55] A. F. Karr and A. A. Porter, “Distributed performance testing
[37] M. Kalita and T. Bezboruah, “Investigation on performance using statistical modeling,” in Proceedings of the 1st interna-
testing and evaluation of prewebd: a .net technique for tional workshop on Advances in model-based testing (A-MOST),
implementing web application,” IET Software, vol. 5, no. 4, 2005, pp. 1–7.
pp. 357 –365, August 2011. [56] L. Bulej, T. Kalibera, and P. Tma, “Repeated results analysis
[38] J. Zhang and S. C. Cheung, “Automated test case generation for middleware regression benchmarking,” Performance Eval-
for the stress testing of multimedia systems,” Software - uation, vol. 60, no. 1-4, pp. 345–358, 2005.
Practice & Experience, vol. 32, no. 15, pp. 1411–1435, 2002. [57] T. Kalibera, L. Bulej, and P. Tuma, “Automated detection of
[39] J. Hill, D. Schmidt, J. Edmondson, and A. Gokhale, “Tools for performance regressions: the mono experience,” in Proceed-
continuously evaluating distributed system qualities,” IEEE ings of the 13th IEEE International Symposium on Modeling,
Software, vol. 27, no. 4, pp. 65 –71, july-aug. 2010. Analysis, and Simulation of Computer and Telecommunication
[40] J. H. Hill, “An architecture independent approach to emu- Systems (MASCOTS), 27-29 2005, pp. 183 – 190.
lating computation intensive workload for early integration [58] J. W. Cangussu, K. Cooper, and W. E. Wong, “Reducing the
testing of enterprise dre systems,” in Proceedings of the Confed- number of test cases for performance evaluation of compo-
erated International Conferences, CoopIS, DOA, IS, and ODBASE nents,” in Proceedings of the Nineteenth International Conference
2009 on On the Move to Meaningful Internet Systems (OTM). on Software Engineering & Knowledge Engineering (SEKE), 2007,
Berlin, Heidelberg: Springer-Verlag, 2009, pp. 744–759. pp. 145–150.
[41] J. H. Hill, D. C. Schmidt, A. A. Porter, and J. M. Slaby, “Cicuts: [59] ——, “A segment based approach for the reduction of the
Combining system execution modeling tools with continuous number of test cases for performance evaluation of com-
integration environments,” in Proceedings of the 15th Annual ponents,” International Journal of Software Engineering and
IEEE International Conference and Workshop on the Engineering Knowledge Engineering, vol. 19, no. 4, pp. 481–505, 2009.
of Computer Based Systems (ECBS), 2008, pp. 66–75. [60] M. G. Stochel, M. R. Wawrowski, and J. J. Waskiel, “Adaptive
[42] J. H. Hill, S. Tambe, and A. Gokhale, “Model-driven engi- agile performance modeling and testing,” in Proceedings of
neering for development-time qos validation of component- the 2012 IEEE 36th Annual Computer Software and Applications
29
Conference Workshops (COMPSACW). Washington, DC, USA: 2008 conference of the center for advanced studies on collaborative
IEEE Computer Society, 2012, pp. 446–451. research (CASCON), 2008, pp. 157–165.
[61] M. J. Johnson, C.-W. Ho, M. E. Maximilien, and L. Williams, [86] M. D. Syer, B. Adams, and A. E. Hassan, “Identifying perfor-
“Incorporating performance testing in test-driven develop- mance deviations in thread pools,” in Proceedings of the 2011
ment,” IEEE Software, vol. 24, pp. 67–73, May 2007. 27th IEEE International Conference on Software Maintenance
[62] A. Avritzer and E. J. Weyuker, “Deriving workloads for (ICSM), 2011.
performance testing,” Software - Practice & Experience, vol. 26, [87] Z. M. Jiang, A. E. Hassan, G. Hamann, and P. Flora, “Auto-
no. 6, pp. 613–633, 1996. mated performance analysis of load tests,” in Proceedings of
[63] P. Csurgay and M. Malek, “Performance testing at early the 25th IEEE International Conference on Software Maintenance
design phases,” in Proceedings of the IFIP TC6 12th International (ICSM), 2009.
Workshop on Testing Communicating Systems, 1999, pp. 317–330. [88] ——, “Automatic identification of load testing problems,” in
[64] G. Denaro, A. Polini, and W. Emmerich, “Early performance Proceedings of the 24th IEEE International Conference on Software
testing of distributed software applications,” 2004. Maintenance (ICSM), 2008.
[65] ——, Performance Testing of Distributed Component Architec- [89] R. Jain, The Art of Computer Systems Performance Analysis:
tures. In Building Quality into COTS Components: Testing and Techniques for Experimental Design, Measurement, Simulation,
Debugging. Springer-Verlag, 2005, ch. Performance Testing and Modeling. Wiley, April 1991.
of Distributed Component Architectures. [90] M. Calzarossa and G. Serazzi, “Workload characterization: a
[66] V. Garousi, “A genetic algorithm-based stress test require- survey,” Proceedings of the IEEE, vol. 81, no. 8, pp. 1136 –1150,
ments generator tool and its empirical evaluation,” Transac- Aug 1993.
tions on Softare Engineering. [91] S. Elnaffar and P. Martin, “Characterizing computer systems’
[67] ——, “Empirical analysis of a genetic algorithm-based stress workloads,” Queen’s University, Tech. Rep., 2002.
test technique,” in Proceedings of the 10th annual conference on [92] N. Snellman, A. Ashraf, and I. Porres, “Towards automatic
Genetic and evolutionary computation (GECCO), 2008, pp. 1743– performance and scalability testing of rich internet appli-
1750. cations in the cloud,” in 37th EUROMICRO Conference on
[68] V. Garousi, L. C. Briand, and Y. Labiche, “Traffic-aware Software Engineering and Advanced Applications (SEAA 2011),
stress testing of distributed systems based on uml models,” September 2011.
in Proceedings of the 28th international conference on Software [93] M. Andreolini, M. Colajanni, and P. Valente, “Design and
engineering (ICSE), 2006, pp. 391–400. testing of scalable web-based systems with performance con-
[69] ——, “Traffic-aware stress testing of distributed real-time straints,” in Proceedings of the 2005 Workshop on Techniques,
systems based on uml models using genetic algorithms,” Methodologies and Tools for Performance Evaluation of Complex
Journal of Systems and Software, vol. 81, no. 2, pp. 161–185, Systems (FIRB-PERF), 2005.
2008. [94] S. Dawar, S. Meer, J. Keeney, E. Fallon, and T. Bennet, “Cloud-
[70] E. Bozdag, A. Mesbah, and A. van Deursen, “Performance ifying mobile network management: Performance tests of
testing of data delivery techniques for ajax applications,” event distribution and rule processing,” in Mobile Networks
Journal of Web Engineering, vol. 8, no. 4, pp. 287–315, 2009. and Management, ser. Lecture Notes of the Institute for Com-
[71] D. S. Hoskins, C. J. Colbourn, and D. C. Montgomery, “Soft- puter Sciences, Social Informatics and Telecommunications
ware performance testing using covering arrays: efficient Engineering, D. Pesch, A. Timm-Giel, R. Calvo, B.-L. Wen-
screening designs with categorical factors,” in Proceedings ning, and K. Pentikousis, Eds. Springer International Pub-
of the 5th international workshop on Software and performance lishing, 2013, vol. 125, pp. 94–107.
(WOSP), 2005, pp. 131–136. [95] B. L. Farrell, R. Menninger, and S. G. Strickland, “Per-
[72] M. Sopitkamol and D. A. Menascé, “A method for evalu- formance testing & analysis of distributed client/server
ating the impact of software configuration parameters on e- database systems,” in Proceedings of the 1998 Computer Man-
commerce sites,” in Proceedings of the 5th international workshop agement Group Conference (CMG), 1998, pp. 910–921.
on Software and performance (WOSP), 2005, pp. 53–64. [96] R. Hayes, “How to load test e-commerce applications,” in
[73] B. Beizer, Software System Testing and Quality Assurance. Van Proceedings of the 2000 Computer Management Group Conference
Nostrand Reinhold, March 1984. (CMG), 2000, pp. 275–282.
[74] I. Gorton, Essential Software Architecture. Springer, 2000. [97] J. A. Meira, E. C. de Almeida, G. Suny, Y. L. Traon, and
[75] A. Adamoli, D. Zaparanuks, M. Jovic, and M. Hauswirth, P. Valduriez, “Stress testing of transactional database sys-
“Automated gui performance testing,” Software Quality Con- tems.” Journal of Information and Data Management (JIDM),
trol, vol. 19, no. 4, December 2011. vol. 4, no. 3, pp. 279–294, 2013.
[76] D. A. Menasce and V. A. F. Almeida, Capacity Planning for [98] J. K. Merton, “Evolution of performance testing in a dis-
Web Services: Metrics, Models, and Methods. Upper Saddle tributed client server environment,” in Proceedings of the 1994
River, NJ, USA: Prentice Hall PTR, 2001. Computer Management Group Conference (CMG), 1999, pp. 118–
[77] D. A. Menasce, V. A. Almeida, and L. W. Dowd, Capacity 124.
Planning and Performance Modeling: From Mainframes to Client- [99] A. Savoia, “Web load test planning: Predicting how your web
Server Systems. Upper Saddle River, NJ, USA: Prentice Hall site will respond to stress,” STQE Magazine, 2001.
PTR, 1997. [100] L. T. Costa, R. M. Czekster, F. M. de Oliveira, E. de M. Ro-
[78] S. Nejati, S. D. Alesio, M. Sabetzadeh, and L. Briand, “Mod- drigues, M. B. da Silveira, and A. F. Zorzo, “Generating
eling and analysis of cpu usage in safety-critical embedded performance test scripts and scenarios based on abstract
systems to support stress testing,” in Proceedings of the 15th intermediate models,” in Proceedings of the 24th International
International Conference on Model Driven Engineering Languages Conference on Software Engineering & Knowledge Engineering
and Systems (MODELS). Berlin, Heidelberg: Springer-Verlag, (SEKE), 2012, pp. 112–117.
2012, pp. 759–775. [101] M. B. da Silveira, E. de M. Rodrigues, A. F. Zorzo, L. T.
[79] E. W. Dijkstra, “Notes on Structured Programming,” April Costa, H. V. Vieira, and F. M. de Oliveira, “Reusing functional
1970. testing in order to decrease performance and stress testing
[80] “CompleteSearch DBLP,” http://dblp.mpi-inf.mpg.de/ costs,” in Proceedings of the 23rd International Conference on
dblp-mirror/index.php, visited 2014-11-24. Software Engineering & Knowledge Engineering (SEKE 2011),,
[81] “Google Scholar,” http://scholar.google.com/, visited 2014- 2011.
11-24. [102] I. de Sousa Santos, A. R. Santos, and P. de Alcantara dos
[82] “Microsoft Academic Search,” http://academic.research. S. Neto, “Generation of scripts for performance testing based
microsoft.com/, visited 2014-11-24. on uml models,” in Proceedings of the 23rd International Con-
[83] “ACM Portal,” http://portal.acm.org/, visited 2014-11-24. ference on Software Engineering & Knowledge Engineering (SEKE
[84] “IEEE Explore,” http://ieeexplore.ieee.org/Xplore/ 2011),, 2011.
guesthome.jsp, visited 2014-11-24. [103] X. Wang, B. Zhou, and W. Li, “Model based load testing of
[85] A. Avritzer, F. P. Duarte, a. Rosa Maria Meri Le E. de Souza e web applications,” in 2010 International Symposium on Parallel
Silva, M. Cohen, and D. Costello, “Reliability estimation and Distributed Processing with Applications (ISPA), September
for large distributed software systems,” in Proceedings of the 2010.
30
[104] M. D. Barros, J. Shiau, C. Shang, K. Gidewall, H. Shi, and 2012 IEEE International Symposium on Performance Analysis of
J. Forsmann, “Web services wind tunnel: On performance Systems & Software, ser. ISPASS ’12, 2012, pp. 35–45.
testing large-scale stateful web services,” in Proceedings of the [123] “Aristotle Analysis System – Siemens Programs, HR Vari-
37th Annual IEEE/IFIP International Conference on Dependable ants.”
Systems and Networks (DSN), 2007, pp. 612 –617. [124] G.-H. Kim, Y.-G. Kim, and S.-K. Shin, “Software performance
[105] K. Kant, V. Tewary, and R. Iyer, “An internet traffic generator test automation by using the virtualization,” in IT Convergence
for server architecture evaluation.” in Proceedings of Workshop and Security 2012, ser. Lecture Notes in Electrical Engineering,
Computer Architecture Evaluation Using Commercial Workloads, K. J. Kim and K.-Y. Chung, Eds. Springer Netherlands, 2013,
2001. vol. 215, pp. 1191–1199.
[106] D. Draheim, J. Grundy, J. Hosking, C. Lutteroth, and G. We- [125] T. Bear, “Shootout: Load Runner vs The Grinder vs Apache
ber, “Realistic load testing of web applications,” in Proceedings JMeter,” 2006, http://blackanvil.blogspot.com/2006/06/
of the Conference on Software Maintenance and Reengineering shootout-load-runner-vs-grinder-vs.html, visited 2014-11-24.
(CSMR), 2006, pp. 57–70. [126] A. Podelko, “Load Testing Tools,” http://alexanderpodelko.
[107] C. Lutteroth and G. Weber, “Modeling a realistic workload for com/PerfTesting.html#LoadTestingTools, visited 2014-11-24.
performance testing,” in Proceedings of the 2008 12th Interna- [127] C. Vail, “Stress, load, volume, performance, benchmark and
tional IEEE Enterprise Distributed Object Computing Conference base line testing tool evaluation and comparison,” 2005, http:
(EDOC), 2008, pp. 149–158. //www.vcaa.com/tools/loadtesttoolevaluationchart-023.
[108] F. Abbors, T. Ahmad, D. Truscan, and I. Porres, “Model-based pdf, visited 2014-11-24.
performance testing in the cloud using the mbpet tool,” in [128] “WebLOAD product overview,” http://radview.com/
Proceedings of the 4th ACM/SPEC International Conference on company/resources/, visited 2014-11-24.
Performance Engineering (ICPE). New York, NY, USA: ACM, [129] “HP LoadRunner software,” http://www8.hp.com/ca/en/
2013, pp. 423–424. software-solutions/loadrunner-load-testing/, visited 2014-
[109] A. J. Maalej, M. Hamza, M. Krichen, and M. Jmaiel, “Au- 11-24.
tomated significant load testing for ws-bpel compositions,” [130] “Apache JMeter,” http://jakarta.apache.org/jmeter/, visited
in 2013 IEEE Sixth International Conference on Software Testing, 2014-11-24.
Verification and Validation Workshops (ICSTW), March 2013, pp. [131] “Microsoft Exchange Load Generator (LoadGen),”
144–153. http://www.microsoft.com/en-us/download/details.aspx?
[110] A. J. Maalej, M. Krichen, and M. Jmaiel, “Conformance test- id=14060, visited 2014-11-24.
ing of ws-bpel compositions under various load conditions,” [132] X. Che and S. Maag, “Passive testing on performance require-
in 2012 IEEE 36th Annual Computer Software and Applications ments of network protocols,” in AINA Workshops, 2013.
Conference (COMPSAC), July 2012, pp. 371–371. [133] S. Dimitrov and T. Stoilov, “Loading test of apache http
[111] P. Zhang, S. Elbaum, and M. B. Dwyer, “Compositional load server by video file and usage measurements of the hardware
test generation for software pipelines,” in Proceedings of the components,” in Proceedings of the 14th International Conference
2012 International Symposium on Software Testing and Analysis on Computer Systems and Technologies (CompSysTech). New
(ISSTA). New York, NY, USA: ACM, 2012, pp. 89–99. York, NY, USA: ACM, 2013, pp. 59–66.
[112] Y. Gu and Y. Ge, “Search-based performance testing of ap- [134] M. Murth, D. Winkler, S. Biffl, E. Kuhn, and T. Moser, “Per-
plications with composite services,” in Proceedings of the 2009 formance testing of semantic publish/subscribe systems,”
International Conference on Web Information Systems and Mining in Proceedings of the 2010 International Conference on On the
(WISM), 2009, pp. 320–324. Move to Meaningful Internet Systems, ser. OTM 2010. Berlin,
[113] M. D. Penta, G. Canfora, G. Esposito, V. Mazza, and Heidelberg: Springer-Verlag, 2010.
M. Bruno, “Search-based testing of service level agreements,” [135] C. Dumitrescu, I. Raicu, M. Ripeanu, and I. Foster, “Diperf:
in Proceedings of the 9th annual conference on Genetic and An automated distributed performance testing framework,”
evolutionary computation (GECCO), 2007, pp. 1090–1097. in Proceedings of the 5th IEEE/ACM International Workshop on
[114] Y. Cai, J. Grundy, and J. Hosking, “Experiences integrating Grid Computing (GRID), 2004, pp. 289–296.
and scaling a performance test bed generator with an open [136] J. A. Meira, E. C. de Almeida, Y. L. Traon, and G. Sunye,
source case tool,” in Proceedings of the 19th IEEE international “Peer-to-peer load testing,” in 2012 IEEE Fifth Interna-
conference on Automated software engineering (ASE), 2004, pp. tional Conference on Software Testing, Verification and Validation
36–45. (ICST), 2012, pp. 642–647.
[115] ——, “Synthesizing client load models for performance en- [137] J. Xie, X. Ye, B. Li, and F. Xie, “A configurable web service
gineering via web crawling,” in Proceedings of the twenty- performance testing framework,” in Proceedings of the 10th
second IEEE/ACM international conference on Automated software IEEE International Conference on High Performance Computing
engineering (ASE), 2007, pp. 353–362. and Communications (HPCC), 2008, pp. 312–319.
[116] G. Canfora, M. D. Penta, R. Esposito, and M. L. Villani, “An [138] N. Baltas and T. Field, “Continuous performance testing in
approach for qos-aware service composition based on genetic virtual time,” in Proceedings of the 2012 Ninth International
algorithms,” in Proceedings of the 2005 conference on Genetic and Conference on Quantitative Evaluation of Systems. Washington,
evolutionary computation (GECCO), 2005, pp. 1069–1075. DC, USA: IEEE Computer Society, 2012, pp. 13–22.
[117] M. M. Maccabee and S. Ma, “Web application performance: [139] D. A. Menasce, “Workload characterization,” IEEE Internet
Realistic work load for stress test,” in Proceedings of the 2002 Computing, vol. 7, no. 5, pp. 89–92, 2003.
Computer Management Group Conference (CMG), 2002, pp. 353– [140] Q. Gao, W. Wang, G. Wu, X. Li, J. Wei, and H. Zhong,
362. “Migrating load testing to the cloud: A case study,” in 2013
[118] G. M. Leganza, “The stress test tutorial,” in Proceedings of the IEEE 7th International Symposium on Service Oriented System
1991 Computer Management Group Conference (CMG), 1991, pp. Engineering (SOSE), March 2013, pp. 429–434.
994–1004. [141] M. Yan, H. Sun, X. Wang, and X. Liu, “Building a taas
[119] E. J. Weyuker and A. Avritzer, “A metric for predicting the platform for web service load testing,” in Proceedings of
performance of an application under a growing workload,” the 2012 IEEE International Conference on Cluster Computing
IBM System Journal, vol. 41, no. 1, pp. 45–54, January 2002. (CLUSTER). Washington, DC, USA: IEEE Computer Society,
[120] N. Mi, G. Casale, L. Cherkasova, and E. Smirni, “Burstiness 2012, pp. 576–579.
in multi-tier applications: symptoms, causes, and new mod- [142] J. Zhou, S. Li, Z. Zhang, and Z. Ye, “Position paper: Cloud-
els,” in Proceedings of the 9th ACM/IFIP/USENIX International based performance testing: Issues and challenges,” in Proceed-
Conference on Middleware (Middleware), 2008, pp. 265–286. ings of the 2013 International Workshop on Hot Topics in Cloud
[121] F. Borges, A. Gutierrez-Milla, R. Suppi, and E. Luque, “Op- Services (HotTopiCS). New York, NY, USA: ACM, 2013, pp.
timal run length for discrete-event distributed cluster-based 55–62.
simulations,” in Proceedings of the 2014 International Conference [143] M. Grechanik, C. Csallner, C. Fu, and Q. Xie, “Is data pri-
on Computational Science (ICCS), 2014, pp. 73–83. vacy always good for software testing?” in Proceedings of the
[122] D. Meisner, J. Wu, and T. F. Wenisch, “Bighouse: A simulation 2010 IEEE 21st International Symposium on Software Reliability
infrastructure for data center systems,” in Proceedings of the Engineering (ISSRE),, Nov. 2010, pp. 368 –377.
31
[144] Y. Wang, X. Wu, and Y. Zheng, Trust and Privacy in Dig- the 2008 23rd IEEE/ACM International Conference on Automated
ital Business. Springer, 2004, ch. Efficient Evaluation of Software Engineering (ASE), 2008, pp. 399–402.
Multifactor Dependent System Performance Using Fractional [166] “PerfMon,” http://technet.microsoft.com/en-us/library/
Factorial Design, pp. 142–151. bb490957.aspx, visited 2014-11-24.
[145] A. Bertolino, G. Angelis, A. Marco, P. Inverardi, A. Sabetta, [167] “Munin,” http://munin-monitoring.org/, visited 2014-11-24.
and M. Tivoli, “A framework for analyzing and testing the [168] “Net SNMP,” http://www.net-snmp.org/, visited 2014-11-
performance of software services,” in Leveraging Applications 24.
of Formal Methods, Verification and Validation, ser. Communica- [169] B. Cornelissen, A. Zaidman, A. van Deursen, L. Moonen, and
tions in Computer and Information Science, T. Margaria and R. Koschke, “A systematic survey of program comprehension
B. Steffen, Eds. Springer Berlin Heidelberg, 2009, vol. 17, through dynamic analysis,” IEEE Transactions on Software
pp. 206–220. Engineering, vol. 35, no. 5, pp. 684–702, September 2009.
[146] X. Meng, “Designing approach analysis on small-scale soft- [170] T. A. S. Foundation, “Log4j,” http://logging.apache.org/
ware performance testing tools,” in 2011 International Confer- log4j/2.x/, visited 2014-11-24.
ence on Electronic and Mechanical Engineering and Information [171] T. O. Group, “Application Response Measurement - ARM,”
Technology (EMEIT), August 2011. https://collaboration.opengroup.org/tech/management/
[147] C. H. Kao, C. C. Lin, and J.-N. Chen, “Performance testing arm/, visited 2014-11-24.
framework for rest-based web applications,” in 2013 13th [172] “Sarbanes-Oxley Act of 2002,” http://www.soxlaw.com/,
International Conference on Quality Software (QSIC), July 2013, visited 2014-11-24.
pp. 349–354. [173] E. M. Friedman and J. L. Rosenberg, “Web load testing made
[148] S. Dunning and D. Sawyer, “A little language for rapidly easy: Testing with wcat and wast for windows applications,”
constructing automated performance tests,” in Proceedings in Proceedings of the 2003 Computer Management Group Confer-
of the Second Joint WOSP/SIPEW International Conference on ence (CMG), 2003, pp. 57–82.
Performance Engineering, ser. ICPE ’11, 2011. [174] G. Din, I. Schieferdecker, and R. Petre, “Performance test
[149] M. Dhote and G. Sarate, “Performance testing complexity design process and its implementation patterns for multi-
analysis on ajax-based web applications,” IEEE Software, services systems,” in Proceedings of the 20th IFIP TC 6/WG
vol. 30, no. 6, pp. 70–74, Nov. 2013. 6.1 international conference on Testing of Software and Commu-
[150] N. Stankovic, “Distributed tool for performance testing,” in nicating Systems (TestCom/FATES), 2008, pp. 135–152.
Software Engineering Research and Practice, 2006, pp. 38–44.
[175] P. Tran, J. Gosper, and I. Gorton, “Evaluating the sustained
[151] ——, “Patterns and tools for performance testing,” in 2006
performance of cots-based messaging systems,” Software Test-
IEEE International Conference on Electro/information Technology,
ing, Verification and Reliability, vol. 13, no. 4, pp. 229–240, 2003.
2006, pp. 152 –157.
[176] X. Yang, X. Li, Y. Ji, and M. Sha, “Crownbench: a grid
[152] “Wireshark - Go Deep,” http://www.wireshark.org/, visited
performance testing system using customizable synthetic
2014-11-24.
workload,” in Proceedings of the 10th Asia-Pacific web conference
[153] “Selenium - Web Browser Automation,” http://seleniumhq.
on Progress in WWW research and development (APWeb), 2008,
org/, visited 2014-11-24.
pp. 190–201.
[154] S. Shirodkar and V. Apte, “Autoperf: an automated load
generator and performance measurement tool for multi-tier [177] A. L. Glaser, “Load testing in an ir organization: Getting by
software systems,” in Proceedings of the 16th international ’with a little help from my friends’,” in Proceedings of the 1999
conference on World Wide Web (WWW), 2007, pp. 1291–1292. Computer Management Group Conference (CMG), 1999, pp. 686–
[155] M. A. S. Netto, S. Menon, H. V. Vieira, L. T. Costa, F. M. 698.
de Oliveira, R. Saad, and A. F. Zorzo:, “Evaluating load gen- [178] D. Grossman, M. C. McCabe, C. Staton, B. Bailey, O. Frieder,
eration in virtualized environments for software performance and D. C. Roberts, “Performance testing a large finance
testing,” in 2011 IEEE International Symposium on Parallel and application,” Software, IEEE, vol. 13, no. 5, pp. 50 –54, sep
Distributed Processing Workshops and Phd Forum (IPDPSW), 1996.
May 2011. [179] S. Duttagupta and M. Nambiar, “Performance extrapolation
[156] J. White and A. Pilbeam, “A survey of virtualiza- for load testing results of mixture of applications,” in 2011
tion technologies with performance testing,” CoRR, vol. Fifth UKSim European Symposium on Computer Modeling and
abs/1010.3233, 2010. Simulation (EMS), November 2011.
[157] G.-B. Kim, “A method of generating massive virtual clients [180] Z. M. Jiang, A. E. Hassa, G. Hamann, and P. Flora, “An auto-
and model-based performance test,” in Proceedings of the Fifth mated approach for abstracting execution logs to execution
International Conference on Quality Software (QSIC), 2005, pp. events,” Journal Software Maintenance Evolution, vol. 20, pp.
250–254. 249–267, July 2008.
[158] “Shunra,” http://www8.hp.com/us/en/ [181] K. C. Foo, Z. M. Jiang, B. Adams, Y. Z. Ahmed E. Hassan, and
software-solutions/network-virtualization/, visited 2014-11- P. Flora, “Mining performance regression testing repositories
24. for automated performance analysis,” in 10th International
[159] R. K. Mansharamani, A. Khanapurkar, B. Mathew, and Conference on Quality Software (QSIC 2010), July 2010.
R. Subramanyan, “Performance testing: Far from steady [182] S. Ghaith, M. Wang, P. Perry, and J. Murphy, “Profile-based,
state,” in IEEE 34th Annual Computer Software and Applications load-independent anomaly detection and analysis in perfor-
Conference Workshops (COMPSACW 2010), July 2010. mance regression testing of software systems,” in Proceedings
[160] M. Grechanik, C. Fu, and Q. Xie, “Automatically finding per- of the 2013 17th European Conference on Software Maintenance
formance problems with feedback-directed learning software and Reengineering (CSMR). Washington, DC, USA: IEEE
testing,” in Proceedings of the 34th International Conference on Computer Society, 2013, pp. 379–383.
Software Engineering (ICSE). Piscataway, NJ, USA: IEEE Press, [183] S. Ghaith, “Analysis of performance regression testing data
2012, pp. 156–166. by transaction profiles,” in Proceedings of the 2013 International
[161] J.-N. Juang, Applied System Identification, 1st ed. Prentice Hall, Symposium on Software Testing and Analysis (ISSTA). New
1993. York, NY, USA: ACM, 2013, pp. 370–373.
[162] L. Ljung, System Identification: Theory for the user. Prentice [184] H. Malik, H. Hemmati, and A. E. Hassan, “Automatic de-
Hall, 1987. tection of performance deviations in the load testing of
[163] T. Mytkowicz, A. Diwan, M. Hauswirth, and P. F. Sweeney, large scale systems,” in Proceedings of the 2013 International
“Evaluating the accuracy of java profilers,” Proceedings of Conference on Software Engineering (ICSE). Piscataway, NJ,
the ACM SIGPLAN 2010 Conference on Programming Language USA: IEEE Press, 2013, pp. 1012–1021.
Design and Implementation (PLDI), pp. 187–197, 2010. [185] H. Malik, B. Adams, and A. E. Hassan, “Pinpointing the
[164] G. M. Leganza, “Coping with stress tests: Managing the subsystems responsible for the performance deviations in a
application benchmark,” in Proceedings of the 1990 Computer load test,” in IEEE 21st International Symposium on Software
Management Group Conference (CMG), 1990, pp. 1018–1026. Reliability Engineering (ISSRE 2010), November 2010.
[165] A. Bertolino, G. D. Angelis, and A. Sabetta, “Vcr: Virtual [186] H. Malik, Bram, Adams, A. E. Hassan, P. Flora, and
capture and replay for performance testing,” in Proceedings of G. Hamann, “Using load tests to automatically compare the
32
subsystems of a large enterprise system,” in IEEE 34th An- Ahmed E. Hassan Ahmed E. Hassan is
nual Computer Software and Applications Conference (COMPSAC the NSERC/BlackBerry Software Engineer-
2010), July 2010. ing Chair at the School of Computing at
[187] H. Malik, Z. M. Jiang, B. Adams, P. Flora, and G. Hamann, Queens University, Canada. His research in-
“Automatic comparison of load tests to support the perfor- terests include mining software repositories,
mance analysis of large enterprise systems,” in Proceedings empirical software engineering, load testing,
of the 14th European Conference on Software Maintenance and and log mining. Hassan received a PhD in
Reengineering (CSMR), March 2010. computer science from the University of Wa-
[188] T. H. D. Nguyen, B. Adams, Z. M. Jiang, A. E. Hassan, M. N. terloo. He spearheaded the creation of the
Nasser, and P. Flora, “Automated verification of load tests Mining Software Repositories (MSR) confer-
using control charts,” in 18th Asia Pacific Software Engineering ence and its research community. Hassan
Conference (APSEC 2011), December 2011. also serves on the editorial boards of IEEE Transactions on Software
[189] M. D. Syer, B. Adams, and A. E. Hassan, “Industrial case Engineering, Springer Journal of Empirical Software Engineering,
study on supporting the comprehension of system behaviour and Springer Journal of Computing.
under load,” in Proceedings of the 2011 IEEE 19th International
Conference on Program Comprehension (ICPC), 2011.
[190] M. D. Syer, Z. M. Jiang, M. Nagappan, A. E. Hassan,
M. Nasser, and P. Flora, “Leveraging performance counters
and execution logs to diagnose memory-related performance
issues,” in Proceedings of the 2013 IEEE International Conference
on Software Maintenance (ICSM), 2013.
[191] J. K. Merton, “Performance testing in a client-server environ-
ment,” in Proceedings of the 1997 Computer Management Group
Conference (CMG), 1997, pp. 594–601.
[192] A. Georges, D. Buytaert, and L. Eeckhout, “Statistically rigor-
ous java performance evaluation,” vol. 42, no. 10, pp. 57–76,
2007.
[193] I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly,
and A. Fox, “Capturing, indexing, clustering, and retrieving
system history,” in Proceedings of the 20th ACM Symposium on
Operating Systems Principles (SOSP), 2005, pp. 105–118.
[194] S. Han, Y. Dang, S. Ge, D. Zhang, and T. Xie, “Performance
debugging in the large via mining millions of stack traces,”
in Proceedings of the 34th International Conference on Software
Engineering (ICSE 2012), June 2012.
[195] M. Jiang, M. A. Munawar, T. Reidemeister, and P. A. S. Ward,
“Detection and diagnosis of recurrent faults in software
systems by invariant analysis,” in Proceedings of the 2008 11th
IEEE High Assurance Systems Engineering Symposium (HASE),
2008, pp. 323–332.
[196] W. Visser, “Who really cares if the program crashes?” in
Proceedings of the 16th International SPIN Workshop on Model
Checking Software, 2009, pp. 5–5.
Zhen Ming Jiang Zhen Ming Jiang is an

Assistant Professor at the Department of
Electrical Engineering and Computer Sci-
ence, York University. Prior to joining York,
he worked at BlackBerry Performance En-
gineering Team. His research interests lie
within Software Engineering and Computer
Systems, with special interests in software
performance engineering, mining software
repositories, source code analysis, software
architectural recovery, software visualiza-
tions and debugging and monitoring of distributed systems. Some
of his research results are already adopted and used in practice on
a daily basis. He is the co-founder and co-organizer of the annually
held International Workshop on Large-Scale Testing (LT). He is also
the recipient of several best paper awards including ICSE 2013,
WCRE 2011 and MSR 2009 (challenge track). Jiang received his
PhD from the School of Computing at the Queen’s University. He
received both his MMath and BMath degrees in Computer Science
from the University of Waterloo.

A Survey On Load Testing of Large-Scale

Uploaded by

Copyright:

Available Formats

A Survey On Load Testing of Large-Scale

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Survey On Load Testing of Large-Scale

Uploaded by

Copyright:

Available Formats

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation

A Survey on Load Testing of Large-Scale

1 I NTRODUCTION Load Test Objectives

Many large-scale systems ranging from e-commerce

tend to be caused by their inability to scale to meet

2.1.1 Load Testing

TABLE 1: Interpretations of Load Testing, Performance Testing and Stress Testing

needed to validate the system’s performance

Only be related to load testing (e.g., [11], [17], [15], [21],

TABLE 2: Load Design Techniques

Techniques Objectives Data Sources Output References

Reporting and reproducing problems like 4.1.1 System Deployment

TABLE 4: Load Execution Techniques

Live-user based Driver based Emulation based

in a model-driven engineering setup (e.g., 4.1.2 Test Execution Setup

TABLE 5: Load Test Analysis Techniques

Approaches Techniques Data Test Objectives Reported Results References

Straight-forward Com- Performance metrics Detecting violations in [174], [175], [176]

• Deriving Target Data tain conditions. Mission critical systems usu-

6) Deriving Transaction Profiles example, the database disconnects and recon-

Zhen Ming Jiang Zhen Ming Jiang is an

You might also like