Software Verification and Validation
Software Verification and Validation
Software Verification and Validation
Ajmal Khan
Blekinge Institute of Technology 371 79 Karlskrona, Sweden
Eleftherios Alveras
Blekinge Institute of Technology 371 79 Karlskrona, Sweden
[email protected] ABSTRACT
This report presents the findings of a controlled experiment, carried out in an academic setting. The purpose of the experiment was to compare the effectiveness of checklist-based code inspection (CBI) versus automatic tool-based code inspection (TBI). The experiment was required to run on the computer game Empire Classic version 4.2, written in C++, with over 34,000 lines of code. On the other hand, the students were given the freedom to choose which part and how much of the code would be inspected. The checklist and the tool, for the CBI and TBI respectively, were also left upon the students. The experiment was carried out by 5 students, divided into two teams, Team A of 2 students and Team B of 3 students. Both teams carried out both CBI and TBI. No defect seeding was done and the number of defect in the game was unknown at the start of the experiment. No significant difference in the number of check violations between the two methods was found. The distributions of the discovered violations did not differ significantly regarding either the severity or the type of the checks. Overall, the students couldnt come to a clear conclusion that one technique is more effective than the other. This indicates the need for further study. Key Words: Code inspection, controlled experiment, static testing, tool support, checklist
[email protected]
IDEs (Integrated Development Environments), e.g., the Eclipse and Microsoft Visual Studio, also provide basic automated code review functionality [14]. We have gone through a list of static code analysis tools available on Wikipedia [16]. Initially, we have selected some of the tools based on our basic requirement of supported sourcecode language and supported environments (platforms, IDEs, and compilers). This list of tools we considered is the following: Coverity Static Analysis: Identifies security vulnerabilities and code defects in C, C++, C# and Java code. CppCheck: Open-source tool that checks for several types of errors, including use of STL (Standard Template Library). CppLint: An open-source tool that checks for compliance with Google's style guide for C++ coding. GrammaTech CodeSonar: A source code analysis tool that performs a whole-program analysis on C/C++, and identifies programming bugs and security vulnerabilities. Klocwork Insight: Provides security vulnerability, defect detection, architectural and build-over-build trend analysis for C, C++, C#, Java. McCabe IQ Developers Edition: An interactive, visual environment for managing software quality through advanced static analysis. Parasoft C/C++test: A C/C++ tool that provide the functionality of static analysis, unit testing, code review, and runtime error detection. PVS-Studio: A software analysis tool for C/C++/C++0x. QA-C/C++: Deep static analysis of C/C++ for quality assurance and guideline enforcement. Red Lizard's Goanna: Static analysis of C/C++ for command line, Eclipse and Visual Studio.
1. INTRODUCTION
The importance of code inspection is not debatable, but whether these code inspections should be carried out by the scarce resource of human inspectors or with the available automatic tools, is certainly debatable. Despite the availability of large number of automatic tools for inspection, companies perform very little automatic testing [2]. Since inspection is a costly process, it is considered a good practice to assign mundane repetitive tasks to automatic tools instead of human testers. This provides more time for manual testing, which involves creativity [2][3][5]. Many studies [4][6][7][11] have been carried out to compare automatic against manual code inspection. Our study differs in 3 ways. First, this study does not consider a small piece of code or program for the experiment, since the code base of the target program is larger than 34.000 lines of code. Second, the selection of the tool was also part of the experiment. Third, no checklist was provided to the students and they were asked to devise a suitable checklist for the purpose.
All of these tools are commercial, except CppCheck and CppLint, which are open-source. From the later, we were only able to successfully download and run CppCheck. The main features of CppCheck [1][19] are listed below: Array bounds checking for overruns Unused functions, variable initialization and memory duplication checking Exception safety checking, e.g., usage of memory allocation and destructor checking Memory and resource leaks checking
Invalid usage of STL (Standard Template Library) idioms and functions checking Miscellaneous stylistic and performance errors checking
All of the above mentioned commercial tools are well renowned, and contain more or less the same kind of features. Since we expected to find more defects with commercial tools, we decided to include at least one commercial tool in our TBI activities. Unfortunately, the only tool we were able to acquire and install successfully was Parasoft C/C++ test. The features of this tool are mentioned in detail on the tool website [18].
number of items1 to inspect are low in number (e.g. less than 20), the inspection would include all of them. However, if the number of items was too large, we would select those items that exist in either the 10 more complex files (i.e. the 10 files with the highest value of total complexity) or the 15 more complex functions (i.e. the 15 functions with the highest value of CC). In some checks both the above selections were applied, in order to ensure the number of items to inspect was appropriate. A list of files and function ranked according to CC are found on Appendix D.
5. METHODOLOGY
The research problem, research questions and the design of the experiment is given in this section of the report.
3. SELECTION OF CHECKLIST
The main objective of our experiment was to compare the effectiveness of the two inspection methods, i.e. CBI and TBI. Towards that end, we decided to use the same checks for the checklist as well as for the tool. Doing so allowed us to have a clear measure for comparing the two methods and that is the number of reported violations. The selected checks are listed in appendix A. It is worth mentioning that, it was not possible to use the number of actual defects as a means of comparison. While some checks prove the existence of a defect, like MRM #8 (see Appendix A), most others provide just an indication. Due to factors, such as the absence of sufficient documentation, feedback from the development team and lack of programming experience, it was impossible to decide whether a check violation was indeed a defect, in many cases. Therefore, the comparison was based on the number of violations of the selected checks and not the number of actual defects. The tool or our choice implements hundreds of checks. We could only select a few, because of the time constraints imposed to us. An additional constraint was the experience with the target programming language (C++), which limited our ability to apply some of the checks. Therefore, only 29 checks were included in the checklist and, as a result, only they were executed with the tool.
suggestions. The result of this discussion is included in Appendix A. Phase 2: In this phase, discussion was carried on the code selection. Towards that end, different options were discussed, as mentioned in section 4. The final decision regarding the strategy was taken in our next meeting, after performing some search for the code metric of our choice and for a tool that would automate its computation. Phase 3: During this phase, team A used the checklist and team B used the tool to carry out the inspection. The time allotted for this phase was 7 days. Phase 4: In the fourth phase, team B used the checklist and team A used the tool to carry out the inspection. The time allotted for this phase was also 7 days. Phase 5: In the final phase, the two teams regrouped and data analysis was performed.
5.2.7 Parameters
The important parameters in this experiment are the manual testing skills of the students, the code selection strategy, the tool for the automated code inspection, the checklist, the testing environment, their skills in C++ programming, and the available time for the execution of the experiment.
5.2.3 Factors
The factor in this experiment is the testing methodology and has two alternatives, i.e., checklist-based code inspection (CBI) and the tool-based code inspection (TBI) for inspecting the code.
7. RESULTS
The statistical analysis of the data collected during this experiment gives the following results.
The graph given below shows a graphical description of the number of defects found using each technique along with the performance of each team.
The above table indicates that, even though there were a lot of violations in the inspected parts of the code, only a few of them were of high or highest severity. Apparently, that proves that the code, even though it can be improved considerably, it is not likely to have disastrous consequences for the target audience, the potential gamers. Even so, it is apparent that a code improvement is required, as the highest severity violations can be harmful.
Table 2: Defects found in each violation type
8. DISCUSSION
In this section we discuss the summary of the results, answer the research questions, give the limitations of this research and discuss our future research. Research Question 1: How does the use of an automatic tool for code inspection affect the number of detected defects? The table given below shows that checklist based inspection proved better in discovering initialization defect, MISRA defects and violation of coding conventions. Overall, 2.81% more defects were detected using the checklist based inspection technique than the tool based inspection technique.
We can see in the graph given below that MRM dominates all the other violation types. Clearly, the run-time behavior of EmpireClassic is unpredictable and probably deviates from the intentions of the game creators. It is worth noting, however, that 0 violations were reported on the PB type. A possible explanation is that, since the game is open-source, a lot of bugs may have been reported by the players of the game and they may have been fixed, before version 4.2, which is the version used in this experiment. No object oriented defects were found in the game. Partly, this is because the object oriented features that are used in the code are very scarce. We could observe only a limited use of classes and data encapsulation, while almost no inheritance or polymorphism was used. The team easily came to the conclusion that the code is much more procedural than object-oriented.
This proved that hypothesis H1-2 was correct. More defects are detected in code inspection done by human code inspectors without the help of an automatic tool. This experiment proved that CBI was a better technique as compared to TBI. Research Question 2: How does the use of an automatic tool for code inspection affect the type of violations defects? Hypothesis H2-0 was proved to be correct. There is virtually no difference in the type of violations found by the two methods, CBI and TBI. From the table of section 7.3, it is clear that both of the methods were as effective in all the violation types.
9. VALIDITY THREATS
1.
Figure 2: Technical types of defects found
2.
3. 4.
In the above table, it is clear that, since no checks were of low severity level, no violations were reported. One the other hand, even though four checks were at severity level high, no violations were reported either. These checks are from multiple types and no sampling was used for them, so no violations reported for the high severity checks was actually a surprise. By far, the most checks of the were assigned the severity medium, so it was expected that the number of medium violations was high.
5. 6. 7.
Only one out of 5 of the students that performed the experiment was competent in C++, before the start of the experiment. Time constraints limited the effectiveness of training in C++, as mentioned in the section 5.2.6. As a result, the technical skills of the team in C++ might have had a negative effect in the quality of the results. No seeding of defect was done, which would have provided a defined target to judge the performance against. The program given had over 34,000 lines of code, which was not suitable for this kind of experiment. Earlier experiments considered smaller code sizes. For example, [10] used two programs, with 58 lines of code and 8 seeded defaults and 128 lines of code with 11 seeded defaults, respectively. The recommended length of two hours for inspection meeting could not be followed due to size of the code. Maturation (learning) effects concern improvement in the performance of subjects during the experiment. Students being the subjects of the experiment cannot be compared with the software engineering professionals. As a result, total inspection time, inspected code size and code selection for inspection do not reflect the real practice of
8.
inspection in the industry, as the choice of the students was limited by their level available expertise of the inspectors. The author of the code was not available and, thus, the code selection may have been not as sophisticated. The authors could not do much to do away with the external threats especially and hence the generalizability of this experiment is not valid.
11. REFERENCES
[1] A Survey of Software Tools for Computational Science: http://www.docstoc.com/docs/81102992/A-Survey-ofSoftware-Tools-for-Computational-Science. Accessed: 2012-05-18. Andersson, C. and Runeson, P. 2002. Verification and Validation in Industry " A Qualitative Survey on the State of Practice. Proceedings of the 2002 International Symposium on Empirical Software Engineering (Washington, DC, USA, 2002), 37. Berner, S. et al. 2005. Observations and lessons learned from automated testing. Proceedings of the 27th international conference on Software engineering (New York, NY, USA, 2005), 571579. Brothers, L.R. et al. 1992. Knowledge-based code inspection with ICICLE. Proceedings of the fourth
[2]
[3]
[4]
conference on Innovative applications of artificial intelligence (1992), 295314. [5] Fewster, M. and Graham, D. 1999. Software Test Automation. Addison-Wesley Professional. [6] Gintell, J. et al. 1993. Scrutiny: A Collaborative Inspection and Review System. Proceedings of the 4th European Software Engineering Conference on Software Engineering (London, UK, UK, 1993), 344360. [7] Gintell, J.W. et al. 1995. Lessons learned by building and using Scrutiny, a collaborative software inspection system. Computer-Aided Software Engineering, 1995. Proceedings., Seventh International Workshop on (Jul. 1995), 350 357. [8] Itkonen, J. et al. 2007. Defect Detection Efficiency: Test Case Based vs. Exploratory Testing. Empirical Software Engineering and Measurement, 2007. ESEM 2007. First International Symposium on (Sep. 2007), 61 70. [9] Juristo, N. and Moreno, A.M. 2001. Basics of Software Engineering Experimentation. Springer. [10] Kitchenham, B.A. et al. 2002. Preliminary guidelines for empirical research in software engineering. Software Engineering, IEEE Transactions on. 28, 8 (Aug. 2002), 721 734. [11] MacDonald, F. and Miller, J. 1998. A Comparison of Tool-Based and Paper-Based Software Inspection. Empirical Softw. Engg. 3, 3 (Sep. 1998), 233253. [12] McCabe, T.J. 1976. A Complexity Measure. Software Engineering, IEEE Transactions on. SE-2, 4 (Dec. 1976), 308 320. [13] Munson, J.C. and Khoshgoftaar, T.M. 1992. The Detection of Fault-Prone Programs. IEEE Trans. Softw. Eng. 18, 5 (May. 1992), 423433. [14] Wikipedia contributors 2012. Automated code review. Wikipedia, the free encyclopedia. Wikimedia Foundation, Inc. [15] Wikipedia contributors 2011. Imagix 4D. Wikipedia, the free encyclopedia. Wikimedia Foundation, Inc. [16] Wikipedia contributors 2012. List of tools for static code analysis. Wikipedia, the free encyclopedia. Wikimedia Foundation, Inc. [17] Yu, Y. et al. 2005. RETR: Reverse Engineering to Requirements. Reverse Engineering, 12th Working Conference on (Nov. 2005), 234. [18] C++TestDataSheet.pdf. http://www.parasoft.com/jsp/printables/C++TestDataShe et.pdf?path=/jsp/products/cpptest.jsp&product=CppTest [19] manual.pdf. http://cppcheck.sourceforge.net/manual.pdf [20] Fagan, M.E., Advances in Software Inspections, July 1986, IEEE Transactions on Software Engineering, Vol. SE-12, No. 7, Page 744-751
Do not declare the size of an array when the array is passed into a function as a parameter.
Bitwise operators, comparison operators, logical operators, comma operator should be const.
The condition of an if-statement and the condition of an iterationstatement shall have type bool. Each operand of the ! operator, the logical && or the logical || operators shall have type bool.
Initialization [INIT]
List members in an initialization list in the order in which they are declared. All classes should contain the assignment operator or appropriate comment.
Always provide empty brackets ([]) for delete when deallocating arrays.
Assignment operators shall not be used in expressions that yield a Boolean value. Do not apply arithmetic to pointers that don't address an array or array element. Floating-point expressions shall not be directly or indirectly tested for equality or inequality.
Do not directly access global data from a constructor. The definition of a constructor shall not contain default arguments that produce a signature identical to that of the implicitly-declared copy constructor. Do not call 'sizeof' on constants.
It is worth noting that, both the types and the checks originate from the tool, i.e. they do not necessarily conform to a known standard, like the Common Weakness Enumeration (CWE). However, it is also worth mentioning that, for each check, the tool does provide references. For example, for the check CODSTA #2, the tool includes two references: 1. 2. Ellemtel Coding Standards, http://www.chris-lott.org/resources/cstyle/Ellemtel-rules-mm.html, 14 Flow Control Structures - Rule 48 JOINT STRIKE FIGHTER, AIR VEHICLE, C++ CODING STANDARDS, Chapter 4.24 Flow Control Structures, AV Rule 194
Appendix B - Checklist final results Check Type Check No. 1 2 3 4 Severity Level 2 3 3 3 Violations Reported 0 0 0 6 File Name types.hpp criteria.hpp gather.cpp CODSTA 5 1 8 island.hpp types.hpp 6 7 3 3 2 5* types.hpp fly.cpp services.cpp fly.cpp gather.cpp report.cpp 1 INIT 2 3 2 3 0 0 1 3 criteria.hpp types.hpp bomb.cpp criteria.hpp gather.cpp island.hpp newsman.cpp 1 3 30 radar.cpp report.cpp set.hpp ship.hpp tdda.hpp MRM types.hpp bomb.cpp criteria.hpp gather.cpp island.hpp newsman.cpp 2 3 28 radar.cpp report.cpp set.hpp ship.hpp tdda.hpp types.hpp All the classes (except class Island and class Region) in these files failed this check. 72 All the classes in these files failed this check. Line No. 116, 162, 172, 238, 679, 723 50 24 59, 60 119, 174, 175, 176 634, 720 664 369, 384, 649, 725 664 844 18 17 356 3 23 Total Reported Violations
2*
3 4 5
3 3 2
2 3 0
tdda.hpp tdda.cpp fly.cpp newsman.cpp bomb.cpp criteria.cpp fly.cpp island.cpp newsman.cpp tdda.cpp
38 27 629, 886 416 352 298 230 177 389, 393, 414 13 21 269, 540, 1124, 1124, 1128, 1129 378 297, 299 320, 325 748, 961, 1000 58, 63, 402, 404, 406 94, 131 197, 412, 417, 690, 779 280, 311, 667 TOTAL 128 0 0 30
7 8
3 3
0 1
tdda.cpp bomb.cpp
12*
2 MISRA 3
3 3
0* 0*
bomb.cpp course.cpp
18*
OOP
1 1 2
1 2 3 3 3 3
0 0 0 0 0*** 0***
PB
3 4 5
* ** ***
Sampling used: 10 most complex files. Sampling used: 15 most complex functions. Sampling used: 10 most complex files and 15 most complex functions.
Check Type
Check No. 1 2 3 4 5 6
Severity Level 2 3 3 3 1 3
File Name types.hpp island.hpp types.hpp types.hpp types.cpp fly.cpp radar.cpp services.cpp bomb.cpp
Line No. 679, 723 59, 60 50 238, 634, 720 113 628 796 369, 384, 649, 725 1335 664, 885 845 646 277 18 17 382 -
CODSTA 7 3 6*
20
5*
fly.cpp gather.cpp services.cpp criteria.cpp report.cpp criteria.hpp types.hpp bomb.cpp criteria.hpp gather.cpp island.hpp newsman.cpp
1 INIT 2 3
2 3
0 0
30
radar.cpp report.cpp set.hpp ship.hpp tdda.hpp types.hpp bomb.cpp criteria.hpp gather.cpp island.hpp
MRM
72
28
All the classes (except class Island and class Region) in these files failed this check.
tdda.hpp types.hpp 3 4 5 3 3 2 0 5 0 fly.cpp island.cpp newsman.cpp bomb.cpp criteria.cpp 6 3 8 fly.cpp island.cpp newsman.cpp tdda.cpp 7 8 3 3 0 1 tdda.cpp course.cpp genecis.cpp 1 5 12* move.cpp radar.cpp ship.cpp 2 MISRA 3 3 3 0* 0* course.cpp genecis.cpp 4 3 13* radar.cpp services.cpp ship.cpp OOP 1 1 2 PB 3 4 5 1 2 3 3 3 3 0 0 0 0 0*** 0*** 629, 886 194, 381 416 352 298 230 177 389, 393, 414 13 21 401 410, 411 106, 150 280, 370, 377 226, 280, 291, 330 177, 333, 340, 470 279, 616, 618, 631, 643, 654 554 335 312 TOTAL * ** *** Sampling used: 10 most complex files. Sampling used: 15 most complex functions. Sampling used: 10 most complex files and 15 most complex functions. 121 0 0 25
Function metrics
Appendix E- Summarized results of both techniques Violations Reported Check Type Check No. 1 2 3 CODSTA 4 5 6 7 8 1 INIT 2 3 1 2 3 MRM 4 5 6 7 8 1 MISRA 2 3 4 OOP 1 1 2 PB 3 4 5 Severity Level 2 3 3 3 1 3 3 3 1 2 3 3 3 3 3 2 3 3 3 5 3 3 3 1 2 3 3 3 3 TOTAL Sampling used x x x x x x x x x x x x x x x x x x x x x 10 most complex files N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 15 most complex functions N/A N/A N/A N/A N/A N/A x x N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A x x x x N/A N/A N/A N/A Manual Inspection (using checklist) 0 0 0 6 8 2 5 2 3 0 0 30 28 2 3 0 8 0 1 12 0 0 18 0 0 0 0 0 0 128 Automated Inspection (using tool) 0 0 0 2 3 4 6 5 4 0 0 30 28 0 5 0 8 0 1 12 0 0 13 0 0 0 0 0 0 121