Data Mining For Secure Software Engineering - Source Code Management Tool Case Study

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

A.V. Krishna Prasad et. al.

/ International Journal of Engineering Science and Technology


Vol. 2 (7), 2010, 2667-2677

Data Mining for Secure Software Engineering


Source Code Management Tool Case Study
A.V.Krishna Prasad
*
Dr.S.Rama Krishna
1

*
Associate Professor Department of Computer Science MIPGS Hyderabad A.P. India
Email: [email protected]
1
Professor Department of Computer Science S.V.University Tirupathi A.P. India
Email: [email protected]

Abstract
As Data Mining for Secure Software Engineering improves software productivity and quality, software engineers
are increasingly applying data mining algorithms to various software engineering tasks. However mining software
engineering data poses several challenges, requiring various algorithms to effectively mine sequences, graphs and
text from such data. Software engineering data includes code bases, execution traces, historical code changes,
mailing lists and bug data bases. They contains a wealth of information about a projects-status, progress and
evolution. Using well established data mining techniques, practitioners and researchers can explore the potential of
this valuable data in order to better manage their projects and do produce higher-quality software systems that are
delivered on time and with in budget. Data mining can be used in gathering and extracting latent security
requirements, extracting algorithms and business rules from code, mining legacy applications for requirements and
business rules for new projects etc. Mining algorithms for software engineering falls into four main categories:
Frequent pattern mining finding commonly occurring patterns; Pattern matching finding data instances for given
patterns; Clustering grouping data into clusters and Classification predicting labels of data based on already
labeled data. In this paper, we will discuss the overview of strategies for data mining for secure software
engineering, with the implementation of a case study of text mining for source code management tool.
Keywords: Data Mining, Software Engineering, Source Code Management, Text Mining
1. INTRODUCTION
Data Mining for Software Engineering: To improve software productivity and quality, software engineers are
increasingly applying data mining algorithms to various software engineering tasks [1 - 15]. However mining
software engineering data poses several challenges, requiring various algorithms to effectively mine sequences,
graphs and text from such data. Software engineering data includes code bases, execution traces, historical code
changes, mailing lists and bug data bases. They contains a wealth of information about a projects-status, progress
and evolution. Using well established data mining techniques, practitioners and researchers can explore the
potential of this valuable data in order to better manage their projects and do produce higher-quality software
systems that are delivered on time and with in budget. Data mining can be used in gathering and extracting latent
security requirements, extracting algorithms and business rules from code, mining legacy applications for
requirements and business rules for new projects etc.[1-5]
Mining algorithms for software engineering falls into four main categories:
Frequent pattern mining finding commonly occurring patterns.
Pattern matching finding data instances for given patterns.
ISSN : 0975-5462 2667
A.V. Krishna Prasad et. al. / International Journal of Engineering Science and Technology
Vol. 2 (7), 2010, 2667-2677

Clustering grouping data into clusters and


Classification predicting labels of data based on already labeled data.
Software engineering data can be broadly categorized into
Sequences such as execution traces collected at runtime, static traces extracted from source code, and co-
changed code locations. Examples of mining algorithms used here are Frequent Item set /Sequence/ Partial-
ordering mining, sequence matching/clustering/classification. Examples of software engineering tasks here
are Programming, maintenance, bug detection and debugging.
Graphs such as dynamic call graphs collected at runtime and static call graphs extracted from source code;
Examples of mining algorithms used here are Frequent Sub-graph mining, Graph
matching/clustering/classification. Examples of software engineering tasks here are bug detection and
debugging.
Text such as bug reports, e-mails, code comments, and documentation. Examples of mining algorithms
used here are Text Matching/Clustering/Classification. Examples of software engineering tasks here are
Maintenance, Bug Detection and Debugging.
2. RESEARCH PROBLEM DEFINITION
The Research problems addressed here are: Mining software engineering data pertaining to Program source code
mining implementation tools which improves software debugging and its related challenges. Strategies for
debugging mining includes: Text Mining, Sequence Mining and Graph Mining.[6-10]
Key research questions addressed are:
1. What types of Software Engineering Data are available to be mined?
2. Which Software Engineering Tasks can be helped using Data Mining?
3. How Data Mining techniques are used in Software Engineering?
4. What are the Challenges in applying Data Mining techniques to Software Engineering data?
5. Which Data Mining techniques are most suitable for specific types of Software Engineering data?
The importance of software is increasing in scientific research and our daily life. Meanwhile, the cost and
consequences of software failure caused by software bugs become more and more serious. This research emphasizes
a standard process for data mining based software debugging. This proposed process provides guidelines for
software testing engineers and researchers on how to apply data mining techniques and software testing theory on
real life software testing projects. Data mining based software debugging projects is a five step process: Establish
the software testing project; data collecting, cleaning and transformation; select, train and verify the data mining
models; classify, locate and describe the software bug found in previous steps; and deploy the knowledge gained
into real life software testing project.

3. OBJECTIVES OF THE RESEARCH WORK
The objective of the research work to propose strategic Data Mining tools for program source code debugging which
improves Software Reliability & Quality. The mining algorithms works on software engineering data like text,
sequences, graphs : Which improves software engineering tasks like Programming; Maintenance; Bug Detection;
Debugging: Bug Detection and debugging : Maintenance, Bug Detection & Debugging. Initially implementation of
source code management tool is done and finally data mining tools are implemented for Debugging Open Web API
Mining. [11-15]
Software engineers can start with either a problem driven approach, but in practice they commonly adopt a mixture
of the first two steps: collecting/investigating data to mine and determining the SE tasks to assist. The three
remaining steps are inorder, preprocessing data, adopting/adapting/developing a mining algorithm, and post
processing applying mining results.
Processing data involved first extracting relevant data from the raw SE data for example, static method call
sequences or call graphs from source code, dynamic method call sequences or call graphs from execution traces, or
word sequences from bug report summaries. This data is further processed by cleaning and properly formatting it for
the mining algorithm. For example, the input format for sequence data can be a sequence database where each
sequence is a series of events.
ISSN : 0975-5462 2668
A.V. Krishna Prasad et. al. / International Journal of Engineering Science and Technology
Vol. 2 (7), 2010, 2667-2677

The next step produces a mining algorithm and its supporting tool, based on the mining requirements derived in the
first two steps. In general, mining algorithms fall into four main categories:
Frequent Pattern Mining: Finding commonly occurring patterns.
Pattern Matching: Finding data instances for given patterns.
Clustering Grouping data into clusters
Classification- Predicting labels of data based on the already labeled data.
The final step transforms the mining algorithm results in to an appropriate format required to assist the SE task. For
example, in the preprocessing step, a software engineer replaces each distinct method call with a unique symbol in
the sequence data base being fed on to the mining algorithm. The mining algorithm then characterizes a frequent
pattern with these symbols. In post processing, the engineer changes each symbol back to the corresponding method
call. When applying frequent pattern mining, this step also includes finding locations that match a mined pattern
for example, to assist in programming or maintenance and finding locations that violate a mined pattern for
example, to assist in bug detection.
Text Mining Tool for Program Source Code Debugging
We can implement Neglected conditions are an important but difficult-to-find class of software defects. This
approach presents a novel approach for revealing neglected conditions that integrates static program analysis and
advanced data mining techniques to discover implicit conditional rules using the novel approach for revealing
neglected conditions that integrates static program analysis
Data mining for legacy requirements: As of now, more than half of new applications are replacements for aging
legacy software applications. Some of these legacy applications may have been in continuous use for more than 25
years. Unfortunately, the software industry is lax in keeping requirements and design documents up to date, so for a
majority of legacy applications, there is no easy way to find out what requirements need to be transferred to the new
replacement. However, some automated tools can examine the source code of legacy applications and extract latent
requirements embedded in the code. These hidden requirements can be assembled for use in the replacement
application. They can also be used to calculate the size of the legacy application in terms of function points, and
thereby can assist in estimating the new replacement application. Latent requirements can also be extracted
manually using formal code inspections, but this is much slower than automated data mining.
Most Software engineering data mining studies rely on well-known, publicly available tools such as association rule
mining and clustering. Such black-box reuse of mining tools may compromise the requirements unique to software
engineering by fitting them to the tools undesirable features. Further, many such tools are general purpose and
should be adapted to assist the particular task at hand. However, Software engineering researchers may lack the
expertise to adapt or develop mining algorithms or tools, while data mining researchers may lack the background to
understand mining requirements in the software engineering domain. On promise way to reduce this gap is to foster
close collaborations between the software engineering community (requirement providers) and data mining
community (solution providers). This research effort represents one such instance.
Writing Requirements is a two way process, classified as Functional Requirements (FR) and Non-Functional
Requirements (NFR) statements from Software Requirements Specification (SRS) documents. This is systematically
transformed into state charts considering all relevant information. The test cases can be used for automated or
manual software testing on system level. A method for reduction of test suite by using mining methods there by
facilitating the mining and knowledge extraction from test cases.
4. IMPLEMENTATIONS AND VALIDATIONS
Text Mining Source Code Management Tool Case Study
The management of source code is one of the greatest challenges facing Programmers today. [16-21] As programs
become larger and more complex, the need to organize and manage source code increases. My motivation is to
implement source code maintenance routines (with which C++programmers gain control over their source code)
which parse tokens from an ANSI C++file, formats the file, extract header files and colorize a file. All of these are
valuable routines because they can be adapted for any computer language and n be easily extended because of object
oriented implementation in C++language. [16 -21]
ISSN : 0975-5462 2669
A.V. Krishna Prasad et. al. / International Journal of Engineering Science and Technology
Vol. 2 (7), 2010, 2667-2677

Source code management involves two operations: Analysis and Manipulation. The analysis of source code can
yield useful information about the program that may not be readily apparent or easily obtained. For example when
files are shared among objects, it is difficult to track which files are dependent on others. A source code maintenance
program can parse the source code and produce documentation that describes each class its member variables and
functions.
Source code can be manipulated to conform to standardized styles or to better display program flow. For example,
a particular indentation format can facilitate code sharing amongst team members. Maintaining structure code
amongst team members is extremely difficult and time consuming because programmers must modify their
individual styles. One programmer may insist that the curly bracket following an if statement appear on a new line,
while another programmer may prefer appending the bracket to the same line as the if statement. A source code
formatter offers a convenient solution to this problem.
Our focus is to design and implement a source code management program that scans code and outputs it to slightly
different format. Code maintenance modules receive source code as input, break the code down into tokens and then
output them in a new format. This output depends on the specific task performed by the module. The utility is based
on three class groups: tokens, scanners and parsers. A scanner reads the code and breaks it down into tokens and
returns them back to the parser. It also identifies the type of token to return. The parser requests successive tokens
from the scanner and takes appropriate action before requesting the next token. The action of parser is to write out
the token. These modules are generic enough too many programming languages with little modifications. One utility
scans a C++file and list all the filenames that have been included in the file being scanned. The second utility
generates a colorized version of input file in HTML (Hypertext Markup Language). After parsing the input file
HTML tags surround each token are written out to a new stream so that the token types appear in unique colors with
in a browser. The utility indents the lines of the file according to rules defined for the language. This tool is
indispensable for configuration management in Maintenance phase as it greatly improves the quality of the
application under consideration.
We had implemented significant code maintenance routines, which can be customized easily for expansion and
enhancement based on further requirements. It demonstrates the potential of a source code parser, as these routines
are generic enough to be adapted to many other programming languages with little modifications as this a pure
object oriented implementation. This tool goes a long way to cut maintenance costs considerably. Importantly C++
programmers feel it is desirable to identify classes and their invoked member functions in all the identified include
files put together.
Refer to Figure 1 which provides the sequence diagram of the overall code maintenance process.


Figure 1: The code maintenance process
Refer to the Figure 2 which provides the Sample Class Diagram of the CToken Hierarchy
ISSN : 0975-5462 2670
A.V. Krishna Prasad et. al. / International Journal of Engineering Science and Technology
Vol. 2 (7), 2010, 2667-2677


Figure 2: A sample class Diagramof the CToken hierarchy
Refer to Table 1 which provides Classes derived from CToken
Table 1 Token Classes Derived fromCToken
Class Name Class Description
CEOFToken End of file.
CEOLToken End of line.
CWhiteSpaceToken White space.(either a space or a tab character)
CEOLCommentToken End of line comment.
CCommentToken Inline comment(for example:/* my comment */)
CStringToken String literal(for example: this is a string)
CCharacterToken Character literal(for example: ;)
CNumericToken Any Numeric value
CPunctuationToken Punctuation (for example {and .)
CWordToken Any string of characters that does not fit any other categories defined in this table.
Reserved words are a special type of CWordToken, but they do not have their
own class. Variable and function names are types of CWordToken objects.
CLineContinuationToken This token represents the characters that cause a statement to continue onto the
next line (for example: \)
CStatementEndToken Signifies the end of a statement. For C++this would be a semicolon. Other
languages may not have such a character. Instead, a linefeed is assumed to
indicate the end of the statement.

Refer to Table 2 which provides Valid Formatting Flags
ISSN : 0975-5462 2671
A.V. Krishna Prasad et. al. / International Journal of Engineering Science and Technology
Vol. 2 (7), 2010, 2667-2677

Table 2 Valid Formatting flags


Enum Name Enum Description
eIndentNone Does not cause an Indent
eIndentAll Indent all the NEXT lines(until an eIndentDecrement token is encountered)
eIndentIgnore Do not indent CURRENT line at all when this token is encountered. This is used
for tokens such as #include in C++
eIndentIgnoreStatementEnd Do not increase the indent even if the statement on the previous line was not
reported as ended.
eIndentDecrement Decrement the indent count for the CURRENT and FOLLOWING lines.
eIndentStatementEnded Indicates that the text ends a statement or that the statement has ended.
eIndentLineContinuation Extend the statement to the next line. The following line should be indented as
thought the statement on the previous line was not completed.
eIndentNewLineBefore Put this token onto a new line
eIndentNewLineAfter Put a new line after this token
eIndentNewLineAfter If this token appears before a token with eIndentNewLineAfter, then ignore the
eIndentNewLineAfter flag (for example to prevent linefeed occurring inside a
C++for statement)

Refer to Table 3 which provides Format Strings for C++
TABLE 3. Format Strings for C++
Format String Assigned enum flags
# eIndentIgnore
{ eIndentAll | eIndentIgnoreStatementEnd | eIndentStatementEnded
| eIndentNewLineBefore | eIndentNewLineAfter
} eIndentStatementEnded | eIndentDectrement |
eIndentIgnoreStatementEnd | eIndentNewLineBefore
: eIndentStatementEnded
case,default eIndentDecrement | eIndentAll
for eIndentIgnoreNewLineAfter
private
protected
public
eIndentIgnore
Refer to Table 4 which provides Format Strings and Flags

TABLE 4. Format Strings and Flags
ISSN : 0975-5462 2672
A.V. Krishna Prasad et. al. / International Journal of Engineering Science and Technology
Vol. 2 (7), 2010, 2667-2677

String Flags Assigned


# The character used to define precompile items such as #define and #endif.
These lines always start in the first column so they should not be indented at all. The
indent flag, therefore, is set to eIndentIgnore.
// Library include files
#include <iostream>
{ Used to start a block of code.
eIndentNewLineBefore combined with eIndentNewLineAfter cause the bracket to appear
on its own line.
eIndentIgnoreStatementEnd causes the line starting with this bracket to not indent even if
the previous line did not end. The bracket could be placed following a while statement,
for example, where the line containing the while statement does not end in a semicolon.
eIndentIgnoreStatementEnd allows the bracket to appear directly below the begining of
the while statement.
eIndentStatementEnded causes the line ending with a bracket to be considered a complete
statement.
while ( token.GetType ( ) !=eTokenTypeEOF )
{
cout <<token;
token =scanner . GetNextToken ( );
}
} Decrement the current line and all the following lined because of the eIndentDecrement
flag. Thus returns the indent level to the level before the opening bracket appeared.
eIndentNewLineBefore causes the close bracket to start on a new line. Note that the
eIndentNewLineAfter flag was not set. This would cause closing comments or a
semicolon on the same line to shifted to the following line.
eIndentIgnoreStatementEnd was used so that even if the previous line did not end, the
bracket would not be indented. This is needed when the bracket is used in an enum
declaration because there are not semicolons to indicate that a statement ended within the
enum.
The last flag, eIndentStatementEnded, is used so that the line following the close bracket
is not indented.
} consider the following fragment
enum EType
{
eInteger,
eFloat,
eDouble
}
ISSN : 0975-5462 2673
A.V. Krishna Prasad et. al. / International Journal of Engineering Science and Technology
Vol. 2 (7), 2010, 2667-2677

while ( bNotFinished )
{
bNotFinised =GetNext ( );
}
Note in the above code that eFloat is indented one more than eInteger. This is because the
eInteger line does not end with a semicolon and yet there is no character to assign the
eIndentIgnoreStatementEnd flag to. Of course, you can alter this behavior. A second
caveat concerns array definition and assignment. Because arrays are assigned using
brackets as well, the array will appear on its own line.
int array [ ] =
{
1,2,3,4
}
: eIndentStatementEnded is used so that the statement following a case or default will not
be indented.
case
default
eIndentDecrement and eIndentAll cause the case and default lines to appear at the same
level as the switch statement.
void myfunc()
{
switch ( token . GetType( ) )
{
case eTokenTypeEOF:
sOutput +=HTML_EOF;
break;
default:
sOutput +=token;
}
}
for eIndentIgnoreNewLineAfter is set to prevent a new line from occurring even if a
statement end is encountered on the same line. This is necessary because for statements
include two semicolons, which are end of statement characters.
void MyFunc ( int repeat )
{
int i
for(i =0;i <repeat; i ++)
{
cout << ;
}
}
One caveat with the for statement: If the original for statement carries over to two lines,
then each part of the autoindented for statement will also appear on its own line.
void MyFunc ( int repeat )
{
int i
for ( i =0;
i <repeat;
i ++)
{
ISSN : 0975-5462 2674
A.V. Krishna Prasad et. al. / International Journal of Engineering Science and Technology
Vol. 2 (7), 2010, 2667-2677

cout <<;
}
}
private
public
protected
Lines containing these tokens are placed flush against the margin in the same manner as
#define and #include. They therefore have the eIndentIgnore flag.
class MyClass
{
public :
MyClass ( );
~MyClass( );
protected :
bool bInitialized;
}
Refer to Figure 3 which provides Colorization of tokens by SCodeMnt demo1ai.cpp /html

Figure 3. Colorization of tokens by SCodeMnt demo1ai.cpp /html
Refer to Figure 4 which provides Summarization of tokens by SCodeMnt demo1ai.cpp /html
ISSN : 0975-5462 2675
A.V. Krishna Prasad et. al. / International Journal of Engineering Science and Technology
Vol. 2 (7), 2010, 2667-2677


Figure 4. Summarization of tokens by SCodeMnt demo1ai.cpp /html
For details of implementations source code and documentation please refer to the web site
http://sites.google.com/site/kpresearchgroup

5. CONCLUSION & FUTURE WORK
Further work includes: In modern integrated software engineering environments, software engineers must be able to
collect and mine software engineering data on the fly to provide rapid just-in-time feed back. SE researchers usually
conduct offline mining of data already collected and stored. Stream data mining algorithms and tools could be
adopted or developed to satisfy such challenging mining requirements. This tool can be extended with features like:
We can colorize or auto indent other languages by defining new load functions similar to LoadScannerCPP( ) and
LoadFormatCPP( ), so that additional languages can also be displayed in a browser window or automatically
formatted, An even better solution would be to change the program so that it can read a language-definition file.
Using this method, the language definition could be loaded at run-time and be data driven rather than having to
write code each time you wish to parse a new language. Also, we can even add a new function that creates a file
containing each unique token found in a program file. This token file could be used as a custom dictionary by a
word processor, for example. Finally, we can create a cross-reference database that indicates each line where a
particular function or variable is used.
6. REFERENCES
[1]. Tao Xie, Suresh Thummalapenta, David lo, Chao Liu,Data Mining for Software Engineering, IEEE Computer, August 2009, pp. 55-62.
[2]. Hamid Abdul BAsit, Stan J arzabek, A Data Mining approach for detecting higher-level clones in Software, IEEE Transactions on
Software Engineering, Vol. 35, No. 4, J uly/August 2009, pp. 497-514.
[3]. Ivano Malavelta, Henry Muccini, Patrizio Pellicciona, Damien Andrew Tamburri, Providing Architectural Languages and Tools
Interoperability through Model Transformation Technologies, IEEE Transactions on Software Engineering, Vol. 36, No. 4, January/February
2010, pp. 119-140.
[4]. Tao Xie, J ain Pei, Ahmed E Hassen, Mining Software Engineering Data, IEEE 29 th International Conference on Software Engineering
ICSE 07.
[5]. Francisco P.Romero, J ose A.Olivas, MArcele Genero, Mario Piattini, Automatic Extraction of the main terminology used in Empirical
Software Engineering through Text Mining Techniques ACM ESEM 08 pp. 357 358.
[6]. Mohammed J Zaki, Christopher D Carothes, Boleslan K Szymaski, VOGUE: A Variable Order hidden Markov Model with duration based
on Frequent Sequence Mining, ACM Transactions on Knowledge Discovery fromData, Vol. 4 No.1, Article 5, J anuary 2010.
ISSN : 0975-5462 2676
A.V. Krishna Prasad et. al. / International Journal of Engineering Science and Technology
Vol. 2 (7), 2010, 2667-2677

[7]. Francine Bermas, Got Data? A guide to data preservation in the Information Age, Communications of the ACM, December 2008 Vol 51,
No.12, pp. 50-56.
[8]. Nizar R Mabroukeh, Christe I Ezeite, Using Domain Ontology for Semantic Web Usage Mining and Next Page Prediction, ACM CIKM 08
pp. 1677 1680.
[9]. TimMenzein, Gary D Boettiecher, Smarter Software Engineering: Practical Data Mining Approaches, IEEE/NASA 27 Th Annual
Software Engineering Workshop 2002.
[10]. J osh Eno, Craig W Thompson, Generating Synthetic Data to Match Data Mining Patterns, IEEE Internet Computing May/J une 2008 pp.
78 82.
[11]. O.Maqbool, A Karim, H.A.Babri, Misarwar, Reverse Engineering using Association Rules, IEEE INMIC 2004, pp. 389 -395.
[12]. Gang Kou Yipeng, A Standard for Data Mining based Software Debugging, IEEE 4 Th International Conference on Networked
Computing and advanced Information Management, pp. 149 152.
[13]. Qi Wang, Bo yo, J ie Zhu, Extract Rules fromSoftware Engineering Quality Prediction Model based on Neural Networks, ICTAI 2004.
[14]. Ngoavel Moha, Yann-Gael Gueheneu, Laurence Duchien, Anne-Fran Coisele Mew, DCOR A Method for the Specification and
Detection of Code and Design Smells, IEEE Transactions on Software Engineering, Vol. 36, No. 4, J anuary/February 2010, pp. 20-36.
[15]. Ray-Yaung Chang, Andy Podgurski, J iong Yang, Discovering Neglected Condition in Software by Mining Dependency Graphs, , IEEE
Transactions on Software Engineering, Vol. 34, No. 5, September/October 2008, pp. 579-596.
[16] Brain W. Kernighan, Rob pike, The practice of programming, Addison Wesley publishers, 1999.
[17] Victor R Volkman, C/C++Programmers Tools and Libraries A Developers Resource Kit of C/C++and Source Code, R&D Books Miller
Freeman Inc., 1998
[18] Herbert Schildt,The Art of C++, TataMGH, 2004
[19] Steven McCornell, After the gold rush: creating a true profession of Software Engineering, Microsoft press, 1999
[20] C, C++, C#A reality check, Developer IQ Software Technology Magazine, Vol 5 Number 10 October 2005
[21]www.cio.com/article/120802/Source_Code_Management_Systems_Trends_Analysis_and_Best_Features (Last accessed on 28/06/2010)
ISSN : 0975-5462 2677

You might also like