Data Mining For Secure Software Engineering - Source Code Management Tool Case Study
Data Mining For Secure Software Engineering - Source Code Management Tool Case Study
Data Mining For Secure Software Engineering - Source Code Management Tool Case Study
The next step produces a mining algorithm and its supporting tool, based on the mining requirements derived in the
first two steps. In general, mining algorithms fall into four main categories:
Frequent Pattern Mining: Finding commonly occurring patterns.
Pattern Matching: Finding data instances for given patterns.
Clustering Grouping data into clusters
Classification- Predicting labels of data based on the already labeled data.
The final step transforms the mining algorithm results in to an appropriate format required to assist the SE task. For
example, in the preprocessing step, a software engineer replaces each distinct method call with a unique symbol in
the sequence data base being fed on to the mining algorithm. The mining algorithm then characterizes a frequent
pattern with these symbols. In post processing, the engineer changes each symbol back to the corresponding method
call. When applying frequent pattern mining, this step also includes finding locations that match a mined pattern
for example, to assist in programming or maintenance and finding locations that violate a mined pattern for
example, to assist in bug detection.
Text Mining Tool for Program Source Code Debugging
We can implement Neglected conditions are an important but difficult-to-find class of software defects. This
approach presents a novel approach for revealing neglected conditions that integrates static program analysis and
advanced data mining techniques to discover implicit conditional rules using the novel approach for revealing
neglected conditions that integrates static program analysis
Data mining for legacy requirements: As of now, more than half of new applications are replacements for aging
legacy software applications. Some of these legacy applications may have been in continuous use for more than 25
years. Unfortunately, the software industry is lax in keeping requirements and design documents up to date, so for a
majority of legacy applications, there is no easy way to find out what requirements need to be transferred to the new
replacement. However, some automated tools can examine the source code of legacy applications and extract latent
requirements embedded in the code. These hidden requirements can be assembled for use in the replacement
application. They can also be used to calculate the size of the legacy application in terms of function points, and
thereby can assist in estimating the new replacement application. Latent requirements can also be extracted
manually using formal code inspections, but this is much slower than automated data mining.
Most Software engineering data mining studies rely on well-known, publicly available tools such as association rule
mining and clustering. Such black-box reuse of mining tools may compromise the requirements unique to software
engineering by fitting them to the tools undesirable features. Further, many such tools are general purpose and
should be adapted to assist the particular task at hand. However, Software engineering researchers may lack the
expertise to adapt or develop mining algorithms or tools, while data mining researchers may lack the background to
understand mining requirements in the software engineering domain. On promise way to reduce this gap is to foster
close collaborations between the software engineering community (requirement providers) and data mining
community (solution providers). This research effort represents one such instance.
Writing Requirements is a two way process, classified as Functional Requirements (FR) and Non-Functional
Requirements (NFR) statements from Software Requirements Specification (SRS) documents. This is systematically
transformed into state charts considering all relevant information. The test cases can be used for automated or
manual software testing on system level. A method for reduction of test suite by using mining methods there by
facilitating the mining and knowledge extraction from test cases.
4. IMPLEMENTATIONS AND VALIDATIONS
Text Mining Source Code Management Tool Case Study
The management of source code is one of the greatest challenges facing Programmers today. [16-21] As programs
become larger and more complex, the need to organize and manage source code increases. My motivation is to
implement source code maintenance routines (with which C++programmers gain control over their source code)
which parse tokens from an ANSI C++file, formats the file, extract header files and colorize a file. All of these are
valuable routines because they can be adapted for any computer language and n be easily extended because of object
oriented implementation in C++language. [16 -21]
ISSN : 0975-5462 2669
A.V. Krishna Prasad et. al. / International Journal of Engineering Science and Technology
Vol. 2 (7), 2010, 2667-2677
Source code management involves two operations: Analysis and Manipulation. The analysis of source code can
yield useful information about the program that may not be readily apparent or easily obtained. For example when
files are shared among objects, it is difficult to track which files are dependent on others. A source code maintenance
program can parse the source code and produce documentation that describes each class its member variables and
functions.
Source code can be manipulated to conform to standardized styles or to better display program flow. For example,
a particular indentation format can facilitate code sharing amongst team members. Maintaining structure code
amongst team members is extremely difficult and time consuming because programmers must modify their
individual styles. One programmer may insist that the curly bracket following an if statement appear on a new line,
while another programmer may prefer appending the bracket to the same line as the if statement. A source code
formatter offers a convenient solution to this problem.
Our focus is to design and implement a source code management program that scans code and outputs it to slightly
different format. Code maintenance modules receive source code as input, break the code down into tokens and then
output them in a new format. This output depends on the specific task performed by the module. The utility is based
on three class groups: tokens, scanners and parsers. A scanner reads the code and breaks it down into tokens and
returns them back to the parser. It also identifies the type of token to return. The parser requests successive tokens
from the scanner and takes appropriate action before requesting the next token. The action of parser is to write out
the token. These modules are generic enough too many programming languages with little modifications. One utility
scans a C++file and list all the filenames that have been included in the file being scanned. The second utility
generates a colorized version of input file in HTML (Hypertext Markup Language). After parsing the input file
HTML tags surround each token are written out to a new stream so that the token types appear in unique colors with
in a browser. The utility indents the lines of the file according to rules defined for the language. This tool is
indispensable for configuration management in Maintenance phase as it greatly improves the quality of the
application under consideration.
We had implemented significant code maintenance routines, which can be customized easily for expansion and
enhancement based on further requirements. It demonstrates the potential of a source code parser, as these routines
are generic enough to be adapted to many other programming languages with little modifications as this a pure
object oriented implementation. This tool goes a long way to cut maintenance costs considerably. Importantly C++
programmers feel it is desirable to identify classes and their invoked member functions in all the identified include
files put together.
Refer to Figure 1 which provides the sequence diagram of the overall code maintenance process.
Figure 1: The code maintenance process
Refer to the Figure 2 which provides the Sample Class Diagram of the CToken Hierarchy
ISSN : 0975-5462 2670
A.V. Krishna Prasad et. al. / International Journal of Engineering Science and Technology
Vol. 2 (7), 2010, 2667-2677
Figure 2: A sample class Diagramof the CToken hierarchy
Refer to Table 1 which provides Classes derived from CToken
Table 1 Token Classes Derived fromCToken
Class Name Class Description
CEOFToken End of file.
CEOLToken End of line.
CWhiteSpaceToken White space.(either a space or a tab character)
CEOLCommentToken End of line comment.
CCommentToken Inline comment(for example:/* my comment */)
CStringToken String literal(for example: this is a string)
CCharacterToken Character literal(for example: ;)
CNumericToken Any Numeric value
CPunctuationToken Punctuation (for example {and .)
CWordToken Any string of characters that does not fit any other categories defined in this table.
Reserved words are a special type of CWordToken, but they do not have their
own class. Variable and function names are types of CWordToken objects.
CLineContinuationToken This token represents the characters that cause a statement to continue onto the
next line (for example: \)
CStatementEndToken Signifies the end of a statement. For C++this would be a semicolon. Other
languages may not have such a character. Instead, a linefeed is assumed to
indicate the end of the statement.
Refer to Table 2 which provides Valid Formatting Flags
ISSN : 0975-5462 2671
A.V. Krishna Prasad et. al. / International Journal of Engineering Science and Technology
Vol. 2 (7), 2010, 2667-2677
while ( bNotFinished )
{
bNotFinised =GetNext ( );
}
Note in the above code that eFloat is indented one more than eInteger. This is because the
eInteger line does not end with a semicolon and yet there is no character to assign the
eIndentIgnoreStatementEnd flag to. Of course, you can alter this behavior. A second
caveat concerns array definition and assignment. Because arrays are assigned using
brackets as well, the array will appear on its own line.
int array [ ] =
{
1,2,3,4
}
: eIndentStatementEnded is used so that the statement following a case or default will not
be indented.
case
default
eIndentDecrement and eIndentAll cause the case and default lines to appear at the same
level as the switch statement.
void myfunc()
{
switch ( token . GetType( ) )
{
case eTokenTypeEOF:
sOutput +=HTML_EOF;
break;
default:
sOutput +=token;
}
}
for eIndentIgnoreNewLineAfter is set to prevent a new line from occurring even if a
statement end is encountered on the same line. This is necessary because for statements
include two semicolons, which are end of statement characters.
void MyFunc ( int repeat )
{
int i
for(i =0;i <repeat; i ++)
{
cout << ;
}
}
One caveat with the for statement: If the original for statement carries over to two lines,
then each part of the autoindented for statement will also appear on its own line.
void MyFunc ( int repeat )
{
int i
for ( i =0;
i <repeat;
i ++)
{
ISSN : 0975-5462 2674
A.V. Krishna Prasad et. al. / International Journal of Engineering Science and Technology
Vol. 2 (7), 2010, 2667-2677
cout <<;
}
}
private
public
protected
Lines containing these tokens are placed flush against the margin in the same manner as
#define and #include. They therefore have the eIndentIgnore flag.
class MyClass
{
public :
MyClass ( );
~MyClass( );
protected :
bool bInitialized;
}
Refer to Figure 3 which provides Colorization of tokens by SCodeMnt demo1ai.cpp /html
Figure 3. Colorization of tokens by SCodeMnt demo1ai.cpp /html
Refer to Figure 4 which provides Summarization of tokens by SCodeMnt demo1ai.cpp /html
ISSN : 0975-5462 2675
A.V. Krishna Prasad et. al. / International Journal of Engineering Science and Technology
Vol. 2 (7), 2010, 2667-2677
Figure 4. Summarization of tokens by SCodeMnt demo1ai.cpp /html
For details of implementations source code and documentation please refer to the web site
http://sites.google.com/site/kpresearchgroup
5. CONCLUSION & FUTURE WORK
Further work includes: In modern integrated software engineering environments, software engineers must be able to
collect and mine software engineering data on the fly to provide rapid just-in-time feed back. SE researchers usually
conduct offline mining of data already collected and stored. Stream data mining algorithms and tools could be
adopted or developed to satisfy such challenging mining requirements. This tool can be extended with features like:
We can colorize or auto indent other languages by defining new load functions similar to LoadScannerCPP( ) and
LoadFormatCPP( ), so that additional languages can also be displayed in a browser window or automatically
formatted, An even better solution would be to change the program so that it can read a language-definition file.
Using this method, the language definition could be loaded at run-time and be data driven rather than having to
write code each time you wish to parse a new language. Also, we can even add a new function that creates a file
containing each unique token found in a program file. This token file could be used as a custom dictionary by a
word processor, for example. Finally, we can create a cross-reference database that indicates each line where a
particular function or variable is used.
6. REFERENCES
[1]. Tao Xie, Suresh Thummalapenta, David lo, Chao Liu,Data Mining for Software Engineering, IEEE Computer, August 2009, pp. 55-62.
[2]. Hamid Abdul BAsit, Stan J arzabek, A Data Mining approach for detecting higher-level clones in Software, IEEE Transactions on
Software Engineering, Vol. 35, No. 4, J uly/August 2009, pp. 497-514.
[3]. Ivano Malavelta, Henry Muccini, Patrizio Pellicciona, Damien Andrew Tamburri, Providing Architectural Languages and Tools
Interoperability through Model Transformation Technologies, IEEE Transactions on Software Engineering, Vol. 36, No. 4, January/February
2010, pp. 119-140.
[4]. Tao Xie, J ain Pei, Ahmed E Hassen, Mining Software Engineering Data, IEEE 29 th International Conference on Software Engineering
ICSE 07.
[5]. Francisco P.Romero, J ose A.Olivas, MArcele Genero, Mario Piattini, Automatic Extraction of the main terminology used in Empirical
Software Engineering through Text Mining Techniques ACM ESEM 08 pp. 357 358.
[6]. Mohammed J Zaki, Christopher D Carothes, Boleslan K Szymaski, VOGUE: A Variable Order hidden Markov Model with duration based
on Frequent Sequence Mining, ACM Transactions on Knowledge Discovery fromData, Vol. 4 No.1, Article 5, J anuary 2010.
ISSN : 0975-5462 2676
A.V. Krishna Prasad et. al. / International Journal of Engineering Science and Technology
Vol. 2 (7), 2010, 2667-2677
[7]. Francine Bermas, Got Data? A guide to data preservation in the Information Age, Communications of the ACM, December 2008 Vol 51,
No.12, pp. 50-56.
[8]. Nizar R Mabroukeh, Christe I Ezeite, Using Domain Ontology for Semantic Web Usage Mining and Next Page Prediction, ACM CIKM 08
pp. 1677 1680.
[9]. TimMenzein, Gary D Boettiecher, Smarter Software Engineering: Practical Data Mining Approaches, IEEE/NASA 27 Th Annual
Software Engineering Workshop 2002.
[10]. J osh Eno, Craig W Thompson, Generating Synthetic Data to Match Data Mining Patterns, IEEE Internet Computing May/J une 2008 pp.
78 82.
[11]. O.Maqbool, A Karim, H.A.Babri, Misarwar, Reverse Engineering using Association Rules, IEEE INMIC 2004, pp. 389 -395.
[12]. Gang Kou Yipeng, A Standard for Data Mining based Software Debugging, IEEE 4 Th International Conference on Networked
Computing and advanced Information Management, pp. 149 152.
[13]. Qi Wang, Bo yo, J ie Zhu, Extract Rules fromSoftware Engineering Quality Prediction Model based on Neural Networks, ICTAI 2004.
[14]. Ngoavel Moha, Yann-Gael Gueheneu, Laurence Duchien, Anne-Fran Coisele Mew, DCOR A Method for the Specification and
Detection of Code and Design Smells, IEEE Transactions on Software Engineering, Vol. 36, No. 4, J anuary/February 2010, pp. 20-36.
[15]. Ray-Yaung Chang, Andy Podgurski, J iong Yang, Discovering Neglected Condition in Software by Mining Dependency Graphs, , IEEE
Transactions on Software Engineering, Vol. 34, No. 5, September/October 2008, pp. 579-596.
[16] Brain W. Kernighan, Rob pike, The practice of programming, Addison Wesley publishers, 1999.
[17] Victor R Volkman, C/C++Programmers Tools and Libraries A Developers Resource Kit of C/C++and Source Code, R&D Books Miller
Freeman Inc., 1998
[18] Herbert Schildt,The Art of C++, TataMGH, 2004
[19] Steven McCornell, After the gold rush: creating a true profession of Software Engineering, Microsoft press, 1999
[20] C, C++, C#A reality check, Developer IQ Software Technology Magazine, Vol 5 Number 10 October 2005
[21]www.cio.com/article/120802/Source_Code_Management_Systems_Trends_Analysis_and_Best_Features (Last accessed on 28/06/2010)
ISSN : 0975-5462 2677