20-Design Recovery For Distributed Systems
20-Design Recovery For Distributed Systems
20-Design Recovery For Distributed Systems
Abstract—Two factors limit the utility of reverse engineering technology for many distributed software systems. First, with the
exception of tools that support Ada and its explicit tasking constructs, reverse engineering tools fail to capture information
concerning the flow of information between tasks. Second, relatively few reverse engineering tools are available for programming
languages in which many older legacy applications were written (e.g., Jovial, CMS-2, and various assembly languages). In this
paper, we describe approaches that were developed for overcoming these limitations. In particular, we have implemented an
approach for automatically extracting task flow information from a command and control system written in CMS-2. Our approach
takes advantage of a small amount of externally provided design knowledge in order to recover design information relevant to the
distributed nature of the target system.
—————————— ✦ ——————————
1 INTRODUCTION
is evaluated within the calling environment of each task of documentation is available as a starting point. Additionally,
the module containing the RTOS call. This evaluation returns the tasks that are contained in a module is a very general
a value for the specific variables together with the calling piece of information and does not change much in practice.
environment in which these values were computed. These The detailed information that we extract, the actual task flow
values identify the task spawned by a particular RTOS call from task to task within and outside of a module, is an ex-
and the calling environment identifies the task spawning the ample of information that changes much more often.
new task. In this section we will explain the overall strategy This documentation is written in English, but it is fairly
M-CLUE uses to automatically extract this information and structured. Thus, with a minimal amount of editing, we
explain the approach in the tool for recognizing the relevant automatically parsed the documentation, using a recursive-
design constructs in the MCE source code.
descent parser. The parser automatically created module
2.1 Design Construct Recognition and task objects for each module and task mentioned in the
The purpose of design construct recognition is to identify documentation. The module and task names and the list of
instances of the key architectual features implemented in a relevant files for each module were also set automatically
software system, and their inter-relationships. We created a during parsing.
domain model that identifies a small set of design con- After obtaining as much information about modules and
structs in the MCE code: tasks, modules, and a small set of tasks as possible from the documentation, we examined the
events through which tasks interact with each other (e.g., source code to complete the model. Each module has an
task spawning). associated entry procedure. Because the documentation only
One of the difficulties in determining task flows is in identifies the name of the file containing a module entry
determining what comprises a task. This is because no procedure, we needed to find this procedure from the
syntactic structure exists in CMS-2 that corresponds to a source code. This is done by generating the procedure call-
task. Additionally, procedures can belong to multiple tasks. ing hierarchy of the module. The module entry procedure is
Therefore one cannot simply read the source code to de- the root procedure in the procedure calling hierarchy. To
termine the task containing a particular RTOS call. How- avoid confusing orphan procedures for the root procedure,
ever, using knowledge of how tasks were implemented in the root of the largest disconnected subgraph is used. The
MCE, our design recovery system can determine all of the module entry procedure in the MCE system contains a
necessary design constructs through the use of a set of rec- CMS-2 construct called a p-switch. The p-switch is very
ognition rules. The current implementation hard codes rec- much like a case statement. It passes control to the task root
ognition rules for these design constructs. Each recognition procedure of a particular task, depending upon the value of
rule creates an instance of an abstract design construct or the p-switch variable. A task continues executing in MCE
determines the value of one of its attributes. until the root procedure terminates. Therefore, from the p-
Because it is impossible to recognize these abstract de- switch we were able to determine the names of the task root
sign constructs from information contained solely in the procedures for each of the tasks in that module.
source code, we implemented recognition rules that operate
on design documentation in addition to the ones that oper- 2.2 Determining the Task Called by RTOS
ate on the parsed representation of the source code. The use Once modules, tasks and their entry procedures have been
of noncode sources for reverse engineering has been ex- recognized, it is possible to determine the behavior of each
plored by other researchers [22]. task by identifying and interpreting the RTOS system calls
A portion of the on-line documentation for MCE de- used in a task. Each of the events produced via an RTOS
scribed each of the modules of the system. Fig. 2 is an ex- call is represented in our domain model. We implemented
ample of one specification. For each module, the module’s event recognition algorithms that identify occurrences of
name (first line, third token) and a list of tasks was enumer- these events. The first step is to find all of the RTOS calls in
ated (third column). A list of files relevant to the module the source code. This is easy to do by traversing the parse
(second column) and the file that contained the entry pro- tree of the program and testing for the name RTOS in each
cedure for the module were also identified (file name on procedure call object encountered.
same row as “Entry” in third column). The next step is to evaluate the value of the arguments
used by RTOS to determine the task behavior the RTOS call
MODULE ATC.ATATS ATA 20 represents. An RTOS call has two arguments, the type of
ATAFS-231016 ATAFSM Entry the RTOS call and a table (a CMS-2 data structure) con-
“ ATAPRE
taining information for that type of call. The fields in the
“ ATAPOS
“ ATASHUT
table vary depending on the type of RTOS call invoked. For
“ ATARST each RTOS call identified in the code, the first argument
“ ATAOAM identifying the type of RTOS call is accessed and the ap-
“ ATAMON propriate event object is created to represent the event. For
“ ATACAL example, if the RTOS call schedules a task its first argument
“ ATASI will indicate that this is a REGISTER type of call.
Fig. 2. The specification of the module ATA in the documentation. The task scheduled by an RTOS call is uniquely deter-
mined by a set of arguments passed to RTOS via the call.
Although this paper argues that documentation can These arguments identify a module and the task it contains.
quickly become out of date, resulting in the need for auto- A module/task pair uniquely identifies a task in the MCE
matic methods to keep it current—it is important to use what system. In order to determine the task spawned by an
464 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 23, NO. 7, JULY 1997
RTOS call, it is necessary to determine the state of these two common programming languages, because all variables in a
variables at the particular point in the program when an CMS-2 program, including formal arguments of proce-
RTOS call is made. In some cases determining the value of dures, are global therefore, procedure invocation does not
these variables is relatively straightforward because these introduce a new scope.
values are initialized and remain constant throughout the For any particular RTOS call, module and task variables
execution of that module. may assume different values in different execution contexts.
When a variable is not initialized, its state can be deter- Because a program slice contains the set of all statements that
mined by identifying the relavent program slice, the set of may influence the state of a variable, only a subset of these
statements that may impact the value of that variable at statements may actually be executed under a particular con-
that point in the code. The slice is then evaluated to yield text. In order to evaluate the program slice under each possible
the state of the variables. For our purposes, this technique context, each syntactically possible execution thread within a
assumed that the state of a module and task variable were module that may reach a designated RTOS call is evaluated.
completely determined within the scope of a task because Each of these evaluations could derive a distinct value for the
the program slicing algorithm does not trace data depend- module and task variables for the particular RTOS call.
encies across task boundaries. This assumption was valid The inputs given to the evaluator are: a list of the variables for
for most of the modules in the MCE system. which values are requested; the statement of interest, i.e., the
statement in the slice for which the variable values should be
2.3 Determining the Calling Task determined. The execution of a slice is guided by a pre-order
The identification of the root-procedure for each task pro- traversal of the procedure calling hierarchy. As it is traversed,
vides the knowledge necessary for identifying the context each procedure encountered may contain some statements
of an RTOS call. A specific RTOS call may be made within found in the slice. The statements are evaluated in the order in
different contexts. Each calling environment contains a root which they occur in the procedure (i.e., statements are sorted by
procedure that corresponds to the calling task. Therefore, line number), and their values are saved for use in other com-
determining the task that spawned a new task requires de- putations of the evaluator. When the statement of interest (the
termining the calling environment for a particular RTOS RTOS call in this case) is encountered, the values of the variables
call passed a specific set of module/task values. of interest are checkpointed. If the statement is encountered
again (see Fig. 3), the values at that time are also checkpointed.
2.4 Program Slicing and Evaluation Sets of checkpointed values, together with the calling environ-
A program slice on some variable v, or set of variables, at ment for each set, are returned from the evaluator.
statement n consists of those statements that contribute to
the value of v just before statement n is executed. By using
program slices, we substantially reduce the number of
source code lines to be considered by the tool.
An example of a program slice for one of the RTOS calls
is below. In this slice, procedure PMASS calls procedure
PMASDS to fill the slots of the table, PMMN and PMTN, with
the correct module and task numbers. They are filled with
CMGDMN and CMATTN, respectively which are named con-
stants for 13 and 6 so that no additional statements are
needed in the slice. Upon return from PMASDS, PMASS
calls the operating system (RTOS). The first argument is
Fig. 3. Two different execution contexts use the same RTOS call with
REGISTER, which indicates that the request is to schedule a
different values.
task. The second argument is the address of the table with
the pertinent information for performing this request.
Given the value for the module/task pair and the calling
Notice that this is an interprocedural slice, where the
environment, it is possible to compute the calling tasks and
module and task numbers are set in a subsidiary procedure.
the called tasks of the RTOS call. A calling task is one whose
As the line numbers following the procedure names indi-
entry procedure is contained in the calling environment. A
cate, this program contains many thousands of lines of
called task is the one that corresponds to the module/task
code, yet these are the only lines necessary to set up and
pair. If the module is the same as the one under investigation,
spawn this particular task.
the task number is an index into the module’s entry proce-
PMASDS.14744: SET IDRTSR(0, PMMN) TO CMGDMN dure p-switch statement, which can be thought of as a list of
PMASDS.14745: SET IDRTSR(0, PMTN) TO CMATTN all intramodule task entry procedures. If the module number
PMASS.13428: PMASDS corresponds to another module, then information from the
PMASS.13433: RTOS INPUT REGISTER, CORAD (IDRTSR)
documentation is used to determine the name of the task so
OUTPUT PMDUM
that it can be displayed in a task flow graph or table.
Our tool contains a simple CMS2 evaluator that is used
to determine the possible values of the table fields used by 2.6 Recognizing other Services
1
an RTOS system call by “executing” the slice. Evaluation of In addition to scheduling new tasks, other services are pro-
a slice is made somewhat easier in CMS-2 than in other vided via calls to the RTOS operating system. The method
outlined above also works for these services. The process of
1. Our program slicer does not generate a compilable slice. finding the calling task of the RTOS call is the same as out-
HOLTZBLATT ET AL.: DESIGN RECOVERY FOR DISTRIBUTED SYSTEMS 465
TABLE 1
REUSE STATISTICS
Analysis 68,000 Average Reuse
Capability Core C CMS-2 Ada Fortran Assembly ( percent)
Data Flow 2,638 861 915 745 not ported 447 78.4
Control Flow 1,021 171 287 in Refine/Ada 148 97 85.5
Orphans 56 - - not ported not ported not ported 100*
466 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 23, NO. 7, JULY 1997
TABLE 2
MANUAL VS. AUTOMATICALLY GENERATED GRAPHS
Edges in Edges in
Manually Automatically Edges in Intersection A-B B-A
Module Generated Graph Generated Graph Common ( percent) ( percent) ( percent)
ATA 19 21 18 85.7 5.3 14.3
CMU 49 49 45 91.8 8.2 8.2
CXU 13 13 10 76.9 23.1 23.1
DMA 9 6 6 66.7 33.3 0.0
DMB 9 9 8 88.9 11.1 11.1
DRA 41 42 41 97.6 0.0 2.4
DSD 96 130 87 66.9 9.4 33.1
DSE 37 57 29 50.9 21.6 49.1
PMA 56 56 54 96.4 3.6 3.6
SMB 19 19 19 100.0 0.0 0.0
SMD 5 5 5 100.0 0.0 0.0
SME 36 36 29 80.6 19.4 19.4
SMF 44 36 16 36.4 63.6 55.6
SMG 18 18 17 94.4 5.6 0
SRC 35 33 33 94.3 5.7 0.0
SRD 39 40 39 97.5 0 2.5
SRE 25 22 22 88.0 12.0 0.0
SRG 9 9 9 100.0 0.0 0.0
To compare an automatically generated and a manually the work of Murphy et al. on the empirical study of call
generated graph requires a common representation. There graph extractors [23]. This work gives the following equa-
must be a way to specify the manually generated graphs in tions to compare graphs pairwise.
a form that is readable by a computer, so that they can be
| Edges in Graph from Method A © Edges in Graph from Method B| * 100
put into this common representation. For this reason, a
simple graph specification language was developed. This max(|Edges in Graph from Method A|, |Edges in Graph from Method B|)
language is used to enter the relationship between tasks This equation generates the percentage of edges found
and other RTOS services. Essentially, it is a way to specify by both methods.
each edge of the graph. The nodes are the various tasks, or
other RTOS services. Together, a set of specified edges is |Edges in Graph from Method A Edges in Graph from Method B| * 100
used to form the graph. Edges in Graph from Method A
As an example to specify the task spawning relationship This equation generates the percentage of edges found
between task ATA5 and DSE9, the following specification by method A but not method B.
would be used:
|Edges in Graph from Method B Edges in Graph from Method A| * 100
(def-rtos-info Edges in Graph from Method B
:caller-module ATA
:caller-task-number 5 This equation generates the percentage of edges found
:called-module DSE by method B but not method A.
:called-task-number 9) The higher the intersection statistic, the more similar the
To specify that a particular io-request was made by a graphs are: 100 percent means the graphs share all the same
module, the following specification would be used: edges. It is desirable for the two set difference statistics to
be as close to 0 percent as possible: 0 percent means that
(def-rtos-info one method did not find any extra edges that were not
:type *rtos-io-request-call
found by the other method.
:caller-module SCB
:caller-task-number 11 Three modules did not adhere to the recognition rules
:file-number 50) developed via expert knowledge (see Section 3.3.3), so they
are not included in any of the results below. Table 2 con-
tains the comparison between the manual and automati-
3.2 Results cally generated graphs.
In the tables below, we compare three different graphs for We investigated each discrepancy to determine if there
each module. The first two are the manually and automati- was an error in the manually generated graphs or in the
cally generated graphs. The third is a graph that we created algorithm used to automatically compute the graphs. This
from investigating the discrepancies between the two involved investigating the program slices and the differ-
graphs. This third graph, called the “corrected” graph can ent execution threads found by the automatic method.
be thought of as being closer to “ground truth” of what the Table 3 expresses the results with respect to that analysis.
actual task flow graph is for each module. It is impossible In addition, the results in Table 3 include some modules
to state absolutely which method is better in terms of the 2
which were run in interactive mode. In this mode, the user
actually implemented task flow in the code, since that in-
formation is never truly available. Another way to measure
the results is to compare the methods to each other using 2. This was a new version of M-CLUE that was applied to only a few modules.
468 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 23, NO. 7, JULY 1997
TABLE 3
“CORRECTED” VS. AUTOMATICALLY GENERATED GRAPHS
Edges in Edges in Automatically Intersection A-B B-A
Module Corrected Graph Generated Graph Edges in Common ( percent) ( percent) ( percent)
ATA 21 21 20 95.2 4.8 4.8
CMU 49 49 49 100.0 0.0 0.0
CXU 13 13 13 100.0 0.0 0.0
DMA 6 6 6 100.0 0.0 0.0
DMB 9 9 9 100.0 0.0 0.0
DRA 42 42 42 100.0 0.0 0.0
DSD 99 130 95 73.1 26.9 26.9
DSE 42 57 34 59.6 40.4 40.4
PMA 56 56 56 100.0 0.0 0.0
SMB 19 19 19 100.0 0.0 0.0
SMD 5 5 5 100.0 0.0 0.0
SME 36 36 28 77.8 22.2 22.2
SMF 44 36 29 65.9 19.4 34.1
SMG 18 18 18 100.0 0.0 0.0
SRC 33 33 33 100.0 0.0 0.0
SRD 40 40 40 100.0 0.0 0.0
SRE 25 22 22 88.0 0.0 12.0
SRG 9 9 9 100.0 0.0 0.0
TABLE 4
“CORRECTED” VS. MANUAL GRAPHS
Edges in Edges in Manually Edges in Intersection A-B (B - A)
Module Corrected Graph Generated Graph Common ( percent) ( percent) ( percent)
ATA 21 19 19 90.5 0.0 9.5
CMU 49 49 49 100.0 0.0 0.0
CXU 13 13 13 100.0 0.0 0.0
DMA 6 9 6 66.7 33.3 0.0
DMB 9 9 9 100.0 0.0 0.0
DRA 42 41 41 97.6 0.0 2.4
DSD 99 96 96 96.0 0.0 3.0
DSE 42 37 37 88.1 0.0 11.9
PMA 56 56 56 100.0 0.0 0.0
SMB 19 19 19 100.0 0.0 0.0
SMD 5 5 5 100.0 0.0 0.0
SME 36 36 36 100.0 0.0 0.0
SMF 44 44 44 100.0 0.0 0.0
SMG 18 18 18 100.0 0.0 0.0
SRC 33 35 33 100.0 5.7 0.0
SRD 40 39 39 97.5 0.0 2.5
SRE 25 25 25 100.0 0.0 0.0
SRG 9 9 9 100.0 0.0 0.0
TABLE 5
AVERAGE STATISTICS OF PAIRWISE COMPARISON
Method Intersection A-B B-A
Manual / Automatic 84.1 12.3 12.7
Corrected / Automatic 92.2 6.3 5.3
Manual / Corrected 96.5 0.39 1.6
was asked to type in the value of a variable if the evaluator the graphs actually constructed program slices on paper, or
was unable to compute one for it. This can potentially solve in their heads. However, the software is too complex to con-
some of the problems caused when variables get their val- struct the correct slices without the aid of automatic tools.
ues from outside the module (see Section 3.3.1). Both the automatic method and the manual method
Table 5 summarizes the results by averaging the statis- found graphs which were about 90 percent the same as the
tics over the set of modules. “corrected” graph, itself not a true measure as described
Notice that the manual method of generating the graphs above. Since the automatic graphs are much less time-
was much more error-prone than the automatic method. This consuming to create than the manual ones, the automatic
is borne out by comparing the Manual/Automatic and the method is much more desirable to use. The automatic
Corrected/Automatic statistics. Once the discrepencies were method also has the potential to get better, if certain limita-
accounted for, the automatic graphs were more similar to the tions in the algorithms are overcome (see Section 3.4). The
corrected graphs than the manual graphs. This shows the manual method is always subject to human error—which is
power of generating program slices automatically. Although likely, given the complex data flow analysis that is needed.
they probably were unaware, the individuals who generated Lastly, the graphs produced by the manual method quickly
HOLTZBLATT ET AL.: DESIGN RECOVERY FOR DISTRIBUTED SYSTEMS 469
become out of date, yet the automatic method can be used ELSE BEGIN
to easily recreate the automatic graphs whenever the code SET IDDMSRD(0,MODULE) TO SMGSWPMN $
changes. Lastly, the automatic method generated 11 of the SET IDDMSRD(0,TASK) TO SMGRDCTN
END $
18 graphs with a 100 percent intersection statistic and 0 per-
cent set difference statistics. The statements
3.3.2 Current Limitations of the Implementation VARY VSUINDX FROM 0 THRU SMNRDR - 1
IF GTMOCU(VSUINDX,RDRNO) EQ RADARIFF THEN
Some of the results are incorrect because of current limita- BEGIN
tions of the implementation. Most of these limitations could SET IDRIOPKT(0,BUSADDR) TO VSUINDX + 567
be eliminated by enhancing the implementation of the pro- RTOS INPUT IOREQUES, CORAD(IDRIOPKT)
gram slicer, CMS-2 source code evaluator, or the task flow OUTPUT DUMMY
graph generator. The following explains these limitations in END “VARY Loop”
more detail. only the RTOS call that would have been executed the
1) The slicing algorithm does not currently include test first time through the loop will be evaluated. There-
statements in a conditional as part of a slice. As a re- fore, many I/O requests are missing.
sult, a slice may not be correctly evaluated. For exam- 3) Impossible Paths. Certain control flow paths are not
ple, consider the following MCE code: possible. For instance, consider two consecutive con-
ditional statements with contradictory tests. Because
(from procedure SMARDSCN):
IF GVMODE EQ MODIFYSM the conditionals are not evaluated this problem could
OR GVMODE EQ PRINTSM arise. This is a special case of 1.
THEN BEGIN IF A THEN
SET IDDMSRD(0,MODULE) TO SMAGENMN $ SET MNUM TO 5
SET IDDMSRD(0,TASK) TO SMARDCTN ELSE
END
470 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 23, NO. 7, JULY 1997
was expressed procedurally in M-CLUE. Each rule is repre- designs/modularizations as in the modularization tool de-
sented declaratively with information about its purpose, scribed in [19]. In general, design recovery tools need to take
pre-conditions, parameterization, effects, and rule methods. advantage of extra-linguistic design information such as that
Rule methods are largely language/operating-system inde- provided in our design recovery rules.
pendent descriptions that look up specialized constructs in
knowledge-based templates. These templates are automati-
cally loaded into the analysis based on language/operating 6 CONCLUSIONS
system dependencies for the system under analysis. While Currently available reverse engineering tools are limited
M-CLUE itself looks explicitly for RTOS calls, our broader to extracting information that is represented explicitly in
framework retrieves models for underlying services (e.g., a
the source code. Therefore, higher-level abstractions of the
model for Unix contains the names of any operating system
code, such as its distributed behavior and the interprocess
function that will achieve a task spawning action).
communication and control, that are not explicitly repre-
As in the M-CLUE work, all the recognizers contain
sented and so cannot be captured. Explicit constructs for
some form of knowledge that enables them to quickly
these abstractions are often not available in legacy lan-
identify code fragments of interest (e.g., RTOS calls), while
guages. This situation is still worse because no basic re-
avoiding a top-down approach requiring full recovery of all
verse engineering tools are generally available for legacy
complex interwoven patterns.
languages.
Using our broader framework, analysts freely select rec-
We have developed M-CLUE a tool that recovers this in-
ognizers to recover multiple as-built views of legacy sys-
formation from a multiple process system written in CMS-
tems. We have linked default recognizers to find features of
2. It is based on CLUE, our general purpose reverse engi-
specific architecture styles, but the analyst is free to select
neering tool for CMS-2. M-CLUE uses expert knowledge of
alternative recovery paths.
the operating system and the architectual design of the tar-
In this work, we are demonstrating that multiview ar-
get system, in addition to the source code. Using this in-
chitecture recovery is practical and effective and can be
formation in the form of recognition rules, and advanced
used to analyze and document large-scale legacy systems.
software methods such as program slicing, we derived the
task flow graphs for each process of the system. When
5 RELATED WORK compared to graphs contructed by analysts, they were on
the average 84 percent correct, and 92 percent correct when
Our approach is influenced by work that recognizes the
we altered the analysts’ graphs with information suggested
organizational power of software architecture, techniques
by the automatic results.
for program understanding and concept recognition, and
By applying powerful program analysis capabilities in
the enabling technology of reverse engineering.
concert with recognition rules derived from some basic
Software architectures as described in [20], [21] provide
system design knowledge, a significant level of system de-
the organizational patterns that drive the distributed design
sign recovery can be achieved. The information is derived
recovery system. In this case, the architecture we analyzed
directly from the source code and can be traced back to that
and recognized was a fairly simple distributed system ar-
source code. As software baselines change, design recovery
chitecture in which modules and tasks are the relevant ar-
can be reapplied to produce current design information,
chitectural objects and the key architectural event is the
yielding a form of “living” documentation that can reliably
spawning of tasks.
aid in program maintenance and understanding.
Traditional program understanding work has focused on
cliché-based recognition [14], [15]. Generally, this is a bot-
tom-up recognition approach in which the program is ACKNOWLEDGMENT
matched to a set of pre-defined plans/clichés from a li-
brary. Recognition is based on a precise data and control All work was done at the MITRE Corporation under Air
flow match which indicates that the recognized source Force Contract No. F19628-94-C-0001.
component is precisely the same as the library template.
Our approach is more of a top-down hypothesis driven REFERENCES
recognition coupled with bottom-up recognition rules. Our [1] T. Biggerstaff, “Design Recovery for Maintenance and Reuse,” Com-
recognition rules do not require algorithmic equivalence of puter, July 1989.
the plan and the source being matched, rather they are [2] T. Biggerstaff, J. Hoskins, and D. Webster, “DESIRE: A System for
Design Recovery,” MCC Technical Report, STP-081-89, May 1989.
based on source code level events [15] in the code. The ex- [3] R. Rock-Evans and K. Hales, “Reverse Engineering: Markets, Meth-
istence of patterns of these events is sufficient to establish a ods and Tools,” vol. 1, Ovum Ltd., 1990.
match. Quilici [16] also explores a mixed top-down, bot- [4] C. Sittenauer, M. Olsem, and D. Murdock, “Re-engineering Tools
Report,” technical report, Air Force Software Technology Support
tom-up recognition approach using traditional plan defini- Center, 1992.
tions. The style of source code event-based recognition [5] H. Reubenstein, R. Piazza, and S. Roberts, “Separating Parsing and
rules is also exemplified in [17], [18] which demonstrates a Analysis in Reverse Engineering Tools,” Proc. Working Conf. on Re-
combination of precise control and data flow relation rec- verse Eng., May 1993.
[6] L.J. Holtzblatt, R.L. Piazza, and S.N. Roberts, “Design Recovery
ognition and more abstract code event recognition. Technology for Real-Time Systems,” Mitre Technical Report
Design recovery work, such as DESIRE [1], relies on: ex- 95B0000057, July 1995.
ternally supplied cues regarding program structure, modu- [7] M. Weiser, “Program Slicing,” IEEE Trans. Software Eng., vol. 10, no.
4, July 1984.
larization heuristics, manual assistance and informal infor- [8] N. Chapin, “Real-Time Software Maintenance,” Proc. Eighth Int’l
mation. Informal information and heuristics can also be used Conf. Software Maintenance and Re-Engineering, Aug. 1991.
to reorganize and automatically refine recovered software
472 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 23, NO. 7, JULY 1997
[9] C. Castells-Schofield, “Engineering a Language-Independent Ap- Richard L. Piazza (M’93) holds a BS degree in
proach to Parsing for Analysis and Testing,” Vitro Technical J., vol. 8, computer science from the State University of
no. 1, 1990. New York at Albany and an MS degree from the
[10] P. Devanbu, “GENOA—A Customizable, Language and Front-End University of Maryland at College Park. He re-
Independent Code Analyzer,” Proc. 14th Int’l Conf. Software Eng., cently joined Reasoning, Inc. as a principal
May 1992. member of the technical staff. Previously, he was
[11] S. Roberts, “Reverse Engineering Assembly Language Programs,” with the MITRE Corporation, which he joined in
Proc. Fourth Reengineering Forum, vol. 2, Sept. 1994.
1985. At MITRE he worked on projects in auto-
[12] Reasoning Systems, “Refine User Guide,” May 1990.
[13] Reasoning Systems, “Refine/C User’s Guide,” Mar. 1992. matic programming; the programming of under-
[14] C. Rich and L. Wills, “Recognizing a Program’s Design: A Graph standing and software reverse engineering; ma-
Parsing Approach,” IEEE Software, vol. 7, no. 1, 1990. chine learning; constraint-based reasoning, and
[15] M. Harandi and J. Ning, “Knowledge-Based Program Analysis,” model-based diagnosis. His current research interests center on
IEEE Software, vol. 7, no. 1, 1990. building various reverse engineering tools. His main contribution was
[16] A. Quilici, “A Hybrid Approach to Recognizing Program Plans,” the Language Independent Method for constructing reverse engineer-
Proc. Working Conf. Reverse Eng., May 1993. ing tools. This method facilitates building tools for different computer
[17] W. Kozaczynski, J. Ning, and T. Sarver, “Program Concept Recog- languages, through sharing language independent analysis routines.
nition,” Proc. Seventh Ann. Knowledge-Based Software Eng. Conf., 1992. He is a member of the IEEE. He has been an instructor in Boston Uni-
[18] A. Engberts, W. Kozaczynski, and J. Ning, “Concept Recognition- versity’s Metropolitan College Computer Science Department since
Based Program Transformation,” Proc. IEEE Conf. Software Mainte- 1985.
nance, 1991.
[19] R. Schwanke, “An Intelligent Tool for Re-Engineering Software
Modularity,” Proc. 13th Int’l Conf. Software Eng., 1991.
[20] M. Shaw, “Larger Scale Systems Require Higher-Level Abstrac- Howard B. Reubenstein (S‘85–M‘90) received
tions,” Proc. Fifth Int’l Workshop Software Specification and Design, the SB and SM degrees in computer science in
1989. 1985 and the PhD degree in computer science in
[21] D. Perry and A. Wolf, “Foundations for the Study of Software Ar- 1990 for his work on the Requirements Appren-
chitecture,” ACM Software Eng. Notes, vol. 17, no. 4, 1992. tice, all from the Massachusetts Institute of
[22] P. Lutsky, “Automatic Testing by Reverse Engineering of Software Technology, Cambridge. The work presented in
Documentation,” Proc. Second Working Conf. Reverse Eng., July 1995. this paper was performed while at the MITRE
[23] G. Murphy, D. Notkin, and E. Lan, “An Empirical Study of Static Corporation as a lead scientist in a Knowledge-
Call Graph Extractors,” Univ. of Washington, Computer Science Based Software Engineering group. He is cur-
Dept, Technical Report 95-08-01. rently employed at Reasoning Inc., developing
[24] D. Garlan and M. Shaw, An Introduction to Software Architecture,“ commerical software reverse engineering and
Tutorial 15th Int’l Conf. Software Eng., 1993. transformation tools. Dr. Reubenstein is a past-general chair and pro-
[25] M. Shaw, “Larger Scale Systems Require Higher-Level Abstrac-
gram committee member of the Automated Software Engineering
tions,” Proc. Fifth Int’l Workshop Software Specification and Design,
1989. Conference, and currently on the Program Committee of the Fourth
[26] D. Perry and A. Wolf, “Foundations for the Study of Software Ar- Working Conference on Reverse Engineering. He is a member of the
chitecture,” Software Eng. Notes, vol. 17, no. 4, 1992. Association for Computing Machinery and the IEEE Computer Society.
[27] G. Abowd, R. Allen, and D. Garlan, “Style to Understand Descrip-
tions of Software Architecture,” Software Eng. Notes, vol. 18, no. 5,
1993. Also in Proc. First ACM SIGSOFT Symp. Foundations of Software
Eng., 1993. Susan N. Roberts holds a BS degree in com-
[28] W. Tracz, “Domain-Specific Software Architecture (DSSA)— puter science and an MS degree in computer
Frequently Asked Questions (FAQ), Software Eng. Notes, vol. 19, no. science, with a concentration in artificial intelli-
2, 1994. gence both from the University of Michigan. She
[29] D. Harris, H. Reubenstein, and A. Yeh, “Reverse Engineering to the recently joined Spyglass, Inc. as a software en-
Architectural Level,” Proc. 17th Int’l Conf. Software Eng., 1995. gineer. Previously, she was an artificial intelli-
[30] S. Roberts, R. Piazza, and D. Katz, “A Portable Assembler Reverse gence lead scientist at the MITRE Corporation,
Engineering Environment,” Proc. Third Working Conf. Reverse Eng., which she joined in 1990. While at MITRE, her
Nov. 1996. early work concentrated in the area of natural
[31] D Harris, A. Yeh, and H. Ruebenstein, “Extracting Architectural language understanding. Following that, she
Features from Source Code,” Automated Software Eng., vol. 3, pp. spent six years working in the area of reverse
109-138, Boston: Kluwer Academic, 1996. engineering. During this time, she developed numerous software
[32] A. Yeh, D. Harris, and M. Chase, “Manipulating Recovered Soft- workbenches for reverse engineering third generation software lan-
ware Architecture Views,” Proc. 19th Int’l Conf. Software Eng., pp. guages and assembly languages. At Spyglass, Inc., she is working on
184-194, Boston: ACM Press, May 1997. the HTML engine of the Device Mosaic web browser.