20-Design Recovery For Distributed Systems

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 23, NO.

7, JULY 1997 461

Design Recovery for Distributed Systems


Lester J. Holtzblatt, Richard L. Piazza, Member, IEEE, Howard B. Reubenstein, Member, IEEE,
Susan N. Roberts, and David R. Harris

Abstract—Two factors limit the utility of reverse engineering technology for many distributed software systems. First, with the
exception of tools that support Ada and its explicit tasking constructs, reverse engineering tools fail to capture information
concerning the flow of information between tasks. Second, relatively few reverse engineering tools are available for programming
languages in which many older legacy applications were written (e.g., Jovial, CMS-2, and various assembly languages). In this
paper, we describe approaches that were developed for overcoming these limitations. In particular, we have implemented an
approach for automatically extracting task flow information from a command and control system written in CMS-2. Our approach
takes advantage of a small amount of externally provided design knowledge in order to recover design information relevant to the
distributed nature of the target system.

Index Terms—Design recovery, program understanding, legacy systems, language independence.

—————————— ✦ ——————————

1 INTRODUCTION

T HE software for many large distributed systems is


maintained by organizations that are responsible for
both enhancing as well as correcting problems in these
ple, all the tools we have surveyed display the calling hier-
archy of a program, although they will vary in how this
information is displayed. Many of the tools also display
systems’ software. The productivity of such software information about the flow of control within individual
maintenance organizations can be adversely affected by the procedures, and the structure and usage of data within a
considerable amount of time software maintainers spend program. For example, many tools generate reports con-
simply trying to understand the software they are main- cerning which procedures use or set particular variables.
taining. Studies of the software maintenance process indi- Additionally, reports concerning the structure of records,
cate that software maintainers, on the average, spend ap- tables, or arrays used within a program are often provided.
proximately one-half of their time developing an under- These tools can provide maintainers with good insight
standing of the software [2], [3]. One of the primary reasons into the structure of a program, particularly when coupled
for this is that the documentation and other formal descrip- with good navigational aids between the reports and the
tions of large distributed systems are often inadequate and source code. For example, using the tools available from
unreliable. As a result, software maintainers typically rely Reasoning Systems (e.g., [13]), a software maintainer can
on source code as the only completely reliable source of interactively navigate through code by selecting different
information on the software. The process of trying to de- portions of code from a structure chart. A maintainer may
velop an understanding of the software by manually navi- also forsee the potential impact of changes he plans to in-
gating through the code is extremely time consuming and troduce into a program by using these tools to identify ar-
error prone. This situation has created a need for a technol- eas of the program that may be affected by making the
ogy that both automatically extracts information from the change. In general, reverse engineering tools can improve
source code and presents this information in a comprehen- the productivity of a software maintainer, both by provid-
sible format. ing the maintainer with an understanding of the structure
of a program and by making relevant portions of a program
1.1 Limitations of Current Reverse Engineering Tools readily accessible to the maintainer.
Reverse engineering tools that extract certain aspects of the In spite of the potential of these tools, their utility is lim-
structure of a software system from source code are com- ited for many distributed software systems because of their
mercially available [3], [4]. Although these tools provide a inability to extract the nonsequential execution aspects of a
range of capabilities, most share a common core. For exam- computer program. Distributed systems consist of individ-
ual units of execution (tasks) that operate concurrently on
¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥ different processors or by interleaving their functioning on
• L.J. Holtzblatt, R.L. Piazza, and H.B. Reubenstein, are with Reasoning, the same processor [8]. These concurrent tasks exchange
Inc., 700 East El Camino Real, Suite 300, Mountain View, CA 94040. both control information as well as data through a variety
E-mail: {ljh, rich, hbr}@reasoning.com.
• S.N. Roberts is with Spyglass, Inc., One Cambridge Center, Cambridge, of mechanisms. However, with the exception of tools that
MA 02142. E-mail: [email protected]. support Ada and its explicit tasking constructs, reverse en-
• D.R. Harris is with the MITRE Corporation, 202 Burlington Rd., Bedford, gineering tools fail to capture information concerning the
MA 01730. E-mail: [email protected].
data and control flow between tasks. As a result, these tools
Manuscript received 22 Jan. 1996; revised 12 June 1997.
Recommended for acceptance by H. Muller. provide limited support for understanding the structure of
For information on obtaining reprints of this article, please send e-mail to: distributed systems.
[email protected], and reference IEEECS Log Number 105251.

0098-5589/97/$10.00 © 1997 IEEE


462 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 23, NO. 7, JULY 1997

1.2 Problems Caused by these Limitations 2 TASK FLOW RECOVERY


One of the primary reasons for the lack of distributed de- Our initial efforts recovered distributed design information
sign recovery support is that understanding design con- from one system in particular, the Modular Control
structs relevant to the execution of concurrent tasks re- Equipment (MCE) system. MCE is a command and control
quires more than an implementation level understanding of system written in the Navy’s source language CMS-2. The
the software [1]. The syntax of programming languages, MCE software runs in a distributed, multiple CPU hard-
particularly older legacy languages such as CMS-2 or For- ware environment. The software is split into modules
tran, does not make constructs such as intertask communi- which are distributed across the different CPUs. The real-
cation and task synchronization explicit. Instead, the sys- time operating system (RTOS) enables the software on dif-
tem’s inter-task behavior often depends on the design and ferent CPUs to communicate, sharing both data and control
semantics of the specific operating system and the way in (task invocation).
which the application code interacts with the operating To perform standard implementation level reverse engi-
system. Since basic reverse engineering tools extract only neering that would provide basic analysis of the MCE sys-
information that is represented explicitly in the syntax of the tem, we constructed a workbench called CLUE (CMS-2 Lan-
programming language, tools for sequential programming guage Understanding Environment) using the Software Re-
languages can only extract information concerning the se- finery environment [12]. CLUE produces the standard views
quential execution of individual tasks. These tools will fail and reports for CMS-2 source code described in Section 1.1.
to capture information concerning how these tasks interact. These include procedure-calling hierarchies, set/use reports
However, maintainers extract knowledge about how and control-flow graphs. The tool we built to do do the task
concurrent tasks interact from the source code of older flow analysis is an extension of CLUE called M-CLUE.
systems, even though such information is not explicitly Tasks in MCE are executable units within a module and
available in the syntax of a programming language. Ex- are each comprised of a number of different procedures.
tracting this information requires knowledge about the type Tasks spawn a variety of actions on themselves or other tasks
of processing model used by the system software and how through procedure calls to RTOS (see Fig. 1). These actions
this processing model has been implemented in a particular include scheduling a task, removing a previously scheduled
operating system. task, and I/O to a file. In this paper we focus primarily on
In addition to knowing about the processing model techniques to determine which tasks schedule other tasks,
used by a system, it is also necessary for maintainers to although we also use this approach to recover information
understand the idiosyncratic techniques used by pro- regarding other types of operating system calls.
grammers to implement these constructs. For example,
although tasks may not be explicitly represented through
syntactic constructs in the code, specific recurring pat-
terns of code may be used to represent a task in a par-
ticular application. As a result, it may still be possible for
a maintainer to recognize those specific portions of code
that implement a particular task. Similarly, the specific
actions through which these tasks communicate with each
other may be implemented through particular types of
calls to the operating system. Interpreting how specific
tasks communicate with each other depends on inter-
preting the meaning of these specific calls. Fig. 1. Spawning a task via the operating system.
A person’s ability to manually extract extra-linguistic
information from the program’s source code depends on Determining the flow of tasks within MCE requires ex-
their ability to use knowledge of how specific design con- tracting information that is not directly available in the
structs are implemented in the source code. However, MCE source code. Extracting the task flows requires dis-
basic reverse engineering tools are not designed to make covering the two primary pieces of information required to
use of such architectural knowledge. Unless techniques understand any task flow: who called a task and what task
are developed to make use of this knowledge, reverse en- was called. Neither piece of information is explicitly repre-
gineering tools will fail to extract more than the imple- sented in the syntax of the source code.
mentation level detail of a program. As long as tools can The overall strategy for determining task flows requires
provide only limited visibility into the structure of a pro- implementating of a set of recognition rules that identify a
gram, they will not be able to provide the insight required small set of design constructs (e.g., tasks and modules) in the
to understand the design of a distributed system or any MCE code. This information is then supplemented with
type of other systems implemented using a particular techniques for evaluating the states of specific variables in the
software architecture (see Section 4). In Section 2, we de- code that identify the tasks spawned by a particular RTOS
scribe our approach for using this design information to call. Algorithms for identifying the set of statements that may
support the automatic extraction of task flow information impact the state of a variable at a particular location are
from a system implemented in CMS-2. In Section 3, we known as program slicing [7]. These evaluations required
summarize our empirical results. Finally, we discuss re- determining the program slice for the module and task vari-
lated work and conclusions. able in each RTOS call. For each RTOS call, this program slice
HOLTZBLATT ET AL.: DESIGN RECOVERY FOR DISTRIBUTED SYSTEMS 463

is evaluated within the calling environment of each task of documentation is available as a starting point. Additionally,
the module containing the RTOS call. This evaluation returns the tasks that are contained in a module is a very general
a value for the specific variables together with the calling piece of information and does not change much in practice.
environment in which these values were computed. These The detailed information that we extract, the actual task flow
values identify the task spawned by a particular RTOS call from task to task within and outside of a module, is an ex-
and the calling environment identifies the task spawning the ample of information that changes much more often.
new task. In this section we will explain the overall strategy This documentation is written in English, but it is fairly
M-CLUE uses to automatically extract this information and structured. Thus, with a minimal amount of editing, we
explain the approach in the tool for recognizing the relevant automatically parsed the documentation, using a recursive-
design constructs in the MCE source code.
descent parser. The parser automatically created module
2.1 Design Construct Recognition and task objects for each module and task mentioned in the
The purpose of design construct recognition is to identify documentation. The module and task names and the list of
instances of the key architectual features implemented in a relevant files for each module were also set automatically
software system, and their inter-relationships. We created a during parsing.
domain model that identifies a small set of design con- After obtaining as much information about modules and
structs in the MCE code: tasks, modules, and a small set of tasks as possible from the documentation, we examined the
events through which tasks interact with each other (e.g., source code to complete the model. Each module has an
task spawning). associated entry procedure. Because the documentation only
One of the difficulties in determining task flows is in identifies the name of the file containing a module entry
determining what comprises a task. This is because no procedure, we needed to find this procedure from the
syntactic structure exists in CMS-2 that corresponds to a source code. This is done by generating the procedure call-
task. Additionally, procedures can belong to multiple tasks. ing hierarchy of the module. The module entry procedure is
Therefore one cannot simply read the source code to de- the root procedure in the procedure calling hierarchy. To
termine the task containing a particular RTOS call. How- avoid confusing orphan procedures for the root procedure,
ever, using knowledge of how tasks were implemented in the root of the largest disconnected subgraph is used. The
MCE, our design recovery system can determine all of the module entry procedure in the MCE system contains a
necessary design constructs through the use of a set of rec- CMS-2 construct called a p-switch. The p-switch is very
ognition rules. The current implementation hard codes rec- much like a case statement. It passes control to the task root
ognition rules for these design constructs. Each recognition procedure of a particular task, depending upon the value of
rule creates an instance of an abstract design construct or the p-switch variable. A task continues executing in MCE
determines the value of one of its attributes. until the root procedure terminates. Therefore, from the p-
Because it is impossible to recognize these abstract de- switch we were able to determine the names of the task root
sign constructs from information contained solely in the procedures for each of the tasks in that module.
source code, we implemented recognition rules that operate
on design documentation in addition to the ones that oper- 2.2 Determining the Task Called by RTOS
ate on the parsed representation of the source code. The use Once modules, tasks and their entry procedures have been
of noncode sources for reverse engineering has been ex- recognized, it is possible to determine the behavior of each
plored by other researchers [22]. task by identifying and interpreting the RTOS system calls
A portion of the on-line documentation for MCE de- used in a task. Each of the events produced via an RTOS
scribed each of the modules of the system. Fig. 2 is an ex- call is represented in our domain model. We implemented
ample of one specification. For each module, the module’s event recognition algorithms that identify occurrences of
name (first line, third token) and a list of tasks was enumer- these events. The first step is to find all of the RTOS calls in
ated (third column). A list of files relevant to the module the source code. This is easy to do by traversing the parse
(second column) and the file that contained the entry pro- tree of the program and testing for the name RTOS in each
cedure for the module were also identified (file name on procedure call object encountered.
same row as “Entry” in third column). The next step is to evaluate the value of the arguments
used by RTOS to determine the task behavior the RTOS call
MODULE ATC.ATATS ATA 20 represents. An RTOS call has two arguments, the type of
ATAFS-231016 ATAFSM Entry the RTOS call and a table (a CMS-2 data structure) con-
“ ATAPRE
taining information for that type of call. The fields in the
“ ATAPOS
“ ATASHUT
table vary depending on the type of RTOS call invoked. For
“ ATARST each RTOS call identified in the code, the first argument
“ ATAOAM identifying the type of RTOS call is accessed and the ap-
“ ATAMON propriate event object is created to represent the event. For
“ ATACAL example, if the RTOS call schedules a task its first argument
“ ATASI will indicate that this is a REGISTER type of call.
Fig. 2. The specification of the module ATA in the documentation. The task scheduled by an RTOS call is uniquely deter-
mined by a set of arguments passed to RTOS via the call.
Although this paper argues that documentation can These arguments identify a module and the task it contains.
quickly become out of date, resulting in the need for auto- A module/task pair uniquely identifies a task in the MCE
matic methods to keep it current—it is important to use what system. In order to determine the task spawned by an
464 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 23, NO. 7, JULY 1997

RTOS call, it is necessary to determine the state of these two common programming languages, because all variables in a
variables at the particular point in the program when an CMS-2 program, including formal arguments of proce-
RTOS call is made. In some cases determining the value of dures, are global therefore, procedure invocation does not
these variables is relatively straightforward because these introduce a new scope.
values are initialized and remain constant throughout the For any particular RTOS call, module and task variables
execution of that module. may assume different values in different execution contexts.
When a variable is not initialized, its state can be deter- Because a program slice contains the set of all statements that
mined by identifying the relavent program slice, the set of may influence the state of a variable, only a subset of these
statements that may impact the value of that variable at statements may actually be executed under a particular con-
that point in the code. The slice is then evaluated to yield text. In order to evaluate the program slice under each possible
the state of the variables. For our purposes, this technique context, each syntactically possible execution thread within a
assumed that the state of a module and task variable were module that may reach a designated RTOS call is evaluated.
completely determined within the scope of a task because Each of these evaluations could derive a distinct value for the
the program slicing algorithm does not trace data depend- module and task variables for the particular RTOS call.
encies across task boundaries. This assumption was valid The inputs given to the evaluator are: a list of the variables for
for most of the modules in the MCE system. which values are requested; the statement of interest, i.e., the
statement in the slice for which the variable values should be
2.3 Determining the Calling Task determined. The execution of a slice is guided by a pre-order
The identification of the root-procedure for each task pro- traversal of the procedure calling hierarchy. As it is traversed,
vides the knowledge necessary for identifying the context each procedure encountered may contain some statements
of an RTOS call. A specific RTOS call may be made within found in the slice. The statements are evaluated in the order in
different contexts. Each calling environment contains a root which they occur in the procedure (i.e., statements are sorted by
procedure that corresponds to the calling task. Therefore, line number), and their values are saved for use in other com-
determining the task that spawned a new task requires de- putations of the evaluator. When the statement of interest (the
termining the calling environment for a particular RTOS RTOS call in this case) is encountered, the values of the variables
call passed a specific set of module/task values. of interest are checkpointed. If the statement is encountered
again (see Fig. 3), the values at that time are also checkpointed.
2.4 Program Slicing and Evaluation Sets of checkpointed values, together with the calling environ-
A program slice on some variable v, or set of variables, at ment for each set, are returned from the evaluator.
statement n consists of those statements that contribute to
the value of v just before statement n is executed. By using
program slices, we substantially reduce the number of
source code lines to be considered by the tool.
An example of a program slice for one of the RTOS calls
is below. In this slice, procedure PMASS calls procedure
PMASDS to fill the slots of the table, PMMN and PMTN, with
the correct module and task numbers. They are filled with
CMGDMN and CMATTN, respectively which are named con-
stants for 13 and 6 so that no additional statements are
needed in the slice. Upon return from PMASDS, PMASS
calls the operating system (RTOS). The first argument is
Fig. 3. Two different execution contexts use the same RTOS call with
REGISTER, which indicates that the request is to schedule a
different values.
task. The second argument is the address of the table with
the pertinent information for performing this request.
Given the value for the module/task pair and the calling
Notice that this is an interprocedural slice, where the
environment, it is possible to compute the calling tasks and
module and task numbers are set in a subsidiary procedure.
the called tasks of the RTOS call. A calling task is one whose
As the line numbers following the procedure names indi-
entry procedure is contained in the calling environment. A
cate, this program contains many thousands of lines of
called task is the one that corresponds to the module/task
code, yet these are the only lines necessary to set up and
pair. If the module is the same as the one under investigation,
spawn this particular task.
the task number is an index into the module’s entry proce-
PMASDS.14744: SET IDRTSR(0, PMMN) TO CMGDMN dure p-switch statement, which can be thought of as a list of
PMASDS.14745: SET IDRTSR(0, PMTN) TO CMATTN all intramodule task entry procedures. If the module number
PMASS.13428: PMASDS corresponds to another module, then information from the
PMASS.13433: RTOS INPUT REGISTER, CORAD (IDRTSR)
documentation is used to determine the name of the task so
OUTPUT PMDUM
that it can be displayed in a task flow graph or table.
Our tool contains a simple CMS2 evaluator that is used
to determine the possible values of the table fields used by 2.6 Recognizing other Services
1
an RTOS system call by “executing” the slice. Evaluation of In addition to scheduling new tasks, other services are pro-
a slice is made somewhat easier in CMS-2 than in other vided via calls to the RTOS operating system. The method
outlined above also works for these services. The process of
1. Our program slicer does not generate a compilable slice. finding the calling task of the RTOS call is the same as out-
HOLTZBLATT ET AL.: DESIGN RECOVERY FOR DISTRIBUTED SYSTEMS 465

lined above. However, different information is often sent to


RTOS. Depending upon the service requested, the appropri-
ate information is also passed to RTOS in slots of a table. The
values of these slots can be computed via a program slice and
evaluated with a similar method to that used to discover the
called module/task pair. The table below lists some of the
other services that RTOS performs. If the service includes
task spawning, then the table passed to RTOS contains a
module and task number. The other slots passed in the table
depend upon the service. In addition, other slots that we re-
cover for some of the RTOS calls are delay and period.
Other Information
Other Services Spawns Tasks (Slots)
Fig. 4. Reusing analysis code.
REGXTERNL yes
BUFREG yes
REMOVE task removed from of queue call-by-value) differences. Since the objects in the abstract tree
IOREQUES no file number are specific to the language parsed, some uniform program-
BUFIOREQ no file number mer’s interface between them and the analysis capabilities
NOTIFY no message id must be used. The interface routines are made up of a small
DMU sometimes ddb-address
amount of language-dependent code which is needed to ac-
2.7 Moving Towards Generic Approaches cess information from the parse tree [10], but the core code
for the analysis routines is common to all languages.
Another goal of our work was to develop tools useful for
This makes the analysis capabilities transparent to the
many different legacy systems. Since legacy systems use a
underlying language-dependent representation (see Fig. 4).
host of different source languages, it is important to dis-
Additionally, the interface and documentation routines can
cover ways to avoid creating a completely new tool for each
be treated similarly.
language. Commercial tools are generally not available for
Table 1 indicates the amount of code reuse that was
these languages due to their small customer base. This is
achieved by the LIM approach. Three different analysis
especially true for assembly languages for special purpose
capabilities are listed. The second column indicates the
hardware [11], [30]. Therefore, since little effort can be ex- number of lines of language-independent core code. The
pended for each language, a method must be found that third through sixth columns indicate the number of lines
will make the development of a similar tool for multiple that needed to be added to specialize the code so that it
legacy language straightforward and inexpensive. handles C, CMS-2, Ada, FORTRAN and 68000 Assembly,
Initially this problem seems trivial: The solution could be respectively. For the data flow analysis, this translated into
summed up as “all you have to do is add another parser.” needing to write about 75 language-dependent routines per
We found this solution naive based on our efforts to modify language, which contained an average of seven lines and a
a tool in just this manner [5]. Our approach is the use of a median of four lines.
language-independent method (LIM) for program under- In general, the amount of specialization code for control
standing [6]. A language-independent method can be the flow is related to the size of the domain model. Therefore,
basis for implementing language-independent analysis ca- CMS-2 which has the largest language domain model needs
pabilities that are not developed for a particular language, much more more specialization code than 68,000 assembly,
but for a family of languages based on an abstract model. which has a small language domain model.
Analysis capabilities can then be written following this Additionally, no extra code needed to be written to sup-
method, without worrying about or depending upon the port orphan analysis (procedures not called anywhere in
idiosyncrasies of any one language. the program). The necessary code had already been written
Our approach is predicated upon the ability to separate for the data flow analysis. This brings up another aspect of
the processes of parsing and analysis. Using this approach a code reuse—across analysis capabilities. This is indicated
parser converts the source code into a language-dependent by the 100%* figure in Table 1.
abstract syntax tree. We then use language-independent Using the LIM enabled us to avoid rewriting the analysis
analysis capabilities [9] to produce reports and higher-level code for every different language workbench. This has sig-
views of the software. nificantly reduced the amount of time required to port the
Even though the concepts represented in each language analysis capabilities. In general it is a matter of a week or
are similar, they contain certain small syntactic (e.g., the two to complete the port to a new language once the
name of the attribute) and semantic (e.g., call-by-reference vs. grammar and domain model have been verified.

TABLE 1
REUSE STATISTICS
Analysis 68,000 Average Reuse
Capability Core C CMS-2 Ada Fortran Assembly ( percent)
Data Flow 2,638 861 915 745 not ported 447 78.4
Control Flow 1,021 171 287 in Refine/Ada 148 97 85.5
Orphans 56 - - not ported not ported not ported 100*
466 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 23, NO. 7, JULY 1997

3 EXPERIMENTAL RESULTS maintainers. This documentation is integrated with the


source code listing. As a result, it can be used for source
Several factors can be used to measure the success of our
code navigation and can be integrated into other reports.
distributed design recovery effort. Automatically generating
task flow graphs has several advantages. First, the reduced Of course, the accuracy of our algorithms in determining
level of effort to generate task flow graphs is particularly use- the actual task flow is difficult to asertain without a more
ful during maintenance. The automated process described objective assessment in which to compare our results. For-
above will take less than an hour to produce a graph, once tunately, as a part of our companies’ support of the MCE
the data flow analysis has been done on a module of up to project, manually generated task flow graphs were pro-
40,000 LOC. The time required to produce the same graph by duced and form a benchmark from which we can compare
hand is on the order of several days to weeks. New versions the automatically generated results.
of the source code are often released on a regular basis. If
these graphs needed to be generated by hand it could require 3.1 Language Description for Graph Specification
significantly more time for a maintainer to notice how Language
changes between versions impact the task flow. An example of the manually generated task flow graph, for
Additionally, the discrepancies between automatically the ATA module can be found in Fig. 5.
generated documentation and the manually generated An example of the automatically generated task flow
documents point to possible errors in the manually generated graph for the ATA module can be found in Fig. 6.
reports (or possibly the algorithm discussed in the previous In order to validate the approach for generating task-
section). As a result, automatically generating documentation ing/service graphs discussed above accurately, it was nec-
will ensure that the documentation that is available to the essary to automatically compare the automatically gener-
maintainer is more reliable than may otherwise be possible. ated graphs with these manually generated ones. We de-
Finally, this process generates on-line documentation for veloped a tasking graph comparator to accomplish this task.

Fig. 5. A manually generated graph for ATA.

Fig. 6. An automatically generated graph for ATA.


HOLTZBLATT ET AL.: DESIGN RECOVERY FOR DISTRIBUTED SYSTEMS 467

TABLE 2
MANUAL VS. AUTOMATICALLY GENERATED GRAPHS
Edges in Edges in
Manually Automatically Edges in Intersection A-B B-A
Module Generated Graph Generated Graph Common ( percent) ( percent) ( percent)
ATA 19 21 18 85.7 5.3 14.3
CMU 49 49 45 91.8 8.2 8.2
CXU 13 13 10 76.9 23.1 23.1
DMA 9 6 6 66.7 33.3 0.0
DMB 9 9 8 88.9 11.1 11.1
DRA 41 42 41 97.6 0.0 2.4
DSD 96 130 87 66.9 9.4 33.1
DSE 37 57 29 50.9 21.6 49.1
PMA 56 56 54 96.4 3.6 3.6
SMB 19 19 19 100.0 0.0 0.0
SMD 5 5 5 100.0 0.0 0.0
SME 36 36 29 80.6 19.4 19.4
SMF 44 36 16 36.4 63.6 55.6
SMG 18 18 17 94.4 5.6 0
SRC 35 33 33 94.3 5.7 0.0
SRD 39 40 39 97.5 0 2.5
SRE 25 22 22 88.0 12.0 0.0
SRG 9 9 9 100.0 0.0 0.0

To compare an automatically generated and a manually the work of Murphy et al. on the empirical study of call
generated graph requires a common representation. There graph extractors [23]. This work gives the following equa-
must be a way to specify the manually generated graphs in tions to compare graphs pairwise.
a form that is readable by a computer, so that they can be
| Edges in Graph from Method A © Edges in Graph from Method B| * 100
put into this common representation. For this reason, a
simple graph specification language was developed. This max(|Edges in Graph from Method A|, |Edges in Graph from Method B|)
language is used to enter the relationship between tasks This equation generates the percentage of edges found
and other RTOS services. Essentially, it is a way to specify by both methods.
each edge of the graph. The nodes are the various tasks, or
other RTOS services. Together, a set of specified edges is |Edges in Graph from Method A  Edges in Graph from Method B| * 100
used to form the graph. Edges in Graph from Method A
As an example to specify the task spawning relationship This equation generates the percentage of edges found
between task ATA5 and DSE9, the following specification by method A but not method B.
would be used:
|Edges in Graph from Method B  Edges in Graph from Method A| * 100
(def-rtos-info Edges in Graph from Method B
:caller-module ATA
:caller-task-number 5 This equation generates the percentage of edges found
:called-module DSE by method B but not method A.
:called-task-number 9) The higher the intersection statistic, the more similar the
To specify that a particular io-request was made by a graphs are: 100 percent means the graphs share all the same
module, the following specification would be used: edges. It is desirable for the two set difference statistics to
be as close to 0 percent as possible: 0 percent means that
(def-rtos-info one method did not find any extra edges that were not
:type *rtos-io-request-call
found by the other method.
:caller-module SCB
:caller-task-number 11 Three modules did not adhere to the recognition rules
:file-number 50) developed via expert knowledge (see Section 3.3.3), so they
are not included in any of the results below. Table 2 con-
tains the comparison between the manual and automati-
3.2 Results cally generated graphs.
In the tables below, we compare three different graphs for We investigated each discrepancy to determine if there
each module. The first two are the manually and automati- was an error in the manually generated graphs or in the
cally generated graphs. The third is a graph that we created algorithm used to automatically compute the graphs. This
from investigating the discrepancies between the two involved investigating the program slices and the differ-
graphs. This third graph, called the “corrected” graph can ent execution threads found by the automatic method.
be thought of as being closer to “ground truth” of what the Table 3 expresses the results with respect to that analysis.
actual task flow graph is for each module. It is impossible In addition, the results in Table 3 include some modules
to state absolutely which method is better in terms of the 2
which were run in interactive mode. In this mode, the user
actually implemented task flow in the code, since that in-
formation is never truly available. Another way to measure
the results is to compare the methods to each other using 2. This was a new version of M-CLUE that was applied to only a few modules.
468 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 23, NO. 7, JULY 1997

TABLE 3
“CORRECTED” VS. AUTOMATICALLY GENERATED GRAPHS
Edges in Edges in Automatically Intersection A-B B-A
Module Corrected Graph Generated Graph Edges in Common ( percent) ( percent) ( percent)
ATA 21 21 20 95.2 4.8 4.8
CMU 49 49 49 100.0 0.0 0.0
CXU 13 13 13 100.0 0.0 0.0
DMA 6 6 6 100.0 0.0 0.0
DMB 9 9 9 100.0 0.0 0.0
DRA 42 42 42 100.0 0.0 0.0
DSD 99 130 95 73.1 26.9 26.9
DSE 42 57 34 59.6 40.4 40.4
PMA 56 56 56 100.0 0.0 0.0
SMB 19 19 19 100.0 0.0 0.0
SMD 5 5 5 100.0 0.0 0.0
SME 36 36 28 77.8 22.2 22.2
SMF 44 36 29 65.9 19.4 34.1
SMG 18 18 18 100.0 0.0 0.0
SRC 33 33 33 100.0 0.0 0.0
SRD 40 40 40 100.0 0.0 0.0
SRE 25 22 22 88.0 0.0 12.0
SRG 9 9 9 100.0 0.0 0.0

TABLE 4
“CORRECTED” VS. MANUAL GRAPHS
Edges in Edges in Manually Edges in Intersection A-B (B - A)
Module Corrected Graph Generated Graph Common ( percent) ( percent) ( percent)
ATA 21 19 19 90.5 0.0 9.5
CMU 49 49 49 100.0 0.0 0.0
CXU 13 13 13 100.0 0.0 0.0
DMA 6 9 6 66.7 33.3 0.0
DMB 9 9 9 100.0 0.0 0.0
DRA 42 41 41 97.6 0.0 2.4
DSD 99 96 96 96.0 0.0 3.0
DSE 42 37 37 88.1 0.0 11.9
PMA 56 56 56 100.0 0.0 0.0
SMB 19 19 19 100.0 0.0 0.0
SMD 5 5 5 100.0 0.0 0.0
SME 36 36 36 100.0 0.0 0.0
SMF 44 44 44 100.0 0.0 0.0
SMG 18 18 18 100.0 0.0 0.0
SRC 33 35 33 100.0 5.7 0.0
SRD 40 39 39 97.5 0.0 2.5
SRE 25 25 25 100.0 0.0 0.0
SRG 9 9 9 100.0 0.0 0.0

TABLE 5
AVERAGE STATISTICS OF PAIRWISE COMPARISON
Method Intersection A-B B-A
Manual / Automatic 84.1 12.3 12.7
Corrected / Automatic 92.2 6.3 5.3
Manual / Corrected 96.5 0.39 1.6

was asked to type in the value of a variable if the evaluator the graphs actually constructed program slices on paper, or
was unable to compute one for it. This can potentially solve in their heads. However, the software is too complex to con-
some of the problems caused when variables get their val- struct the correct slices without the aid of automatic tools.
ues from outside the module (see Section 3.3.1). Both the automatic method and the manual method
Table 5 summarizes the results by averaging the statis- found graphs which were about 90 percent the same as the
tics over the set of modules. “corrected” graph, itself not a true measure as described
Notice that the manual method of generating the graphs above. Since the automatic graphs are much less time-
was much more error-prone than the automatic method. This consuming to create than the manual ones, the automatic
is borne out by comparing the Manual/Automatic and the method is much more desirable to use. The automatic
Corrected/Automatic statistics. Once the discrepencies were method also has the potential to get better, if certain limita-
accounted for, the automatic graphs were more similar to the tions in the algorithms are overcome (see Section 3.4). The
corrected graphs than the manual graphs. This shows the manual method is always subject to human error—which is
power of generating program slices automatically. Although likely, given the complex data flow analysis that is needed.
they probably were unaware, the individuals who generated Lastly, the graphs produced by the manual method quickly
HOLTZBLATT ET AL.: DESIGN RECOVERY FOR DISTRIBUTED SYSTEMS 469

become out of date, yet the automatic method can be used ELSE BEGIN
to easily recreate the automatic graphs whenever the code SET IDDMSRD(0,MODULE) TO SMGSWPMN $
changes. Lastly, the automatic method generated 11 of the SET IDDMSRD(0,TASK) TO SMGRDCTN
END $
18 graphs with a 100 percent intersection statistic and 0 per-
cent set difference statistics. The statements

3.3 Limitations SET IDDMSRD(0,MODULE) TO SMAGENMN $


SET IDDMSRD(0,MODULE) TO SMGSWPMN $
In this section, we will enumerate the reasons for the incon-
sistent results produced between the manually generated are both part of the slice for IDDMSRD(0,MODULE)
and automatically generated graphs. since both statements could affect its value.
The problem arises during evaluation of this slice.
3.3.1 Inherent Limitations of the Approach Slices do not include information regarding the con-
1) The program slicer is a static tool. It cannot take into text in which a statement should be executed intra-
consideration dynamic information. Additionally, the procedurally. In actuality, only one branch of the
data flow analyzer does not currently take non- conditional is executed at any time, which means that
standard control flows into consideration, i.e., excep- the slice includes two different contexts. Since the
tion or error handling evaluator simply uses line numbers to establish the
2) Since data flow analysis among tasks would require a order of execution of the statements within a proce-
task flow graph, the assumption that was made when dure, one execution path will be obscured, because
processing the MCE code was that each module was the execution of the second assignment to the variable
self contained. By that we mean that all variables are will overwrite the first one. In the case of the example,
set within the module. This is for the most part true, the evaluator will execute the first statement:
however there were some exceptions. One example SET IDDMSRD(0,MODULE) TO SMAGENMN $
where this occurs is in module CXU. The automatic
method found that tasks CMU4, CMU5, and CMU6 assigning the table slot with the value SMAGENMN.
made an I/O request via RTOS with the file number When executing the second statement it will over-
of 2,052. However, the manual method found that the write the table slot value of SMAGENMN with the
same tasks made a slightly different I/O request, with value SMGSWPMN :
the file number of 2,084. SET IDDMSRD(0,MODULE) TO SMGSWPMN $
All of these discrepancies are caused by the same code. When computing the tasking graph we must inves-
CXUSMD: SET CXURTIO(0, BUSADDR) TO 2052 tigate all contexts. The missing conditional information
+ ((GCUMARK(0, DCUADDR) * 32)) causes the evaluator to miss at least one context.
CXUSMD: RTOS INPUT IOREQUES, CORAD(CXURTIO) To remedy this limitation, it is necessary to en-
OUTPUT GPOINTER hance the evaluator so that it uses the control flow
The problem is caused by the fact that GCUMARK(0, graph to order the statements, and starts a new con-
DCUADDR) is set somewhere outside of the module, so the text when encountering conditionals.
value cannot be determined under this assumption (it is 2) Statements controlling iteration are not included in
assumed to be 0). As stated previously, one strategy to the slice, nor are they handled by the evaluator. In
deal with this problem is to ask the user what the value MCE source code, many I/O requests are made
should be. within a loop. For example, in the following code

3.3.2 Current Limitations of the Implementation VARY VSUINDX FROM 0 THRU SMNRDR - 1
IF GTMOCU(VSUINDX,RDRNO) EQ RADARIFF THEN
Some of the results are incorrect because of current limita- BEGIN
tions of the implementation. Most of these limitations could SET IDRIOPKT(0,BUSADDR) TO VSUINDX + 567
be eliminated by enhancing the implementation of the pro- RTOS INPUT IOREQUES, CORAD(IDRIOPKT)
gram slicer, CMS-2 source code evaluator, or the task flow OUTPUT DUMMY
graph generator. The following explains these limitations in END “VARY Loop”
more detail. only the RTOS call that would have been executed the
1) The slicing algorithm does not currently include test first time through the loop will be evaluated. There-
statements in a conditional as part of a slice. As a re- fore, many I/O requests are missing.
sult, a slice may not be correctly evaluated. For exam- 3) Impossible Paths. Certain control flow paths are not
ple, consider the following MCE code: possible. For instance, consider two consecutive con-
ditional statements with contradictory tests. Because
(from procedure SMARDSCN):
IF GVMODE EQ MODIFYSM the conditionals are not evaluated this problem could
OR GVMODE EQ PRINTSM arise. This is a special case of 1.
THEN BEGIN IF A THEN
SET IDDMSRD(0,MODULE) TO SMAGENMN $ SET MNUM TO 5
SET IDDMSRD(0,TASK) TO SMARDCTN ELSE
END
470 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 23, NO. 7, JULY 1997

SET MNUM TO 6 tasking relationships should remain relatively unchanged. It


IF NOT A THEN would be useful to compare the tasking graphs from the old
SET TNUM TO 7 and the new versions to note differences. This would be a
ELSE
SET TNUM TO 8
good way to focus on the impact of the changes made in the
new version. Although certain differences would be correct,
The correct answer is the pair (5, 8) or (6, 7), but the other unexpected differences might indicate a software bug.
algorithm would determine (6, 8).
4) Array/Pointer References. In CMS-2 a variable can be set 3.4.3. Comparing for Debugging Our Software
to the address of a table via the language construct Lastly, the Comparator was very helpful in debugging the
CORAD. Because we do not perform data flow analysis task flow graph generator. First, it was useful in comparing
on what is essentially a pointer, we miss certain rela- the automatically generated graphs to the manually gener-
tionships. An example of this occurs in module SIB. ated ones so that one could focus on locating problems with
the code. Additionally, the current results could be compared
SIBMSACT: SIBREAD INPUT CORAD(IDMRPMRC) to previous results. It could be determined that a bug fix to
SIBREAD: SET CORAD(IPACKET) TO PCKTADDR
SIBREAD: DMUREAD INPUT CORAD(IPACKET)...
correct the graph of one module did not cause an incorrect
graph to be generated for another module. Thus, changes to
In the above slice, PCKTADDR is the formal parameter the software tool could be made with more confidence.
of the procedure SIBREAD. Since the algorithm is un-
able to determine the value of IPACKET (it is address
of IDMRPMRC since that is the actual parameter), it 4 GENERALIZING THE METHOD
cannot determine the task/module pair of this call. 4.1 Recovering Software Architectures in General
M-CLUE’s task recovery was the initial research and devel-
3.3.3 MCE Code Not Implemented According to the
MCE Documentation opment work we performed on architecture recovery—the
automatic (or semiautomatic) recognition of high-level ar-
The degree to which the automatically generated graphs
chitectural structures found in legacy code. M-CLUE’s ap-
accurately capture the correct relationships is dependent
plication to MCE stands out as a major effort where we
upon how well the actual MCE source code adheres to the
have been able to carry out an extensive comparison to
design documentation of the MCE system. This is necessary
more manual approaches. Building on this work, we have
because many of the heuristics built into the tool depend
conducted an extensive program to broaden its application.
upon the distributed design architecture documented for
In this broader setting, we have expanded architecture re-
the MCE system.
In particular, in modules DSB, RTH, and SRF the p- covery [29], [31], [32] in three ways. First, we cover additional
switches in the module entry procedures are nonstandard. languages (C, Ada, FORTRAN). Second, we recover multiple
Therefore, the heuristics do not work well, and produce architectural views (task spawning being one example) of leg-
incomplete results. acy software. Third, we provide for flexible analyst-access to a
broad-based recovery regime. Nuances in programming pat-
3.4 Other Uses for the Tasking Graph Comparator terns and recovery goals frequently make it essential for ana-
The Comparator can also serve other purposes. It can be lysts to adjust their approaches used to recover information.
used to compare the tool’s results to specifications of the This work draws from the evolving research area of
distributed system created during the design phase, or it software architectures (e.g., [24], [25], [26]). Each view we
could be used to compare different versions of the system recover is associated with a style. Styles are the idealized
to insure consistency. Lastly, it was used to debug the de- approaches that developers use for the construction of large
complex systems. Examples of views that we recover in-
sign recovery tool we described in this paper. Below we
clude interprocess communication, object-oriented, cli-
discuss all of these uses in more detail.
ent/server, and implicit invocation, along with the execu-
3.4.1 Comparing Specification to Implementation tion-oriented task spawning style described above.
Importantly there are some unifying principles which
The Tasking Graph Comparator can be used to compare the
we have used to develop our recognition framework. There
automatically generated graphs to a specification of the task
is consensus in software architecture community that an
flow produced during the design phase of the system In this
architectural representation consists of components, con-
way, a comparison between the design and the implementa- nectors, and constraints [27], [28]. Thus, the style establishes
tion would be possible. A situation similar to this could arise a set of recognition commitments which define compo-
if new modules were added to the MCE system. In general, nents/connectors types that our tools look for in the source
such a comparator could be an important part of a software code. Components are blocks of code (e.g., individual pro-
forward engineering tool set. cedures, collections of procedures defining a task, files).
Connectors are code fragments that activate a process, open
3.4.2 Comparing Different Software Versions a repository, or establish a communications path from one
The MCE system is undergoing constant changes during its component to another.
maintenance phase. New versions of the system are pro- As a cornerstone for architecture recovery, we have de-
duced when changes are made. Under the assumption that veloped recognition rules that extract connector/source-
the requirements and higher level abstractions of the soft- component/sink-component tuples. This is a more declara-
ware will not change much from version to version, the tive representation of the design recovery information that
HOLTZBLATT ET AL.: DESIGN RECOVERY FOR DISTRIBUTED SYSTEMS 471

was expressed procedurally in M-CLUE. Each rule is repre- designs/modularizations as in the modularization tool de-
sented declaratively with information about its purpose, scribed in [19]. In general, design recovery tools need to take
pre-conditions, parameterization, effects, and rule methods. advantage of extra-linguistic design information such as that
Rule methods are largely language/operating-system inde- provided in our design recovery rules.
pendent descriptions that look up specialized constructs in
knowledge-based templates. These templates are automati-
cally loaded into the analysis based on language/operating 6 CONCLUSIONS
system dependencies for the system under analysis. While Currently available reverse engineering tools are limited
M-CLUE itself looks explicitly for RTOS calls, our broader to extracting information that is represented explicitly in
framework retrieves models for underlying services (e.g., a
the source code. Therefore, higher-level abstractions of the
model for Unix contains the names of any operating system
code, such as its distributed behavior and the interprocess
function that will achieve a task spawning action).
communication and control, that are not explicitly repre-
As in the M-CLUE work, all the recognizers contain
sented and so cannot be captured. Explicit constructs for
some form of knowledge that enables them to quickly
these abstractions are often not available in legacy lan-
identify code fragments of interest (e.g., RTOS calls), while
guages. This situation is still worse because no basic re-
avoiding a top-down approach requiring full recovery of all
verse engineering tools are generally available for legacy
complex interwoven patterns.
languages.
Using our broader framework, analysts freely select rec-
We have developed M-CLUE a tool that recovers this in-
ognizers to recover multiple as-built views of legacy sys-
formation from a multiple process system written in CMS-
tems. We have linked default recognizers to find features of
2. It is based on CLUE, our general purpose reverse engi-
specific architecture styles, but the analyst is free to select
neering tool for CMS-2. M-CLUE uses expert knowledge of
alternative recovery paths.
the operating system and the architectual design of the tar-
In this work, we are demonstrating that multiview ar-
get system, in addition to the source code. Using this in-
chitecture recovery is practical and effective and can be
formation in the form of recognition rules, and advanced
used to analyze and document large-scale legacy systems.
software methods such as program slicing, we derived the
task flow graphs for each process of the system. When
5 RELATED WORK compared to graphs contructed by analysts, they were on
the average 84 percent correct, and 92 percent correct when
Our approach is influenced by work that recognizes the
we altered the analysts’ graphs with information suggested
organizational power of software architecture, techniques
by the automatic results.
for program understanding and concept recognition, and
By applying powerful program analysis capabilities in
the enabling technology of reverse engineering.
concert with recognition rules derived from some basic
Software architectures as described in [20], [21] provide
system design knowledge, a significant level of system de-
the organizational patterns that drive the distributed design
sign recovery can be achieved. The information is derived
recovery system. In this case, the architecture we analyzed
directly from the source code and can be traced back to that
and recognized was a fairly simple distributed system ar-
source code. As software baselines change, design recovery
chitecture in which modules and tasks are the relevant ar-
can be reapplied to produce current design information,
chitectural objects and the key architectural event is the
yielding a form of “living” documentation that can reliably
spawning of tasks.
aid in program maintenance and understanding.
Traditional program understanding work has focused on
cliché-based recognition [14], [15]. Generally, this is a bot-
tom-up recognition approach in which the program is ACKNOWLEDGMENT
matched to a set of pre-defined plans/clichés from a li-
brary. Recognition is based on a precise data and control All work was done at the MITRE Corporation under Air
flow match which indicates that the recognized source Force Contract No. F19628-94-C-0001.
component is precisely the same as the library template.
Our approach is more of a top-down hypothesis driven REFERENCES
recognition coupled with bottom-up recognition rules. Our [1] T. Biggerstaff, “Design Recovery for Maintenance and Reuse,” Com-
recognition rules do not require algorithmic equivalence of puter, July 1989.
the plan and the source being matched, rather they are [2] T. Biggerstaff, J. Hoskins, and D. Webster, “DESIRE: A System for
Design Recovery,” MCC Technical Report, STP-081-89, May 1989.
based on source code level events [15] in the code. The ex- [3] R. Rock-Evans and K. Hales, “Reverse Engineering: Markets, Meth-
istence of patterns of these events is sufficient to establish a ods and Tools,” vol. 1, Ovum Ltd., 1990.
match. Quilici [16] also explores a mixed top-down, bot- [4] C. Sittenauer, M. Olsem, and D. Murdock, “Re-engineering Tools
Report,” technical report, Air Force Software Technology Support
tom-up recognition approach using traditional plan defini- Center, 1992.
tions. The style of source code event-based recognition [5] H. Reubenstein, R. Piazza, and S. Roberts, “Separating Parsing and
rules is also exemplified in [17], [18] which demonstrates a Analysis in Reverse Engineering Tools,” Proc. Working Conf. on Re-
combination of precise control and data flow relation rec- verse Eng., May 1993.
[6] L.J. Holtzblatt, R.L. Piazza, and S.N. Roberts, “Design Recovery
ognition and more abstract code event recognition. Technology for Real-Time Systems,” Mitre Technical Report
Design recovery work, such as DESIRE [1], relies on: ex- 95B0000057, July 1995.
ternally supplied cues regarding program structure, modu- [7] M. Weiser, “Program Slicing,” IEEE Trans. Software Eng., vol. 10, no.
4, July 1984.
larization heuristics, manual assistance and informal infor- [8] N. Chapin, “Real-Time Software Maintenance,” Proc. Eighth Int’l
mation. Informal information and heuristics can also be used Conf. Software Maintenance and Re-Engineering, Aug. 1991.
to reorganize and automatically refine recovered software
472 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 23, NO. 7, JULY 1997

[9] C. Castells-Schofield, “Engineering a Language-Independent Ap- Richard L. Piazza (M’93) holds a BS degree in
proach to Parsing for Analysis and Testing,” Vitro Technical J., vol. 8, computer science from the State University of
no. 1, 1990. New York at Albany and an MS degree from the
[10] P. Devanbu, “GENOA—A Customizable, Language and Front-End University of Maryland at College Park. He re-
Independent Code Analyzer,” Proc. 14th Int’l Conf. Software Eng., cently joined Reasoning, Inc. as a principal
May 1992. member of the technical staff. Previously, he was
[11] S. Roberts, “Reverse Engineering Assembly Language Programs,” with the MITRE Corporation, which he joined in
Proc. Fourth Reengineering Forum, vol. 2, Sept. 1994.
1985. At MITRE he worked on projects in auto-
[12] Reasoning Systems, “Refine User Guide,” May 1990.
[13] Reasoning Systems, “Refine/C User’s Guide,” Mar. 1992. matic programming; the programming of under-
[14] C. Rich and L. Wills, “Recognizing a Program’s Design: A Graph standing and software reverse engineering; ma-
Parsing Approach,” IEEE Software, vol. 7, no. 1, 1990. chine learning; constraint-based reasoning, and
[15] M. Harandi and J. Ning, “Knowledge-Based Program Analysis,” model-based diagnosis. His current research interests center on
IEEE Software, vol. 7, no. 1, 1990. building various reverse engineering tools. His main contribution was
[16] A. Quilici, “A Hybrid Approach to Recognizing Program Plans,” the Language Independent Method for constructing reverse engineer-
Proc. Working Conf. Reverse Eng., May 1993. ing tools. This method facilitates building tools for different computer
[17] W. Kozaczynski, J. Ning, and T. Sarver, “Program Concept Recog- languages, through sharing language independent analysis routines.
nition,” Proc. Seventh Ann. Knowledge-Based Software Eng. Conf., 1992. He is a member of the IEEE. He has been an instructor in Boston Uni-
[18] A. Engberts, W. Kozaczynski, and J. Ning, “Concept Recognition- versity’s Metropolitan College Computer Science Department since
Based Program Transformation,” Proc. IEEE Conf. Software Mainte- 1985.
nance, 1991.
[19] R. Schwanke, “An Intelligent Tool for Re-Engineering Software
Modularity,” Proc. 13th Int’l Conf. Software Eng., 1991.
[20] M. Shaw, “Larger Scale Systems Require Higher-Level Abstrac- Howard B. Reubenstein (S‘85–M‘90) received
tions,” Proc. Fifth Int’l Workshop Software Specification and Design, the SB and SM degrees in computer science in
1989. 1985 and the PhD degree in computer science in
[21] D. Perry and A. Wolf, “Foundations for the Study of Software Ar- 1990 for his work on the Requirements Appren-
chitecture,” ACM Software Eng. Notes, vol. 17, no. 4, 1992. tice, all from the Massachusetts Institute of
[22] P. Lutsky, “Automatic Testing by Reverse Engineering of Software Technology, Cambridge. The work presented in
Documentation,” Proc. Second Working Conf. Reverse Eng., July 1995. this paper was performed while at the MITRE
[23] G. Murphy, D. Notkin, and E. Lan, “An Empirical Study of Static Corporation as a lead scientist in a Knowledge-
Call Graph Extractors,” Univ. of Washington, Computer Science Based Software Engineering group. He is cur-
Dept, Technical Report 95-08-01. rently employed at Reasoning Inc., developing
[24] D. Garlan and M. Shaw, An Introduction to Software Architecture,“ commerical software reverse engineering and
Tutorial 15th Int’l Conf. Software Eng., 1993. transformation tools. Dr. Reubenstein is a past-general chair and pro-
[25] M. Shaw, “Larger Scale Systems Require Higher-Level Abstrac-
gram committee member of the Automated Software Engineering
tions,” Proc. Fifth Int’l Workshop Software Specification and Design,
1989. Conference, and currently on the Program Committee of the Fourth
[26] D. Perry and A. Wolf, “Foundations for the Study of Software Ar- Working Conference on Reverse Engineering. He is a member of the
chitecture,” Software Eng. Notes, vol. 17, no. 4, 1992. Association for Computing Machinery and the IEEE Computer Society.
[27] G. Abowd, R. Allen, and D. Garlan, “Style to Understand Descrip-
tions of Software Architecture,” Software Eng. Notes, vol. 18, no. 5,
1993. Also in Proc. First ACM SIGSOFT Symp. Foundations of Software
Eng., 1993. Susan N. Roberts holds a BS degree in com-
[28] W. Tracz, “Domain-Specific Software Architecture (DSSA)— puter science and an MS degree in computer
Frequently Asked Questions (FAQ), Software Eng. Notes, vol. 19, no. science, with a concentration in artificial intelli-
2, 1994. gence both from the University of Michigan. She
[29] D. Harris, H. Reubenstein, and A. Yeh, “Reverse Engineering to the recently joined Spyglass, Inc. as a software en-
Architectural Level,” Proc. 17th Int’l Conf. Software Eng., 1995. gineer. Previously, she was an artificial intelli-
[30] S. Roberts, R. Piazza, and D. Katz, “A Portable Assembler Reverse gence lead scientist at the MITRE Corporation,
Engineering Environment,” Proc. Third Working Conf. Reverse Eng., which she joined in 1990. While at MITRE, her
Nov. 1996. early work concentrated in the area of natural
[31] D Harris, A. Yeh, and H. Ruebenstein, “Extracting Architectural language understanding. Following that, she
Features from Source Code,” Automated Software Eng., vol. 3, pp. spent six years working in the area of reverse
109-138, Boston: Kluwer Academic, 1996. engineering. During this time, she developed numerous software
[32] A. Yeh, D. Harris, and M. Chase, “Manipulating Recovered Soft- workbenches for reverse engineering third generation software lan-
ware Architecture Views,” Proc. 19th Int’l Conf. Software Eng., pp. guages and assembly languages. At Spyglass, Inc., she is working on
184-194, Boston: ACM Press, May 1997. the HTML engine of the Device Mosaic web browser.

Lester J. Holtzblatt holds a BS degree in psy-


David R. Harris received the BS degree in
chology from the University of Illinois and an MS
mathematics from Cleveland State University in
degree in computer science from the University
1966 and the MS degree in mathematics from
of Massachusetts—Lowell. He also has an MA
Northeastern University in 1968. In 1993, he
degree in educational psychology from the Uni-
joined the MITRE Corporation, where he has
versity of Toronto. He is area manager for Tech-
made contributions to software reverse engi-
nical Support Services in the Boston office of
neering research and practical application. His
Reasoning, Inc. Prior to joining Reasoning, Inc.,
research interest is in building computer sys-
Holtzblatt was a senior principal engineer at the
tems to support complex engineering tasks. He
MITRE Corporation. In this role, Holtzblatt was
currently is applying artificial intelligence tech-
responsible for MITRE’s software reengineering
niques to software architecture recovery and
technology program, directing the development of new software re-
software understanding problems.
engineering technology to better support MITRE’s Department of De-
fense customers. Previously at MITRE, Mr. Holtzblatt managed the
Software Re-engineering Technology group and the Knowledge Based
Applications group. In these assignments, Holtzblatt was responsible
for evaluating and prototyping the application of advanced software
technology to support government information systems projects.

You might also like