Abstract:
There are four phases to the forward development of software: Analysis, Design, Logical Code and Environment Code. Similarly, there are four steps to complete the circle from code back to design: Restructuring, Maintenance, Design Recovery, and Analysis. At each stage of forward development, information is added; at each stage of reverse development, information is lost. The data that reside in a completed system are information about the environment, logic, and design. These data must be kept separately when reverse-developing to allow forward development from the recovered design. The complete cyclic development process will require restructurers for moving about software environments, code editors for maintenance, and code generators (and 4GLs) to generate new systems from old. The purpose of reverse engineering is to provide tools to move the representation of a system from one point to the next.
Article in the July 1992 issue of CASE Trends.
The Reverse Engineering / Forward Engineering Cycle
Restructuring
The implementation of a computer software system occurs in three phases, here called Analysis, Design, and Code.
The Analysis phase is the determination and expression of the problem. It contains the business rules appropriate for the problem and the business needs to be met. Expression is usually in terms of the problem and problem-poser rather than computer-specific.
The Design phase expresses the Analysis phase solution in terms of the the schedule, fiscal, and resource constraints, and the interactions that are necessary for completion of the task. Resource constraints are the hardware environment and target implementation personnel (experienced or entry-level programmers).
For computer system development, this phase focuses on the computer-related aspects of the solution. Data flow, modularity, dependencies and user interface are the primary items of this phase. Resource constraints are further specified in terms of systems being used, the operating system, data management software, communications protocols, coding standards, and coding language.
The Logical Code phase is the expression of the logical aspects of the design in a particular coding language, such as C or COBOL. This expression does not include particular details of environmental details such as input and output.
The Environmental Code phase expresses the details specific to a particular environment and set of resources. This expression is the database i/o calls (for IMS, get uniques or inserts, for DB/2 selects and fetches), database definitions (such as IMS DBDs or DB/2 tables and views), and other files and procedures that are required for the system to execute (such as IMS PSBs, ACBs, etc.).
The Logical and Environmental Code phases are usually executed at the same time, but they really deal with very different data. One key that you are shifting from one phase to the other within a sequence of code is when the syntax of the statements changes. For example, while CICS instructions are embedded inline with a stream of COBOL code, the syntax of CICS command language is quite different from that of COBOL.

Note that each phase down the chart (see Figure 1) is increasingly specific, and each phase requires additional human input; i.e. the knowledge representing the system is increased at each step via human intervention. Likewise, each phase up the chart is increasingly more abstract and less specific.
Each phase also contributes a specific set of knowledge about the problem solution. The Analysis phase contains business rules. The Design phase contains logic, and the Code phase contains knowledge about the execution environment.
Much interest is currently being shown in retracing the steps of system development, particularly for older systems, which have little documentation of their development or current functionality. Indeed, the system may not have been "designed" but rather was badly constructed with no methodological approach. Two terms related to these activities have come into vogue: reverse engineering and design recovery.
First, it is important to make a distinction between design recovery and reverse engineering. Both start with existing system code. System code is that amalgam of code fragments that together are required to generate the running system. This starts with program source code (such as COBOL), but also includes database definitions, data access control statements, transaction control statements, human interfaces, and job control statements. In the IBM world, these would be Database Definitions (DBDs), Program Specification Blocks (PSBs), Access Control Blocks (ACBs), Message Formatting Services screen maps (MFS), CICS BMS screen maps, report layouts, and Job Control Language (JCL).
Reverse engineering is a set of transformations that can be applied to these data to move them to an equivalent state in the forward development path. Examples would be code restructurers, "pretty printers", and conversion programs that change data access methods, say from IMS to DB2.
Design recovery is a less algorithmic process. When a system designer designs a software system, there is a certain set of things s/he would do. The designer creates a series of charts to map data flow (Entity-Relation (E-R) diagrams, data flow diagrams, data model diagrams), and a series of charts to show data or functional dependency (E-R diagrams, structure charts, state transition diagrams, etc.). The designer would do some data descriptions and some functional descriptions. Design recovery is the attempt to derive those data, to the extent possible, from the system code.
Each step in the reverse development process leads to increasing is lost. Information that resides in the code as part of the instance expression of the problem solution is selectively discarded to expose the design "essence". As you ascend the arch shown in figure 1, each step sheds the knowledge that had been added in the parallel step down the arch in forward development. And herein lies the two problems at the heart of dealing with reverse development: the abstract nature of design and the lack of instance-specific data.
First, because of the more abstract nature of a system design, there is more than one instance of a code expression that will fulfill the design (just as there is more than one design that can be expressed by the same code). For example, if the COBOL code is...
P1.
PERFORM P2.
PERFORM P3.
PERFORM P2.
PERFORM P3.
PERFORM P2.
PERFORM P3.
...and you want to derive structural dependency, the iterating instances and their sequence are lost in the dependency structure diagram:

Furthermore, there is a wealth of instance-specific data that is discarded in deriving the design. Examples are access methods for DBDs, terminal-specific attributes for screen images, file attributes from files described in JCL or COBOL File Description statements (FDs) and Environment Divisions.
The Reverse Engineering / Forward Engineering Cycle
Intuitively, reverse development should be complimentary to forward development; the process should be commutative. Yet the above two problems make the completion of that circle very difficult.
In addition, the top of the arc, the Analysis Phase, is not addressed. This would be the phase that derives business rules from existing code. This phase is ill-defined even for new systems. There is no standard methodology for expressing business requirements to be handed to systems analysts, nor are there programs that could take an expression of business rules, have knowledge of the computing environment, and generate a design which could then be processed by a code generator to create a finished computer system. When more work is done in this area from the standpoint of forward design, it will be easier for reverse processing to deal with the question.
Fortunately, to be able to make the transition from the reverse development branch to the forward development branch does not require a faithful adherence to the arc through the Analysis phase; intermediate bridges can be built. The requirements for each bridge change with how high up the arch the bridge is to be placed.
There are two intermediate bridges that represent the most common functions of system maintenance: what I will call Restructuring and Maintenance.
Restructuring is a set of linear transformations applied to a system that do not alter the inherent correctness or knowledge expressed by the system. Examples of Restructuring are converting to a new language, converting to a new data base engine, enforcing corporate coding standards (for naming, structure, format), converting to a new operating system.
For example, converting from one language to another within a syntactic family could be as simple as a translation facility.
COBOL:
PERFORM P1 VARYING A FROM 1 UNTIL A EQUALS B.
C:
for (a = 1; a < b; ++a)
{
p1:
}
This transformation requires little knowledge of the other data that have gone into the system, such as the design or logic of the code.
Maintenance alters the knowledge within a system. In the MIS world, a payroll union calculation is changed to reflect a new contract, a new type of insurance policy is added to the others that can be processed, an inventory system is modified to handle a new part. Given the investment already made in an existing, working system, it would be foolish to throw away that investment to create a wholly new system when only a small part of the system is to be changed.
This is the major omission of current CASE products.

On the other hand, if the bridge is to built across at the design level, and the logic and environment data have been lost, it is impossible to complete the crossing (unless humans are required to re-input the lost data). Therefore, when recovering design, there must be some way to extract and store the logic and environment of the source code.
A symmetrical reverse/forward engineering process should follow these steps:

A change to a design may alter the design so as to be inconsistent with the captured logical or environmental data. For example, it is possible to move a logical block to a point preceding the setting of a control variable. Therefore, provision must be made to allow human input to prevent or resolve logical or environmental inconsistencies.
Bits and pieces of the development cycle are being implemented. Microfocus has the Workbench for code creation and maintenance. Companies like Texas Instruments, KnowledgeWare, and Bachman are addressing the data part of the analysis phase, and companies like Aion are addressing the process part of the analysis phase. Easel and Borland are addressing the 4GL/code generator phase.
For reverse development, among others, Intersolv has Excelerator for Design Recovery to derive process information, and Bachman has tools to recover database information and data relationships.
Still, there are critical pieces missing. Even to the design level, no one has put together an integrated platform that can support the cycle from design to implementation in the forward development line, much less an integrated environment that encompasses existing systems.
There is very little in the way of intelligence to support the programmer/analyst (though the Ada specifications show the way).
And last, there is little formalism in the codification of the rules by which a company operates.
As existing reverse development pieces are implemented, and as the missing pieces are provided, the process of generating and maintaining computer systems will be transformed.