System Software

Download as pdf or txt
Download as pdf or txt
You are on page 1of 100

UNIT-1

INTRODUCTION TO SYSTEM SOFTWARE

The Software is set of instructions or programs written to carry out certain task on digital computers. It
is classified into system software and application software. System software consists of a variety of programs
that support the operation of a computer. Application software focuses on an application or problem to be
solved. System software consists of a variety of programs that support the operation of a computer.

Examples for system software are Operating system, compiler, assembler, macro processor, loader
or linker, debugger, text editor, database management systems (some of them) and, software engineering tools.
These software’s make it possible for the user to focus on an application or other problem to be solved, without
needing to know the details of how the machine works internally.

System Software and Machine Architecture:

One characteristic in which most system software differs from application software is machine dependency.

System software supports operation and use of computer. Application software provides solution to a
problem. Assembler translates mnemonic instructions into machine code. The instruction formats, addressing
modes etc., are of direct concern in assembler design. Similarly,

Compilers must generate machine language code, taking into account such hardware characteristics
as the number and type of registers and the machine instructions available. Operating systems are directly
concerned with the management of nearly all of the resources of a computing system.

There are aspects of system software that do not directly depend upon the type of computing system,
general design and logic of an assembler, general design and logic of a compiler and code optimization
techniques, which are independent of target machines. Likewise, the process of linking together independently
assembled subprograms does not usually depend on the computer being used.
The Simplified Instructional Computer (SIC):

Simplified Instructional Computer (SIC) is a hypothetical computer that includes the hardware features
most often found on real machines. There are two versions of SIC, they are, standard model (SIC), and,
extension version (SIC/XE) (extra equipment or extra expensive).

SIC Machine Architecture:

We discuss here the SIC machine architecture with respect to its Memory and Registers, Data Formats,
Instruction Formats, Addressing Modes, Instruction Set, Input and Output

 Memory:

There are 215 bytes in the computer memory, that is 32,768 bytes. It uses Little Endian format to store
the numbers, 3 consecutive bytes form a word, each location in memory contains 8-bit bytes.

 Registers:

There are five registers, each 24 bits in length. Their mnemonic, number and use are given in the
following table.

Mnemonic Number Use

A 0 Accumulator; used for arithmetic operations

X 1 Index register; used for addressing

L 2 Linkage register; JSUB

PC 8 Program counter

SW 9 Status word, including CC

 Data Formats:
Integers are stored as 24-bit binary numbers. 2’s complement representation is used for negative
values, characters are stored using their 8-bit ASCII codes.No floating-point hardware on the standard version
of SIC.

 Instruction Formats:

Opcode(8) x Address (15)

All machine instructions on the standard version of SIC have the 24-bit format as shown above

 Addressing Modes:

Mode Indication Target address calculation

Direct x=0 TA = address

Indexed x=1 TA = address + (x)

There are two addressing modes available, which are as shown in the above table.
Parentheses are used to indicate the contents of a register or a memory location.

 Instruction Set :

1. SIC provides, load and store instructions (LDA, LDX, STA, STX, etc.). Integer arithmetic operations:
(ADD, SUB, MUL, DIV, etc.).
2. All arithmetic operations involve register A and a word in memory, with the result being left in the
register. Two instructions are provided for subroutine linkage.
3. COMP compares the value in register A with a word in memory, this instruction setsa condition code
CC to indicate the result. There are conditional jump instructions: (JLT, JEQ, JGT), these instructions
test the setting of CC and jump accordingly.
4. JSUB jumps to the subroutine placing the return address in register L, RSUB returns by jumping to the
address contained in register L.

 Input and Output:


Input and Output are performed by transferring 1 byte at a time to or from the rightmost 8 bits of
register A (accumulator). The Test Device (TD) instruction tests whether the addressed device is ready to send
or receive a byte of data. Read Data (RD), Write Data (WD) are used for reading or writing the data.

 Data movement and Storage Definition

LDA, STA, LDL, STL, LDX, STX ( A- Accumulator, L – Linkage Register, X – Index Register), all
uses3-byte word. LDCH, STCH associated with characters uses 1-byte. There are no memory-memory move
instructions.

Storage definitions are

 WORD - ONE-WORD CONSTANT

 RESW - ONE-WORD VARIABLE

 BYTE - ONE-BYTE CONSTANT

 RESB - ONE-BYTE VARIABLE

Example Programs (SIC):

Example 1: Simple data and character movement operation

LDA FIVE

STA ALPHA

LDCH CHARZ

STCH C1

ALPHA RESW 1

FIVE WORD 5

CHARZ BYTE C’Z’

C1 RESB 1

Example 2: Arithmetic operations

LDA ALPHA
ADD INCR

SUB ONE

STA BETA

……..

……..

……..

ONE WORD 1

ALPHA RESW 1

BEETA RESW 1

INCR RESW 1

Example 3: Looping and Indexing operation

LDX ZERO ;X=0

MOVECH LDCH STR1, X ; LOAD A FROM STR1

STCH STR2, X ; STORE A TO STR2

TIX ELEVEN ; ADD 1 TO X, TEST

JLT MOVECH

STR1 BYTE C ‘HELLO WORLD’

STR2 RESB 11

ZERO WORD 0

ELEVEN WORD 11
Example 4:Input and Output operation

INLOOP TD INDEV : TEST INPUT DEVICE

JEQ INLOOP : LOOP UNTIL DEVICE IS READY

: READ ONE BYTE INTO A


RD INDEV

STCH DATA : STORE A TO DATA

OUTLP TD OUTDEV : TEST OUTPUT DEVICE

JEQ OUTLP : LOOP UNTIL DEVICE IS READY

LDCH DATA : LOAD DATA INTO A

: WRITE A TO OUTPUT DEVICE


WD OUTDEV

INDEV BYTE X ‘F5’ : INPUT DEVICE NUMBER

OUTDEV BYTE X ‘08’ : OUTPUT DEVICE NUMBER

DATA RESB 1: ONE-BYTE VARIABLE

Example 5: To transfer two hundred bytes of data from input device to memory

LDX ZERO

CLOOP TD INDEV

JEQ CLOOP

RD INDEV

STCH RECORD, X
TIX B200

JLT CLOOP

INDEV BYTE X ‘F5’

RECORD RESB 200

ZERO WORD 0

B200 WORD 200

SIC/XE Machine Architecture:

 Memory

Maximum memory available on a SIC/XE system is 1 Megabyte (220 bytes).

 Registers

Additional B, S, T, and F registers are provided by SIC/XE, in addition to theregisters of SIC.

Mnemonic Number Special use

B 3 Base register

S 4 General working register

T 5 General working register

F 6 Floating-point accumulator (48 bits)

 Floating-point data type:


8 4 4 There is a 48-bit floating-point data type, F*2(e-1024)

1 11 36

s exponent fraction

 Instruction Formats:

The new set of instruction formats fro SIC/XE machine architecture are as follows.

 Format 1 (1 byte): contains only operation code (straight from table).

 Format 2 (2 bytes): first eight bits for operation code, next four for register 1 and following four for
register 2. The numbers for the registers go according to the numbers indicated at the registers section
(ie, register T is replaced by hex 5, F is replaced by hex 6).
 Format 3 (3 bytes): First 6 bits contain operation code, next 6 bits contain flags, last 12 bits contain
displacement for the address of the operand. Operation code uses only 6 bits, thus the second hex digit
will be affected by the values of the first two flags (n and i). The flags, in order, are: n, i, x, b, p, and e.
Its functionality is explained in the next section. The last flag e indicates the instruction format (0 for 3
and 1 for 4).
 Format 4 (4 bytes): same as format 3 with an extra 2 hex digits (8 bits) for addresses that require more
than 12 bits to be represented.

Format 1 (1 byte)

op

Format 2 (2 bytes)
op r1 r2

Formats 1 and 2 are instructions do not reference memory at all

Format 3 (3 bytes)

6 1 1 1 1 1 1 12

op n i x b p e disp

Format 4 (4 bytes)

6 1 1 1 1 1 1 20

op n i x b p e address

 Addressing modes & Flag Bits

Five possible addressing modes plus the combinations are as follows.

1. Direct (x, b, and p all set to 0): operand address goes as it is. n and i are both set to the same value,
either 0 or 1. While in general that value is 1, if set to 0 for format 3 we can assume that the rest of the
flags (x, b, p, and e) are used as a part of the address of the operand, to make the format compatible to
the SIC format.

2. Relative (either b or p equal to 1 and the other one to 0): the address of the operand should be added to
the current value stored at the B register (if b = 1) or to the value stored at the PC register (if p = 1)

3. Immediate(i = 1, n = 0): The operand value is already enclosed on the instruction (ie. lies on the last
12/20 bits of the instruction)

4. Indirect(i = 0, n = 1): The operand value points to an address that holds the address for the operand
value.
5. Indexed (x = 1): value to be added to the value stored at the register x to obtain real address of the
operand. This can be combined with any of the previous modes except immediate.

The various flag bits used in the above formats have the following meaningse - > e = 0

means format 3, e = 1 means format 4

Bits x,b,p : Used to calculate the target address using relative, direct, and indexed addressingModes.

Bits i and n: Says, how to use the target address

b and p - both set to 0, disp field from format 3 instruction is taken to be the target address.
For a format 4 bits b and p are normally set to 0, 20 bit address is the target address

x - x is set to 1, X register value is added for target address calculation

i=1, n=0 Immediate addressing, TA: TA is used as the operand value, no memory reference

i=0, n=1 Indirect addressing, ((TA)): The word at the TA is fetched. Value of TA is taken asthe address of the
operand value

i=0, n=0 or i=1, n=1 Simple addressing, (TA):TA is taken as the address of the operand value

Two new relative addressing modes are available for use with instructions assembled usingformat 3.

Mode Indication Target address calculation

TA=(B)+ disp
Base relative b=1,p=0
(0disp 4095)

TA=(PC)+ disp
Program-counter
b=0,p=1
relative
(-2048disp 2047)
 Instruction Set:

SIC/XE provides all of the instructions that are available on the standard version. In addition we have,
Instructions to load and store the new registers LDB, STB, etc, Floating- point arithmetic operations, ADDF,
SUBF, MULF, DIVF, Register move instruction : RMO,

Register-to-register arithmetic operations, ADDR, SUBR, MULR, DIVR and, Supervisor callinstruction : SVC.

 Input and Output:

There are I/O channels that can be used to perform input and output while the CPU is executing other
instructions. Allows overlap of computing and I/O, resulting in more efficient system operation. The
instructions SIO, TIO, and HIO are used to start, test and haltthe operation of I/O channels.

Example Programs (SIC/XE)

Example 1: Simple data and character movement operation

LDA #5

STA ALPHA

LDA #90

STCH C1

ALPHA RESW 1

C1 RESB 1

Example 2: Arithmetic operations

LDS INCR

LDA ALPHA

ADD S,A

SUB #1 STA

BETA

………….

…………..
ALPHA RESW 1

BETA RESW 1

INCR RESW 1

Example 3: Looping and Indexing operation

LDT #11

LDX #0 : X=0
MOVECH LDCH STR1, X : LOAD A FROM STR1

STCH STR2, X : STORE A TO STR2 TIXR

T : ADD 1 TO X, TEST (T)

JLT MOVECH

……….

……….

………

STR1 BYTE C ‘HELLO WORLD’

STR2 RESB 11

ASSEMBLERS-1

Basic Assembler Functions:

The basic assembler functions are:

 Translating mnemonic language code to its equivalent object code.

 Assigning machine addresses to symbolic labels.

SOURCE ASSEMBLER OBJECT CODE


PROGRAM

• The design of assembler can be to perform the following:


– Scanning (tokenizing)

– Parsing (validating the instructions)


– Creating the symbol table

– Resolving the forward references

– Converting into the machine language


 SIC Assembler Directive:

– START: Specify name & starting address.


– END: End of the program, specify the first execution instruction.

– BYTE, WORD, RESB, RESW

– End of record: a null char(00)End


of file: a zero length record

• The design of assembler in other words:


– Convert mnemonic operation codes to their machine language equivalents
Convert symbolic operands to their equivalent machine addresses

– Decide the proper instruction format Convert the data constants to internal machinerepresentations
– Write the object program and the assembly listing

So for the design of the assembler we need to concentrate on the machine architecture of the SIC/XE machine.
We need to identify the algorithms and the various data structures to be used. According to the above required
steps for assembling the assembler also has to handle assembler directives, these do not generate the object
code but directs the assembler to perform certain operation. These directives are:

The assembler design can be done:

 Single pass assembler

 Multi-pass assembler

Single-pass Assembler:

In this case the whole process of scanning, parsing, and object code conversion is done in single pass.
The only problem with this method is resolving forward reference. Thisis shown with an example below:

10 1000 FIRST STL RETADR 141033

--

--

--
1
--

95 1033 RETADR RESW


In the above example in line number 10 the instruction STL will store the linkage register with the
contents of RETADR. But during the processing of this instruction the value of this symbol is not known as it is
defined at the line number 95. Since I single-pass assembler the scanning, parsing and object code conversion
happens simultaneously. The instruction is fetched; it is scanned for tokens, parsed for syntax and semantic
validity. If it valid then it has to be converted to its equivalent object code. For this the object code is generated
for the opcode STL and the value for the symbol RETADR need to be added, which is not available.

Due to this reason usually the design is done in two passes. So a multi-pass assembler resolves the
forward references and then converts into the object code. Hence the process of the multi-pass assembler can be
as follows:

Pass-1

 Assign addresses to all the statements

 Save the addresses assigned to all labels to be used in Pass-2

 Perform some processing of assembler directives such as RESW, RESB to find thelength of data
areas for assigning the address values.

 Defines the symbols in the symbol table(generate the symbol table)

Pass-2

 Assemble the instructions (translating operation codes and looking up addresses).

 Generate data values defined by BYTE, WORD etc.

 Perform the processing of the assembler directives not done during pass-1.

 Write the object program and assembler listing.

Assembler Design:

The most important things which need to be concentrated is the generation of Symboltable and
resolving forward references.
• Symbol Table:

– This is created during pass 1


– All the labels of the instructions are symbols

– Table has entry for symbol name, address value.


• Forward reference:

– Symbols that are defined in the later part of the program are called forwardreferencing.
– There will not be any address value for such symbols in the symbol table inpass 1.

Example Program:

The example program considered here has a main module, two subroutines

• Purpose of example program

- Reads records from input device (code F1)

- Copies them to output device (code 05)

- At the end of the file, writes EOF on the output device, then RSUB to theoperating

system

• Data transfer (RD, WD)


-A buffer is used to store record

-Buffering is necessary for different I/O rates

-The end of each record is marked with a null character (00)16

-The end of the file is indicated by a zero-length record

 Subroutines (JSUB, RSUB)


-RDREC, WRREC

-Save link register first before nested jump


The first column shows the line number for that instruction, second column shows the addresses
allocated to each instruction. The third column indicates the labels given to the statement, and is followed by
the instruction consisting of opcode and operand. The last column gives the equivalent object code.

The object code later will be loaded into memory for execution. The simple object program we use
contains three types of records:

• Header record
- Col. 1 H

- Col. 2~7 Program name

- Col. 8~13 Starting address of object program (hex)

- Col. 14~19 Length of object program in bytes (hex)

• Text record
- Col. 1 T

- Col. 2~7 Starting address for object code in this record (hex)
- Col. 8~9 Length of object code in this record in bytes (hex)

- Col. 10~69 Object code, represented in hex (2 col. per byte)

• End record
- Col.1 E

- Col.2~7 Address of first executable instruction in object program (hex) “^” is only forseparation only

Simple SIC Assembler

The program below is shown with the object code generated. The column named LOC gives the machine
addresses of each part of the assembled program (assuming the program is starting at location 1000). The
translation of the source program to the object program requires us to accomplish the following functions:

1. Convert the mnemonic operation codes to their machine language equivalent.


2. Convert symbolic operands to their equivalent machine addresses.
3. Build the machine instructions in the proper format.
4. Convert the data constants specified in the source program into their internalmachine
representations in the proper format.
5. Write the object program and assembly listing.

All these steps except the second can be performed by sequential processing of the source program, one line at a
time. Consider the instruction

10 1000 LDA ALPHA 00-----

This instruction contains the forward reference, i.e. the symbol ALPHA is used is not yet defined. If
the program is processed ( scanning and parsing and object code conversion) is done line-by-line, we will be
unable to resolve the address of this symbol. Due to this problem most of the assemblers are designed to process
the program in two passes.
In addition to the translation to object program, the assembler has to take care of handling assembler
directive. These directives do not have object conversion but gives direction to the assembler to perform some
function. Examples of directives are the statements like BYTE and WORD, which directs the assembler to
reserve memory locations without generating data values. The other directives are START which indicates the
beginning of the program and END indicating the end of the program.

The assembled program will be loaded into memory for execution. The simple object program contains
three types of records: Header record, Text record and end record. The header record contains the starting
address and length. Text record contains the translated instructions and data of the program, together with an
indication of the addresses where these are to be loaded. The end record marks the end of the object program
and specifies the address where the execution is to begin.

The format of each record is as given below. Header

record:

Col 1 H

Col. 2-7 Program name

Col 8-13 Starting address of object program (hexadecimal) Col 14-

19 Length of object program in bytes (hexadecimal) Text

record:

Col. 1 T

Col 2-7. Starting address for object code in this record (hexadecimal) Col 8-9

Length off object code in this record in bytes (hexadecimal)

Col 10-69 Object code, represented in hexadecimal (2 columns per byte ofobject code)

 In pass 1 the OPTAB is used to look up and validate the operation code in the source program. In pass
2, it is used to translate the operation codes to machine language. In simple SIC machine this process
can be performed in either in pass 1 or in pass 2. But for machine like SIC/XE that has instructions of
different lengths, we must search OPTAB in the first pass to find the instruction length for
incrementingLOCCTR.

 In pass 2 we take the information from OPTAB to tell us which instruction format to use in
assembling the instruction, and any peculiarities of the object code instruction.
 OPTAB is usually organized as a hash table, with mnemonic operation code as the key. The hash
table organization is particularly appropriate, since it provides fast retrieval with a minimum of
searching. Most of the cases the OPTAB is a static table- that is, entries are not normally added to or
deleted from it. In such cases it is possible to design a special hashing function or other data structure
to give optimum performance for the particular set of keys being stored.
SYMTAB:

 This table includes the name and value for each label in the source program, together with flags to
indicate the error conditions (e.g., if a symbol is defined in two different places).

 During Pass 1: labels are entered into the symbol table along with their assigned address value as they
are encountered. All the symbols address value should get resolved at the pass 1.

 During Pass 2: Symbols used as operands are looked up the symbol table to obtain theaddress value to
be inserted in the assembled instructions.

 SYMTAB is usually organized as a hash table for efficiency of insertion and retrieval. Since entries are
rarely deleted, efficiency of deletion is the important criteria for optimization.

 Both pass 1 and pass 2 require reading the source program. Apart from this an intermediate file is
created by pass 1 that contains each source statement together with its assigned address, error
indicators, etc. This file is one of the inputs to the pass2.
 A copy of the source program is also an input to the pass 2, which is used to retain the operations that
may be performed during pass 1 (such as scanning the operation field for symbols and addressing
flags), so that these need not be performed during pass 2. Similarly, pointers into OPTAB and
SYMTAB is retained for each operation code and symbol used. This avoids need to repeat many of
the table-searching operations.

LOCCTR:

Apart from the SYMTAB and OPTAB, this is another important variable which helps in the assignment of the
addresses. LOCCTR is initialized to the beginning address mentioned in the START statement of the program.
After each statement is processed, the length of the assembled instruction is added to the LOCCTR to make it
point to the next instruction. Whenever a label is encountered in an instruction the LOCCTR value gives the
address to be associated with that label.


The Algorithm for Pass 1:
Begin
read first input line
if OPCODE = ‘START’ then begin save
#[Operand] as starting addr initialize
LOCCTR to starting addresswrite line to
intermediate file
read next line
end( if START)
else

initialize LOCCTR to 0 While


OPCODE != ‘END’ dobegin
if this is not a comment line thenbegin
if there is a symbol in the LABEL field thenbegin
search SYMTAB for LABELif
found then
set error flag (duplicate symbol)else
(if symbol)

search OPTAB for OPCODEif


found then
add 3 (instr length) to LOCCTRelse if
OPCODE = ‘WORD’ then
add 3 to LOCCTR

else if OPCODE = ‘RESW’ thenadd 3 *


#[OPERAND] to LOCCTR
else if OPCODE = ‘RESB’ then
add #[OPERAND] to LOCCTR

else if OPCODE = ‘BYTE’ thenbegin

find length of constant in bytesadd

length to LOCCTR

end

else

set error flag (invalid operation code)end (if

not a comment)

write line to intermediate fileread

next input line

end { while not END}

write last line to intermediate file

Save (LOCCTR – starting address) as program length End {pass

1}
 It next checks for the mnemonic code, it searches for this code in the OPTAB. If found then the length
of the instruction is added to the LOCCTR to make it point to the next instruction.
 If the opcode is the directive WORD it adds a value 3 to the LOCCTR. If it is RESW, it needs to add
the number of data word to the LOCCTR. If it is BYTE it adds a valueone to the LOCCTR, if RESB it
adds number of bytes.
 If it is END directive then it is the end of the program it finds the length of the program by evaluating
current LOCCTR – the starting address mentioned in the operand field of the END directive. Each
processed line is written to the intermediate file.

Write text record to object code

initialize new Text record

end

add object code to Text recordend

{if not comment}

write listing line read

next input lineend

write listing line read

next input line write last

listing line

End {Pass 2}

Here the first input line is read from the intermediate file. If the opcode is START, then this line is directly
written to the list file. A header record is written in the object program which gives the starting address and the
length of the program (which is calculated during pass 1). Then the first text record is initialized. Comment
lines are ignored. In the instruction, for the opcode the OPTAB is searched to find the object code.

If a symbol is there in the operand field, the symbol table is searched to get the address value for this
which gets added to the object code of the opcode. If the address not found then zero value is stored as operands
address. An error flag is set indicating it as undefined. If symbol itself is not found then store 0 as operand
address and the object code instruction is assembled.
If the opcode is BYTE or WORD, then the constant value is converted to its equivalent object code(
for example, for character EOF, its equivalent hexadecimal value ‘454f46’ is stored). If the object code cannot
fit into the current text record, a new text record is created and the rest of the instructions object code is listed.
The text records are written to the object program. Once the whole program is assemble and when the END
directive is encountered, the End record is written.

Design and Implementation Issues

Some of the features in the program depend on the architecture of the machine. If the program is for SIC
machine, then we have only limited instruction formats and hence limited addressing modes. We have only
single operand instructions. The operand is always a memory reference. Anything to be fetched from memory
requires more time. Hence the improved version of SIC/XE machine provides more instruction formats and
hence more addressing modes. The moment we change the machine architecture the availability of number of
instruction formats and the addressing modes changes. Therefore the design usually requires considering two
things: Machine-dependent features and Machine-independent features.

Machine-Dependent Assembler Features:

 Instruction formats and addressing modes

 Program relocation.

Instruction formats and Addressing Modes

The instruction formats depend on the memory organization and the size of the memory. In SIC machine
the memory is byte addressable. Word size is 3 bytes. So the size of the memory is 212 bytes. Accordingly it
supports only one instruction format. It has only two registers: register A and Index register. Therefore the
addressing modes supported by this architecture are direct, indirect, and indexed. Whereas the memory of a
SIC/XE machine is 220 bytes (1 MB). This supports four different types of instruction types, they are:
 1 byte instruction
 2 byte instruction
 3 byte instruction
 4 byte instruction

• Instructions can be:

– Instructions involving register to register

– Instructions with one operand in memory, the other in Accumulator (Singleoperand


instruction)
– Extended instruction format

• Addressing Modes are:

– Index Addressing(SIC): Opcode m, x

– Indirect Addressing: Opcode @m


– PC-relative: Opcode m
– Base relative: Opcode m

– Immediate addressing: Opcode #c

1. Translations for the Instruction involving Register-Register addressing mode:

During pass 1 the registers can be entered as part of the symbol table itself. The value for these registers is their
equivalent numeric codes. During pass2, these values are assembled along with the mnemonics object code. If
required a separate table can be created with the register names and their equivalent numeric values.

2. Translation involving Register-Memory instructions:

In SIC/XE machine there are four instruction formats and five addressing modes. For formats and addressing
modes

Among the instruction formats, format -3 and format-4 instructions are Register-Memorytype of
instruction. One of the operand is always in a register and the other operand is in the
memory. The addressing mode tells us the way in which the operand from the memory is to be fetched.

There are two ways: Program-counter relative and Base-relative. This addressing mode can be represented
by either using format-3 type or format-4 type of instruction format. In format-3, the instruction has the opcode
followed by a 12-bit displacement value in the address field. Where as in format-4 the instruction contains the
mnemonic code followed bya 20-bit displacement value in the address field.

Program-Counter Relative:
In this usually format-3 instruction format is used. The instruction contains the opcode followed by a 12-
bit displacement value.

The range of displacement values are from 0 -2048. This displacement (should be small enough to fit in a
12-bit field) value is added to the current contents of the program counter toget the target address of the operand
required by the instruction.

This is relative way of calculating the address of the operand relative to the program counter. Hence the
displacement of the operand is relative to the current program counter value. The following example shows how
the address is calculated:
Base-Relative Addressing Mode:

In this mode the base register is used to mention the displacement value. Therefore the targetaddress is

TA = (base) + displacement value

 This addressing mode is used when the range of displacement value is not sufficient. Hence the
operand is not relative to the instruction as in PC-relative addressing mode.Whenever this mode is used
it is indicated by using a directive BASE.
 The moment the assembler encounters this directive the next instruction uses base- relative addressing
mode to calculate the target address of the operand.

 When NOBASE directive is used then it indicates the base register is no more used to calculate the
target address of the operand. Assembler first chooses PC-relative, when the displacement field is not
enough it uses Base-relative.

LDB #LENGTH (instruction)

BASE LENGTH (directive)

: NOBASE
UNIT – 2
ASSEMBLERS
Assembler is a program for converting instructions written in low-level assembly code into
relocatable machine code and generating along information for the loader.It is necessary to
convert user written programs into a machinery code. This is called as translation of the high
level language to low level that is machinery language.

It generates instructions by evaluating the mnemonics (symbols) in operation field and find
the value of symbol and literals to produce machine code.Here assembler divide these tasks in
two passes:
 Pass-1:
1. Define symbols and literals and remember them in symbol table and literal table
respectively.
2. Keep track of location counter
3. Process pseudo-operations
4. Defines program that assigns the memory addresses to the variables and translates the
source code into machine code
 Pass-2:
1. Generate object code by converting symbolic op-code into respective numeric op-code
2. Generate data for literals and look for values of symbols
3. Defines program which reads the source code two times
4. It reads the source code and translates the code into object code.
Working of Pass-1:
 Define Symbol and literal table with their addresses. Note: Literal address is specified
by LTORG or END.
 Step-1: START 200
 (here no symbol or literal is found so both table would be empty)
 Step-2: MOVER R1, =’3′ 200
 ( =’3′ is a literal so literal table is made)

Literal Address

=’3′ –––

 Step-3: MOVEM R1, X 201


 X is a symbol referred prior to its declaration so it is stored in symbol table with blank
address field.
Symbol Address

X –––

 Step-4: L1 MOVER R2, =’2′ 202


 L1 is a label and =’2′ is a literal so store them in respective tables
Symbol Address

X –––

L1 202

Literal Address

=’3′ –––

=’2′ –––

 Step-5: LTORG 203


 Assign address to first literal specified by LC value, i.e., 203
Literal Address

=’3′ 203

=’2′ –––

 Step-6: X DS 1 204
 It is a data declaration statement i.e X is assigned data space of 1. But X is a symbol
which was referred earlier in step 3 and defined in step 6.This condition is called
Forward Reference Problem where variable is referred prior to its declaration and can
be solved by back-patching. So now assembler will assign X the address specified by
LC value of current step.
Symbol Address

X 204

L1 202

 Step-7: END 205


 Program finishes execution and remaining literal will get address specified by LC
value of END instruction. Here is the complete symbol and literal table made by pass 1
of assembler.
Symbol Address

X 204

L1 202

Literal Address

=’3′ 203

=’2′ 205

Now tables generated by pass 1 along with their LC value will go to pass-2 of
assembler for further processing of pseudo-opcodes and machine op-codes.
Working of Pass-2:
Pass-2 of assembler generates machine code by converting symbolic machine-opcodes
into their respective bit configuration(machine understandable form). It stores all
machine-opcodes in MOT table (op-code table) with symbolic code, their length and
their bit configuration. It will also process pseudo-ops and will store them in POT
table(pseudo-op table). Various Data bases required by pass-2:
1. MOT table(machine opcode table)
2. POT table(pseudo opcode table)
3. Base table(storing value of base register)
4. LC ( location counter)

 Take a look at flowchart to understand:


Assembler Features
• Machine Dependent Assembler Features
– Instruction formats and addressing modes
(SIC/XE)
– Program relocation
• Machine Independent Assembler Features
 Literals
 Symbol-defining statements
 Expressions
 Program blocks
 Control sections and program linking
SIC/XE INSTRUCTION -FORMATS AND ADDRESSING MODES
• PC-relative or Base-relative (BASE directive needs to be used) addressing: op m
• Indirect addressing: op @m
• Immediate addressing: op #c
• Extended format (4 bytes): +op m
• Index addressing: op m,X
• Register-to-register instructions
RELATIVE ADDRESSING MODES
• PC-relative or base-relative addressing mode is preferred over direct addressing mode.
– Can save one byte from using format 3 rather than format 4.
• Reduce program storage space
• Reduce program instruction fetch time – Relocation will be easier.
THE DIFFERENCES BETWEEN THE SIC AND SIC/XE PROGRAMS
• Register-to-register instructions are used whenever possible to improve execution speed.
– Fetch a value stored in a register is much faster than fetch it from the memory.
• Immediate addressing mode is used whenever possible.
– Operand is already included in the fetched instruction. There is no need to fetch the operand from
the memory.
• Indirect addressing mode is used whenever possible. – Just one instruction rather than two
isenough.
PC OR BASE-RELATIVE MODES
• Format 3: 12-bit displacement field (in total 3 bytes)
– Base-relative: 0~4095
– PC-relative: -2048~2047
• Format 4: 20-bit address field (in total 4 bytes)
• The displacement needs to be calculated so that when the displacement is added to PC (which
points to the following instruction after the current instruction is fetched) or the base register (B),
the resulting value is the target address.
• If the displacement cannot fit into 12 bits, format 4 then needs to be used. (E.g.,line 15 and 125)
– Bit e needs to be set to indicate format 4.
– A programmer must specify the use of format 4 by putting a + before the instruction.
Otherwise, it will be treated as an error.
Example of pc relative

Indirect Addressing Example


• The target address is computed as usual (either PC-relative or BASE-relative)
• We only need to set the n bit to 1 to indicate that the content stored at this
location represents the address of the operand, not the operand itself.

RELOCATABLE IS DESIRED
• The program in Fig. 2.1 specifies that it must be loaded at address 1000 for
correct execution. This restriction is too inflexible for the loader.
• If the program is loaded at a different address, say 2000, its memory references will access
wrong data! For
example:
– 55 101B LDA THREE 00102D
• Thus, we want to make programs relocatable so that they can be loaded and execute correctly at
any place in the memory.

INSTRUCTIONS NEEDS TO BE MODIFIED


• Only those instructions that use absolute (direct) addresses to reference symbols.
• The following need not be modified:
– Immediate addressing (no memory references)
– PC or Base-relative addressing (Relocatable is one advantage of relative addressing, among
others.)
– Register-to-register instructions (no memory references)
ASSEMBLER DESIGN OPTIONS
One-Pass and Multi-Pass Assemblers.
One-Pass Assemblers. One-pass assemblers are used when it is necessary or desirable to avoid a
second pass over the source program the external storage for the intermediate file between two
passes is slow or is inconvenient to use

Unit-3
Basic Functions of Loader
Assemblers and compilers are used to convert source code to object code. The loader will
accept that object code, make it ready for execution, and helps to execute. Loader performs
its task via four functions, these are as follows:

1. Allocation: It allocates memory for the program in the main memory.


2. Linking: It combines two or more separate object programs or modules and
supplies necessary information.
3. Relocation: It modifies the object program so that it can be loaded at an address
different from the location.
4. Loading: It brings the object program into the main memory for execution.

A general loading scheme is shown below:

Allocation:

In order to allocate memory to the program, the loader allocates the memory on the basis of the
size of the program, this is known as allocation. The loader gives the space in memory where the
object program will be loaded for execution.

Linking:

The linker resolves the symbolic reference code or data between the object modules by allocating
all of the user subroutine and library subroutine addresses. This process is known as linking.

Relocation:

There are some address-dependent locations in the program, and these address constants must be
modified to fit the available space, this can be done by loader and this is known as relocation. In
order to allow the object program to be loaded at a different address than the one initially
supplied, the loader modifies the object program by modifying specific instructions.
Loading:

The loader loads the program into the main memory for execution of that program. It loads
machine instruction and data of related programs and subroutines into the main memory, this
process is known as loading. The loader performs loading; hence, the assembler must provide the
loader with the object program.

Eg. Absolute Loader

Absolute Loader:

The absolute loader transfers the text of the program into memory at the address provided by the
assembler after reading the object program line by line. There are two types of information that
the object program must communicate from the assembler to the loader.It must convey the
machine instructions that the assembler has created along with the memory address.

SIMPLE BOOTSTRAP LOADER:


Bootstrap is a widely-used open-source front-end framework for web development, providing a
collection of HTML, CSS, and JavaScript components and tools that enable developers to build
responsive, mobile-first websites with ease.Bootstrap is a free and open-source tool collection for
creating responsive websites and web applications. It is the most popular HTML, CSS,
and JavaScript framework for developing responsive, mobile-first websites. Nowadays, the
websites are perfect for all browsers (IE, Firefox, and Chrome) and for all sizes of screens
(Desktop, Tablets, Phablets, and Phones). All thanks to Bootstrap developers – Mark Otto and
Jacob Thornton of Twitter, though it was later declared to be an open-source project.
Machine-Dependent Loader Features
Absolute loader is simple and efficient, but the scheme has potential disadvantages One of the
most disadvantage is the programmer has to specify the actual starting address, from where the
program to be loaded. This does not create difficulty, if one program to run, but not for several
programs. Further it is difficult to use subroutine libraries efficiently.
This needs the design and implementation of a more complex loader. The loader must provide
program relocation and linking, as well as simple loading functions.

Relocation

The concept of program relocation is, the execution of the object program using any part of the
available and sufficient memory. The object program is loaded into memory wherever there is
room for it. The actual starting address of the object program is not known until load time.
Relocation provides the efficient sharing of the machine with larger memory and when several
independent programs are to be run together. It also supports the use of subroutine libraries
efficiently. Loaders that allow for program relocation are called relocating loaders or relative
loaders.

Methods for specifying relocation

Use of modification record and, use of relocation bit, are the methods available for specifying
relocation. In the case of modification record, a modification record M is used in the object
program to specify any relocation. In the case of use of relocation bit, each instruction is
associated with one relocation bit and, these relocation bits in a Text record is gathered into bit
masks.

Modification records are used in complex machines and is also called Relocation and Linkage
Directory (RLD) specification. The format of the modification record (M) is as follows. The
object program with relocation by Modification records is also shown here.

Modification record

HΛCOPY Λ000000 001077


TΛ000000 Λ1DΛ17202DΛ69202DΛ48101036Λ…Λ4B105DΛ3F2FECΛ032010
TΛ00001DΛ13Λ0F2016Λ010003Λ0F200DΛ4B10105DΛ3E2003Λ454F46
TΛ001035 Λ1DΛB410ΛB400ΛB440Λ75101000Λ…Λ332008Λ57C003ΛB850
TΛ001053Λ1DΛ3B2FEAΛ134000Λ4F0000ΛF1Λ..Λ53C003ΛDF2008ΛB850

TΛ00070Λ07Λ3B2FEFΛ4F0000Λ05

MΛ000007Λ05+COPY
MΛ000014Λ05+COPY

MΛ000027Λ05+COPY

EΛ000000

The relocation bit method is used for simple machines. Relocation bit is 0: no modification is
necessary, and is 1: modification is needed. This is specified in the columns 10-12 of text record
(T), the format of text record, along with relocation bits is as follows.

Text record

Col 1: T

Col 2-7: starting address

Col 8-9: length (byte)

Col 10-12: relocation bits

Col 13-72: object code

Twelve-bit mask is used in each Text record (col:10-12 – relocation bits), since each text record
contains less than 12 words, unused words are set to 0, and, any value that is to be

Modified during relocation must coincide with one of these 3-byte segments. For absolute
loader, there are no relocation bits column 10-69 contains object code. The object program with
relocation by bit mask is as shown below. Observe FFC – means all ten words are to be modified
and, E00 – means first three records are to be modified.

HΛCOPY Λ000000 00107A

TΛ000000Λ1EΛFFCΛ140033Λ481039Λ000036Λ280030Λ300015Λ…Λ3C0003 Λ …

TΛ00001EΛ15ΛE00Λ0C0036Λ481061Λ080033Λ4C0000Λ…Λ000003Λ000000

TΛ001039Λ1EΛFFCΛ040030Λ000030Λ…Λ30103FΛD8105DΛ280030Λ…

TΛ001057Λ0AΛ800Λ100036Λ4C0000ΛF1Λ001000

TΛ001061Λ19ΛFE0Λ040030ΛE01079Λ…Λ508039ΛDC1079Λ2C0036Λ…

EΛ000000

Program Linking
The Goal of program linking is to resolve the problems with external references (EXTREF) and
external definitions (EXTDEF) from different control sections.

EXTDEF (external definition) – The EXTDEF statement in a control section names symbols,
called external symbols, that are defined in this (present) control section and may be used by
other sections.

Ex: EXTDEF BUFFER, BUFFEND, LENGTH

EXTDEF LISTA, ENDA

EXTREF (external reference) – The EXTREF statement names symbols used in this (present)
control section and are defined elsewhere.
Ex: EXTREF RDREC, WRREC EXTREF LISTB, ENDB, LISTC, ENDC
UNIT-4

MACRO PREPROCESSOR
Macro
 A macro (or macro instruction)
– It is simply a notational convenience for the programmer.
– It allows the programmer to write shorthand version of a program
– It represents a commonly used group of statements in the source program.
 For example:
 Suppose a program needs to add two numbers frequently. This requires a sequence
of instructions. We can define and use a macro called SUM, to represent this
sequence of instructions.

SUM MACRO &X,&Y


LDA &X
MOV B
LDA &Y
ADD B
MEND

Macro Preprocessor
 The macro pre-processor(or macro processor) is a system software which replaces
each macro instruction with the corresponding group of source language statements.
This operation is called expanding the macro.
 It does not concern the meaning of the involved statements during macro expansion.
 The design of a macro processor generally is machine independent.
BASIC MACRO PROCESSOR FUNCTIONS

The fundamental functions common to all macro processors are: ( Code to remember - DIE )
o Macro Definition
o Macro Invocation
o Macro Expansion

Macro Definition
Macro definitions are typically located at the start of a program.
A macro definition is enclosed between a macro header statement(MACRO) and a
macro end statement(MEND)
Format of macro definition
macroname MACRO parameters
:
body
:
MEND

A macro definition consist of macro prototype statement and body of macro.


A macro prototype statement declares the name of a macro and its parameters. It has
following format:
macroname MACRO parameters
where macroname indicates the name of macro, MACRO indicates the beginning of
macro definition and parameters indicates the list of formal parameters. parameters is
of the form &parameter1, &parameter2,…Each parameter begins with ‘&’. Whenever
we use the term macro prototype it simply means the macro name along with its
parameters.

Body of macro consist of statements that will generated as the expansion of macro.
Consider the following macro definition:
SUM MACRO &X,&Y
LDA &X
MOV B
LDA &Y
ADD B
MEND
Here, the macro named SUM is used to find the sum of two variables passed to it.
Macro Invocation(or Macro Call)
A macro invocation statement (a macro call) gives the name of the macro instruction
being invoked and the arguments to be used in expanding the macro.
The format of macro invocation
macroname p1, p2,...pn
The above defined macro can be called as SUM P,Q

Macro Expansion
Each macro invocation statement will be expanded into the statements that form the
body of the macro.
Arguments from the macro invocation are substituted for the parameters in the macro
prototype.
The arguments and parameters are associated with one another according to their
positions. The first argument in the macro invocation corresponds to the first
parameter in the macro prototype, etc.
Comment lines within the macro body have been deleted, but comments on individual
statements have been retained. Macro invocation statement itself has been included
as a comment line.
Consider the example for macro expansion on next page:
In this example, the macro named SUM is defined at the start of the program. This
macro is invoked with the macro call SUM P,Q and the macro is expanded as
LDA &P
OV B

LDA &Q
ADD B
MEND
Again the same macro is invoked with the macro call SUM M,N and the macro is expanded
as

LDA &M
MOV B
LDA &N
ADD B
MEND
Figure: Example for macro expansion
Difference between Macro and Subroutine
Macros differ from subroutines in two fundamental aspects:
1. A macro call leads to its expansion, whereas subroutine call leads to its execution. So
there is difference in size and execution efficiency.
2. Statements of the macro body are expanded each time the macro is invoked. But the
statements of the subroutine appear only once, regardless of how many times the
subroutine is called.
MACRO PROCESSOR ALGORITHM AND DATA STRUCTURES

Two pass macro processor


 It is easy to design a two-pass macro processor in which all macro definitions are
processed during the first pass and all macro invocation statements are expanded
during second pass.
 Such a two pass macro processor cannot handle nested macro definitions. Nested
macros are macros in which definition of one macro contains definition of other
macros.
 Consider the macro definition example given below, which is used to swap two
numbers. The macro named SWAP defines another macro named STORE inside it.
These type of macro are called nested macros.

SWAP MACRO
&X,&Y
LDA &X
LDX &Y
outer macro
STORE MACRO
Inner
&X,&Y macro
STA &Y
STX
&
XMEND
MEND
One pass macro processor
 A one-pass macro processor uses only one pass for processing macro definitions and
macro expansions.
 It can handle nested macro definitions.
 To implement one pass macro processor, the definition of a macro must appear in the
source program before any statements that invoke that macro.
Data Structures involved in the design of one pass macro processor
 There are 3 main data structures involved in the design of one pass macro processor:
DEFTAB
NAMTAB
ARGTAB

Definition table (DEFTAB)


 All the macro definitions in the program are stored in DEFTAB, which includes
macro prototype and macro body statements.
 Comment lines from macro definition are not entered into DEFTAB because they will
not be a part of macro expansion.
 References to the macro instruction parameters are converted to a positional notation
for efficiency in substituting arguments.

Name table (NAMTAB)


 The macro names are entered into NAMTAB
 NAMTAB contains pointers to beginning and end of definition in DEFTAB.

Argument table (ARGTAB)


 The third data structure is an argument table (ARGTAB), which is used during
expansion of macro invocations.
 When macro invocation statements are recognized, the arguments are stored in
ARGTAB according to their position in argument list.
 As the macro is expanded, arguments from ARGTAB are substituted for the
corresponding parameters in the macro body.
 Example: Consider the following source code
SUM MACRO &X,&Y
LDA &X
MOV B
LDA &Y
ADD

B MEND
START
LDA 4500
ADD B
SUM P,Q
LDA 3000
………
…. END
When the macro definition for SUM is encountered, the macro name SUM along withits
parameters X and Y are entered into DEFTAB. Then the statements in the body of macro is also
entered into DEFTAB. The positional notation is used for the parameters. The parameter &X has been
converted to ?1, &Y has been converted to
?2.
 The macro name SUM is entered into NAMTAB and the beginning and end pointers
are also marked.
 On processing the input code, opcode in each statement is compared with the
NAMTAB, to check whether it is a macro call. When the macro call SUM P,Q is
recognized, the arguments P and Q will entered into ARGTAB. The macro is
expanded by taking the statements from DEFTAB using the beginning and end
pointers of NAMTAB.
 When the ?n notation is recognized in a line from DEFTAB, the corresponding
argument is taken from ARGTAB.

Figure shows the different data structures used


Algorithm for one pass macro processor
Explanation of algorithm
 The algorithm uses 5 procedures
o MACROPROCESSOR (main function)
o DEFINE
o EXPAND
o PROCESSLINE
o GETLINE

MACROPROCESSOR (MAIN function)


 This function initialize the variable named EXPANDING to false.
 It then calls GETLINE procedure to get next line from the source program and
PROCESSLINE procedure to process that line.
 This process will continue until the END of program.
PROCESSLINE
 This procedure checks
o If the opcode of current statement is present in NAMTAB. If so it is a macro
invocation statement and calls the procedure EXPAND
o Else if opcode =MACRO, then it indicates the beginning of a macro definition
and calls the procedure DEFINE
o Else it is identified as a normal statement(not a macro definition or macro
call) and write it to the output file.
DEFINE
 The control will reach in this procedure if and only if it is identified as a macro
definition statement.Then:
o Macro name is entered into NAMTAB
o Then the macro name along with its parameters are entered into DEFTAB.
o The statements in body of macro is also enterd into DEFTAB. References to
the macro instruction parameters are converted to a positional notation for
efficiency in substituting arguments.
o Comment lines from macro definition are not entered into DEFTAB because
they will not be a part of macro expansion.
o Store in NAMTAB the pointers to beginning and end of definition in
DEFTAB.

 To deal with Nested macro definitions DEFINE procedure maintains a counter named
LEVEL.
o When the assembler directive MACRO is read, the value of LEVEL is
incremented by 1
o When MEND directive is read, the value of LEVEL is decremented by 1
o That is, whenever a new macro definition is encountered within the current
definition, the value of LEVEL will be incremented and the while loop which
is used to process the macro definition will terminate only after the value of
LEVEL =0. With this we can ensure the nested macro definitions are properly
handled.
EXPAND
 The control will reach in this procedure if and only if it is identified as a macro call.
 In this procedure, the variable EXPANDING is set to true. It actually indicates the
GETLINE procedure that it is going to expand the macro call. So that GETLINE
procedure will read the next line from DEFTAB instead of reading from input file.
 The arguments of macro call are entered into ARGTAB.
 The macro call is expanded with the lines from the DEFTAB. When the ?n notation is
recognized in a line from DEFTAB, the corresponding argument is taken from
ARGTAB.

GETLINE
 This procedure is used to get the next line.
 If EXPANDING = TRUE, the next line is fetched from DEFTAB. (It means we are
expanding the macro call)
 If EXPANDING = False, the next line is read from input file.

Flow Diagram of a one pass macroprocessor


MACHINE INDEPENDENT MACRO PROCESSOR FEATURES
The features of macro which doesn’t depends on the architecture of machine is called
machine independent macro processor features.
The features includes:
1. Concatenation of Macro Parameters
2. Generation of Unique Labels
3. Conditional Macro Expansion
4. Keyword Macro Parameters

1. Concatenation of Macro Parameters


Most macro processor allows parameters to be concatenated with other character
strings.
Suppose that a program contains a series of variables named by the symbols XA1,
XA2, XA3,..., and another series of variables named XB1, XB2, XB3,..., etc.
If similar processing is to be performed on each series of labels, the programmer
might put this as a macro instruction.
The parameter to such a macro instruction could specify the series of variables to be
operated on (A, B, etc.). The macro processor would use this parameter to construct
the symbols required in the macro expansion (XA1, XB1, etc.).
For example, suppose that the parameter to such a macro instruction is named &ID.
The body of the macro definition contain a statement like LDA X&ID1, which means
&ID is concatenated after the string “X” and before the string “1”
TOTAL MACRO &ID
LDA X&ID1
ADD X&ID2
STA X&ID3
MEND

The macro call TOTAL A will be expanded as:


LDA XA1
ADD XA2
STA XA3
Ambiguity Problem: In the statement LDA X&ID1, & is the starting character of
the macro parameter; but the end of the parameter is not marked.
So X&ID1 may mean
X + &ID + 1 or X +&ID1

Most of the macro processors deal with this problem by providing a special
concatenation operator to specify the end of parameter.
Thus LDA X&ID1 can be rewritten as . So that end of the parameter
&ID is clearly defined.
Example:

The example given above shows a macro definition that uses the concatenation
operator as previously described. The statement TOTAL A and TOTAL BETA shows
the invocation statements and the corresponding macro expansion.
The macroprocessor deletes all occurrences of the concatenation operator
immediately after performing parameter substitution, so the symbol will not
appear in the macro expansion.

2. Generation of unique labels


It is not possible to use labels for the instructions in the macro definition, since every
expansion of macro would include the label repeatedly.
This results in duplicate labels problem if macro is invoked and expanded multiple
times.(which is not allowed by the assembler)
To avoid this we can use the technique of generating unique labels for every macro
invocation and expansion.
During macro expansion each $ will be replaced with $XX, where XX is a two
character alphanumeric counter of the number of macro instructions expansion.
For the first macro expansion in a program, XX will have the value AA. For
succeeding macro expansions XX will be set to AB,AC etc. This allows 1296 macro
expansions in a single program
Consider the following program:

SAMPLE START 0
COPY MACRO &A,&B
LDA &A
.........
$LOOP ADD &B
..........
JLE $LOOP
STA &B
MEND
.............
COPY X,Y
LDA M
COPY P,Q
...........
END

After the macro expansion above code becomes:

SAMPLE START 0
.............
LDA &X
.........
$AALOOP ADD &Y Expansion of COPY X,Y
..........
JLE $AALOOP
STA &Y
LDA M
LDA &P
.........
$ABLOOP ADD &P
.......... Expansion of COPY P,Q
JLE $ABLOOP
STA &Q
...........
END

In the example, for the first invocation of COPY X,Y the label generated is $AALOOP and
for the second invocation COPY P,Q the label generated is $ABLOOP. Thus for each
invocation of macro unique label is generated.

3. Conditional Macro Expansion


Arguments in macro invocation can be used to:
o Substitute the parameters in the macro body without changing the sequence of
statements expanded.(sequential macro expansion)
o Modify the sequence of statements for conditional macro expansion. This
capability increases power and flexibility of a macro language.
Macro-Time Variables:
o Macro-time variables (often called as SET Symbol) can be used to store
working values during the macro expansion.
o It can also be used to store the evaluation result of Boolean expression
o Any symbol that begins with & and not a macro instruction parameter is
considered as macro-time variable. All such variables are initialized to zero.
o The value of macro-time variables can be modified by using the macro
processor directive SET
o The macro processor must maintain a symbol table to store the value of all
macro-time variables used. The table is used to look up the current value of
the macro-time variable whenever it is required.

Implementation of Macro-Time Conditional Structure IF-ELSE-ENDIF


Structure of IF-ELSE_ENDIF:
Macroname MACRO &COND
........
IF (&COND NE ’’)
.part I
ELSE
.part II
ENDIF
.........
MEND
When an IF statement is encountered during the expansion of a macro, the specified
Boolean expression is evaluated.
If the value of this expression TRUE,
o The macro processor continues to process lines from the DEFTAB until it
encounters the ELSE or ENDIF statement.
o If an ELSE is found, macro processor skips lines in DEFTAB until the next
ENDIF.
o Once it reaches ENDIF, it resumes expanding the macro in the usual way.
If the value of the expression is FALSE,
o The macro processor skips ahead in DEFTAB until it encounters next ELSE
or ENDIF statement.
o The macro processor then resumes normal macro expansion.

Example for conditional macro:


COPY START 0
EVAL MACRO
&X,&Y,&
Z IF (&Y LE
&X)
LDA &X
SUB
&
Z ELSE
LDA &Y
ADD
&
Z ENDIF
MEND
STA P
....................

EVAL 2 ,3
,4 STA Q
............
... END
After expansion the above code
becomes: COPY START
0
STA P
................
... LDA

3
ADD 4
STA Q
........................
. END

Implementation of Macro-time looping(or Expansion time looping) structure: WHILE-


ENDW
WHILE statement specifies that the following lines until the next ENDW statement,
are to be generated repeatedly as long as a particular condition is true. The testing
of this condition, and the looping are done during the macro under expansion.
When a WHILE statement is encountered during the expansion of a macro, the
specified Boolean expression is evaluated.
If expression is TRUE, the macro processor continues to process lines from
DEFTAB until it encounters the next ENDW statement.When ENDW is
encountered, the macro processor returns to the preceding WHILE, re-evaluates the
Boolean expression, and takes action based on the new value.
If the expression is FALSE, the macro processor skips ahead in DEFTAB until it
finds the next ENDW statement and then resumes normal macro expansion.
The example given below shows the usage of Macro-Time Loopingtatement.
In this example, the variable &LIMIT is a macro time variable.

Above loop after expansion

4. Keyword Macro Parameters


 The parameters used in macro can be of two types
a)Positional Parameters
b)Keyword Parameters

a) Positional Parameters
 Parameters and arguments are associated according to their positions in the macro
prototype and invocation. The programmer must specify the arguments in proper
order.
 Syntax : In macro definition,
macroname MACRO &parameter1, &parameter2,……
In macro invocation,
macroname argument1, argument2,….
Example: In macro definition,
EVAL MACRO &X, &Y
In macro invocation,
EVAL P,Q
Here &X recieves the value of P and &Y recieves the value of Q
 If an argument is to be omitted, a null argument should be used to maintain the
proper order in macro invocation statement.
 For example: Suppose a macro named EVAL has 5 possible parameters, but in a
particular invocation of the macro only the 1st and 4th parameters are to be
specified. Then the macro call will be EVAL P,,,S,

 Positional parameter is not suitable if a macro has a large number of parameters,


and only a few of these are given values in a typical invocation.
b) Keyword Parameters
 Each argument value is written with a keyword that names the corresponding
parameter.
 Arguments may appear in any order. That is not any importance to the position of
arguments.
 Null arguments no longer need to be used.
 For macros using keyword parameters the macro prototype statement
specification is different. Here each parameter name is followed by equal sign,
which identifies a keyword parameter and a default value is specified for some of
the parameters.
 Keyword parameters are easier to read and much less error-prone than the
positional parameters.
 It is of the form
&formal parameter name = <ordinary string>
 Consider the example:
INCR MACRO &X=, &Y=, &Z=
MOV &Z, &X
ADD &Z, &Y
MOV &Z, &X
MEND
The following calls are now equivalent
1) INCR X=A, Y=B, Z=C
2) INCR Y=B, X= A, Z= C
Default specification of parameters
 Default specification of parameters can also be possible in macros.
 This specification is useful in situations where a parameter has the same value in most
calls.
 When the desired value is different from the default value, the desired value can be
specified explicitly in a macro call.

 This specification overrides the default value of the parameter for the duration of that
macro call.
 Consider the example:
Here the default value R is
specified for the parameter Z
INCR MACRO &X=, &Y=, &Z=R
MOV &Z, &X
ADD &Z, &Y
MOV &Z, &X
MEND

 Then the macro call INCR X=A, Y=B will take the values A for parameter X, B for
parameter Y and R for the parameter Z.
 The macro call INCR X=A, Y=B, Z=C will take the values A for parameter X, B for
parameter Y and C for the parameter Z.

MACRO PROCESSOR DESIGN OPTIONS

Two Pass Macro Processor


 Same as on page 4
 It is easy to design a two-pass macro processor in which all macro definitions are
processed during the first pass and all macro invocation statements are expanded
during second pass.
 Such a two pass macro processor cannot handle nested macro definitions.
 Nested macros are macros in which definition of one macro contains definition of
other macros.
 Consider the macro definition example given below, which is used to swap two
numbers.
 The macro named SWAP defines another macro named STORE inside it. These type
of macro are called nested macros.
SWAP MACRO
&X,&Y
LDA &X
LDX &Y
outer macro
STORE MACRO
Inner
&X,&Y macro
STA &Y
STX
&
XMEND
MEND

One Pass Macro Processor


Same as on page 4 (here only brief description is needed)
A one-pass macro processor uses only one pass for processing macro definitions and
macro expansions.
It can handle nested macro definitions.
To implement one pass macro processor, the definition of a macro must appear in the
source program before any statements that invoke that macro.
Data Structures involved in the design of one pass macro processor
DEFTAB
NAMTAB
ARGTAB
Whenever a macro definition is encountered, the macro prototype and body of macro
is entered into DEFTAB. References to the macro instruction parameters are
converted to a positional notation for efficiency in substituting arguments.
The macro name along with the begin and end pointers are entered into NAMTAB.
Whenever a macro invocation is encountered, the arguments are entered into
ARGTAB.
The macro call is expanded with the lines from the DEFTAB. When the ?n notation is
recognized in a line from DEFTAB, the corresponding argument is taken from
ARGTAB.
Recursive Macro Expansion
 Invocation of one macro by another macro is called recursive macro.
 Example for recursive macro:

SUM MACRO &X,&Y


STA &X
ADD &Y
MEND

INPUT MACRO &A,&B


SUM &A,&B .i n v o k i n g the macro
SUM
MEND

Here the macro named INPUT is calling another macro named SUM. This is called as
recursive macro.
 The macro processor design algorithm discussed previously cannot handle recursive macro
invocation and expansion.
 Reasons are:
o The procedure EXPAND would be called recursively, thus the invocation
arguments in the ARGTAB will be overwritten.
o The Boolean variable EXPANDING would be set to FALSE when the “inner”
macro expansion is finished, that is, the macro process would forget that it had
been in the middle of expanding an “outer” macro.
o A similar problem would occur with PROCESSLINE since this procedure too
would be called recursively.
 Solutions:
o Write the macro processor in a programming language that allows recursive calls,
thus local variables will be retained.
o If we are writing in a language without recursion support, use a stack to take care of
pushing and popping local variables and return addresses. So the recursive calls can
be handled.
General-Purpose Macro Processors
 The macro processor we discussed so far is related to assembly language programming.
Macro processor for high level languages have also been developed. Macro processors
which are designed for a specific language are called special purpose macro processors.
Example for special purpose macro processor is MASM Macro processor
 These special purpose macro processors are similar in general function and approach but
the implementation details differ from language to language.

 The general purpose macro processor do not dependent on any particular programming
language, but can be used with a variety of different languages.
 Example for a general purpose macro processor is ELENA Macro processor

Advantages
 Programmers do not need to learn many macro languages, so much of the time and
expense involved in training are eliminated.
 Although its development costs are somewhat greater than those for a specific language
macro processor, this expense does not need to be repeated for each language, thus save
substantial overall cost.

Disadvantages
In spite of the advantages noted, there are still relatively few general purpose macro
processors. The reasons are:
 Large number of details must be dealt with in a real programming language.
 There are several situations in which normal macro parameter substitution or normal
macro expansion should not occur.
o For example, comments are usually ignored by the macro processor. But each
programming languages uses its own method for specifying comments. So a
general purpose macro processor should be designed with the capability for
identifying the comments in any programming languages.
 Another problem is with the facilities for grouping together terms, expressions, or
statements.
o Some languages use keywords such as begin and end for grouping statements
while some other languages uses special characters like { }. A general purpose
macro processor may need to take these groupings into account in scanning the
source statements.
 Another problem is with the identification of tokens of the programming languages. The
tokens means the identifiers, constants, operators and keywords in the programming
language. Different languages uses different rules for the formation of tokens. So the
design of a general purpose macro processor must take this into consideration.
 Another problem is with the syntax used for macro definitions and macro invocation
statements.
Macro Processing within Language Translators
 Macros can be processed
1. Outside the language translators
2. Within the Language translators

1. Macro processing outside the language translators


 The macro processors that we had discussed so far belongs to this group. They are
called macro preprocessors.
 They reads the source program statements, process the macro definitions and expand
macro invocations, producing an expanded version of the source program.
 This expanded program is then used as input to an assembler or compiler.
 So this can be considered as macro processing outside the language translators.

2. Macro processing within the language translators


 In this section we discuss the methods for combining the macro processing functions
with the language translator itself.
 Two common methods are
a) Line –by- Line Macro Processors
b) Integrated Macro Processors
a) Line-by-line macro processor
 The simplest method for combining the macro processing functions with the
language translator is a line-by-line approach.
 Using this approach, the macro processor reads the source program statement, process
the macro definitions and expand macro invocations. But it does not produce an
expanded version of source program.
 The processed lines are passed to the language translator(compiler or assembler) as
they are generated, instead of being written to an expanded source file.
 Thus macro processor operates as a sort of input routine for the assembler or
compiler.
Benefits of line –by-line macro processor
 It avoids making an extra pass over the source program. So it can be more efficient
than using a macro preprocessor.
 Some of the data structures required by the macro processor and the language
translator can be combined (e.g., OPTAB and NAMTAB)
 Many utility subroutines can be used by both macro processor and the language
translator. This involves scanning input lines, searching tables , converting numeric
values to internal representation etc.
 It is easier to give diagnostic messages related to the source statements.

b) Integrated Macro Processors


 Although a line-by-line macro processor may use some of the same utility routines as
language translator, the functions of macro processing and program translation are
still relatively independent.
 It is possible to have even closer cooperation between the macro processor and the
language translator. Such a scheme is called as language translator with an integrated
macro processor.
 An integrated macro processor can potentially make use of any information about
the source program that is extracted by the language translator.
 The macro processor may use the results of the translator operations such as scanning
of symbols, constants, etc without involved in processing it.
 For example in FORTRAN language, consider the statement:
DO 100 I = 1,20
where DO is a keyword, 100 is statement number, I is a variable name etc.
DO 100 I = 1
since in FORTRAN blanks are not significant, this statement is an assignment
that gives the value 1 to the variable DO100I. Thus the proper interpretation
of characters cannot be decided until the rest of the statement is examined.

Disadvantages of line-by-line and integrated macro processors


 They must be specially designed and written to work with a particular implementation
of an assembler or compiler. The cost of macro processor development is added to the
costs of the language translator, which results in a more expensive software. The
assembler or compiler will be considerably larger and more complex than using a
macro preprocessor. Size may be a problem if the translator is to run on a computer
with limited memory.
TYPES OF MACRO
Different types of macro are 1. Parameterized Macro 2. Nested Macro 3. Recursive Macro
1. Parameterized Macro
 Macros which uses parameters are called parameterized macro.This type of macro has
the capability to insert the given parameters into its expansion.Macros we studied so
far belongs to this type.
 Example:
SUM MACRO
&X,&
Y LDA &X
MOV B
LDA &Y
ADD B
MEND

2. Nested Macros
 Nested macros are macros in which definition of one macro contains definition of
other macros.Consider the macro definition example given below, which is used to
swap two numbers. The macro named SWAP defines another macro named STORE
inside it. These type of macro are called nested macros.

SWAP MACRO
&X,&Y
LDA &X
LDX &Y outer macro
STORE MACRO &X,&Y Inner
STA &Y macro
STX
&
X
ME
ND
ME
ND

3. Recursive Macro
 Invocation of one macro by another macro is called recursive macro.Example for
recursive macro:
SUM MACRO &X,&Y
STA &X
ADD &Y
MEND
INPUT MACRO &A,&B
SUM &A,&B .i n v o k i n g the
macro SUM MEND
Here the macro named INPUT is calling another macro named SUM. This is called as
recursive macro.
UNIT-5

Compilers-- Basic functions

 Generally, an independent course, maybe plus one semester on


implementation of a compiler.
• High level language program program in assembly language or objectcode directly
• Three steps:
– Scanning, parsing, and (object) code generation.
– (more general, five steps: scanning, parsing, intermediate code generation,code
optimization, and object code generation)
• What does a program consists of?
– a string of characters? From lowest level, yes.
– A sequence of tokens: a keyword, a variable name, an integer, an operator, …,
• each token consists of a string of characters satisfying some rules/format, called lexical
rules.
– A sequence of statements, such as a declaration statement, an assignmentstatement,
an IF statement.
• Each statement consists of tokens satisfying some rules/format, called syntax or grammar.
• Each statement also has a specific meaning, called semantics.
– Scan the character string of a program, analyze them (according to rules, called lexical rules),
then figure out each token. Also called lexical analysis.The part of a compiler for this function
is called scanner.
• Parsing:
– Pass through the sequence of tokens, parse them (according to rules, called grammars), then figure
out each statement, also called syntactic analysis. Thepart of a compiler for this function is called
parser.
• (Object) code generation:
– Each statement has its meaning (semantics), for each parsed statement, generate its code according
to its meaning. Also called semantic analysis. Thepart of a compiler for this function is called code-
generator.
• Notes:
– Three steps, not necessarily three passes.
– Comparison with assembly language program and assemble:
• What does an assembly language program consists of? How does the assembler finishits
task?
Grammars
• Syntax and semantics of a statement.
• Grammar:
– Describe the form, or syntax, of the legal statements in high level language.
– Does not describe the meaning of a statement.
– For examples:
• I:=J+K and X:=Y+I
• Where I, J, K are INTEGER variables and X,Y are REAL variables.
• They have identical syntax: an assignment, the value to be assigned is given by atwo variable
expression with operator +.
• Recall SIX machine: integer arithmetic and floating-point arithmetic.
• The first assignment will add J and K together and store to I.
• The second assignment needs to first convert I to floating point and add twofloating-point
numbers and then store it to X.
• So they will generate different machine codes.
– The meaning of a statement is used in or to say, controlled by code-generation.
• Many different ways to describe grammars.
– One typical way is BNF (Backus-Naur Form). We will use it.
Grammars (cont.)
• BNF grammars: consists of a set of rules, each of which defines the syntax of someconstruct in the
programming language.
– <read>::= READ(<id-list>)
– ::= : is defined as
– The left symbol: a language construct being defined
– The right side: description of the syntax being defined for it.
– A name between < and > is a nonterminal symbol, which is the construct of the language
– Entities not in < and > are terminal symbols, also called tokens.
– In the example, <read> and <id-list> are non-terminal symbols, and READ, (, and ) are terminal
symbols.
– It says that language construct <read> consists of tokens READ, followed by token (, followed bya
language construct <id-list>, and followed by token ).
– Note: blanks are not significant and they are there simply for readability.
– Furthermore, <id-list>::==id | <id-list>, id
• Recursive definition. id is an identifier that is recognized by the scanner.
• So <id-list> consists of one or more id’s separated by commas.
• ALPHA is an <id-list>. ALPHA, BETA is another <id-list>
• More examples: READ(VALUE), <assign>::= id:=<exp>, <exp>::=<term>| <exp> + <term>| <exp> -
<term>
• Simplified Pascal grammars (Figure 5.2)
• Parse tree for two statements in Fig. 5.1 (Figure 5.3)
• Parse tree for the entire program in Figure 5.1 (Figure 5.4) and Figure5.4 (cont'd)
Lexical analysis

• Scan the program to be compiled and recognize the tokens (from string ofcharacters).
• Tokens make up the source program.
• Tokens: keywords, operators, and identifiers, integers, floating-point numbers,character strings,
and other items.
• Token coding scheme for the grammar from Figure 5.2 (Figure 5.5)
• Lexical scan of the program from Fig5.1 (Figure 5.6)
– Line, token type, token specifier
– Where token specifier is either a pointer to a symbol table or the value of integers.
• Relation among scanner and parser
– As a subroutine, called by parser, and return a token on each call.
– Scanner reads line by line of the program.
– Comments are ignored by scanner.
– Possibly print the source listing.
• Specific features of different languages
– Any specific format of statements, such as FORTRAN. Column 1-5: line number, not integer.
– Whether blanks as delimiters for tokens (such as PASCAL) or not (such as FORTRAN).
– Continuation of one line to the next line (as in PASCAL) or continuation flag (as in FORTRAN).
– Rules for token formation, such as READ within a quoted character string, may not be a token.
– MOREOVER, DO 10 I=1,100 and DO 10 I=1.
– Languages without reserved key words: IF, THEN and ELSE may be keyword or variable names:
• IF (THEN .EQ. ELSE) THEN IF=THEN ELSE THEN=IF ENDIF is legal in FORTRAN.
Lexical analysis

--Modeling scanners as finite automata

• A set of states and a set of transitions from one state to another. One initial stateand some final
states.
– Begin at start state, scan the characters and transit to next states, and recognize a token whenreaching a
final state.
• Graphical representation of a finite automaton (Figure 5.7)
• Finite automata for typical programming language tokens (Figure 5.8)
– (a) recognizes identifiers and keywords beginning with letters and then letters and/or digits.
– (b) allows _ in identifiers and keywords.
– (c) recognizes integers, allowing leading 0.
– (d) does not allow leading 0, except number 0.
• Finite automaton to recognize tokens from Fig 5.5 (Figure 5.9)
– Same final state for keywords and identifiers, so a special keyword look-up is needed for
distinguishing keywords.
– In addition, a special check for the length of identifiers, if any. (Automata can not represent the
limitation on the length of strings.
– State 3 is for keyword END., should check for others such as VAR., in this case, to move back tostate
2.
• Finite automata provides an easy way to visualize the operation of scanner.However, the
real advantage is its easy implementation.
• Token recognition using (a) algorithmic code and (b) tabular representation of finiteautomaton
(Figure 5.10)
– The tabular implementation is more preferred.
Syntactic Analysis

• The source statements written by programmers arerecognized as language constructs described


by thegrammar.
– Building the parse tree for the statements beingtranslated.
• Bottom-up and top-down techniques.

– Bottom-up: building the leave of the tree first which matchthe statements, and then combining into
higher-level nodes until the root is reached.
– Top-down: beginning from the root, i.e., the rule of thegrammar specifying the goal of the
analysis, and constructing the tree so that the leave match the statements being analyzed.
Bottom-up parsing

--operator-precedence parsing
• E.g. A+B*C-D,
– + < * and * > -,
– i.e. multiplication and division have higher precedence than addition and subtraction.
– So when analyzing the above expression, B*C will be constructed first at the lower level and then +
and – at the higher level.
– The statement being analyzed is scanned for a sub-expression whose operator has higher precedence
than the surrounding operators.
• Here, operator is taken to mean any terminal symbol (i.e. , any token), so there are precedence relation
among BEGIN, END, id, (, ...
– E.g., PROGRAM = VAR, and BEGIN < FOR
– also, ; >END as well as END >;
– Note: there is no relation between some pairs of tokens, which means that the second token should not
follow the first one. If this occurs, it is an error.
– If there is a relation, it is unique. Non-unique relation will make operator-precedence parsing fail.
• Precedence matrix for the grammar from Fig 5.2 (Figure 5.11)
– (Automatically) Generated from the grammar in Figure 5.2.
• Operator-precedence parse of a READ statement from line 9 of program 5.1 (Figure 5.12)
– BEGIN < READ, so keep BEGIN (in stack) and proceed to READ.
– READ = (, so they are children of a same parent.
– (< id, so keep READ and ( and proceed to id, VALUE is recognized as id (from Scanner).
– id>), id is recognized as <id-list>, simply denoted as <N1> (a non-terminal). If follow rule 12 of the
grammar, id can be recognized as <factor>, but it does not matter.
– ( = ), so READ, (, <N1>, ) are all the children of the same parent.
– )>;, finishing the parsing tree for READ, by following rule 13, get non-terminal <read>, simply denoted as
<N2>.
– Please compare this parsing tree with Fig. 5.3 (a), they are same except the name of non-terminals.
Bottom-up parsing (cont.)

--operator-precedence parsing
• Operator-precedence parse of an assignment in line 14 of program 5.1 (Figure 5.13) andFigure 5.13
(cont'd)
– Note: the left-to-right scan is continued in each step only far enough to determine the next portion ofthe
statement to be recognized, i.e., the first portion delimited by < and >.
– Each portion is constructed bottom-up.
– Compare this parsing tree with Fig5.3(b). In Fig5.3(b), id SUMSQ <factor> <term> which is one
operand of DIV. But here, SUMSQ is interpreted as <N1>. These differences are consistent with the use
of arbitrary names for non-terminals. The id, <factor> <term> grammar are for determining the
precedence, which is combined in our parsing precedence matrix, so, no need in the analyzed parsing
tree.
• Work on the entire program Fig 5.1, following the operator-precedence parsing method, the
resulting parsing tree similar to Fig 5.4 will be generated, except for thedifferences in the naming of
non-terminals.
• Other parsing techniques:
– Shift-reduce parsing technique:
• Shift: pushing the current token into stack, similar to when < and = are encountered
• Reduce: recognize symbols on top of the stack according to a rule of the grammar, similar to
when > is encountered.
• Example of shift-reduce parsing (Figure 5.14)
– LR(k) parsing technique: the most powerful one.
• K: indicates the number of tokens following the current position that are considered in making
parsing decision.
– SLR and LALR techniques, for more restricted set of grammars.
Top-down parsing technique

--Recursive-descent parsing
• For each non-terminal, there is a procedure which
– Begins from the current token, search the following tokens, and try to recognize the
rule associated with the non-terminal.
– May call other procedures or even itself for non-terminals included in the rule of this non-
terminal.
– When a match is recognized, the procedure returns an indication of success, otherwise, error.
• Recursive-descent parse of a READ statement (Figure 5.16)
– Begins from token READ, then search for next token (, then call procedure IDLIST,
and after IDLIST returns true, search for token ), if success, return true (and goes
to next token), otherwise, return false.
• There are several alternatives for a non-terminal:
– The procedure must determine which alternative to try.
– For recursive-descent parsing, assume that the next input token is enough to check for making
the decision.
• E.g., for the procedure of <stmt>, see the next token, if READ, then call <read>, if
id, then call <assign>, if FOR, call <for>, and if WRITE, call <write>.
• Problem with left recursive non-terminal such as <id-list>
– If try <id-list>,id, then recursively call itself, and call itself again, so unending chain
without consuming any input token.
– So top-down parsers cannot be directly used with a grammar containing left-recursive rules.
• Solution:
– Change the rule <id-list>::=id|<id-list>,id to <id-list>::=id {,id}, where {,id}
means that the zero or more occurrences of ,id.
– This kind of rule is an extension to BNF.
– Similar for others: such as: <exp>::=<term> {+ <term> | - <term>}
– Simplified Pascal grammar modified for recursive-descent parse (Figure 5.15)
• Note: <exp> <term> <factor> (<exp>). This kind of indirect recursive
call is OK. Since this chain of calls will consume at least one token from the
input. That is the progress can be made.
Top-down parsing technique (cont.)

--Recursive-descent parsing
• Graphical representation of the recursive-descent parsing process forREAD statement (Figure 5.16
cont'd)
–Part (iii) is same as the parse tree in Figure 5.3(a).
–Here tree is constructed in a top-down manner.
• Recursive-descent parsing procedures of an assignment statement(Figure 5.17) and Figure 5.17
(Cont'd)
• Graphical representation of recursive-descent parsing of assignmentstatement (Figure 5.17 cont'd)
and Figure 5.17 (Cont'd)
–Compare the parsing tree in Figure 5.17 (b) to the one in Figure 5.3(b).They are nearly similar. The
differences correspond to the differencesbetween grammars in Figure 5.15 and Figure 5.2.
• You can write all the procedures for the entire grammars, beginningthe parsing by calling procedure
<prog>. The top-down parsing treewill be similar to the one in Figure 5.4.
• Combination of top-down and bottom-up techniques:
–Recursive-descent parsing for high-level constructs and operator- precedence parsing for
expressions.
Code generation

• Generate object code in the form of machine code directly or assembly language.
• A basic technique:
– Associate each rule (or an alternative rule) of the grammar with a routine, which translates the
construct into object code according to its meaning/semantics.
– Called semantic routine or code-generation routine.
– Possibly generate an intermediate form so that optimization can be done to get more efficient code.
– Code generation needs not to be associated with a specific parsing technique.
– Code generated clearly depends on a particular machine. Here assume SIC/XE.
• Data structures needed:
– A list, or a queue, first-in-first-out, also a LISTCOUNT variable
– A stack, first-in-last-out.
– S(token): specifier of a token, i.e., a pointer to the symbol table or the integer value.
– In addition, S(<non-terminal>) for a non-terminal, is set to rA, indicating that the result of code
for this construct is stored in register A.
– LOCCTR: location counter, indicating the next available address.
• The generated code here is represented in SIC assembly language, for easy understanding and
readability.
• Code generation for a READ statement (Figure 5.18)
– (a): parsing tree, no matter what parsing technique is used, the left-most substring which can be
interpreted by a rule of the grammar is recognized:
• In recursive-descent, a recognition occurs when a procedure returns to its caller, with a success.
• In operator-precedence, a recognition occurs when a substring of input is reduced to a non-terminal.
• Thus, the parse first recognizes the id VALUE as an <id-list>, and then recognizes the complete statement
as <read>.
– (b): A set of routines for the rules of this part.
– (c): the symbolic representation of code generated.
• +JSUB XREAD, call standard library function XREAD.
• WORD 1, a word in memory, indicates how many data followed. WORD VALUE, a word in
memory, indicates the address to store the read value.
• The address of WORD 1 will be stored in register L when JSUB is executed, so that XREAD can
access the memory following JSUB and put the read value in the address of WORD VALUE. Also
after finishing, the XREAD will return to (L)+2.
Code generation (cont.)
• Code generation for an assignment statement (Figure 5.19) and Figure 5.19 (Cont'd)and Figure 5.19
(Cont'd)
– The order in which the parts of the statement are recognized is the same as the order in whichthe
calculations are performed.
– As each portion of the statement is recognized, the corresponding routine is called to generatecode.
– E.g., <term>1::=<term>2*<factor>, all arithmetic operations are performed in register A.
• Clearly need to generate MUL instruction. The result will be in A.
• If either the result of <term>2 or <factor> is already in A, (perhaps as the result of a
previous computation), MUL is all we need.
• Otherwise, we must generate LDA to load <term>2 or <factor> to register A, preceding MUL. We
may also need to save the previous value in A for later use (so generate STA instruction).
– Need to keep track of the result left in A by each segment of code generated:
• Specifier S(<non-terminal>) is set to rA, indicating that the result of this computation is in A
• REGA: indicate the highest-level node of the parsing tree whose value is left in A by code
generation so far. Clearly, just one such node at any time.
• temporary variables introduced by compiler, which will be assigned memory locations in object code.
– Similar code generation process for +, DIV and – operations.
– For <assign>, it generates LDA <exp>, STA S(id), and REGA=NULL.
– Rules in Figure 5.19 (b) do not generate object code, but set the node specifier of the higher-level
node to reflect the location of the corresponding value.



Code generation (Cont’d)

• Other code-generation routines for the grammar in Figure 5.2 (Figure5.20) and Figure 5.20
(Cont'd)
– <prog-name> generates header information, similar to that created by STARTand EXTREF, it also
generates save return address and jump to the first executable instruction.
– When <prog> is recognized, storage locations are assigned to any temporary variables and any
references to these variables are then fixed using the sameprocess performed for forward references
by one-pass assembler.
– Regard <for>, a little more complicate.
• When <index-exp> is recognized, code is generated to initialize the index variable andtest for
loop termination. The information is also stored in stack for later use.
• For each statement in the body, code is generated
• When the complete <for> is recognized, code is generated to increase the value of index variable and
jump back to the beginning of the loop to test for loop termination. The information about the index
variable, the beginning address of the loop, and the addressof jumping out of loop is obtained from
stack.
• One feature is that stack is used and it will also allow nested for.
• Symbolic representation of object code generated for the program inFigure 5.1 (Figure 5.21)
– Please go through it carefully and understand the compilation processcompletely.
– Note: object code is in symbolic representation. Also code and data interleave.
Machine-Dependent Compiler Features

• In generally, high languages are machine-independent.


• But the generation and optimization of code is machinedependent.
• In order for code optimization, intermediate form of theprogram is usually generated.
• So: lexical analysis syntactical analysis intermediateform generation optimization code
generation.
• Here discuss:
– intermediate form generation (which is machine-independent,but needed for optimization)
– Optimization (machined dependent).
– Many machine independent optimizations will be discussedlater.
Intermediate form of program

• Quadruples: operation, op1, op2, result.


• Ex1: SUM := SUM + VALUE
– +, SUM, VALUE, i1
– :=, i1, , SUM (:= is treated as an operator)
• EX2: VARIANCE := SUMSQ DIV 100 – MEAN*MEAN
– DIV, SUMSQ, #100, i1
– *, MEAN, MEAN, i2
– -, i1, i2, i3
– :=, i3, , VARIANCE
• Examples of optimization:
– Re-arrange quadruples to remove redundant LOAD and STORE
– Intermediate result is assigned to register to make code more efficient.
• Intermediate code for the program from Fig5.1 (Figure 5.22)
• How to generate intermediate form of the program?
– Replace the code generation routines with intermediate form generation routines.
Machine-dependent optimization

• Assignment and use of registers


– Faster than from memory
– Try to have values in registers and use them as much as possible.
– Ex: VALUE is used in 7 and then in 9. If enough registers are available, VALUE should be
retained in a register.
– Ex: in 16, i5 is assigned to MEAN. If i5 is assigned to a register, it can be used directly in 18.
– Ex: machine code in Fig 5.21 uses only one register A to efficiently handle six of eight
intermediate results (ij) in Fig 5.22.
• Since the number of registers is fixed, how to select a register to replace whenneeded?
– Scan the next point where a register’s value will be used, the one with longest next point isselected
for replacement.
– The value in the replacement register may need to be stored in memory if not yet. (see GETA
procedure which stores A’s value to memory before load if needed.)
• Also need to consider control flow of a program, since JUMP creates difficulty tokeep track of
register contents.
• Ex: Quadruple 1 assigns 0 to SUM, quadruple 7 uses SUM. If SUM is in a register, then use the
register directly. However, the JUMP in quadruple 14 jumps to 4, then to 7. But this time, SUM may
not be in the register.
• Solution: divide a program into basic blocks with jump between blocks: each block has one
entry point, one exit point (can be a jump), and no jump within the block.
– Basic blocks and flow graph for the quadruples in Fig5.22 (Figure 5.23)
– Then limit the register assignments to basic blocks.
– More sophisticated techniques may allow register assignments from one block to the next.
Machine-dependent optimization (cont.)

• Re-arrangement of quadruples:
– Rearrangement of quadruples for codeoptimization (Figure 5.24)
• Reduce to 7 instruction from 9 after re-arrangement ofquadruples.
• Other optimizations by taking advantages ofspecific machine characteristics and instructions:
– Specific loop control or addressing models.
– Calling procedures and manipulating datastructures in a single operation.
– Parallel features, if any.
Machine independent compiler features

--structured variables
• Array, record, string, set,…
• A: array[1..10] of integer
– If each integer is one word, then allocate 10 words for A.
– Generally, array[l..u] of integer, allocate u-l+1 words.
• B: array[0..3,1..6] of integer, allocate 4*6=24 words.
– Generally, array[l1..u1,l2..u2] of integer allocates (u1-l1+1)*(u2-l2+1) words.
• Reference to an element of an array:
– One dimension: direct correspondence
– Multiple dimensions: row-major or column-major. FORTRAN uses column-major.
• Storage of B: array[0..3,1..6] in (a) row-major order and (b) column-major order (Figure 5.25)
– Must calculate the address of the element relative to the beginning of the array.
• Such code is generated by compiler, put in the index register, use index addressing.
• Ex. Reference to A[6], so the relative address is: 3*5=15. (3 bytes * 5 preceding elements). 15 can be
computed by compiler.
• However, reference to A[i], generate code to compute relative address: w*(i-l).
• Moreover, reference to B[i,j], generate code to compute relative address: w*[(i-l1)*(u2-l2+1)+(j-l2)]
• Code generation for array references (Figure 5.26)
– The symbol table for array contains array types, dimensions, lower and upper limits.
• It is enough for compiler to generate code if the information is constant/static, called static array.
– Dynamic array:
• INTEGER, ALLOCATABLE, ARRAY(: , :)::MATRIX, then ALLOCATE (MATRIX(ROWS,
COLUMNS)).
• Since ROWS and COLUMNS are variables, and their values are not known in compiling time,
– compiler can not generate code like the above one for element reference, how to process this case?
– Instead, compiler creates a descriptor (called dope vector) for the array. When the storage is allocated
for the array, the values of these bounds are calculated and stored in the descriptor. Then the generated
code for array reference will use the values from descriptor to calculate relative address.
• Same principle can be used to compile other structured variables. There is a need to store information
about the structure of the variables so that correct codes can be generated.
Machine independent compiler features

--machine independent code optimization

• Elimination of common subexpressions


– Code optimization by elimination of common subexpressions (Figure 5.27)
• 2*J is common subexpression
• It can be analyzed via quadruples.
• Quadruple 5 and 12 compute the same value since J does not change. So quadruple 12 can be
deleted and change all references to i10 to i3.
• After i10 to i3, quadruple 6 and 13 become same.
– So quadruple 13 is removed and substitute i4 for i11.
• Similarly, quadruples 10 and 11 can be removed since they are equivalent to 3 and 4.
• Total number of quadruples: 19 15.
• Code optimization by removal (or move) of loop invariants (Figure 5.27 Cont'd)
– Even though the total number of quadruples remains same, but the quadruples within the loop
body are reduced : 14 11.
• Substitution of fast operation for a slower one
– In the loop, 2**I is substituted by power:=power*2.
– Quadruples 3,and 4, 3*(I-1) is replaced with disp:=disp+3.
• Other optimizations:
– Folding: constant computation
– Loop unrolling: convert loop into line code
– Loop jamming: merge the bodies of loops.
• Optimizations by programmers:
– Change for I=1 to 10 do X[I,2*J-1]:=X[I,2*J] to:
– T1=2*J;T2=T1-1; for I=1 to 10 do X[I,T2]:=X[I,T1]
– Not good for programmer, also not clear.
– So let compiler do such kinds of optimizations, have programmers write programs freely and clearly.
Storage allocation

• Variables are assigned storage allocations


– Programmer defined variables
– Temporary variables
– By compiler, so called static allocation.
• Problem with static allocation
– Recursive procedure call will fail
• Problem with recursive call of a procedure with static allocation (Figure 5.29)
• SUB stores the return address for call 3 into RETADR from L, destroying the return address for call
2. Thus, cannot return to MAIN.
• Same for any variable in SUB. The previous call values are lost.
– Not allow user to allocate/apply dynamic storage.
• In C, p=malloc(size); free(p);
• Dynamic allocation (automatic allocation):
– Activation record: for each procedure call and including: parameters, temporaries, local variables,
return address, and register save area.
– Recursive call will create another activation record.
– The starting address of an activation record is store in a Base register.
– Access to the variables by the procedure will be addressed using base relative.
– Stack is used: when a call is made, its activation record is put on stack. When a call returns, its
activation record is deleted from stack. Base register is reset to point to the previous activation
record.
– Recall MASM SS (Stack Segment) register.
– Compiler will generate code to manage activation records.
• At the beginning of a procedure, code, called prologue, will create activations on stack
• At the end of the procedure, code, called epilogue, will delete activation record and resetting pointers.
– Recursive call of a procedure using dynamic allocation (Figure 5.30) and Figure 5.20 (Cont'd)
Storage allocation (cont.)

• Programmers conducted dynamic allocation:


– In Pascal: New(p) and dispose(p) and in C: MALLOC(size) and free(p)
– A variable allocated this way does not have fixed location in memory and must be addressed by
indirect addressing via a pointer P. P doeshave a fix location in activation record.
– Such kind of storage may be managed by OS or by
– HEAP, a large space of memory allocated to the user program by OSand managed by a run-time
procedure.
– In some language like Java: there is no need to free the dynamic allocated storage. A run-time
garbage collection will do the work.
– Provides another example of delayed binding: the association of an address with a variable is made
when the procedure is executed, notwhen it is compiled or loaded.
• In general: three kinds of memory:
– static/global memory for static variables or global variables
– Stack: for procedure call
– Heap: for user based dynamic allocation.
Block-structured languages

• A very interesting construct in language


• A block is a portion of a program that has the ability to declare itsown identifiers.
– Nested definitions of a procedure/block within another.
– A name defined in a block is available within the block includingwithin all the inner blocks
defined within this block.
– A name cannot be used outside its defining block.
– A same name defined in an inner block will override the one definedin its outer block.
• Nesting of blocks in a source program (Figure 5.31)
– Ex: in A, VAR X,Y,Z: INTEGER; in B: VAR W,X,Y: REAL; in C: VAR V,W:INTEGER.
– In B, references to X,Y will be REAL and to Z will be INTEGER
– In C, references to W will be INTEGER, to X, Y will be REAL, and to Zwill be INTEGER.
– Blocks and their levels, the nesting depth of each block.
Block-structured languages (Cont.)

• Compiling of blocks
– The entries representing the declaration of the same name declared in different blocks are linked
together in symbol table with a chain of pointers.
– When a reference to a name is encountered, check the current block for itsdefinition, if not, then its
directly surrounding block, if not, then the surrounding surrounding block, …
– Automatic storage allocation: each call to a procedure, its activation recordis put on stack.
– In addition, Display: the display contains pointers to the most recent activation records for the current
blocks and all its surrounding blocks in thesource program. When a reference to a name declared in
a surrounding block, Display is used to find the corresponding activation record in stack.
– Use of display for procedure in Fig. 5.31 (Figure 5.32)
• Ex: A B C: (a), C C (b), C D (c) (since D’s direct outside is A), D B (d) (since Bcan only
access B and A)
– At the beginning of a block, code is generated to initialize the Display for theblock.
– At the end of a block, code is generated to restore the previous Display.
– Where is Display?
Compiler Design Options

• One pass compiler


–Difficult for optimization
–Declaration of a name appears after its reference.
• Ex: X:=Y+X, if X, Y, Z are not same type, some conversion code must be generated. Afterward
declaration will have problem.
• Multiple passes:
–ALGOL 68 requires at least three passes.
–Compilation efficiency vs. execution efficiency
• E.g., in the student environments, mainly compilation and testing, one pass compilation is good.
• If execution many times after compilation, or processing large amount of data, multiple passes is OK.
• Interpreters
–For each statement in source program, translate into some form of intermediate form, and thenexecute
directly. Then to next statement.
–In general, for statements within a loop, repeatedly analyze and translate the statements.
–So slower than compilation.
–Real advantage: debugging. Know the error in source program directly and immediately.
–So attractive to education environments and beginners.
–Some cases where interpreters are better:
• Call to many system libraries, which are the same (already in object code) no matter interpreting
or compiling.
• The types of variables can dynamically change.
• Dynamic scoping: the variables that can be referred to by a function or a subroutine are determined by
the sequence of the calls made during execution rather than the nesting of blocks in the source program.
Compiler Design Options (Cont.)
• P-Code compiler:
–Intermediate form is the machine language for a hypothetical computer, calledpseudo-machine or P-
machine.
–The machine structure is essentially “stack”.
–Advantage: portability. P-code can be run on any machine with P-codeinterpreter.
–Translation and execution using a P-code compiler (Figure 5.33)
• Main disadvantage: slower.
• Java: .class.
• Compiler-compiler: compiler generator.
–Both scanners and parsing can be constructed automatically, in many cases.
–So, given lexical rules and grammars, along with semantic routines for rules, acompiler can be
generated automatically.
–Automated compiler construction using a compiler-compiler (Figure 5.34)
–Advantages: reducing the work implementing a compiler,
–Disadvantages: the compiler generated tends to require more memory and isslower in compilation
than the hand-written compiler.
–However, the code by such compiler may be better since the writer can focusmore on optimization.
–Snow-ball compiler construction technique.
Implementation Examples

--SunOS C Compiler

• Pre-processing
– File inclusion, including assembly subroutines
– Trigraph sequences conversion, such as ??< to {
– Line combinations.
– Lexical analysis
– Macro processing.
– Escape sequence conversion, such as \n to a newline
– Concatenation of adjacent strings. “Hello”, “World” to “Hello, World”.
• Compiled to assembly language program.
• Many system programs are written in C, so efficiency is important.
• Four level code optimizations
– O1: local optimization; O2: global optimization;
– O3 and O4: improve speed. Such as loop unrolling (O3), and convert a procedure call into in-line
code (O4).
• Insert code in object code to gather running-time information and perform certaintasks.
• As an option, symbolic information can be included in object code (e.g., for debugpurpose).
Implementation

Examples

--Cray MPP

FORTRAN

Compiler

• Massive Parallel
Processing
• Mainly parallelization
for shared data.

You might also like