Design Assembler Based On Lex and Yacc: Amera Ismail Melhum, Suzan Abdulla Mahmood

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Melhum and Mahmood Iraqi journal

179 of science, Vol.53, No.1, 2012, pp.179-185

DESIGN ASSEMBLER BASED ON LEX AND YACC

Amera Ismail Melhum, *Suzan Abdulla Mahmood


Department of Computer Science, College of Science, University of Duhok. Duhok-Iraq.
* Department of Computer Science, College of Science, University of Sulaimani. Sulaimaniah-Iraq.

Abstract
LEX and YACC are very useful tools for constructing the assemblers; they generate
functions that perform standard operations of a lexical analysis and parsing without
any effecting on the organization processes for semantic analysis, machine code
generation and listing source of compilation. In this work, the cross assembler was
designed using some compiler development tools like LEX and YACC having the
ability to implement the lexical analyzer and parser, for generating the syntax and
parsing modules in a short period of time. These activates are expressed in form of
actions, the bigger number of lexical and grammar rules are used by these actions in
a simpler way.
Keywords: Assembly language, Assembler, Context Free Grammar, LEX, YACC

LEX‫ و‬YACC ‫ﺗﺻﻣﯾم اﻟﻣﺟﻣﻊ اﻟﺻﻠب ﺑﺎﺳﺗﺧدام‬

‫ *ﺳوزان ﻋﺑداﷲ ﻣﺣﻣود‬،‫أﻣﯾرة اﺳﻣﺎﻋﯾل ﻣﻠﺣم‬


.‫اﻟﻌراق‬-‫ دﻫوك‬.‫ ﺟﺎﻣﻌﺔ دﻫوك‬،‫ ﻛﻠﯾﺔ اﻟﻌﻠوم‬،‫ﻗﺳم ﻋﻠوم اﻟﺣﺎﺳﺑﺎت‬
.‫اﻟﻌراق‬-‫ اﻟﺳﻠﯾﻣﺎﻧﯾﺔ‬.‫ ﺟﺎﻣﻌﺔ اﻟﺳﻠﯾﻣﺎﻧﯾﺔ‬،‫ ﻛﻠﯾﺔ اﻟﻌﻠوم‬،‫* ﻗﺳم ﻋﻠوم اﻟﺣﺎﺳﺑﺎت‬

‫اﻟﺧﻼﺻﺔ‬
‫ اذ ﺗوﻟــد وظــﺎﺋف ﻟﺗﻧﻔﯾــذ اﻟﻌﻣﻠﯾــﺎت اﻟﻘﯾﺎﺳــﯾﺔ‬.‫ ذات ﻓﺎﺋـدة ﻛﺑﯾـرة ﻟﺑﻧــﺎء اﻟﻣﺟﻣﻌــﺎت‬YACC‫ و‬LEX ‫ﺗﻌــد اﺳــﻠوﺑﻲ‬
‫ ﻓـﻲ ﻫـذا اﻟﺑﺣـث ﺗﻣـت ﺗـﺻﻣﯾم‬.‫وذﻟك ﻋن طرﯾق ﺗﺣﻠﯾل اﻟﻣﻔردات ﺑدون ﺗﺄﺛﯾر ﻋﻠﻰ ﺗﻧظﯾم ﻋﻣﻠﯾـﺎت اﻟﺗﺣﻠﯾـل اﻟـدﻻﻟﻲ‬
‫ ذﻟـك ﻟﻘـﺎﺑﻠﯾﺗﻬم ﻷﻧﺟـﺎز ﺗﺣﻠﯾـل‬YACC‫ و‬LEX ‫اﻟﻣﺟﻣـﻊ اﻟـﺻﻠب ﺑﺎﺳـﺗﺧدام ﺑﻌـض ادوات اﻟﺗطـوﯾر اﻟﻣﺗـرﺟم ﻣﺛـل‬
‫ و‬C++ ‫اﻟﻣﻔــردات و ﺗﻌرﯾـب اﻟﻛﻠﻣــﺔ و اﻧــﺷﺎء ﺗرﻛﯾــب ﻓــﻲ ﻓﺗـرة زﻣﻧﯾــﺔ ﻗﻠﯾﻠــﺔ ﻣﻘﺎرﻧﺗﻬــﺎ ﻣــﻊ اﻟﻠﻐــﺎت اﻟﺑرﻣﺟﯾــﺔ ﻣﺛــل‬
‫و اﻟﺦ ﻣن اﻟﻠﻐﺎت اﻟﺑرﻣﺟﯾﺔ واﻟﺟدﯾر ﺑﺎﻻﺷﺎرة ان ﻫـذﻩ اﻟﻔﻌﺎﻟﯾـﺎت ﺗوﺻـف ﺑـﺷﻛل ﺻـﯾﻎ واﺟـراءات‬Visual Basic
.‫ﺑطرﯾﻘﺔ ﻣﺑﺳطﺔ وﻟﻬﺎ اﻟﻘدرة ﻋﻠﻰ ﻣﻌﺎﻟﺟﺔ أﻛﺑر ﻋدد ﻣن اﻟﻘواﻋد اﻟﻧﺣوﯾﺔ وﻣﻔردات اﻟﻣﻌﺎﺟم‬

1. Introduction is analogous to spelling and grammar checks in


In our world, there is a great variety of a word processor. If errors are found, exceptions
microprocessors, and the construction of are thrown and reported to the developer. If,
assemblers for these microprocessors requires however, no errors are found, method calls are
much man-power and time [1]. Assemblers are made to the backend and synthesis begin [2].
often considered as consisting of two phases. This paper is of interest not only to new
The analysis phase, where most of the error comers to assembler construction, but to those
checking is done, and the synthesis phase, where already familiar with the subject who wishes to
the object code is created. During the analysis follow the development of assembler for
phase, the assembler source code will be microprocessors requires. It can therefore be
scanned for any lexical or syntactic errors. This

179
Melhum and Mahmood Iraqi journal
180 of science, Vol.53, No.1, 2012, pp.179-185

used as an extensive programming project in a written by the user are executed in the order in
software engineering setting. which the corresponding regular expressions
In this research, it shows how to build the occur in the input stream[5, 6], the following is
assembler using some compiler development a sample code for Lex part, in the left hand the
tools like LEX and YACC[3] having the ability description of Token and in the right hand the
to implement the lexical analyzer and parser also associative action for it.
help us to generate the syntax and parsing %{
modules in a short period of time. #include <stdlib.h>
In the next section, the structure of the void yyerror(char *);
assembler is implemented by using two pass #include "y.tab.h"
assembler, the two pass assembler refer to the %}
number of time reading file. %%
/* variables */
1.1 LEX and Yacc [a-z] {yylval = *yytext - 'a';
An assembler, compiler or interpreter for a
return VARIABLE;}
programming language is often decomposed into
two parts:[4]
/* integers */
1-Read the source program and discover its
[0-9]+ { yylval = atoi(yytext);
structure.
return INTEGER;}
2-Process this structure, e.g. to generate the
}
target program.
/* operators */
Lex and Yacc can generate program
[-+()=/*\n] { return *yytext; }
fragments that solve the first task. The task of
/* skip whitespace */
discovering the source structure again is
[ \t] ;
decomposed into subtasks:
/* anything else is an error */
1-Split the source file into tokens (Lex).
. yyerror("invalidcharacter");
2-Find the hierarchical structure of the program
%%
(Yacc).
int yywrap(void) {
Lex and Yacc are tools that help
return 1 ;}
programmers build compilers and interpreters,
but they also have a wider range of applications 1.3 Yacc: Yet Another Compiler –
like converting text data to HTML ,XML or Compiler
Latex. Computer program input generally has some
structure; in fact, every computer program that
1.2 Lex : A lexical Analyzer Generator
does input can be thought of as defining an
Lex helps to write programs whose control
``input language'' which it accepts. An input
flow is directed by instances of regular
language may be as complex as a programming
expressions in the input stream. It is well suited
language, or as simple as a sequence of
for editor-script type transformations and for
numbers. Unfortunately, usual input facilities
segmenting input in preparation for a parsing
are limited, difficult to use, and often are lax
routine. While Lex can be used to generate
about checking their inputs for validity.
sequences of tokens for a parser, it can also be
Yacc provides a general tool for describing
used to perform simple input operations. A
the input to a computer program. The Yacc user
number of programs have been written to
specifies the structures of his input, together
convert simple English text into dialects using
with code to be invoked as each such structure is
only Lex.
recognized. Yacc turns such a specification into
Lex source is a table of regular expressions
a subroutine that handles the input process
and corresponding program fragments. The table
frequently, it is convenient and appropriate to
is translated to a program which reads an input
have most of the flow of control in the user's
stream, copying it to an output stream and
application handled by this subroutine [5, 6].
partitioning the input into strings which match
The following a sample code for Yacc part, in
the given expressions. As each such string is
the left hand the description of Grammar rule
recognized the corresponding program fragment
and in the right hand the associative action for it.
is executed. The recognition of the expressions
%token INTEGER VARIABLE
is performed by a deterministic finite automaton
%left '+' '-'
generated by Lex. The program fragments

180
Melhum and Mahmood Iraqi journal
181 of science, Vol.53, No.1, 2012, pp.179-185

%left '*' '/'


%{
void yyerror(char *);
int yylex(void);
int sym[26];
%}
%%
program:
program statement '\n'
|
;
statement:
expr { printf("%d\n", $1); }
| VARIABLE '=' expr { sym[$1] = $3; }
;
expr:
INTEGER
| VARIABLE { $$ = sym[$1]; }
| expr '+' expr { $$ = $1 + $3; }
| expr '-' expr { $$ = $1 - $3; }
| expr '*' expr { $$ = $1 * $3; }
| expr '/' expr { $$ = $1 / $3; }
| '(' expr ')' { $$ = $2; }
;
%% Figure 1: Flowchart of Pass 1
2. Proposed structure of the Assembler
Since problem of assembler is as old as
symbolic languages, there are many different
solutions of it described in details in the
literature. All the approaches may be roughly
classified according to the number of readings
(passes) of the input file. So there are one-pass,
two pass and multi-pass assembler; they have
well known advantages and disadvantages, the
two pass assembler is surely the simplest for
implementation and that's why we have decided
to accept this approach.
In this paper, the whole algorithm has been
divided into two parts called "pass 1" and" pass
2" performing the functions as shown in
(Figure 1) and (Figure 2).
The functions of Pass_1:
-Checking lexical, syntax and semantic
correctness of a program.
-Storing label definitions and calls (references).
-Checking possibility of resolving all internal
and external references.
-Printing out the source program merged
(if such need) with error messages.
Figure 2: Flowchart of pass2
The functions of Pass_2: -Label processing (searching for the required
-Checking semantic correctness of a program. labels).
-Machine code generation.
-Generation of compilation listing- in case of

181
Melhum and Mahmood Iraqi journal
182 of science, Vol.53, No.1, 2012, pp.179-185

error suitable message is displayed. This definition essentially states a context-


The flowcharts were translated to the C free grammar must begin with a start symbol
language program with the following structural and eventually reduced to a finite set of
units: keywords, known as tokens or terminals, and
- data non-terminals (e.g. variable names, strings of
- main segment numbers or characters). The rules that define
- lexical analyzer routine these objects and their order in the language are
- terminal symbol table known as productions. This will be made clearer
- parser routine in the following example. Developed
- grammar rules actions collaboratively with John Backus, the father of
- post-pass 1 processing FORTRAN, Backus- Noir Format (BNF) is
- post pass 2 processing syntax for describing context-free grammars, as
The most important problem is distribution of given by Chomsky.example of the BNF for a
"work" (i.e. function specified for both the simple calculator is shown below:
passes) between these structural units.
Especially roles of lexical analyzer, parser, and expr  mexpr ( + mexpr ) *
terminal symbol actions grammar rules actions mexpr  atom ( * atom ) *
are exchangeable in a wide range. It is even atom  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
possible to organize the process of lexical Here, the tokens are
analysis within the parser routine but this is +*0123456789
surely an ineffective approach. On the other and the nonterminals are expr, mexpr and atom.
hand, the lexical analyzer supported by carefully
design terminal symbol actions may shorten the
routine and speed up its execution. Assigning
strict and unique functions to the structural units
would result in inefficiency features of the
parser, lexical analyzer and C- language
constructions. So, the machine code can be
taken by the parser routine, whether from the
machine code tables or from the lexical
analyzer.
The internal structure of the design assembler
has been shown in the (Figure 3).
The first step, the context-free grammar for
assembly language was defined, once a context-
free grammar for the language has been properly
defined, the next step would be used to design
tools that will check the developer code for
errors, based upon these rules. The LEX /
YACC or the GNU( General Public license)
Flex/Bison[7], these are software tools that
essentially scan a grammar definition, given in
BNF, and then generate the lexer and parser
accordingly. First, it is need to specify all Figure 3: The internal structure of the
design –assembler
pattern matching rules for lex (bas.l) and
grammar rules for yacc (bas.y). The start symbol is expr and the first two
A context-free grammar consists of four lines are productions. The | is “or” and  is the
components [8, 9]: symbol for definition such that the last line
1. A set of tokens, known as terminal symbols. defines an atom as a zero or a one or a two, and
2. A set of nonterminals. so on. Note the quantifier ( )* allows for zero or
3. A set of productions, i.e. a nonterminal more of the objects contained in the parentheses.
followed by a sequence of tokens and/or Hence, the first line essentially defines an
nonterminals expression as a multiplicative expression,
4. A designation of one of the nonterminals as optionally added to one or more multiplicative
the start symbol. expressions. This allows for statements such as

182
Melhum and Mahmood Iraqi journal
183 of science, Vol.53, No.1, 2012, pp.179-185

2+3*4 in this simple example grammar [3].the - lrl, the list of the required labels, the labels
following is a part of the BNF for the proposal that appears in operand field.
assembler Both lists have the same structure:
struct list { char name[10]; // name of label
prog  stmts end int lnum; // line number
stmts  stmt newline stmts | epsiod int address; // address contents of the
stmt  line | comment | label line | equ arg byte counter
line  ld arg , arg struct list *next}
| in arg , arg *ldl // pointer to the current element of the ldl
| out arg , arg * pd // pointer to the last element of the ldl
| brs arg , arg * lrl // pointer to the current element of the lrl
| al1 *pr // pointer to the last element of the LRL
| al2 Applying the list structure is an acute
| al2 arg , arg necessity since number of labels and references
| posh ras to them is dependent on a program and can vary
| ret conr in a big range.
arg  a | b | c | d | e | |h | i | r | hl | bc | de | af
| sp |af |ix | iy
| [ bc |de | hl | sp |ixy | label | number]
conr  z | c | p | m | nz | nc | po | pe 2.2 Main segment
ixy  i xy mp number In contrast to the name this segment performs
xy  x | y only initialization of the lists LDL and LRL,
mp  - | + calls parser and after terminating its work, prints
posh  pop | push the closing message.
al1  cp | sub | and | or | xor 2.3 Lexical Analyzer Routine
al2  add | adc | sbc The lexical analyzer routine identifies
al3  inc | dec terminal symbols in the input text. Each time the
ras  rlc | rrc | rl | rr | sla | sra routine tries to find the longest string of
| srl characters matching one of the given patterns of
brs  bit | set | res terminal symbols. In case of ambiguity (there
................ are two or more patterns matching the same
.................. string) the first pattern found is taken into
.................. account. So the order of patterns have an
comment  ; identifier important role. Each of the patterns correspond
identifier  letter identifier | letter letterdigit to the user define actions coded in the C–
| episode language. Recognition of the given pattern in the
label  letter Letterdigit input text is associated with execution of the
letterdigit  letter letterdigit | digit letterdigit suitable action.
| _ letterdigit | episode This routine was produced by the lexical
number  digit number | episode analyzer generator LEX: input data to this
digit  0 | 1 | 2 | 3|… |9 program is just specification of patterns and
letter  a | b | c | d |…|z corresponding to actions (terminal symbol
actions). Output file of the LEX is (yy.lex.c) file
2.1 Data ( machine code for the that contains the C –language lexical analyzer
instructions) routine.
The assembler uses 2 types of data: 2.4 Terminal symbol actions
-One or two dimension tables contain machine Main job of a terminal symbol action is
codes of instruction for different operands. The informing the parser routine about the identified
correct association of an instruction and token. Additionally this action organizes of a
operands correspond to a non-zero value of a characters is printed out (by suitable place of the
machine code table element. Two lists of output buffer char linebuf [75]). The number of
symbolic names (labels): lexical rules decides about the size of the lexical
- ldl, the list of the label definitions, the address analyzer routine. It is common view, that an
of actual label that appears in the label field . action continues the process of input string

183
Melhum and Mahmood Iraqi journal
184 of science, Vol.53, No.1, 2012, pp.179-185

identification instead of multiplication rules it is 2.6 Grammar rules actions


enough to add one or two statements in the C- Most of the grammar rules are associated
language. May be the final specification with actions. It means that each time a grammar
becomes not so clear, but it undoubtedly rule is applied, the corresponding action is
produces shorter (both source and object) performed. The top most grammar rule(
routine. Usually an action transmits a value corresponding to recognition of a complete
corresponding to the token: sometimes it is a assembly language line terminated by 'new line'
machine code, sometime s selector of an character) calls action printing out the line
instruction belonging to the token and along with an error message (if any) and the
sometimes it is a string of characters. byte counter will be incremented by the machine
code length for the current instruction.
2.5 Parser routine
Majority of actions cooperates with grammar
The parser routine has to state correctness of
rules identifying assembly language statements.
the input sentence. The sequence of tokens
Each such action identifies the kind of
received from the lexical analyzer is compared
instruction and operand(s) and continues testing
with the grammar rules defined within the
syntax correctness of the statement; if nothing
parser, when such rule is found the tokens are
wrong is detected, the action finds a machine
reduced to the nonterminal symbol and the
code corresponding to the instruction (pass 2) or
action corresponding to this grammar rule is
only determines the length of the instruction
executed. If there is no such grammar rule the
machine code.
parser calls function errorok( ) sending message
about the error and makes recovery (removes 2.7 Post - pass 1 processing
effects of the error; usually rest of the line is Pass 1 is terminated when the lexical
skipped). analyzer discovers end of the input file. The
The parser routine was produced by the lexical analyzer calls a designer supplied
complier YACC [6, 7] on the base of an user function yywrap( ) . This function is called after
(designer) supplied grammar. This grammar terminating of the pass 2.
accompanied by actions (grammar rule actions), In case of errors coming from syntax analysis of
YACC accepts this data and produces the C- a program or post pass 1 processing, the
language routine yyparse( ) loaded to the file assembler sends a message and ends its work.
y.tab.c. To obtain the object code of the Otherwise the input file is reopened and the pass
assembler the contents of the two files y.tab.c 2 is started.
and lex.yy.c must be processed by the C-
language compiler (CC). 2.8 Post - pass 2 processing
The output of Lex part will be the input to the At the end of file calls again the function
Yacc part, its token with its associated value yywrap(Function yywrap is called by lex when
store in variable. The token are NA LABEL input is exhausted. Return 1 if you are done or 0
JPCAL COMMT ORG ARG CONR LD if more processing is required) like in post-pass
AL2 AL1 AL3 RAS BRS RAT IN OUT 1 processing but this time the second part of the
RET POSH END CON DJR RESB ACON, function is active, to produce a reference table
the following are examples about some token including information about all defined
with its value: /declared labels and references to them.
Token value associated with token 3. Conclusion
--------- ------------------------------------------ It is worth to notice that LEX and YACC
ARG a | b | c | d | e | |h | i | r | hl | bc | de | af
seem to be a too advanced tool for constructing
| sp |af |ix | iy
| [ bc |de | hl | sp |ixy | label |number] of such simple compilers like assemblers. It also
AL2 add | adc | sbc helps us to generate the syntax parsing modules
AL1 cp | sub | and | or | xor in a short period of time, LEX and YACC
AL3 inc | dec having the ability to implement the lexical
RAS rlc | rrc | rl | rr | sla | sra | srl analyzer and parser and both are based on the
BRS bit | set | res minimal specification of a programming
POSH pop | push language which includes the definitions of the
JPCAL call | jp set of symbols used (lexical rules), the set of
END end valid programs (syntax rules) and the "meaning"
ORG org
RESB resb

184
Melhum and Mahmood Iraqi journal
185 of science, Vol.53, No.1, 2012, pp.179-185

of valid programs (semantics). Using this way


implemented by Z80 assembler.
References
1. Peter, P.K. And sammy, T.K., 1990.A
Generative Approach to universal cross
assembler Design, ACM SIGPLAN
25(1), pp 43-51.
2. Nakano, K. and Ito, Y. 2008. Processor,
Assembler, and Complier Design
Education using an FPGA, IEEE
International Conference on Parallel
and Distributed Systems, 14(1)
pp 723-728.
3. Lappalainen, P. 2001. Visualization of
assembly-code conversion into machine
code. http://www.ee.oulu.fi/~pl1/tkt1/
4. Niemann, T. 2004. A Guid to LEX &
YACC. http://www.scribd.com/ doc/
43874121/A-Guide-to-Lex-Yacc-By-
Thomas-Niemann
5. Linseman, A. and Nicol, S. 1988. MKS
LEX & Yacc reference manual, Mortice
Kerns Systems, Inc. pp 240-270.
6. Levine, J. and Mason, T. 1992. Lex &
Yacc. Second Edition O’Reilly and
Associates. Brown, pp 235-300.
7. Ashton, B.; Bradley, J.; Dixon, T.;
Gaines, J.; Lodder, M. and Ricks, M.,
2007. DDC Assembly Language
Specification and Assembler
Implementation, Engineering Clinic
Project Sponsored by Hill Air Force
Base, University of Utah.
8. Aho, A.V.; Sethi, R. and Ullman, J.
2006. Compilers Principles Techniques
and Tools, 2nd Edition,Dargon Book, pp
156-170 http://www.jim-gaines.com/
research_DDC/ DDCFinalPaper3.pdf
9. Nicol, G.T. 1993. Flex: the lexical
scanner generator. Free Software
Foundation, Version
2.3.7.http://www.amazon.com/Flex-
Lexical-Scanner-Generator-
Version/dp/1882114213

185

You might also like