Compiler Design
Compiler Design
Compiler Design
Looking for GATE Preparation Material? Join & Get here now!
<<Previous Next>>
Introduction Of Tokanisation
Parsing actually starts to get interesting with tokenisation. This is where input
is finally converted into a form that the players couldn’t have typed in
directly.
Here’s the gist of it. You take a pre-processed input line and chunk it into
units called symbols. A symbol is one of:
1. Anything between matching quotation marks, or between an unmatched
quotation mark and the end of the line. Example: "this is a string".
2. Any other piece of punctuation. Examples: ? ; , ! .
3. Any series of digits not bounded by alphabetical characters. Examples: 7,
26. Minus signs can count as part of integers too.
4. Any series of alphanumerical characters. Examples: WALK, DRAGON,
COIN22.
5. Whatever special characters you use to talk to the tokenisation process
directly. I’ll discuss these later.
ADVERTISEMENT
OK, now what you need from the tokeniser is a list of tokens. These are
nodes that represent multiple parts of speech (nouns, verbs etc.), of which
the main part of the parser can then attempt to make sense. They usually
consist of three elements:
1. A type identifier, so you know what kind of token it is.
2. Data (for freestyle tokens).
3. A set of parts of speech that the token can take.
For strings, the type will be some predefined constant, such as T_STRING,
StringType or whatever your naming convention decrees. The data will be the
body of the string, e.g. WHAT?!!. The set of parts of speech will contain some
representation for nouns, and maybe also for verbs. I’ll write this as [noun,
verb]. Don’t panic, I shall explain parts of speech in detail when I reach the
main parsing process in a later article.
For integers, the type will be T_INTEGER or IntegerType or whatever, and the
data will be a number such as 142857. The set of parts of speech will be at
least [noun, adjective], with maybe some others thrown in too.
Punctuation marks will have their own predefined nodes. You can look them
up in a table, it’s simple enough. If you like, you can point some of them to
the same record, e.g. question marks and exclamation marks could be
mapped to the same node as a full stop (my apologies to American readers, I
know you call these latter two "exclamation points" and "periods").
This brings us to words...
The Vocabulary
Words must be translated into atoms (from the inheritance hierarchy, as I
described earlier in this set of articles). The data structure linking the two is
the vocabulary. This consists of a symbol table that connects words, parts of
speech (PoS) and atoms. Here’s an extract showing what a vocabulary might
contain:
If a player typed HIT ORANGE BOX then the tokeniser would need to look up
all definitions of each word and the appropriate possible meanings, i.e.:
Compiler Design
Looking for GATE Preparation Material? Join & Get here now!
<<Previous Next>>
LookaHead