01 Compiler_Scanner intro [Recovered]
01 Compiler_Scanner intro [Recovered]
Introduction
• Transform a stream of characters into a stream of words in the input
language
• Each word must be classified into a syntactic category, or “part of
speech.”
• Only pass in the compiler to touch every character in the input
program
• Aggregate characters to form words and applies a set of rules to
determine whether or not each word is legal in the source language
• If the word is valid, the scanner assigns it a syntactic category, or part
of speech
Overview
• Compiler’s scanner reads an input stream that consists of characters
and produces an output stream that contains words, each labelled
with its syntactic category
• Classification of words according to their grammatical usage
• Scanner applies a set of rules that describe the lexical structure of the
input programming language called microsyntax
• specifies how to group characters into words and, conversely, how to separate
words that run together
note: punctuation marks and other symbols as words
executable code
• Table-driven scanner
• Direct-coded scanner
• Hand-coded scanner
• All of these scanners operate in the same manner, by
simulating DFA
• Repeatedly read the next character in the input and
simulate the DFA transition caused by that character
• Process stops when the DFA recognizes a word
Implementation strategies
• When the current state, s- has no outbound transition on the
current input character
• If s is an accepting state- the scanner recognizes the word and
returns a lexeme and its syntactic category to the calling
procedure
• If s is a nonaccepting state -the scanner must determine whether
or not it passed through an accepting state on the way to s
• If the scanner did encounter an accepting state, it should roll
back its internal state and its input stream to that point and report
success
• If it did not, it should report the failure
Table-Driven Scanners
• Uses a skeleton scanner for control and a set of
generated tables that encode language-specific
knowledge
• Skeleton scanner uses the variable state to hold the current state of the simulated
DFA
• Updates state using a two-step, table-lookup process
• Classifies char into one of a small set of categories using the Char- Cat
table.
The scanner for has three categories: Register, Digit, or Other.
• Uses the current state and the character category as indices into the transition
table
• Two-step translation,
• Character to category then state
• Category to new state then lets the scanner use a compressed transition table
• Tradeoff between direct access into a larger table and indirect access into the
compressed table is straightforward
Table-Driven Scanners
• Complete table would eliminate the mapping through CharCat, but
would increase the memory footprint of the table
• Uncompressed transition table grows as the product of the number
of states in the DFA and the number of characters in Σ; it can grow
to the point where it will not stay in cache
• With a small, compact character set, such as ascii, CharCat can be
represented as a simple table lookup
• Relevant portions of CharCat should stay in the cache
• In that case, table compression adds one cache reference per input
character
• As the character set grows (e.g. Unicode), more complex implementations of
CharCat may be needed
• The precise tradeoff between the per-character costs of both
compressed and uncompressed tables will depend on properties of
both the language and the computer that runs the scanner
Table-Driven Scanners
• To provide a character-by-character interface to the input stream,
Skeleton scanner uses
• NextChar macro, which sets its sole parameter to contain the
next character in the input stream
• RollBack macro moves the input stream back by one
character
• If the scanner reads too far, state will not contain an accepting
state at the end of the first while loop
• In that case, the second while loop uses the state trace from the
stack to roll the state, lexeme, and input stream back to the most
recent accepting state
• In most languages, the scanner’s overshoot will be limited
• Pathological behavior, however, can cause the scanner to
examine individual characters many times, significantly
increasing the overall cost of scanning
• In most programming languages, the amount of roll back is small
relative to the word lengths
Avoiding Excess Roll Back
• Some regular expressions can produce quadratic calls to roll back
in the scanner
• Problem arises from to have the scanner return the longest word
that is a prefix of the input stream
• RE ab | (ab) c - The corresponding DFA recognizes either ab or
any number of occurrences of ab followed by a final c
• On the input string “ababababc”, a scanner built from the DFA
will read all the characters and return the entire string as a single
word
• If, however, the input is “abababab”, it must scan all of the
characters before it can determine that the longest prefix is “ab”
• On the next invocation, it will scan “ababab” to return “ab”
• Third call will scan “abab” to return “ab”, and the final call will
simply return “ab” without any roll back
• Worst, case- it can spend quadratic time reading the input stream
Avoiding Excess Roll Back
• Direct-coded scanners reduce the cost of computing DFA transitions by replacing the
explicit representation of the DFA’s state and transition graph with an implicit one
• The implicit representation simplifies the two-step, table-lookup computation
• Eliminates the memory references entailed in that computation and allows other
specializations.
• Resulting scanner has the same functionality as the table-driven scanner, but with a
lower overhead per character
• A direct-coded scanner is no harder to generate than the equivalent table-driven
scanner
• The table-driven scanner spends most of its time inside the central while loop; thus, the
heart of a direct-coded scanner is an alternate implementation of that while loop.
• With some detail abstracted, that loop performs the following actions: