0% found this document useful (0 votes)
2 views

01 Compiler_Scanner intro [Recovered]

The document discusses the implementation of a scanner in a compiler, which transforms a stream of characters into words classified by syntactic categories. It outlines the process of constructing scanners using regular expressions, DFA (Deterministic Finite Automaton), and various implementation strategies, including table-driven and direct-coded scanners. Additionally, it addresses challenges such as excess roll back during scanning and methods to optimize performance.

Uploaded by

126003020
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

01 Compiler_Scanner intro [Recovered]

The document discusses the implementation of a scanner in a compiler, which transforms a stream of characters into words classified by syntactic categories. It outlines the process of constructing scanners using regular expressions, DFA (Deterministic Finite Automaton), and various implementation strategies, including table-driven and direct-coded scanners. Additionally, it addresses challenges such as excess roll back during scanning and methods to optimize performance.

Uploaded by

126003020
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Scanner

Introduction
• Transform a stream of characters into a stream of words in the input
language
• Each word must be classified into a syntactic category, or “part of
speech.”
• Only pass in the compiler to touch every character in the input
program
• Aggregate characters to form words and applies a set of rules to
determine whether or not each word is legal in the source language
• If the word is valid, the scanner assigns it a syntactic category, or part
of speech
Overview
• Compiler’s scanner reads an input stream that consists of characters
and produces an output stream that contains words, each labelled
with its syntactic category
• Classification of words according to their grammatical usage
• Scanner applies a set of rules that describe the lexical structure of the
input programming language called microsyntax
• specifies how to group characters into words and, conversely, how to separate
words that run together
note: punctuation marks and other symbols as words

• Keyword- a word that is reserved for a particular syntactic purpose


and, thus, cannot be used as an identifier
Example
• while ,static - match the rule for an identifier but have special meanings
• Recognize keywords, the scanner can either use dictionary lookup or
encode the keywords directly into its microsyntax rules
• Lexical structure of programming languages lends itself to efficient
scanners
• Compiler writer starts from a specification of the language’s
microsyntax
• Either encodes the microsyntax into a notation accepted by a
scanner generator,
• constructs an executable scanner
• uses that specification to build a hand-crafted scanner
• Both generated and hand-crafted scanners can be implemented to
require just O(1) time per character, so they run in time proportional
to the number of characters in the input stream
Cycle of Construction
Implementing scanners
• For most languages, the compiler writer can produce an
acceptably fast scanner directly from a set of regular
expressions
• Compiler writer creates an Regular Expression (RE) for
each syntactic category and gives the “REs” as input to
a scanner generator
• Generator constructs an “NFA” for each RE, joins them
with ϵ-transitions
• Creates a corresponding “DFA”, and minimizes the “DFA”
• At that point, the scanner generator must convert the
“DFA” into executable code
Implementation strategies
• Implementation strategies for converting a DFA into

executable code
• Table-driven scanner
• Direct-coded scanner
• Hand-coded scanner
• All of these scanners operate in the same manner, by
simulating DFA
• Repeatedly read the next character in the input and
simulate the DFA transition caused by that character
• Process stops when the DFA recognizes a word
Implementation strategies
• When the current state, s- has no outbound transition on the
current input character
• If s is an accepting state- the scanner recognizes the word and
returns a lexeme and its syntactic category to the calling
procedure
• If s is a nonaccepting state -the scanner must determine whether
or not it passed through an accepting state on the way to s
• If the scanner did encounter an accepting state, it should roll
back its internal state and its input stream to that point and report
success
• If it did not, it should report the failure
Table-Driven Scanners
• Uses a skeleton scanner for control and a set of
generated tables that encode language-specific
knowledge

• Compiler writer provides a set of lexical


patterns, specified as regular expressions.
• Scanner generator then produces tables that
drive the skeleton scanner
Table-Driven
Scanners
• Table-driven scanner for
the RE which was our
first attempt at an RE for
iloc register names
• Left side -shows the
skeleton scanner
• Right side- shows the
tables for and the
underlying “DFA”
Table-Driven Scanners

• The skeleton scanner divides into four sections:


• Initializations
• Scanning loop that models the DFA’s behavior
• Roll back loop in case the DFA overshoots the end of the token
• Final section that interprets and reports the results
• Scanning loop repeats the two basic actions of a scanner:
• Read a character
• Simulate the DFA’s action
• It halts when the DFA enters the error state,
• Two tables, CharCat and transition table , encode all knowledge
about the DFA
• Roll back loop uses a stack of states to revert the scanner to its
most recent accepting state
Table-Driven Scanners

• Skeleton scanner uses the variable state to hold the current state of the simulated
DFA
• Updates state using a two-step, table-lookup process
• Classifies char into one of a small set of categories using the Char- Cat
table.
The scanner for has three categories: Register, Digit, or Other.
• Uses the current state and the character category as indices into the transition
table
• Two-step translation,
• Character to category then state
• Category to new state then lets the scanner use a compressed transition table
• Tradeoff between direct access into a larger table and indirect access into the
compressed table is straightforward
Table-Driven Scanners
• Complete table would eliminate the mapping through CharCat, but
would increase the memory footprint of the table
• Uncompressed transition table grows as the product of the number
of states in the DFA and the number of characters in Σ; it can grow
to the point where it will not stay in cache
• With a small, compact character set, such as ascii, CharCat can be
represented as a simple table lookup
• Relevant portions of CharCat should stay in the cache
• In that case, table compression adds one cache reference per input
character
• As the character set grows (e.g. Unicode), more complex implementations of
CharCat may be needed
• The precise tradeoff between the per-character costs of both
compressed and uncompressed tables will depend on properties of
both the language and the computer that runs the scanner
Table-Driven Scanners
• To provide a character-by-character interface to the input stream,
Skeleton scanner uses
• NextChar macro, which sets its sole parameter to contain the
next character in the input stream
• RollBack macro moves the input stream back by one
character
• If the scanner reads too far, state will not contain an accepting
state at the end of the first while loop
• In that case, the second while loop uses the state trace from the
stack to roll the state, lexeme, and input stream back to the most
recent accepting state
• In most languages, the scanner’s overshoot will be limited
• Pathological behavior, however, can cause the scanner to
examine individual characters many times, significantly
increasing the overall cost of scanning
• In most programming languages, the amount of roll back is small
relative to the word lengths
Avoiding Excess Roll Back
• Some regular expressions can produce quadratic calls to roll back
in the scanner
• Problem arises from to have the scanner return the longest word
that is a prefix of the input stream
• RE ab | (ab) c - The corresponding DFA recognizes either ab or
any number of occurrences of ab followed by a final c
• On the input string “ababababc”, a scanner built from the DFA
will read all the characters and return the entire string as a single
word
• If, however, the input is “abababab”, it must scan all of the
characters before it can determine that the longest prefix is “ab”
• On the next invocation, it will scan “ababab” to return “ab”
• Third call will scan “abab” to return “ab”, and the final call will
simply return “ab” without any roll back
• Worst, case- it can spend quadratic time reading the input stream
Avoiding Excess Roll Back

• Differs from the earlier scanner in three important ways:


• First, it has a global counter, InputPos, to record position in the
input stream
• Second, it has a bit-array, Failed, to record dead-end
transitions as the scanner finds them
• Failed has a row for each state and a column for each
position in the input stream
• Third, it has an initialization routine that must be called before
NextWord() is invoked
• Routine sets InputPos to zero and sets Failed uniformly to false
Avoiding Excess Roll Back
• This scanner, called the maximal munch scanner, avoids the pathological behavior
by marking dead-end transitions as they are popped from the stack
• Thus, over time, it records specific (state,input position)pairs that cannot lead to
an accepting state
• Inside the scanning loop, the first while loop, the code tests each (state,input
position) pair and breaks out of the scanning loop whenever a failed transition is
attempted
• Optimizations can drastically reduce the space requirements of this scheme
• Most programming languages have simple enough microsyntax that this kind of
quadratic roll back cannot occur
• If building a scanner for a language that can exhibit this behavior, the scanner can
avoid it for a small additional overhead per character
Generating the Transition and
Classifier Tables
• Given a dfa, the scanner generator can generate the tables in a
straightforward fashion
• Initial table has one column for every character in the input
alphabet and one row for each state in the dfa
• For each state, in order, the generator examines the outbound
transitions and fills the row with the appropriate states
• Generator can collapse identical columns into a single instance; as
it does so, it can construct the character classifier. (Two characters
belong in the same class if and only if they have identical columns
in .)
• If the dfa has been minimized, no two rows can be identical, so row
compression is not an issue
Changing Languages
• To model another DFA, the compiler writer can simply supply new tables
Direct-Coded Scanners
• To improve the performance of a table-driven scanner, must reduce the cost of one or
both of its basic actions:
• Read a character
• Compute the next DFA transition

• Direct-coded scanners reduce the cost of computing DFA transitions by replacing the
explicit representation of the DFA’s state and transition graph with an implicit one
• The implicit representation simplifies the two-step, table-lookup computation
• Eliminates the memory references entailed in that computation and allows other
specializations.
• Resulting scanner has the same functionality as the table-driven scanner, but with a
lower overhead per character
• A direct-coded scanner is no harder to generate than the equivalent table-driven
scanner
• The table-driven scanner spends most of its time inside the central while loop; thus, the
heart of a direct-coded scanner is an alternate implementation of that while loop.
• With some detail abstracted, that loop performs the following actions:

You might also like