Lexical Analysis
Lexical Analysis
Syntax analyzer
Symbol table
manager Semantic analyzer Error handler
Code optimizer
Backend
Code generator
Outline
• Role of lexical analyzer
• Specification of tokens
• Recognition of tokens
• Lexical analyzer generator
• Finite automata
• Design of lexical analyzer generator
The Role of the Lexical Analyzer
(Interaction of Lexical analyzer with parser)
Token To
Source Lexical semantic
Parser analysis
Program Analyzer
getNextToken
error error
Symbol Table
Lexical Analyzer
• Functions(Tasks)
– Grouping input characters into tokens
– Stripping out comments and white spaces
– Keep track of number of newline characters
seen
– Correlating error messages with the source
program
– Handle include files and macros
Compiler Construction
The Reason for Using the Lexical Analyzer
• Simplifies the design of the compiler
– A parser that had to deal with comments and white space as
syntactic units would be more complex.
– If lexical analysis is not separated from parser, then LL(1) or
LR(1) parsing with 1 token lookahead would not be possible
(multiple characters/tokens to match)
• Compiler efficiency is improved
– Systematic techniques to implement lexical analyzers by hand or
automatically from specifications
– Stream buffering methods to scan input
• Compiler portability is enhanced
– Input-device-specific peculiarities can be restricted to the lexical
analyzer.
Why to separate Lexical analysis and
parsing
1. Simplicity of design
2. Improving compiler efficiency
3. Enhancing compiler portability (e.g. Linux to Win)
Lexical Analyzer
• Lexical analyzer are divided into a cascade of
two process.
– Scanning
• Consists of the simple processes that do not require
tokenization of the input.
– Deletion of comments.
– Compaction of consecutive whitespace characters into one.
– Lexical analysis
• The scanner produces the sequence of tokens as
output.
Lexical Analysis
•What do we want to do? Example:
if (i == j)
Z = 0;
else
Z = 1;
•The input is just a string of characters:
\t if (i == j) \n \t \t z = 0;\n \t else \n \t \t z = 1;
•Goal: Partition input string into substrings
– Where the substrings are tokens
Compiler Construction
What’s a Token?
• A syntactic category
– In English:
• noun, verb, adjective, …
– In a programming language:
• Identifier, Integer, Keyword, Whitespace,
Compiler Construction
Tokens
• Tokens correspond to sets of strings.
– Identifier: strings of letters or digits, starting with
a letter
– Integer: a non-empty string of digits
– Keyword: “else” or “if” or “begin” or …
– Whitespace: a non-empty sequence of blanks,
newlines, and tabs
Compiler Construction
• Two issues in lexical analysis.
– How to specify tokens (patterns)?
– How to recognize the tokens giving a token specification (how to
implement the nexttoken() routine)?
E = M * C * 2 eof eof
lexemeBegin forward
eof
Sentinels
Lookahead Code with Sentinels
switch (*forward++) {
case eof:
if (forward is at end of first buffer) {
reload second buffer;
forward = beginning of second buffer;
}
else if (forward is at end of second buffer) {
reload first buffer;
forward = beginning of first buffer;
}
else
/* eof within a buffer marks the end of inout */
terminate lexical anaysis;
break;
cases for the other characters;
}
Specification of tokens
• In theory of compilation regular expressions
are used to formalize the specification of
tokens
• Regular expressions are means for specifying
regular languages
• Example:
• letter_(letter_ | digit)*
• Each regular expression is a pattern specifying
the form of strings
Ambiguity Resolving
* highest left
| lowest left
Algebraic Laws for Regular Expressions
Regular definitions
d1 -> r1
d2 -> r2
…
dn -> rn
• Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*
Extensions
• One or more instances: (r)+
• Zero of one instances: r?
• Character classes: [abc]
• Example:
– letter_ -> [A-Za-z_]
– digit -> [0-9]
– id -> letter_(letter|digit)*
Lex Regular Expressions
Expression Matches Example
\c Character c literally \*
“s” String s literally “**”
. Any character but newline a.*b
^ Beginning of a line ^a
$ End of a line a$
[^s] Any one character not in string s [^a]
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE
Transition Graph for FA
is a state
is a transition
is a final state
Transition Diagrams
relop < <= <> > >= =
start < =
0 1 2 return(relop, LE)
>
3 return(relop, NE)
other
4 * return(relop, LT)
=
5 return(relop, EQ)
> =
6 7 return(relop, GE)
other
8 * return(relop, GT)
Transition Diagrams
letter or digit
C
lex.yy.c a.out
compiler