0% found this document useful (0 votes)
22 views

Lexical Analysis

Uploaded by

Meghana K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Lexical Analysis

Uploaded by

Meghana K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 57

Regular Expressions

and Lexical Analysis


Chapter 3
Unit 2
Review: Compiler Phases:
Source program

Lexical analyzer Front End

Syntax analyzer
Symbol table
manager Semantic analyzer Error handler

Intermediate code generator

Code optimizer

Backend
Code generator
Outline
• Role of lexical analyzer
• Specification of tokens
• Recognition of tokens
• Lexical analyzer generator
• Finite automata
• Design of lexical analyzer generator
The Role of the Lexical Analyzer
(Interaction of Lexical analyzer with parser)

Token To
Source Lexical semantic
Parser analysis
Program Analyzer
getNextToken

error error

Symbol Table
Lexical Analyzer

• Functions(Tasks)
– Grouping input characters into tokens
– Stripping out comments and white spaces
– Keep track of number of newline characters
seen
– Correlating error messages with the source
program
– Handle include files and macros

Compiler Construction
The Reason for Using the Lexical Analyzer
• Simplifies the design of the compiler
– A parser that had to deal with comments and white space as
syntactic units would be more complex.
– If lexical analysis is not separated from parser, then LL(1) or
LR(1) parsing with 1 token lookahead would not be possible
(multiple characters/tokens to match)
• Compiler efficiency is improved
– Systematic techniques to implement lexical analyzers by hand or
automatically from specifications
– Stream buffering methods to scan input
• Compiler portability is enhanced
– Input-device-specific peculiarities can be restricted to the lexical
analyzer.
Why to separate Lexical analysis and
parsing
1. Simplicity of design
2. Improving compiler efficiency
3. Enhancing compiler portability (e.g. Linux to Win)
Lexical Analyzer
• Lexical analyzer are divided into a cascade of
two process.
– Scanning
• Consists of the simple processes that do not require
tokenization of the input.
– Deletion of comments.
– Compaction of consecutive whitespace characters into one.
– Lexical analysis
• The scanner produces the sequence of tokens as
output.
Lexical Analysis
•What do we want to do? Example:
if (i == j)
Z = 0;
else
Z = 1;
•The input is just a string of characters:
\t if (i == j) \n \t \t z = 0;\n \t else \n \t \t z = 1;
•Goal: Partition input string into substrings
– Where the substrings are tokens

Compiler Construction
What’s a Token?
• A syntactic category
– In English:
• noun, verb, adjective, …
– In a programming language:
• Identifier, Integer, Keyword, Whitespace,

Compiler Construction
Tokens
• Tokens correspond to sets of strings.
– Identifier: strings of letters or digits, starting with
a letter
– Integer: a non-empty string of digits
– Keyword: “else” or “if” or “begin” or …
– Whitespace: a non-empty sequence of blanks,
newlines, and tabs

Compiler Construction
• Two issues in lexical analysis.
– How to specify tokens (patterns)?
– How to recognize the tokens giving a token specification (how to
implement the nexttoken() routine)?

• How to specify tokens:


– all the basic elements in a language must be
tokens so that they can be recognized.
main() {
int i, j;
for (I=0; I<50; I++) {
printf(“I = %d”, I);
}
}
• Token types: constant, identifier, reserved word, operator and
misc. symbol.
– Tokens are specified by regular expressions.
Tokens, Patterns and Lexemes
• A token is a pair a token name and an optional
token value.
• Example: num, id
• A pattern is a description of the form that the
lexemes of a token may take.
• Example: “non-empty sequence of digits”, “letter followed by letters and
digits”
• identifier: ([a-zA-Z_]) ([a-zA-Z_]|[0-9])*
• A lexeme is a sequence of characters in the source
program that matches the pattern for a token.
• Example: 123, abc
Examples: Tokens, Patterns, and Lexemes
Token Pattern Lexeme
if characters i f if
else characters e l s e else
comparison < or > or <= or >= or == or != <=, !=
id letter followed by letters and pi, score, D2
digits
number any numeric constant 3.14, 0, 6.23
literal anything but “, surrounded by “’s “core dump”
An Example
• E = M * C ** 2
• A sequence of pairs by lexical analyzer
<id, pointer to symbol-table entry for E>
<assign_op>
<id, pointer to symbol-table entry for M>
<mult_op>
<id, pointer to symbol-table entry for C>
<exp_op>
<number, integer value 2>
Lexical errors
• Some errors are out of power of lexical
analyzer to recognize:
– fi (a == f(x)) …
• However it may be able to recognize errors
like:
– d = 2r
• Such errors are recognized when no pattern
for tokens matches a character sequence
Error recovery
• Panic mode: successive characters are ignored
until we reach to a well formed token or
delimiters.
Corrective Actions
• Delete one character from the remaining input
• Insert a missing character into the remaining
input
• Replace a character by another character
• Transpose two adjacent characters
Input buffering
• Sometimes lexical analyzer needs to look
ahead some symbols to decide about the
token to return
– In C language: we need to look after -, = or < to
decide what token to return
– In Fortran: DO 5 I = 1.25
• We need to introduce a two buffer scheme to
handle large look-aheads safely
E = M * C * 2 eof
Input Buffering

E = M * C * 2 eof eof

lexemeBegin forward

eof

Sentinels
Lookahead Code with Sentinels
switch (*forward++) {
case eof:
if (forward is at end of first buffer) {
reload second buffer;
forward = beginning of second buffer;
}
else if (forward is at end of second buffer) {
reload first buffer;
forward = beginning of first buffer;
}
else
/* eof within a buffer marks the end of inout */
terminate lexical anaysis;
break;
cases for the other characters;
}
Specification of tokens
• In theory of compilation regular expressions
are used to formalize the specification of
tokens
• Regular expressions are means for specifying
regular languages
• Example:
• letter_(letter_ | digit)*
• Each regular expression is a pattern specifying
the form of strings
Ambiguity Resolving

• Find the longest matching token


• Between two tokens with the same length
use the one declared first
How to Implement Ambiguity Resolving

• Between two tokens with the same length


use the one declared first
• Find the longest matching token
Pathological Example
if { return IF; }
[a-z][a-z0-9]* { return ID; }
[0-9]+ { return NUM; }
[0-9]”.”[0-9]*|[0-9]*”.”[0-9]+ { return REAL; }
(\-\-[a-z]*\n)|(“ “|\n|\t) {;}
. { error(); }
The Lexical Analysis Problem
• Given
– A set of token descriptions
• Token name
• Regular expression
– An input string
• Partition the strings into tokens
(class, value)
• Ambiguity resolution
– The longest matching token
– Between two equal length tokens select the first
Strings and Languages
String Operations
Language Operations
Regular expressions
Regular Expressions
• Regular Expressions
– A convenient means of specifying certain simple sets
of strings.
– We use regular expressions to define structures of
tokens.
– Tokens are built from symbols of a finite vocabulary.
• Regular Sets
– The sets of strings defined by regular expressions.
Regular Expressions
Operator Precedence

Operator Precedence Associative

* highest left

concatenation Second left

| lowest left
Algebraic Laws for Regular Expressions
Regular definitions
d1 -> r1
d2 -> r2

dn -> rn

• Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*
Extensions
• One or more instances: (r)+
• Zero of one instances: r?
• Character classes: [abc]

• Example:
– letter_ -> [A-Za-z_]
– digit -> [0-9]
– id -> letter_(letter|digit)*
Lex Regular Expressions
Expression Matches Example
\c Character c literally \*
“s” String s literally “**”
. Any character but newline a.*b
^ Beginning of a line ^a
$ End of a line a$
[^s] Any one character not in string s [^a]

r* Zero or more strings


r? Zero or one
r{m,n} Between m and n occurrences of r a{1,2}
ab concatenation
r1|r2 r1 or an r2 a|b
(r) Same as r (a|b)
r1/r2 r1 when followed by r2 (i/am)
R+ One or more strings
Regular Definitions
• If Σ is an alphabet of basic symbols, then a regular
definitions is a sequence of definitions of the form:
d1  r1
d2  r2

dn  rn
– Each di is a new symbol, not in Σ and not the same as any other
of d’s.
– Each ri is a regular expression over the alphabet
  {d1, d2, …, di-1 }
• Any dj in ri can be textually substituted in ri to obtain an
equivalent set of definitions
Extensions of Regular Definitions
• One or more instance
– r+ = rr* = r*r
– r* = r+ | ε
• Zero or one instance
– r? = r |ε
• Character classes
– [a-z] = abc…z
– [A-Za-z] = A|B|…|Z|a|…|z
• Example
– digit  [0-9]
– num  digit+ (. digit+)? ( E (+-)? digit+ )?
Write character classes for the following sets of characters:

1. The first ten letters (up to "j") in either upper


or lower case.
2. The lowercase consonants.
3. The "digits" in a hexadecimal number (choose
either upper or lower case for the "digits"
above 9).
4. The characters that can appear at the end of
alegitimate English sentence (e.g. ,
exclamation point) .
Write Regular Expressions for
1. Arithmetic expression
2. Relational expression
I. Most languages are case sensitive, so keywords can be written only
one way, and the regular expressions describing their lexeme is
very simple. However, some languages, like SQL, are case
insensitive, so a keyword can be written either in lowercase or in
uppercase, or in any mixture of cases. Thus, the SQL keyword
SELECT can also be written select, Select, or sElEcT, for instance.
Show how to write a regular expression for a keyword in a case­
insensitive language. Illustrate the idea by writing the expression
for "select" in SQL.
Answer

• select -> [Ss][Ee][Ll][Ee][Cc][Tt]


• OR
• Select(S|s)(E|e)(L|l)(E|e)(C|c)(T|t)
Regular Definitions and Grammars
Context-Free Grammars
stmt  if expr then stmt
 if expr then stmt else stmt

expr  term relop term ws  ( blank | tab | newline )+
 term Regular Definitions
term  id
digit  [0-9]
 num letter  [A-Za-z]
if  if
then  then
else  else
relop  <  <=  <>  >  >=  =
id  letter ( letter | digit )*
num  digit+ (. digit+)? ( E (+ | -)? digit+ )?
Recognition of tokens
• Starting point is the language grammar to
understand the tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt

expr -> term relop term
| term
term -> id
| number
Recognition of tokens (cont.)
• The next step is to formalize the patterns:
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
• We also need to handle whitespaces:
ws -> (blank | tab | newline)+
LEXEMES TOKEN NAME ATTRIBUTE VALUE
Any ws - -
if if -
then then -
else else -
Any id id Pointer to table entry

Any number number Pointer to table entry

< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE
Transition Graph for FA
is a state

is a transition

is a the start state

is a final state
Transition Diagrams
relop  <  <=  <>  >  >=  =

start < =
0 1 2 return(relop, LE)
>
3 return(relop, NE)
other
4 * return(relop, LT)
=
5 return(relop, EQ)
> =
6 7 return(relop, GE)
other
8 * return(relop, GT)
Transition Diagrams

id  letter ( letter | digit )*

letter or digit

start letter other *


9 10 11

return (getToken(), installID() )


Transition diagrams (cont.)
• Transition diagram for reserved words and
identifiers
Transition diagrams (cont.)
• Transition diagram for unsigned numbers
Transition diagrams (cont.)
• Transition diagram for whitespace
Transition diagrams diagrams (cont.)
• Transition diagram for relop
Architecture of a transition-diagram-based
lexical analyzer
Implement of RELOP
TOKEN getRelop()
{
TOKEN retToken = new(RELOP);
while (1) { /* repeat character processing until a
return or failure occurs */
case 0: c = nextChar();
if (c == ‘<‘) state = 1;
else if (c == ‘=‘) state= 5;
else if (c == ‘>‘) state= 6;
else fail(); /* lexeme is not a relop */
break;
case 1: ...
...
case 8: retract();
retToken.attribute = GT;
return(retTOKEN);
}
}
Lexical Analyzer Generator - Lex

Lex Source program


Lexical Compiler lex.yy.c
lex.l

C
lex.yy.c a.out
compiler

Input stream a.out


Sequence
of tokens

You might also like