0% found this document useful (0 votes)

22 views

Lexical Analysis

Uploaded by

Meghana K

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Lexical Analysis

Uploaded by

Meghana K

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 57

Regular Expressions

and Lexical Analysis

Chapter 3
Unit 2
Review: Compiler Phases:
Source program

Lexical analyzer Front End

Syntax analyzer
Symbol table
manager Semantic analyzer Error handler

Intermediate code generator

Code optimizer

Backend
Code generator
Outline
• Role of lexical analyzer
• Specification of tokens
• Recognition of tokens
• Lexical analyzer generator
• Finite automata
• Design of lexical analyzer generator
The Role of the Lexical Analyzer
(Interaction of Lexical analyzer with parser)

Token To
Source Lexical semantic
Parser analysis
Program Analyzer
getNextToken

error error

Symbol Table
Lexical Analyzer

• Functions(Tasks)
– Grouping input characters into tokens
– Stripping out comments and white spaces
– Keep track of number of newline characters
seen
– Correlating error messages with the source
program
– Handle include files and macros

Compiler Construction
The Reason for Using the Lexical Analyzer
• Simplifies the design of the compiler
– A parser that had to deal with comments and white space as
syntactic units would be more complex.
– If lexical analysis is not separated from parser, then LL(1) or
LR(1) parsing with 1 token lookahead would not be possible
(multiple characters/tokens to match)
• Compiler efficiency is improved
– Systematic techniques to implement lexical analyzers by hand or
automatically from specifications
– Stream buffering methods to scan input
• Compiler portability is enhanced
– Input-device-specific peculiarities can be restricted to the lexical
analyzer.
Why to separate Lexical analysis and
parsing
1. Simplicity of design
2. Improving compiler efficiency
3. Enhancing compiler portability (e.g. Linux to Win)
Lexical Analyzer
• Lexical analyzer are divided into a cascade of
two process.
– Scanning
• Consists of the simple processes that do not require
tokenization of the input.
– Deletion of comments.
– Compaction of consecutive whitespace characters into one.
– Lexical analysis
• The scanner produces the sequence of tokens as
output.
Lexical Analysis
•What do we want to do? Example:
if (i == j)
Z = 0;
else
Z = 1;
•The input is just a string of characters:
\t if (i == j) \n \t \t z = 0;\n \t else \n \t \t z = 1;
•Goal: Partition input string into substrings
– Where the substrings are tokens

Compiler Construction
What’s a Token?
• A syntactic category
– In English:
• noun, verb, adjective, …
– In a programming language:
• Identifier, Integer, Keyword, Whitespace,

Compiler Construction
Tokens
• Tokens correspond to sets of strings.
– Identifier: strings of letters or digits, starting with
a letter
– Integer: a non-empty string of digits
– Keyword: “else” or “if” or “begin” or …
– Whitespace: a non-empty sequence of blanks,
newlines, and tabs

Compiler Construction
• Two issues in lexical analysis.
– How to specify tokens (patterns)?
– How to recognize the tokens giving a token specification (how to
implement the nexttoken() routine)?

• How to specify tokens:

– all the basic elements in a language must be
tokens so that they can be recognized.
main() {
int i, j;
for (I=0; I<50; I++) {
printf(“I = %d”, I);
}
}
• Token types: constant, identifier, reserved word, operator and
misc. symbol.
– Tokens are specified by regular expressions.
Tokens, Patterns and Lexemes
• A token is a pair a token name and an optional
token value.
• Example: num, id
• A pattern is a description of the form that the
lexemes of a token may take.
• Example: “non-empty sequence of digits”, “letter followed by letters and
digits”
• identifier: ([a-zA-Z_]) ([a-zA-Z_]|[0-9])*
• A lexeme is a sequence of characters in the source
program that matches the pattern for a token.
• Example: 123, abc
Examples: Tokens, Patterns, and Lexemes
Token Pattern Lexeme
if characters i f if
else characters e l s e else
comparison < or > or <= or >= or == or != <=, !=
id letter followed by letters and pi, score, D2
digits
number any numeric constant 3.14, 0, 6.23
literal anything but “, surrounded by “’s “core dump”
An Example
• E = M * C ** 2
• A sequence of pairs by lexical analyzer
<id, pointer to symbol-table entry for E>
<assign_op>
<id, pointer to symbol-table entry for M>
<mult_op>
<id, pointer to symbol-table entry for C>
<exp_op>
<number, integer value 2>
Lexical errors
• Some errors are out of power of lexical
analyzer to recognize:
– fi (a == f(x)) …
• However it may be able to recognize errors
like:
– d = 2r
• Such errors are recognized when no pattern
for tokens matches a character sequence
Error recovery
• Panic mode: successive characters are ignored
until we reach to a well formed token or
delimiters.
Corrective Actions
• Delete one character from the remaining input
• Insert a missing character into the remaining
input
• Replace a character by another character
• Transpose two adjacent characters
Input buffering
• Sometimes lexical analyzer needs to look
ahead some symbols to decide about the
token to return
– In C language: we need to look after -, = or < to
decide what token to return
– In Fortran: DO 5 I = 1.25
• We need to introduce a two buffer scheme to
handle large look-aheads safely
E = M * C * 2 eof
Input Buffering

E = M * C * 2 eof eof

lexemeBegin forward

eof

Sentinels
Lookahead Code with Sentinels
switch (*forward++) {
case eof:
if (forward is at end of first buffer) {
reload second buffer;
forward = beginning of second buffer;
}
else if (forward is at end of second buffer) {
reload first buffer;
forward = beginning of first buffer;
}
else
/* eof within a buffer marks the end of inout */
terminate lexical anaysis;
break;
cases for the other characters;
}
Specification of tokens
• In theory of compilation regular expressions
are used to formalize the specification of
tokens
• Regular expressions are means for specifying
regular languages
• Example:
• letter_(letter_ | digit)*
• Each regular expression is a pattern specifying
the form of strings
Ambiguity Resolving

• Find the longest matching token

• Between two tokens with the same length
use the one declared first
How to Implement Ambiguity Resolving

• Between two tokens with the same length

use the one declared first
• Find the longest matching token
Pathological Example
if { return IF; }
[a-z][a-z0-9]* { return ID; }
[0-9]+ { return NUM; }
[0-9]”.”[0-9]*|[0-9]*”.”[0-9]+ { return REAL; }
(\-\-[a-z]*\n)|(“ “|\n|\t) {;}
. { error(); }
The Lexical Analysis Problem
• Given
– A set of token descriptions
• Token name
• Regular expression
– An input string
• Partition the strings into tokens
(class, value)
• Ambiguity resolution
– The longest matching token
– Between two equal length tokens select the first
Strings and Languages
String Operations
Language Operations
Regular expressions
Regular Expressions
• Regular Expressions
– A convenient means of specifying certain simple sets
of strings.
– We use regular expressions to define structures of
tokens.
– Tokens are built from symbols of a finite vocabulary.
• Regular Sets
– The sets of strings defined by regular expressions.
Regular Expressions
Operator Precedence

Operator Precedence Associative

* highest left

concatenation Second left

| lowest left
Algebraic Laws for Regular Expressions
Regular definitions
d1 -> r1
d2 -> r2
…
dn -> rn

• Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*
Extensions
• One or more instances: (r)+
• Zero of one instances: r?
• Character classes: [abc]

• Example:
– letter_ -> [A-Za-z_]
– digit -> [0-9]
– id -> letter_(letter|digit)*
Lex Regular Expressions
Expression Matches Example
\c Character c literally \*
“s” String s literally “**”
. Any character but newline a.*b
^ Beginning of a line ^a
$ End of a line a$
[^s] Any one character not in string s [^a]

r* Zero or more strings

r? Zero or one
r{m,n} Between m and n occurrences of r a{1,2}
ab concatenation
r1|r2 r1 or an r2 a|b
(r) Same as r (a|b)
r1/r2 r1 when followed by r2 (i/am)
R+ One or more strings
Regular Definitions
• If Σ is an alphabet of basic symbols, then a regular
definitions is a sequence of definitions of the form:
d1  r1
d2  r2
…
dn  rn
– Each di is a new symbol, not in Σ and not the same as any other
of d’s.
– Each ri is a regular expression over the alphabet
  {d1, d2, …, di-1 }
• Any dj in ri can be textually substituted in ri to obtain an
equivalent set of definitions
Extensions of Regular Definitions
• One or more instance
– r+ = rr* = r*r
– r* = r+ | ε
• Zero or one instance
– r? = r |ε
• Character classes
– [a-z] = abc…z
– [A-Za-z] = A|B|…|Z|a|…|z
• Example
– digit  [0-9]
– num  digit+ (. digit+)? ( E (+-)? digit+ )?
Write character classes for the following sets of characters:

1. The first ten letters (up to "j") in either upper

or lower case.
2. The lowercase consonants.
3. The "digits" in a hexadecimal number (choose
either upper or lower case for the "digits"
above 9).
4. The characters that can appear at the end of
alegitimate English sentence (e.g. ,
exclamation point) .
Write Regular Expressions for
1. Arithmetic expression
2. Relational expression
I. Most languages are case sensitive, so keywords can be written only
one way, and the regular expressions describing their lexeme is
very simple. However, some languages, like SQL, are case
insensitive, so a keyword can be written either in lowercase or in
uppercase, or in any mixture of cases. Thus, the SQL keyword
SELECT can also be written select, Select, or sElEcT, for instance.
Show how to write a regular expression for a keyword in a case
insensitive language. Illustrate the idea by writing the expression
for "select" in SQL.
Answer

• select -> [Ss][Ee][Ll][Ee][Cc][Tt]

• OR
• Select(S|s)(E|e)(L|l)(E|e)(C|c)(T|t)
Regular Definitions and Grammars
Context-Free Grammars
stmt  if expr then stmt
 if expr then stmt else stmt

expr  term relop term ws  ( blank | tab | newline )+
 term Regular Definitions
term  id
digit  [0-9]
 num letter  [A-Za-z]
if  if
then  then
else  else
relop  <  <=  <>  >  >=  =
id  letter ( letter | digit )*
num  digit+ (. digit+)? ( E (+ | -)? digit+ )?
Recognition of tokens
• Starting point is the language grammar to
understand the tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr -> term relop term
| term
term -> id
| number
Recognition of tokens (cont.)
• The next step is to formalize the patterns:
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
• We also need to handle whitespaces:
ws -> (blank | tab | newline)+
LEXEMES TOKEN NAME ATTRIBUTE VALUE
Any ws - -
if if -
then then -
else else -
Any id id Pointer to table entry

Any number number Pointer to table entry

< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE
Transition Graph for FA
is a state

is a transition

is a the start state

is a final state
Transition Diagrams
relop  <  <=  <>  >  >=  =

start < =
0 1 2 return(relop, LE)
>
3 return(relop, NE)
other
4 * return(relop, LT)
=
5 return(relop, EQ)
> =
6 7 return(relop, GE)
other
8 * return(relop, GT)
Transition Diagrams

id  letter ( letter | digit )*

letter or digit

start letter other *

9 10 11

return (getToken(), installID() )

Transition diagrams (cont.)
• Transition diagram for reserved words and
identifiers
Transition diagrams (cont.)
• Transition diagram for unsigned numbers
Transition diagrams (cont.)
• Transition diagram for whitespace
Transition diagrams diagrams (cont.)
• Transition diagram for relop
Architecture of a transition-diagram-based
lexical analyzer
Implement of RELOP
TOKEN getRelop()
{
TOKEN retToken = new(RELOP);
while (1) { /* repeat character processing until a
return or failure occurs */
case 0: c = nextChar();
if (c == ‘<‘) state = 1;
else if (c == ‘=‘) state= 5;
else if (c == ‘>‘) state= 6;
else fail(); /* lexeme is not a relop */
break;
case 1: ...
...
case 8: retract();
retToken.attribute = GT;
return(retTOKEN);
}
}
Lexical Analyzer Generator - Lex

Lex Source program

Lexical Compiler lex.yy.c
lex.l

C
lex.yy.c a.out
compiler

Input stream a.out

Sequence
of tokens

Maximo Application Suite (MAS) Level 2 Mobile - Assist
100% (1)
Maximo Application Suite (MAS) Level 2 Mobile - Assist
13 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
69 pages
Lexical Analysis 3
No ratings yet
Lexical Analysis 3
27 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
56 pages
2_Lexical Analysis
No ratings yet
2_Lexical Analysis
52 pages
Chapter2-Lexical Analysis
No ratings yet
Chapter2-Lexical Analysis
64 pages
Compiler Design Lexical Analysis
No ratings yet
Compiler Design Lexical Analysis
24 pages
Chapter 2 - Lexical Analysis_Regular Expressions(1)
No ratings yet
Chapter 2 - Lexical Analysis_Regular Expressions(1)
27 pages
Compiler-Lexical Analysis
100% (1)
Compiler-Lexical Analysis
59 pages
Compiler
No ratings yet
Compiler
60 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
CD_UNIT-2
No ratings yet
CD_UNIT-2
64 pages
Ch3 Modified
No ratings yet
Ch3 Modified
80 pages
Module 3
No ratings yet
Module 3
7 pages
Compiler Design Chapter-2
60% (5)
Compiler Design Chapter-2
105 pages
CH 2 - Lexical Analysis
No ratings yet
CH 2 - Lexical Analysis
36 pages
Chapter 2
No ratings yet
Chapter 2
56 pages
Chapter 3 Lexical Analysis
No ratings yet
Chapter 3 Lexical Analysis
5 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
40 pages
Lexical Analysis
No ratings yet
Lexical Analysis
31 pages
Compiler Design
No ratings yet
Compiler Design
122 pages
Chapter-2[1]
No ratings yet
Chapter-2[1]
77 pages
CC 2
No ratings yet
CC 2
65 pages
ch3 M.PPTX - 0
No ratings yet
ch3 M.PPTX - 0
46 pages
Lexical Analyzer 2023
No ratings yet
Lexical Analyzer 2023
38 pages
Lexical Analysis
No ratings yet
Lexical Analysis
44 pages
CD ch2
No ratings yet
CD ch2
104 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
2 Lex
No ratings yet
2 Lex
45 pages
Unit 2 Lexical Analyzer
No ratings yet
Unit 2 Lexical Analyzer
63 pages
Ch3myppt
No ratings yet
Ch3myppt
59 pages
03 Lex Analysis
No ratings yet
03 Lex Analysis
61 pages
Ch2+3 Compiler
No ratings yet
Ch2+3 Compiler
21 pages
Pdf&rendition 1
No ratings yet
Pdf&rendition 1
14 pages
Lexical Analyzer in Perspective: Parser Source Program Token
No ratings yet
Lexical Analyzer in Perspective: Parser Source Program Token
22 pages
4 Lexical Analysis
No ratings yet
4 Lexical Analysis
60 pages
unit1
No ratings yet
unit1
34 pages
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
No ratings yet
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
62 pages
Lexical Analyzer 1
No ratings yet
Lexical Analyzer 1
37 pages
Ch3 - Lexical Analysis
No ratings yet
Ch3 - Lexical Analysis
52 pages
Intro To Compilers Lecture 2
No ratings yet
Intro To Compilers Lecture 2
15 pages
Compilers - Week 2
No ratings yet
Compilers - Week 2
14 pages
CD KCS502 Unit 1 B
No ratings yet
CD KCS502 Unit 1 B
12 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
Chapter Two (3) (Autosaved)
No ratings yet
Chapter Two (3) (Autosaved)
29 pages
Chapter 7 Lexical Analysis
No ratings yet
Chapter 7 Lexical Analysis
61 pages
Lecture 03
No ratings yet
Lecture 03
42 pages
Compiler Course: Lexical Analysis
No ratings yet
Compiler Course: Lexical Analysis
50 pages
Lect2 Lexical
No ratings yet
Lect2 Lexical
9 pages
SSC Module2 LexicalAnalysis
No ratings yet
SSC Module2 LexicalAnalysis
26 pages
Lecture3_E
No ratings yet
Lecture3_E
153 pages
cd1
No ratings yet
cd1
92 pages
Lexical Analysis1
No ratings yet
Lexical Analysis1
44 pages
4-Intro To Flex and Bison-09!09!2024
No ratings yet
4-Intro To Flex and Bison-09!09!2024
28 pages
Chapter 2
No ratings yet
Chapter 2
91 pages
ch-2 Compiler Design
No ratings yet
ch-2 Compiler Design
9 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
33 pages
Lexical Analysis: CD: Compiler Design
No ratings yet
Lexical Analysis: CD: Compiler Design
122 pages
Unit 01 - PART 2
No ratings yet
Unit 01 - PART 2
25 pages
Lecture 02
No ratings yet
Lecture 02
150 pages
Mam Notes Dfa
No ratings yet
Mam Notes Dfa
19 pages
Chapter 4 - Syntax Analysis CIE1
No ratings yet
Chapter 4 - Syntax Analysis CIE1
69 pages
11-Chapter 6-Intermediate Code Generation
No ratings yet
11-Chapter 6-Intermediate Code Generation
9 pages
4 ICG ThreeAddressCode Expressions
No ratings yet
4 ICG ThreeAddressCode Expressions
9 pages
Chapter 1 Compiler
No ratings yet
Chapter 1 Compiler
31 pages
Tutorial 2 (1)
No ratings yet
Tutorial 2 (1)
7 pages
Bigdata Syllabus Copy
No ratings yet
Bigdata Syllabus Copy
2 pages
Tutorial 1
No ratings yet
Tutorial 1
4 pages
1 Intermediate Code Generation VNM
No ratings yet
1 Intermediate Code Generation VNM
17 pages
Software Project Face Detection System Using Open CV Final Report
No ratings yet
Software Project Face Detection System Using Open CV Final Report
2 pages
What Is The Difference Between Derived Tables and A View? Which One Gives Better Performance ?
No ratings yet
What Is The Difference Between Derived Tables and A View? Which One Gives Better Performance ?
1 page
Dardanius SlidesCarnival
No ratings yet
Dardanius SlidesCarnival
53 pages
Configuring Recovery Services Vault Slides
No ratings yet
Configuring Recovery Services Vault Slides
13 pages
Micro Project
No ratings yet
Micro Project
20 pages
5 Azure Introduction 1
No ratings yet
5 Azure Introduction 1
40 pages
Bar Bulletin 12
No ratings yet
Bar Bulletin 12
2 pages
Project Report Titles For MBA in Information Technology (IT)
No ratings yet
Project Report Titles For MBA in Information Technology (IT)
14 pages
Log-2025_03_07_05_39 d27
No ratings yet
Log-2025_03_07_05_39 d27
7 pages
Training Neural Network Controller
No ratings yet
Training Neural Network Controller
7 pages
CSS Stylesheets
No ratings yet
CSS Stylesheets
31 pages
FOS8.1.2f RN
No ratings yet
FOS8.1.2f RN
195 pages
Managing in The Digital World: Chapte R
No ratings yet
Managing in The Digital World: Chapte R
41 pages
1P11 Scenario Manual
No ratings yet
1P11 Scenario Manual
3 pages
A5E02338368-ADen MAG 5000-6000 OI en-US
No ratings yet
A5E02338368-ADen MAG 5000-6000 OI en-US
106 pages
Smart Hotel Menu Ordering System PDF
63% (8)
Smart Hotel Menu Ordering System PDF
37 pages
Designing for User Engagement on the Web 1st Edition Cheryl Geisler 2024 scribd download
100% (2)
Designing for User Engagement on the Web 1st Edition Cheryl Geisler 2024 scribd download
81 pages
x64 Assembly Language Step-by-Step: Programming With Linux (Tech Today), 4th Edition Jeff Duntemann All Chapter Instant Download
100% (1)
x64 Assembly Language Step-by-Step: Programming With Linux (Tech Today), 4th Edition Jeff Duntemann All Chapter Instant Download
49 pages
What Is An API?: P1 Examine The Relationship Between An API and A Software Development Kit (SDK)
No ratings yet
What Is An API?: P1 Examine The Relationship Between An API and A Software Development Kit (SDK)
11 pages
PSODCN-2557-Building Hybrid Data Centers With Cisco Cloud ACI On AWS
No ratings yet
PSODCN-2557-Building Hybrid Data Centers With Cisco Cloud ACI On AWS
35 pages
SBI YONO Salary Account Opening
No ratings yet
SBI YONO Salary Account Opening
3 pages
PSP July2019 QP Solution
No ratings yet
PSP July2019 QP Solution
20 pages
Slot3 SWE201c
No ratings yet
Slot3 SWE201c
79 pages
HoduCC - MT - Agent - User - Manual - V 3.5.1
No ratings yet
HoduCC - MT - Agent - User - Manual - V 3.5.1
114 pages
Microsoft Project Simplified
No ratings yet
Microsoft Project Simplified
13 pages
MCDelta EasySetup4CFin
No ratings yet
MCDelta EasySetup4CFin
39 pages
Project Proposal guidlines
No ratings yet
Project Proposal guidlines
10 pages
HR050 Worksheet M11 (211121223RirinSimamora Si B Sore)
No ratings yet
HR050 Worksheet M11 (211121223RirinSimamora Si B Sore)
20 pages
The Pragmatic Programmer 5
No ratings yet
The Pragmatic Programmer 5
1 page
Lane Community College Library Revised 12/09/2019 LMC/CD
No ratings yet
Lane Community College Library Revised 12/09/2019 LMC/CD
2 pages

Lexical Analysis

Uploaded by

Lexical Analysis

Uploaded by

Regular Expressions

and Lexical Analysis

Lexical analyzer Front End

Intermediate code generator

• How to specify tokens:

• Find the longest matching token

• Between two tokens with the same length

Operator Precedence Associative

concatenation Second left

r* Zero or more strings

1. The first ten letters (up to "j") in either upper

• select -> [Ss][Ee][Ll][Ee][Cc][Tt]

Any number number Pointer to table entry

is a the start state

id  letter ( letter | digit )*

start letter other *

return (getToken(), installID() )

Lex Source program

Input stream a.out

You might also like