0% found this document useful (0 votes)
25 views26 pages

Code Source Tokens Scanner Parser IR

The document describes scanners and how they work. It discusses: - Scanners map characters into tokens which are the basic units of syntax. They eliminate whitespace and identify tokens like numbers, identifiers, and operators. - Scanners must specify patterns to recognize tokens using notations like regular expressions and regular grammars. Regular expressions use operations like union, concatenation, and Kleene closure to describe patterns. - From regular expressions, a scanner can construct a deterministic finite automaton to recognize tokens as it reads character input.

Uploaded by

moien
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views26 pages

Code Source Tokens Scanner Parser IR

The document describes scanners and how they work. It discusses: - Scanners map characters into tokens which are the basic units of syntax. They eliminate whitespace and identify tokens like numbers, identifiers, and operators. - Scanners must specify patterns to recognize tokens using notations like regular expressions and regular grammars. Regular expressions use operations like union, concatenation, and Kleene closure to describe patterns. - From regular expressions, a scanner can construct a deterministic finite automaton to recognize tokens as it reads character input.

Uploaded by

moien
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Scanner

source tokens
code scanner parser IR

errors
• maps characters into tokens – the basic unit of syntax
x = x + y;
becomes
<id, x> = <id, x> + <id, y> ;
• character string value for a token is a lexeme
• typical tokens: number, id, +, -, *, /, do, end
• eliminates white space (tabs, blanks, comments)
• a key issue is speed
⇒ use specialized recognizer (as opposed to lex)

1
Specifying patterns

A scanner must recognize the units of syntax


Some parts are easy:

white space
<ws> ::= <ws> ’ ’
| <ws> ’\t’
| ’ ’
| ’\t’
keywords and operators
specified as literal patterns: do, end
comments
opening and closing delimiters: /* · · · */

2
Specifying patterns

A scanner must recognize the units of syntax


Other parts are much harder:

identifiers
alphabetic followed by k alphanumerics ( , $, &, . . . )
numbers
integers: 0 or digit from 1-9 followed by digits from 0-9
decimals: integer ’.’ digits from 0-9
reals: (integer or decimal) ’E’ (+ or -) digits from 0-9
complex: ’(’ real ’,’ real ’)’

We need a powerful notation to specify these patterns


3
Operations on languages

Operation Definition
union of L and M L ∪ M = {s | s ∈ L or s ∈ M}
written L ∪ M
concatenation of L and M LM = {st | s ∈ L and t ∈ M}
written LM
Kleene closure of L L∗ = ∞ i
S
i=0 L
written L∗
+ S∞
positive closure of L L = i=1 Li
written L+

4
Regular expressions

Patterns are often specified as regular languages


Notations used to describe a regular language (or a regular set) include
both regular expressions and regular grammars
Regular expressions (over an alphabet Σ):

1. ε is a RE denoting the set {ε}


2. if a ∈ Σ, then a is a RE denoting {a}
3. if r and s are REs, denoting L(r) and L(s), then:
(r) is a RE denoting L(r)
(r) | (s) is a RE denoting L(r) L(s)
S

(r)(s) is a RE denoting L(r)L(s)


(r)∗ is a RE denoting L(r)∗

If we adopt a precedence for operators, the extra parentheses can go


away. We assume closure, then concatenation, then alternation as the
order of precedence.
5
Examples

identifier
letter → (a | b | c | ... | z | A | B | C | ... | Z)
digit → (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)
id → letter ( letter | digit )∗
numbers
integer → (+ | − | ε) (0 | (1 | 2 | 3 | ... | 9) digit ∗)
decimal → integer . ( digit )∗
real → ( integer | decimal ) E (+ | −) digit ∗
complex → ’(’ real , real ’)’

Numbers can get much more complicated

Most programming language tokens can be described with REs

We can use REs to build scanners automatically


6
Algebraic properties of REs

Axiom Description
r|s = s|r | is commutative
r|(s|t) = (r|s)|t | is associative
(rs)t = r(st) concatenation is associative
r(s|t) = rs|rt concatenation distributes over |
(s|t)r = sr|tr
εr = r ε is the identity for concatenation
rε = r
r∗ = (r|ε)∗ relation between ∗ and ε
r∗∗ = r∗ ∗ is idempotent

7
Examples

Let Σ = {a, b}

1. a|b denotes {a, b}


2. (a|b)(a|b) denotes {aa, ab, ba, bb}
i.e., (a|b)(a|b) = aa|ab|ba|bb
3. a∗ denotes {ε, a, aa, aaa, . . .}
4. (a|b)∗ denotes the set of all strings of a’s and b’s (including ε)
i.e., (a|b)∗ = (a∗b∗)∗
5. a|a∗b denotes {a, b, ab, aab, aaab, aaaab, . . .}

8
Recognizers
From a regular expression we can construct a
deterministic finite automaton (DFA)
Recognizer for identifier :
letter
digit

letter other
0 1 2

digit accept
other
3

error

identifier
letter → (a | b | c | ... | z | A | B | C | ... | Z)
digit → (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)
id → letter ( letter | digit )∗
9
Code for the recognizer
char ← next char();
state ← 0; /* code for state 0 */
done ← false;
token value ← "" /* empty string */
while( not done ) {
class ← char class[char];
state ← next state[class,state];
switch(state) {
case 1: /* building an id */
token value ← token value + char;
char ← next char();
break;
case 2: /* accept state */
token type = identifier;
done = true;
break;
case 3: /* error */
token type = error;
done = true;
break;
}
}
return token type;

10
Tables for the recognizer

Two tables control the recognizer


a−z A−Z 0−9 other
char class:
value letter letter digit other

class 0 1 2 3
letter 1 1 — —
next state:
digit 3 1 — —
other 3 2 — —

To change languages, we can just change tables


11
Automatic construction

Scanner generators automatically construct code from RE-like


descriptions

• construct a DFA
• use state minimization techniques
• emit code for the scanner
(table driven or direct code )

A key issue in automation is an interface to the parser

lex is a scanner generator supplied with UNIX

• emits C code for scanner


• provides macro definitions for each token
(used in the parser)
12
Grammars for regular languages

Can we place a restriction on the form of a grammar to ensure that it


describes a regular language?

Provable fact:

For any RE r, ∃ a grammar g such that L(r) = L(g)

Grammars that generate regular sets are called regular grammars:

They have productions in one of 2 forms:


1. A → aA
2. A → a
where A is any non-terminal and a is any terminal symbol

These are also called type 3 grammars (Chomsky)


13
More regular languages

Example: the set of strings containing an even number of zeros and an


even number of ones

1
s0 s1
1
0 0 0 0
1
s2 s3
1
The RE is (00 | 11)∗((01 | 10)(00 | 11)∗(01 | 10)(00 | 11)∗)∗

14
More regular expressions

What about the RE (a | b)∗abb ?

a jb

a b b
s0 s1 s2 s3

State s0 has multiple transitions on a!


⇒ nondeterministic finite automaton
a b
s0 {s0, s1} {s0}
s1 – {s2}
s2 – {s3}

15
Finite automata

A non-deterministic finite automaton (NFA) consists of:

1. a set of states S = {s0, . . . , sn}


2. a set of input symbols Σ (the alphabet)
3. a transition function move mapping state-symbol pairs to sets of
states
4. a distinguished start state s0
5. a set of distinguished accepting or final states F

A Deterministic Finite Automaton (DFA) is a special case of an NFA:

1. no state has a ε-transition, and


2. for each state s and input symbol a, there is at most one edge labelled
a leaving s

A DFA accepts x iff. ∃ a unique path through the transition graph from s0 to
a final state such that the edges spell x.
16
DFAs and NFAs are equivalent

1. DFAs are clearly a subset of NFAs


2. Any NFA can be converted into a DFA, by simulating sets of
simultaneous states:
• each DFA state corresponds to a set of NFA states
• possible exponential blowup

17
NFA to DFA using the subset construction: example 1
ajb

a b b
s0 s1 s2 s3

a b
{s0} {s0, s1} {s0}
{s0, s1} {s0, s1} {s0, s2}
{s0, s2} {s0, s1} {s0, s3}
{s0, s3} {s0, s1} {s0}
b

b
a

a b b
fs0 g fs0 ; s1 g fs0 ; s2 g fs0 ; s3 g

a
a

18
Constructing a DFA from a regular expression

DFA
minimized

RE DFA

NFA
ε moves

RE →NFA w/ε moves


build NFA for each term
connect them with ε moves
NFA w/ε moves to DFA
construct the simulation
the “subset” construction
DFA → minimized DFA
merge compatible states
DFA → RE
construct Rkij = Rk−1
ik (Rk−1 ∗ k−1 S k−1
kk ) Rk j Ri j

19
RE to NFA
ε
N(ε)
a
N(a)

N(A) A
ε ε

N(A|B)
ε ε
N(B) B

N(AB) N(A) A N(B) B

N(A∗) ε
ε
ε
N(A) A

20
RE to NFA: example

N(A) A
ε ε

a|b
ε ε
N(B) B

a
2 3

ε ε

(a|b)∗ 0
ε
1 6
ε
7

ε ε

4 5
b

a b b
abb 7 8 9 10

21
NFA to DFA: the subset construction
Input: NFA N
Output: A DFA D with states Dstates and transitions Dtrans such that L(D) = L(N)
Method: Let s be a state in N and T be a set of states, and using the following operations:

Operation Definition
ε-closure(s) set of NFA states reachable from NFA state s on ε-transitions
alone
ε-closure(T ) set of NFA states reachable from some NFA state s in T on
ε-transitions alone
move(T, a) set of NFA states to which there is a transition on input symbol
a from some NFA state s in T

add state T = ε-closure(s0) unmarked to Dstates


while ∃ unmarked state T in Dstates
mark T
for each input symbol a
U = ε-closure(move(T, a))
if U 6∈ Dstates then add U to Dstates unmarked
Dtrans[T, a] = U
endfor
endwhile

ε-closure(s0) is the start state of D


A state of D is final if it contains at least one final state in N

22
NFA to DFA using subset construction: example 2
ε

a
2 3

ε ε

ε ε a b b
0 1 6 7 8 9 10

ε ε

4 5
b

a b
A B C
A = {0, 1, 2, 4, 7} D = {1, 2, 4, 5, 6, 7, 9}
B B D
B = {1, 2, 3, 4, 6, 7, 8} E = {1, 2, 4, 5, 6, 7, 10}
C B C
C = {1, 2, 4, 5, 6, 7}
D B E
E B C
b

b
b
a
a

a b b
A B D E

a
a

23
Limits of regular languages

Not all languages are regular

One cannot construct DFAs to recognize these languages:

• L = {pk qk }
• L = {wcwr | w ∈ Σ∗}

Note: neither of these is a regular expression!


(DFAs cannot count!)

But, this is a little subtle. One can construct DFAs for:

• alternating 0’s and 1’s


(ε | 1)(01)∗(ε | 0)
• sets of pairs of 0’s and 1’s
(01 | 10)+

24
Ramification - Internet Protocol

How does your browser establish a connection with a web server?

• The client sends a SYN message to the server.

• In response, the server replies with a SYN-ACK.

• Finally the client sends an ACK back to the server.

This is done through two DFAs in the client and server, respectively.

25
Ramification - Intrusion Detection
Code Operating System
FILE * f;
f=fopen("demo", "r"); ------> SYS_OPEN
strcpy(...); //vulnerability
if (!f)
printf("Fail to open\n"); ------> SYS_WRITE
else
fgets(f, buf); ------> SYS_READ
...

A DFA will be exercised simultaneously with the program on the OS side


to detect intrusion.

26

You might also like