Chapter 2 (Lexical Analysis)
Chapter 2 (Lexical Analysis)
A lexical analyser, or lexer for short, will as its input take a string of
individual letters and divide this string into tokens. Additionally, it will
filter out whatever separates the tokens (the so-called white-space),
i.e., lay-out characters (spaces, newlines etc.) and comments.
The main purpose of lexical analysis is to make life easier for the
subsequent syntax analysis phase.
1
Lexical Analysis
• The work that is done during lexical analysis can be made an integral
part of syntax analysis, and in simple systems this is indeed often
done. However, there are reasons for keeping the phases separate:
2
Lexical Analysis
Efficiency: A lexer may do the simple parts of the work faster than the
more general parser can. Furthermore, the size of a system that is split in
two may be smaller than a combined system.
Tradition: Languages are often designed with separate lexical and syntaxical
phases in mind, and the standard documents of such languages typically
separate lexical and syntaxical elements of the languages.
3
Lexical Analysis
4
Example of Natural language
1- Introduction: Structure of natural language
Three level approach
- alphabet ={letters} Alphabet of
= {a,b, …z} letters
5
Example of Formal languages
Computer languages = programming languages
6
Example of Formal languages
7
Example of Formal languages
Computer languages = formal languages
8
Example of Formal languages
Another example:
alphabet: ={a,b}
words : a.b.a=aba
a.=a
Definition : A word is obtained by concatenating various symbols of the
alphabet
Example : ={a,b}
*={,a,b,aa,ab,ba,bb,aaa,aab,aba, …}
Definition : The mirror image of a word, denoted ~, consists in reversing the order of the
symbols of this word
Example : m=abaaab et m ~ =baaaba
Exercise : Given ={a,b} and L is the set of words having the form mm ~.
10
Example of Formal languages
Steps:
Alphabet
concatenation
word
Set of words
Language
11
Example of Formal languages
3- Operations on languages:
complement : L ={m / m* et mL}
union : L ' L'' ={m / mL' or mL''}
intersection : L' L'' ={m / mL' and mL''}
Product/ concatenation : L'.L'' ={m / m =m'm'' such as m'L' and m''L''}
produit cartésien : L'L'' ={(m',m'')/ m'L' and m''L''}
power : Li ={m / m=m1 … mi t.q m1L, …, miL}
Star : L* = {} L L2 … Li …
12
Example of Formal languages
13
Example
4-Belonging Problem:
of Formal languages
Given a language L and a word m, how to test if mL?
Difficulty : L is an infinite langage L is not implementable by an
in-memory data structure
Solution : Describe the infinite language by a finite and
programmable formalism
world m
mL or
mL
14
Example of Formal languages
5- The automaton
15
Example of Formal languages
Example: A=<={a,b}, Q ={q0, q1}, q0, F={q1}, ={(q0,a)=q0, (q0,b)=q1}>
The transition function can be represented by a table :
a b
q0 q0 q1
b
q0 q1
16
Example of Formal languages
6- Words recognized by an automaton:
Extension of
c
q0 q1
The automaton can be represented with the same graph with one entry and
many outputs:
a b
c
q0 q1
18
Example of Formal languages
The word recognized by the automaton A
a b
c
q0 q1
are :
the empty word because q0 is both initial and terminal
states
The words without letters b : ac, aac, aaac, …
The words without letters a : cb, cbb, cbbb, …
The words with letters a and b :
acb,acbb, acbbb, …
aacb, aacbb, aacbbb, …
…
L(A)={ancbm/n>=0, m>=0}
20
Formal languages
7- Implémentation du test d’appartenance:
word m
Program?
Language L construction
Automat A such as L(A)=L
mL or
mL
Problem : How especially to implement the function ?
(q,m)=q’
How to implement the full automaton?
: Q * Q
m* with m=xm’, x and m’* and qQ, then
(q,m)= (q,xm’)= ((x,q),m’)
21
Example of Formal languages
Implementation of the automaton:
/ ** Implementation de * **/
Repeat
/** implementation of **/
switch state of
case q0: if (*symbol='a') {state=(q0,a); break;}
else if (*symbol='b') {state=(q0,b); break;}
else {printf("the word is not in the language"); exit(1); }
case q1: transitions over q1
…
23
Languages
Example : Write the program of the automat
a b
q0 c q1
24
Example of Formal languages
char word[30]; char symbol; int state;
main(){
scanf("%",word);
symbol=word;
state=q0;
while(1) {
Switch state of {
case q0: if (*symbol='a') {state=q0; break;}
else if (*symbol='c') {state=q1; break;}
else {printf("Lexical error"); exit(1); }
case q1: if (*symbol='b') {state=q1; break;}
else if (*symbol='\0') {printf("The word is in the language");
exit(1); }
else {printf("Lexical error"); exit(1); }
}
/** go to the next symbol **/ symbol++;
}
}
25
Regular expressions
The set of all integer constants or the set of all variable names are
sets of strings, where the individual letters are taken from a particular
alphabet. Such a set of strings is called a language.
For integers, the alphabet consists of the digits 0-9 and for variable
names the alphabet contains both letters and digits.
26
Regular expressions
27
Regular expressions
When talking about regular expressions, we will use the letters (r, s
and t) in italics to denote unspecified regular expressions. When
letters stand for themselves (i.e., in regular expressions that describe
strings that use these letters) we will use typewriter font, e.g., a or b.
A single letter describes the language that has the one-letter string
consisting of that letter as its only element.
28
Regular expressions
29
Regular expressions
The symbol ε (the Greek letter epsilon) describes the language that consists
solely of the empty string. Note that this is not the empty set of strings.
s|t (pronounced “s or t”) describes the union of the languages described by s and
t.
st (pronounced “s t”) describes the concatenation of the languages L(s) and L(t),
i.e., the sets of strings obtained by taking a string from L(s) and putting this in
front of a string from L(t).
For example, if L(s) is {“a”, “b”} and L(t) is {“c”, “d”}, then L(st) is the set {“ac”, “ad”,
“bc”, “bd”}.
30
Regular expressions
• The language for s* (pronounced “s star”) is described recursively: It consists of
the empty string plus whatever can be obtained by concatenating a string from
L(s) to a string from L(s* ).
• This is equivalent to saying that L(s* ) consists of strings that can be obtained by
concatenating zero or more (possibly different) strings from L(s).
• If, for example, L(s) is {“a”, “b”} then L(s* ) is {“”, “a”, “b”, “aa”, “ab”, “ba”, “bb”,
“aaa”, . . . }, i.e., any string (including the empty) that consists entirely of as and
bs.
31
Precedence rules
• We combine different constructor symbols, e.g., in the regular expression a|ab* , it is
not a priori clear how the different subexpressions are grouped. We can use
parentheses to make the grouping of symbols explicit such as in (a|(ab))* .
• For regular expressions, we use the following conventions: ∗ binds tighter than
concatenation, which binds tighter than alternative (|). The example a|ab* from
above, hence, is equivalent to a|(a(b* )).
32
Precedence rules
33
Short hands
It gets even worse when we get to variable names, where we must
enumerate all alphabetic letters.
34
Short hands
We introduce a shorthand for sets of letters. Sequences of letters within
square brackets represent the set of these letters. For example, we use
[ab01] as a shorthand for a|b|0|1.
We can combine several intervals within one bracket and for example write
[a-zA-Z] to denote all alphabetic letters in both lower and upper case.
35
Short hands
• Getting back to the example of integer constants above, we can now write this
much shorter as [0-9][0-9]* . Since s* denotes zero or more occurrences of s, we
needed to write the set of digits twice to describe that one or more digits are
allowed.
• Such non-zero repetition is quite common, so we introduce another shorthand,
s+, to denote one or more occurrences of s. With this notation, we can abbreviate
our description of integers to [0-9]+.
• On a similar note, it is common that we can have zero or one occurrence of
something (e.g., an optional sign to a number). Hence we introduce the
shorthand s? for s|ε.
• + and ? bind with the same precedence as ∗ . 36
Properties of Regular expression
37
Examples
• Keywords. A keyword like if is described by a regular expression that
looks exactly like that keyword, e.g., the regular expression if (which is
the concatenation of the two regular expressions i and f).
38
Examples
• Integers. An integer constant is an optional sign followed by a non-empty
sequence of digits: [+-]?[0-9]+. In some languages, the sign is a separate
symbol and not part of the constant itself. This will allow whitespace
between the sign and the number, which is not possible with the above.
• Floats. A floating-point constant can have an optional sign. After this, the
mantissa part is described as a sequence of digits followed by a decimal
point and then another sequence of digits.
39
Examples
• Finally, there is an optional exponent part, which is the letter e (in upper or lower
case) followed by an (optionally signed) integer constant.
• If there is an exponent part to the constant, the mantissa part can be written as an
integer constant (i.e., without the decimal point). Some examples: 3.14, -3., .23, 3e+4
11.22e-3.
• We can make the description simpler if we make the regular expression for floats
also include integers, and instead use other means of distinguishing integers from
floats.
• If we do this, the regular expression can be simplified to:
[+-]?( ([0-9]+ (.[0-9]*)?|.[0-9]+) ([eE][+-]?[0-9]+)?)
40
Examples
• String constants. A string constant starts with a quotation mark followed
by a sequence of symbols and finally another quotation mark.
• There are usually some restrictions on the symbols allowed between the
quotation marks. For example, line-feed characters or quotes are typically
not allowed, though these may be represented by special “escape”
sequences of other characters, such as "\n\n" for a string containing two
line-feeds.
"([a-zA-Z0-9]|\[a-zA-Z])* "
41
Nondeterministic finite automata (NFA)
• A finite automaton is, in the abstract sense, a machine that has a finite number of
states and a finite number of transitions between these. A transition between
states is usually labelled by a character from the input alphabet, but we will also
use transitions marked with ε, the so-called epsilon transitions.
42
Nondeterministic finite automata
When all characters from the input are read, we see if the current state
is marked as being accepting. If so, the string we have read from the
input is in the language defined by the automaton.
43
Nondeterministic finite automata
44
Nondeterministic finite automata
45
Nondeterministic finite automata
• We will mostly use a graphical notation to describe finite automata. States
are denoted by circles, possibly containing a number or name that
identifies the state.
• This name or number has, however, no operational significance, it is solely
used for identification purposes.
• Accepting states are denoted by using a double circle instead of a single
circle. The initial state is marked by an arrow pointing to it from outside the
automaton.
46
Nondeterministic finite automata
• A transition is denoted by an arrow connecting two states. Near its midpoint, the
arrow is labelled by the symbol (possibly ε) that triggers the transition. Note that
the arrow that marks the initial state is not a transition and is, hence, not marked
by a symbol.
• Figure 2.3 shows an example of a nondeterministic finite automaton having three
states. State 1 is the starting state and state 3 is accepting. There is an epsilon
transition from state 1 to state 2, transitions on the symbol a from state 2 to
states 1 and 3 and a transition on the symbol b from state 1 to state 3.
• This NFA recognises the language described by the regular expression a*(a|b).
• As an example, the string aab is recognised by the following sequence of
transitions: 47
Nondeterministic finite automata
48
Nondeterministic finite automata
At the end of the input we are in state 3, which is accepting. Hence, the string is
accepted by the NFA. You can check this by placing a coin at the starting state and
follow the transitions by moving the coin.
49
Nondeterministic finite automata
51
Nondeterministic finite automata
• A program that decides if a string is accepted by a given NFA will have to check all
possible paths to see if any of these accepts the string.
• This requires either backtracking until a successful path found or simultaneously
following all possible paths, both of which are too time-consuming to make NFAs
suitable for efficient recognisers.
• We will, hence, use NFAs only as a stepping stone between regular expressions
and the more efficient DFAs. We use this stepping stone because it makes the
construction simpler than direct construction of a DFA from a regular expression.
52
Converting a regular expression to an NFA
• We will construct an NFA compositionally from a regular expression,
i.e., we will construct the NFA for a composite regular expression
from the NFAs constructed from its subexpressions.
53
Converting a regular expression to an NFA
54
55
NFA for the regular expression (a|b)*ac
56
Optimisations
We can use the construction in figure 2.4 for any regular expression
by expanding out all shorthand, e.g. converting s+ to ss* , [0-9] to
0|1|2|···|9 and s? to s|ε, etc.
Note that while this is optimised, it is not optimal. You can make an
NFA for this language using only two states.
57
58
Example of Optimised NFA for [0-9]+
59
Deterministic finite automata
• The deterministic finite automaton, or DFA for short. DFAs are NFAs,
but obey a number of additional restrictions:
60
Deterministic finite automata
• There are no epsilon-transitions.
• There may not be two identically labelled transitions out of the same state
This means that we never have a choice of several next-states: The state
and the next input symbol uniquely determine the transition (or lack of
same).
This is why these automata are called deterministic. Figure 2.8 shows a DFA
equivalent to the NFA in figure 2.3
61
Example of a DFA
62
Examples of Automata and Regular
Introduction
Expressions
Question: Does an algebraic formalism exist to represent the rational
language other than the formalism of the automata ?
c
q0 q1
63
Examples of Automata and Regular
Introduction
Expressions
Definition:The formalism of rational expressions is an algebraic
formalism allowing to represent part of the languages built on a given
alphabet and which are called rational languages.
64
Examples of Automata and Regular
1- Regular expressions Expressions
Définition: Given an alphabet. A regular expression is defined by:
RE Language automata
is a RE L()={}
q0 q1
a L(a)={a}
q0 a q1
If r and s are RE
then
-r | s is a RE L(r|s)=L(r) L(s) ?
-r s is a RE L(r.s)=L(r) . L(s) ?
-r* is a RE (L(r))* ?
-(r) is a RE (L((R))=L(R) ?
65
Examples of Automata
and Regular Expressions
Automata of r|s :
q1,s
q0,s
automata de s …
qn,s
q0
q1,r
q0,r
automata de r …
qm,r
66
Examples of Automata
and Regular Expressions
Automata of r.s :
q1,s q1,r
q0,s q0,r
automata of s … automata of r …
qn,s qm,r
67
Examples of Automata
and Regular Expressions
Automata of r*:
q1,r
q0 q0,r
… qf
automata of r
qm,r
68
Examples of Automaton and Regular
Expressions
1.1- Algorithm for switching from ER to automata
- The Automaton for the regular expression is:
A=<{},{q0,q1},q0,{(q0,,q1)},{q1}>
Ar|s=<rs,QrQs{qrs,o}, qrs,o,
rs{(qrs,o, , qr,o),(qrs,o, , qs,o)}, Tr Ts>
69
Examples of Automata and Regular
Expressions
- Automata for r . s:
Ar=<r,Qr,qr,0,r,Tr> As=<s,Qs,qs,0,s,Ts>
Ar.s=<rs,QrQs, qr,o,
rs{(q, , qs,o)/qTr}, Ts >
70