0% found this document useful (0 votes)
16 views69 pages

Chapter 2 (Lexical Analysis)

Uploaded by

newsetup48
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views69 pages

Chapter 2 (Lexical Analysis)

Uploaded by

newsetup48
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Lexical Analysis

A lexical analyser, or lexer for short, will as its input take a string of
individual letters and divide this string into tokens. Additionally, it will
filter out whatever separates the tokens (the so-called white-space),
i.e., lay-out characters (spaces, newlines etc.) and comments.

The main purpose of lexical analysis is to make life easier for the
subsequent syntax analysis phase.

1
Lexical Analysis

• The work that is done during lexical analysis can be made an integral
part of syntax analysis, and in simple systems this is indeed often
done. However, there are reasons for keeping the phases separate:

2
Lexical Analysis

 Efficiency: A lexer may do the simple parts of the work faster than the
more general parser can. Furthermore, the size of a system that is split in
two may be smaller than a combined system.

 Tradition: Languages are often designed with separate lexical and syntaxical
phases in mind, and the standard documents of such languages typically
separate lexical and syntaxical elements of the languages.

3
Lexical Analysis

• For lexical analysis, specifications are traditionally written using


regular expressions: An algebraic notation for describing sets of
strings.

• The generated lexers are in a class of extremely simple programs


called finite automata.

4
Example of Natural language
1- Introduction: Structure of natural language
Three level approach
- alphabet ={letters} Alphabet of
= {a,b, …z} letters

- building words by concatenation of letters Vocabulary of


cedar is the concatenation of c and e and d and a and r words ={words}

-construction of sentences according to rules: French Language


subject verb complement ={sentences}
the computer makes calculations
is a correct sentence of the French language

French language = { sentences}

- usefulness of the language:


dialogue between people

5
Example of Formal languages
Computer languages = programming languages

Building a programming language :


3-level approach
- choosing an alphabet  ={symbols}
Alphabet of
example : ={a,…,z,0,…9,$, -, _, …}
symbols
={Keyboard chracters}

-construction of words by concatenation of symbols


language of words (keywords)
Vocabulary of
examples of words: int, char, scanf, printf, for,
words={words}
keyword language = {int, char, scanf, printf,
for, …}
- Construction of instructions according to rules: :
instruction language = programming language
sample rules: int identifier; printf("format", Instruction
identifier);example statements: int i; printf("format",i); Language
={instructions}

6
Example of Formal languages

Computer languages = programming languages

- usefulness of programming languages


dialogue between a person and a
computer

7
Example of Formal languages
Computer languages = formal languages

Construction of a formal language:


3-level approach
- choice of an alphabet  ={symbols} Alphabet of
example : ={0,1} symbols

a single operation : the concatenation . Construction


0.1=01 of words
0.0.1.0=0010
0.=0 ( is the neutral element for concatenation)

- construction of words by concatenation of symbols language of


words language words={words}
Example: the word 100 is obtained by concatenating 1 with 0 with 0
The word 110 is obtained by concatenating 1 with 1 with 0

language of binary numbers = {0, 1, 00, 01, 10, 11, 000, …}

8
Example of Formal languages
Another example:
alphabet: ={a,b}
words : a.b.a=aba
a.=a
Definition : A word is obtained by concatenating various symbols of the
alphabet

Definition : The free monoid * o n a n alphabet  is the set of all words


obtained by concatenating of various symbols of the alphabet

Example : ={a,b}
*={,a,b,aa,ab,ba,bb,aaa,aab,aba, …}

Definition : A langage L on an alphabet  is a subset of *, L *


Example : ={a,b}
L1={,a,aa,aaa,…} The set off words built with 0 or n times a
L2={,ab,abab,ababab,…} The set off words built with 0 or n times
ab
9
Example of Formal languages
2- Operations on words:

Definition : The length of a word noted || is the number of symbols that


constitutes
Example : |abbab|=5

We can count the number of a particular symbol in a word |abbab|a=2

Note: Concatenation is non-commutative:


abba

Definition : The mirror image of a word, denoted ~, consists in reversing the order of the
symbols of this word
Example : m=abaaab et m ~ =baaaba

Exercise : Given ={a,b} and L is the set of words  having the form mm ~.

Give eight words of the language L.

10
Example of Formal languages
Steps:
Alphabet

concatenation

word

Set of words

Language

- * is the set or all words


- Each language L on  verifie : L*

11
Example of Formal languages
3- Operations on languages:


complement : L ={m / m* et mL}
union : L '  L'' ={m / mL' or mL''}
intersection : L'  L'' ={m / mL' and mL''}
Product/ concatenation : L'.L'' ={m / m =m'm'' such as m'L' and m''L''}
produit cartésien : L'L'' ={(m',m'')/ m'L' and m''L''}
power : Li ={m / m=m1 … mi t.q m1L, …, miL}
Star : L* = {}  L  L2 … Li  …

12
Example of Formal languages

Exercise 1: Given  = {a,b} and L= {anbm/n>=0 and m>=0} a language .


Question 1.1: Compare L with *.

Exercise 2: Given  = {a,b} and L1= {anbm/n>=0 et m>=0}, L2=


{(ab)n/ n>=0} two languages .
Question 2.1: Give five of L1 and five words of L2.
Question 2.2: determine L1  L2.
Question 2.2: Compare L1 and L2.

Exercise 3: Given  = {a,b} and L isa langage. Prove que (L*)*=L*.

13
Example
4-Belonging Problem:
of Formal languages
Given a language L and a word m, how to test if mL?
Difficulty : L is an infinite langage  L is not implementable by an
in-memory data structure
Solution : Describe the infinite language by a finite and
programmable formalism
world m

Language L construction Automate A such that L(A)=L

mL or
mL

14
Example of Formal languages
5- The automaton

Definition: An automaton is an abstract machine with states which


recognizes words composed with symbols of an alphabet

Definition : an automaton is a 5-uplet A=<, Q, q0, F, > where:


 is the alphabet
Q is the set of states
q0 is the initial state
F is the set of terminal states
 is the transition function
:QQ

15
Example of Formal languages
Example: A=<={a,b}, Q ={q0, q1}, q0, F={q1}, ={(q0,a)=q0, (q0,b)=q1}>
The transition function can be represented by a table :
 a b
q0 q0 q1

The transition function can be represented by a graph :


a

b
q0 q1

The automaton can be represented by a graph with one entry and


many outputs:
a
q0 q1
b

16
Example of Formal languages
6- Words recognized by an automaton:
Extension of 

 is defined on the symbols of  :


:QQ

Question :How to extend  on the words of *:


 : Q  *  Q
m* with m=xm’, x and m’* and qQ, alors
(q,m)= (q,xm’)= ((x,q),m’)

Example : (q0,aab)= ((a,q0),ab)= (q0,ab)=((q0, a),b)=(q0,b)=q1


As q1F so aab is recognized by the automaton or aabL(A).

Le langage recognized by the automaton A is:


L(A)={m/ m* and (m,q0)F}.

Example : L(A)={b,ab,aab,aaab, …}={anb/n>=0} 17


Example of Formal languages
Example: A=< ={a,b,c}, Q={q0,q1},q0, F={q0,q1}, ={(q0,a)=q0,(q0,c)=q1,
(q1,b)=q1>
The transition function can be represented by a table:
 a b c
q0 q0 q1
q1 q1

The transition function can be represented by a graph:


a b

c
q0 q1

The automaton can be represented with the same graph with one entry and
many outputs:
a b

c
q0 q1

18
Example of Formal languages
The word recognized by the automaton A
a b

c
q0 q1

are :
the empty word  because q0 is both initial and terminal
states
The words without letters b : ac, aac, aaac, …
The words without letters a : cb, cbb, cbbb, …
The words with letters a and b :
acb,acbb, acbbb, …
aacb, aacbb, aacbbb, …

L(A)={ancbm/n>=0, m>=0}

Définition : For an automaton A, its language L(A) is called rational language


19
Example of Formal languages
Example : A=< ={a,b,c}, Q={q0,q1},q0, F={q0,q1}, ={(q0,a)=q0, (q0,b)=q0,
(q0,c)=q1> :
b
a
c
q0 q1

The words recognized by the automaton A are :


c, ac, bc, aac, abc, bac, bbc, aaac, aabc, abac, baac, abbc, babc, bbac, …

20
Formal languages
7- Implémentation du test d’appartenance:

word m
Program?

Language L construction
Automat A such as L(A)=L

mL or
mL
Problem : How especially to implement the function ?
(q,m)=q’
How to implement the full automaton?

 : Q  *  Q
m* with m=xm’, x and m’* and qQ, then
(q,m)= (q,xm’)= ((x,q),m’)

21
Example of Formal languages
Implementation of the automaton:

The variables of the automaton


Program
The automaton tests if an input word belongs to a
language variable word
The automaton performs the transitions
(q,a)=q’: State variable :
Symbol variable

Initialization of the l’automaton :


word input : scanf(mot)
initialization of the symbol variable :
symbol=mot[0] state variable initialization :
State=q0
22
Example of Formal languages
/ ** Implementation of * **/
Repeat function *
/** implementation of  **/
switch state of
case q0: transitions over q0 function 
case q1: transitions over q1

case qn: transitions over qn

/ ** Implementation de * **/
Repeat
/** implementation of  **/
switch state of
case q0: if (*symbol='a') {state=(q0,a); break;}
else if (*symbol='b') {state=(q0,b); break;}
else {printf("the word is not in the language"); exit(1); }
case q1: transitions over q1

case qn: transitions over qn

23
Languages
Example : Write the program of the automat

a b

q0 c q1

24
Example of Formal languages
char word[30]; char symbol; int state;
main(){
scanf("%",word);
symbol=word;
state=q0;
while(1) {
Switch state of {
case q0: if (*symbol='a') {state=q0; break;}
else if (*symbol='c') {state=q1; break;}
else {printf("Lexical error"); exit(1); }
case q1: if (*symbol='b') {state=q1; break;}
else if (*symbol='\0') {printf("The word is in the language");
exit(1); }
else {printf("Lexical error"); exit(1); }
}
/** go to the next symbol **/ symbol++;
}
}
25
Regular expressions

The set of all integer constants or the set of all variable names are
sets of strings, where the individual letters are taken from a particular
alphabet. Such a set of strings is called a language.

 For integers, the alphabet consists of the digits 0-9 and for variable
names the alphabet contains both letters and digits.

26
Regular expressions

Given an alphabet, we will describe sets of strings by regular


expressions, an algebraic notation that is compact and easy for
humans to use and understand.

The idea is that regular expressions that describe simple sets of


strings can be combined to form regular expressions that describe
more complex sets of strings.

27
Regular expressions

When talking about regular expressions, we will use the letters (r, s
and t) in italics to denote unspecified regular expressions. When
letters stand for themselves (i.e., in regular expressions that describe
strings that use these letters) we will use typewriter font, e.g., a or b.

A single letter describes the language that has the one-letter string
consisting of that letter as its only element.

28
Regular expressions

29
Regular expressions
The symbol ε (the Greek letter epsilon) describes the language that consists
solely of the empty string. Note that this is not the empty set of strings.

s|t (pronounced “s or t”) describes the union of the languages described by s and
t.

st (pronounced “s t”) describes the concatenation of the languages L(s) and L(t),
i.e., the sets of strings obtained by taking a string from L(s) and putting this in
front of a string from L(t).

For example, if L(s) is {“a”, “b”} and L(t) is {“c”, “d”}, then L(st) is the set {“ac”, “ad”,
“bc”, “bd”}.
30
Regular expressions
• The language for s* (pronounced “s star”) is described recursively: It consists of
the empty string plus whatever can be obtained by concatenating a string from
L(s) to a string from L(s* ).
• This is equivalent to saying that L(s* ) consists of strings that can be obtained by
concatenating zero or more (possibly different) strings from L(s).
• If, for example, L(s) is {“a”, “b”} then L(s* ) is {“”, “a”, “b”, “aa”, “ab”, “ba”, “bb”,
“aaa”, . . . }, i.e., any string (including the empty) that consists entirely of as and
bs.

31
Precedence rules
• We combine different constructor symbols, e.g., in the regular expression a|ab* , it is
not a priori clear how the different subexpressions are grouped. We can use
parentheses to make the grouping of symbols explicit such as in (a|(ab))* .

• Additionally, we use precedence rules, similar to the algebraic convention that


3+ 4 ∗ 5 means 3 added to the product of 4 and 5 and not multiplying the sum of 3
and 4 by 5.

• For regular expressions, we use the following conventions: ∗ binds tighter than
concatenation, which binds tighter than alternative (|). The example a|ab* from
above, hence, is equivalent to a|(a(b* )).
32
Precedence rules

• The | operator is associative and commutative (as it corresponds to


set union, which has these properties).

• Concatenation is associative (but obviously not commutative) and


distributes over |.

33
Short hands

If we want to describe non-negative integer constants, we can do so


by saying that it is one or more digits, which is expressed by the
regular expression (0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*.

It gets even worse when we get to variable names, where we must
enumerate all alphabetic letters.

34
Short hands
We introduce a shorthand for sets of letters. Sequences of letters within
square brackets represent the set of these letters. For example, we use
[ab01] as a shorthand for a|b|0|1.

Additionally, we can use interval notation to abbreviate [0123456789] to


[0-9].

We can combine several intervals within one bracket and for example write
[a-zA-Z] to denote all alphabetic letters in both lower and upper case.

35
Short hands
• Getting back to the example of integer constants above, we can now write this
much shorter as [0-9][0-9]* . Since s* denotes zero or more occurrences of s, we
needed to write the set of digits twice to describe that one or more digits are
allowed.
• Such non-zero repetition is quite common, so we introduce another shorthand,
s+, to denote one or more occurrences of s. With this notation, we can abbreviate
our description of integers to [0-9]+.
• On a similar note, it is common that we can have zero or one occurrence of
something (e.g., an optional sign to a number). Hence we introduce the
shorthand s? for s|ε.
• + and ? bind with the same precedence as ∗ . 36
Properties of Regular expression

37
Examples
• Keywords. A keyword like if is described by a regular expression that
looks exactly like that keyword, e.g., the regular expression if (which is
the concatenation of the two regular expressions i and f).

• Variable names. In the programming language C, a variable name


consists of letters, digits and the underscore symbol and it must begin
with a letter or underscore. This can be described by the regular
expression [a-zA-Z_][a-zA-Z_0-9]* .

38
Examples
• Integers. An integer constant is an optional sign followed by a non-empty
sequence of digits: [+-]?[0-9]+. In some languages, the sign is a separate
symbol and not part of the constant itself. This will allow whitespace
between the sign and the number, which is not possible with the above.

• Floats. A floating-point constant can have an optional sign. After this, the
mantissa part is described as a sequence of digits followed by a decimal
point and then another sequence of digits.

39
Examples
• Finally, there is an optional exponent part, which is the letter e (in upper or lower
case) followed by an (optionally signed) integer constant.
• If there is an exponent part to the constant, the mantissa part can be written as an
integer constant (i.e., without the decimal point). Some examples: 3.14, -3., .23, 3e+4
11.22e-3.
• We can make the description simpler if we make the regular expression for floats
also include integers, and instead use other means of distinguishing integers from
floats.
• If we do this, the regular expression can be simplified to:
[+-]?( ([0-9]+ (.[0-9]*)?|.[0-9]+) ([eE][+-]?[0-9]+)?)
40
Examples
• String constants. A string constant starts with a quotation mark followed
by a sequence of symbols and finally another quotation mark.
• There are usually some restrictions on the symbols allowed between the
quotation marks. For example, line-feed characters or quotes are typically
not allowed, though these may be represented by special “escape”
sequences of other characters, such as "\n\n" for a string containing two
line-feeds.
"([a-zA-Z0-9]|\[a-zA-Z])* "

41
Nondeterministic finite automata (NFA)
• A finite automaton is, in the abstract sense, a machine that has a finite number of
states and a finite number of transitions between these. A transition between
states is usually labelled by a character from the input alphabet, but we will also
use transitions marked with ε, the so-called epsilon transitions.

• A finite automaton can be used to decide if an input string is a member in some


particular set of strings. To do this, we select one of the states of the automaton
as the starting state. We start in this state and in each step, we can do one of the
following:

42
Nondeterministic finite automata

• Follow an epsilon transition to another state, or

• Read a character from the input and follow a transition labelled by


that character.

When all characters from the input are read, we see if the current state
is marked as being accepting. If so, the string we have read from the
input is in the language defined by the automaton.

43
Nondeterministic finite automata

• We may have a choice of several actions at each step: We can choose


between either an epsilon transition or a transition on an alphabet
character, and if there are several transitions with the same symbol,
we can choose between these.

• This makes the automaton nondeterministic, as the choice of action


is not determined solely by looking at the current state and input.

44
Nondeterministic finite automata

• Definition 2.1 A nondeterministic finite automaton consists of a set S of states.


One of these states, s0 ∈ S, is called the starting state of the automaton and a
subset F ⊆ S of the states are accepting states. Additionally, we have a set T of
transitions.
• Each transition t connects a pair of states s1 and s2 and is labelled with a symbol,
which is either a character c from the alphabet Σ, or the symbol ε, which indicates
an epsilon-transition.
• A transition from state s to state t on the symbol c is written as sct.

45
Nondeterministic finite automata
• We will mostly use a graphical notation to describe finite automata. States
are denoted by circles, possibly containing a number or name that
identifies the state.
• This name or number has, however, no operational significance, it is solely
used for identification purposes.
• Accepting states are denoted by using a double circle instead of a single
circle. The initial state is marked by an arrow pointing to it from outside the
automaton.
46
Nondeterministic finite automata
• A transition is denoted by an arrow connecting two states. Near its midpoint, the
arrow is labelled by the symbol (possibly ε) that triggers the transition. Note that
the arrow that marks the initial state is not a transition and is, hence, not marked
by a symbol.
• Figure 2.3 shows an example of a nondeterministic finite automaton having three
states. State 1 is the starting state and state 3 is accepting. There is an epsilon
transition from state 1 to state 2, transitions on the symbol a from state 2 to
states 1 and 3 and a transition on the symbol b from state 1 to state 3.
• This NFA recognises the language described by the regular expression a*(a|b).
• As an example, the string aab is recognised by the following sequence of
transitions: 47
Nondeterministic finite automata

48
Nondeterministic finite automata

At the end of the input we are in state 3, which is accepting. Hence, the string is
accepted by the NFA. You can check this by placing a coin at the starting state and
follow the transitions by moving the coin.

49
Nondeterministic finite automata

• If we in the example above had chosen to follow the a-transition to


state 3 instead of state 1, we would have been stuck: We would have
no legal transition and yet we would not be at the end of the input.

• But, as previously stated, it is enough that there exists a path


leading to acceptance, so the string aab is still accepted.

51
Nondeterministic finite automata
• A program that decides if a string is accepted by a given NFA will have to check all
possible paths to see if any of these accepts the string.
• This requires either backtracking until a successful path found or simultaneously
following all possible paths, both of which are too time-consuming to make NFAs
suitable for efficient recognisers.
• We will, hence, use NFAs only as a stepping stone between regular expressions
and the more efficient DFAs. We use this stepping stone because it makes the
construction simpler than direct construction of a DFA from a regular expression.

52
Converting a regular expression to an NFA
• We will construct an NFA compositionally from a regular expression,
i.e., we will construct the NFA for a composite regular expression
from the NFAs constructed from its subexpressions.

• To be precise, we will from each subexpression construct an NFA


fragment and then combine these fragments into bigger fragments.

• A fragment is not a complete NFA, so we complete the construction


by adding the necessary components to make a complete NFA.

53
Converting a regular expression to an NFA

• To be precise, we will from each subexpression construct an NFA


fragment and then combine these fragments into bigger fragments.

• A fragment is not a complete NFA, so we complete the construction


by adding the necessary components to make a complete NFA.

54
55
NFA for the regular expression (a|b)*ac

56
Optimisations
We can use the construction in figure 2.4 for any regular expression
by expanding out all shorthand, e.g. converting s+ to ss* , [0-9] to
0|1|2|···|9 and s? to s|ε, etc.

The optimised constructions are shown in figure 2.6. As an example,


an NFA for [0-9]+ is shown in figure 2.7.

Note that while this is optimised, it is not optimal. You can make an
NFA for this language using only two states.

57
58
Example of Optimised NFA for [0-9]+

59
Deterministic finite automata

• Nondeterministic automata are, as mentioned earlier, not quite as


close to “the machine” as we would like. Hence, we now introduce a
more restricted form of finite automaton:

• The deterministic finite automaton, or DFA for short. DFAs are NFAs,
but obey a number of additional restrictions:

60
Deterministic finite automata
• There are no epsilon-transitions.
• There may not be two identically labelled transitions out of the same state
This means that we never have a choice of several next-states: The state
and the next input symbol uniquely determine the transition (or lack of
same).
This is why these automata are called deterministic. Figure 2.8 shows a DFA
equivalent to the NFA in figure 2.3

61
Example of a DFA

62
Examples of Automata and Regular
Introduction
Expressions
Question: Does an algebraic formalism exist to represent the rational
language other than the formalism of the automata ?

Example : L(A)={ancbm/n>=0, m>=0}. How to représente L?

Answer: L can be represented by the regular expression L=a*cb*.


In addition to the representation by the automaton: :
a b

c
q0 q1

What is a regular expression?

63
Examples of Automata and Regular
Introduction
Expressions
Definition:The formalism of rational expressions is an algebraic
formalism allowing to represent part of the languages built on a given
alphabet and which are called rational languages.

Example : Given ={a,b} an alphabet.


The language L1= {anbm/n>=0 et m>=0} is represented by the rational
expression a*b*
The langage L2= {(ab)n/ n>=0} is represented by the rational expression
(ab)*
The language L3= * is represented by the regular expression (a|b)*
The language L4= {anbn/n>=0} cannot be represented with a regular
expression.

How to define a regular expression?

64
Examples of Automata and Regular
1- Regular expressions Expressions
Définition: Given  an alphabet. A regular expression is defined by:
RE Language automata
 is a RE L()={}
q0  q1

a L(a)={a}

q0 a q1
If r and s are RE
then
-r | s is a RE L(r|s)=L(r)  L(s) ?
-r s is a RE L(r.s)=L(r) . L(s) ?
-r* is a RE (L(r))* ?
-(r) is a RE (L((R))=L(R) ?

65
Examples of Automata
and Regular Expressions
Automata of r|s :

q1,s
q0,s
automata de s …

 qn,s

q0

 q1,r
q0,r
automata de r …

qm,r

66
Examples of Automata
and Regular Expressions
Automata of r.s :

q1,s q1,r
q0,s  q0,r
automata of s … automata of r …

qn,s qm,r

67
Examples of Automata
and Regular Expressions
Automata of r*:

q1,r
q0 q0,r 
… qf
 automata of r

qm,r

68
Examples of Automaton and Regular
Expressions
1.1- Algorithm for switching from ER to automata
- The Automaton for the regular expression  is:
A=<{},{q0,q1},q0,{(q0,,q1)},{q1}>

- The Automata for the regular expression a is :


A=<{a},{q0,q1},q0,{(q0,a,q1)},{q1}>

- The Automata for r | s:


Ar=<r,Qr,qr,0,r,Tr> As=<s,Qs,qs,0,s,Ts>

Ar|s=<rs,QrQs{qrs,o}, qrs,o,
rs{(qrs,o, , qr,o),(qrs,o, , qs,o)}, Tr  Ts>

69
Examples of Automata and Regular
Expressions
- Automata for r . s:
Ar=<r,Qr,qr,0,r,Tr> As=<s,Qs,qs,0,s,Ts>

Ar.s=<rs,QrQs, qr,o,
rs{(q, , qs,o)/qTr}, Ts >

- Automata for r*:


Ar=<r,Qr,qr,0,r,Tr>

Ar*=<r,Qr{qr*,o , qr*,f}, qr*,o,


r{(q, , qr*,f)/qTr} 
{(qr*,o, , qr*,f),(qr*,f, , qr*,o)}, {qr*,f}>

70

You might also like