CC 2
CC 2
Lexical Analysis
Lexical Analyzer
Tokens
Example:
if( i == j )
z = 0;
else
z = 1;
3
Tokens
• Input is just a sequence of characters:
i f ( \b i \b = = \b j \n \t ....
4
Tokens
Goal:
• partition input string into
substrings
• classify them according to their
role
5
Tokens
• A token is a syntactic
category
• Natural language:
“He wrote the program”
• Words: “He”, “wrote”, “the”,
“program”
6
Tokens
• Programming language:
“if(b == 0) a = b”
• Words:
“if”, “(”, “b”, “==”, “0”,
“)”, “a”, “=”, “b”
7
Tokens
Alphabets
• Any finite set of symbols {0,1} is a set of binary alphabets,
{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets, {a-z, A-Z} is
a set of English language alphabets.
Strings
• Any finite sequence of alphabets (characters) is called a string. Length of the
string is the total number of occurrence of alphabets, e.g., the length of the
string Pakistan is 8 and is denoted by |Pakistan| = 8. A string having no
alphabets, i.e. a string of zero length is known as an empty string and is
denoted by ε (epsilon)
Specification of Tokens
Special symbols
A typical high-level language contains the following symbols:-
Regular Expression
Language
• A language is considered as a finite set of strings over some finite set of alphabets.
• Computer languages are considered as finite sets, and mathematically set operations can be
performed on them.
• Finite languages can be described by means of regular expressions.
Regular Expression
Regular Expressions
• The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme that
belong to the language in hand.
• It searches for the pattern defined by the language rules.
• Regular expressions have the capability to express finite languages by defining a pattern for finite
strings of symbols.
• The grammar defined by regular expressions is known as regular grammar.
• The language defined by regular grammar is known as regular language.
• Regular expression is an important notation for specifying patterns.
• Each pattern matches a set of strings, so regular expressions serve as names for a set of
strings.
• Programming language tokens can be described by regular languages.
• The specification of regular expressions is an example of a recursive definition.
• Regular languages are easy to understand and have efficient implementation.
• There are a number of algebraic laws that are obeyed by regular expressions, which can be
used to manipulate regular expressions into equivalent forms.
Regular Expression
Operations
The various operations on languages are:
• Union of two languages L and M is written as
• L U M = {s | s is in L or s is in M}
• Concatenation of two languages L and M is written as
• LM = {st | s is in L and t is in M}
• The Kleene Closure of a language L is written as
• L* = Zero or more occurrence of language L.
Notations
If r and s are regular expressions denoting the languages L(r) and L(s), then
•Union : (r)|(s) is a regular expression denoting L(r) U L(s)
•Concatenation : (r)(s) is a regular expression denoting L(r)L(s)
•Kleene closure : (r)* is a regular expression denoting (L(r))*
•(r) is a regular expression denoting L(r)
Regular Expression
Precedence and Associativity
• *, concatenation (.), and | (pipe sign) are left associative
• * has the highest precedence
• Concatenation (.) has the second highest precedence.
• | (pipe sign) has the lowest precedence of all.
This
matches this... but not this...
expression...
(ab)*c abc ababab
ababababc ababd
(.a)+b xab b
ra5afab aagb
Regular Expression – Choosing one character from many
• A string of characters enclosed in square brackets ([]) matches any one character in that string.
• If the first character in the brackets is a caret (^), it matches any character except those in the
string.
• For example, [abc] matches a, b, or c, but not x, y, or z.
• However, [^abc] matches x, y, or z, but not a, b, or c.
• A minus sign (-) within square brackets indicates a range of consecutive ASCII characters.
• For example, [0-9] is the same as [0123456789].
• If a right square bracket is immediately after a left square bracket, it does not terminate the string but is
considered to be one of the characters to match.
• If any special character, such as backslash (\), asterisk (*), or plus sign (+), is immediately after the left
square bracket, it doesn't have its special meaning and is considered to be one of the characters to
match.
Regular Expression – Choosing one character from many
28
The Lex and Flex Scanner Generators
Creating a Lexical Analyzer with Lex and
Flex
lex
source lex or flex
lex.yy.c
program compiler
lex.l
C
lex.yy.c a.out
compiler
input sequence
a.out
stream of tokens
29
The Lex and Flex Scanner Generators
Lex Specification
A LEX program consists of three sections : Declarations, Rules and Auxiliary functions
DECLARATIONS
%%
RULES
%%
AUXILIARY FUNCTIONS
30
The Lex and Flex Scanner Generators
The Lex and Flex Scanner Generators
The Lex and Flex Scanner Generators
Lex Specification
Declarations
• The declarations section consists of two parts, auxiliary
declarations and regular definitions.
33
The Lex and Flex Scanner Generators
Lex Specification
Rules
Rules in a LEX program consists of two parts :
1. The pattern to be matched
2. The corresponding action to be executed
34
The Lex and Flex Scanner Generators
Regular Expressions in Lex
x match the character x
\. match the character .
“string”match contents of string of characters
. match any character except newline
^ match beginning of a line
$ match the end of a line
[xyz] match one character x, y, or z (use \ to escape -)
[^xyz]match any character except x, y, and z
[a-z] match one of a to z
r* closure (match zero or more occurrences)
r+ positive closure (match one or more occurrences)
r? optional (match zero or one occurrence)
r1r2 match r1 then r2 (concatenation)
r1|r2 match r1 or r2 (union)
(r) grouping
{d} match the regular expression defined by d
35
The Lex and Flex Scanner Generators
Lex Specification
Auxiliary functions
• LEX generates C code for the rules specified in the Rules
section and places this code into a single function
called yylex().
36
The Lex and Flex Scanner Generators
Lab Assignment # 2
Write a lex file to check whether user enter a VALID
operator or INVALID operator. Output Should look like
below;
Deadline: ???
For Assignment:
1. https://www.youtube.com/watch?v=54bo1qaHAfk
2. https://www.youtube.com/watch?v=ilwXAchl4uw37
3. https://codedost.com/flex/flex-programs/
Regular Expression
Representation occurrence of symbols using regular expressions
• letter = [a – z] or [A – Z]
• digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]
• sign = [ + | - ]
• The only problem left with the lexical analyzer is how to verify the validity of a regular expression
used in specifying the patterns of keywords of a language.
• A well-accepted solution is to use finite automata for verification.
Finite Automata
• Finite automata is a state machine that takes a string of symbols as input and changes its
state accordingly.
• Finite automata is a recognizer for regular expressions.
• When a regular expression string is fed into finite automata, it changes its state for each literal.
• If the input string is successfully processed and the automata reaches its final state, it is
accepted, i.e., the string just fed was said to be a valid token of the language in hand.
The transition function (δ) maps the finite set of state (Q) to a finite set of input symbols (Σ),
Q×Σ➔Q
Finite Automata Construction
Let L(r) be a regular language recognized by some finite automata (FA).
•States : States of FA are represented by circles. State names are written inside circles.
•Start state : The state from where the automata starts, is known as the start state. Start state
has an arrow pointed towards it.
•Intermediate states : All intermediate states have at least two arrows; one pointing to and
another pointing out from them.
•Final state : If the input string is successfully parsed, the automata is expected to be in this state.
Final state is represented by double circles.
•Transition : The transition from one state to another state happens when a desired symbol in the
input is found. Upon transition, automata can either move to the next state or stay in the same
state. Movement from one state to another is shown as a directed arrow, where the arrows points
to the destination state. If automata stays on the same state, an arrow pointing from a state to itself
is drawn.
Finite Automata Construction - Example
We assume FA accepts any three digit binary value ending in digit 1.
FA = {Q(q0, qf), Σ(0,1), q0, qf, δ}
Finite Automata
State Graphs
A state
An accepting
state
42
Finite Automata
State Graphs
a
A transition
43
Finite Automata
44
Finite Automata - Example
45
Finite Automata - Example
1
0
46
Finite Automata - Example
b
a a
47
Finite Automata – Transition Table
48
Nondeterministic Finite Automaton (NFA)
• NFA stands for non-deterministic finite automata. It is easy to construct an NFA than DFA for a given
regular language.
• The finite automata are called NFA when there exist many paths for specific input from the current state
to the next state.
• Every NFA is not DFA, but each NFA can be translated into DFA.
• NFA is defined in the same way as DFA but with the following two exceptions, it contains multiple next
states, and it contains ε transition.
Nondeterministic Finite Automaton (NFA)
• We can see that from state q0 for input a, there are two next states q1 and q2, similarly, from q0 for input b, the next
states are q0 and q1.
• Thus it is not fixed or determined that with a particular input where to go next. Hence this FA is called non-deterministic
finite automata.
Comparison of NFA/DFA
• One transition per input per state.
• No e – moves
• Can take only one path through the state graph.
• Completely determined by input.
• DFAs are easier to implement – table driven.
• NFAs and DFAs recognize the same set of languages (regular languages)
• For a given language, the NFA can be simpler than the DFA.
• DFA can be exponentially larger than NFA.
• NFAs are the key to automating RE → DFA construction.
RE to Finite Automata
• We can use Thompson's Construction to find out a Finite Automaton from a Regular Expression.
• We will reduce the regular expression into smallest regular expressions and converting these to
NFA, combine NFAs with e moves and finally to DFA.
• Some basic RA expressions are the following −
52
RE to Finite Automata
Case 3 − For a regular expression (a+b), we can construct the following FA −
Step 4: In DFA, the final state will be all the states which contain F(final states of NFA)
Conversion from NFA to DFA- Example - 1
Now we will obtain δ' transition for state q0.
Convert the given NFA to DFA.
Solution: For the given transition diagram we will first construct the
transition table
Now we will obtain δ' transition for state q1.
State 0 1
→q0 q0 q1
q1 {q1, q2} q1
*q2 q2 {q1, q2}
Conversion from NFA to DFA- Example - 1
Now we will obtain δ' transition for state [q1,q2].
Convert the given NFA to DFA.
State 0 1
→q0 q0 q1
q1 {q1, q2} q1
*q2 q2 {q1, q2}
Conversion from NFA to DFA- Example - 1
Convert the given NFA to DFA.
State 0 1
State 0 1
State 0 1
Transition Table
RE to Finite Automata - Examples
Design a FA from given regular expression 10 + (0 + 11)0* 1.
Deadline: ???