0% found this document useful (0 votes)
23 views

CC 2

This document discusses lexical analysis and tokens in compiler construction. It covers: 1. The goal of lexical analysis is to partition the input string into substrings and classify them according to their roles. 2. A token is a syntactic category that can represent things like identifiers, keywords, integers, floats, symbols, and strings in a programming language. 3. Regular expressions are used to define patterns to identify valid tokens in a language and represent the language's grammar. Operations like union, concatenation, and Kleene closure can manipulate regular expressions.

Uploaded by

Kami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

CC 2

This document discusses lexical analysis and tokens in compiler construction. It covers: 1. The goal of lexical analysis is to partition the input string into substrings and classify them according to their roles. 2. A token is a syntactic category that can represent things like identifiers, keywords, integers, floats, symbols, and strings in a programming language. 3. Regular expressions are used to define patterns to identify valid tokens in a language and represent the language's grammar. Operations like union, concatenation, and Kleene closure can manipulate regular expressions.

Uploaded by

Kami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 65

Compiler Construction

Lexical Analysis
Lexical Analyzer
Tokens

Example:
if( i == j )
z = 0;
else
z = 1;

3
Tokens
• Input is just a sequence of characters:

i f ( \b i \b = = \b j \n \t ....

4
Tokens

Goal:
• partition input string into
substrings
• classify them according to their
role

5
Tokens

• A token is a syntactic
category
• Natural language:
“He wrote the program”
• Words: “He”, “wrote”, “the”,
“program”

6
Tokens

• Programming language:
“if(b == 0) a = b”
• Words:
“if”, “(”, “b”, “==”, “0”,
“)”, “a”, “=”, “b”

7
Tokens

• Identifiers: x y11 maxsize


• Keywords: if else while for
• Integers: 2 1000 -44 5L
• Floats: 2.0 0.0034 1e5
• Symbols: ( ) + * / { } < > ==
• Strings: “enter x” “error”
8
Tokens
• Lexemes are said to be a sequence of characters (alphanumeric) in a token.
• There are some predefined rules for every lexeme to be identified as a valid
token.
• These rules are defined by grammar rules, by means of a pattern.
• A pattern explains what can be a token, and these patterns are defined by
means of regular expressions.
• In programming language, keywords, constants, identifiers, strings, numbers,
operators and punctuations symbols can be considered as tokens.
• For example, in C language, the variable declaration line ; int value = 100;

Contains following tokens


int (keyword), value (identifier), = (operator), 100 (constant) and ; (symbol)
Specification of Tokens
• Let us understand how the language theory undertakes the following terms:

Alphabets
• Any finite set of symbols {0,1} is a set of binary alphabets,
{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets, {a-z, A-Z} is
a set of English language alphabets.

Strings
• Any finite sequence of alphabets (characters) is called a string. Length of the
string is the total number of occurrence of alphabets, e.g., the length of the
string Pakistan is 8 and is denoted by |Pakistan| = 8. A string having no
alphabets, i.e. a string of zero length is known as an empty string and is
denoted by ε (epsilon)
Specification of Tokens
Special symbols
A typical high-level language contains the following symbols:-
Regular Expression
Language
• A language is considered as a finite set of strings over some finite set of alphabets.
• Computer languages are considered as finite sets, and mathematically set operations can be
performed on them.
• Finite languages can be described by means of regular expressions.
Regular Expression
Regular Expressions
• The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme that
belong to the language in hand.
• It searches for the pattern defined by the language rules.
• Regular expressions have the capability to express finite languages by defining a pattern for finite
strings of symbols.
• The grammar defined by regular expressions is known as regular grammar.
• The language defined by regular grammar is known as regular language.
• Regular expression is an important notation for specifying patterns.
• Each pattern matches a set of strings, so regular expressions serve as names for a set of
strings.
• Programming language tokens can be described by regular languages.
• The specification of regular expressions is an example of a recursive definition.
• Regular languages are easy to understand and have efficient implementation.
• There are a number of algebraic laws that are obeyed by regular expressions, which can be
used to manipulate regular expressions into equivalent forms.
Regular Expression
Operations
The various operations on languages are:
• Union of two languages L and M is written as
• L U M = {s | s is in L or s is in M}
• Concatenation of two languages L and M is written as
• LM = {st | s is in L and t is in M}
• The Kleene Closure of a language L is written as
• L* = Zero or more occurrence of language L.

Notations
If r and s are regular expressions denoting the languages L(r) and L(s), then
•Union : (r)|(s) is a regular expression denoting L(r) U L(s)
•Concatenation : (r)(s) is a regular expression denoting L(r)L(s)
•Kleene closure : (r)* is a regular expression denoting (L(r))*
•(r) is a regular expression denoting L(r)
Regular Expression
Precedence and Associativity
• *, concatenation (.), and | (pipe sign) are left associative
• * has the highest precedence
• Concatenation (.) has the second highest precedence.
• | (pipe sign) has the lowest precedence of all.

Representing valid tokens of a language in regular expression


If x is a regular expression, then:
• x* means zero or more occurrence of x.
• i.e., it can generate { e, x, xx, xxx, xxxx, … }
• x+ means one or more occurrence of x.
• i.e., it can generate { x, xx, xxx, xxxx … } or x.x*
• x? means at most one occurrence of x
• i.e., it can generate either {x} or {e}.
• [a-z] is all lower-case alphabets of English language.
• [A-Z] is all upper-case alphabets of English language.
• [0-9] is all natural digits used in mathematics.
Regular Expression – Examples
Regular Expression – Examples

UNIX style Regular Expression


Regular Expression – Examples

UNIX style Regular Expression


Regular Expression – Examples
UNIX style Regular Expression examples with outcomes
Regular Expression – Examples
UNIX style Regular Expression examples with outcomes
Regular Expression – Matching simple expressions
Most characters match themselves. The only exceptions are called special characters:
• asterisk (*),
• plus sign (+),
• question mark (?),
• backslash (\), matches but not
• period (.), This expression...
this... this...
• caret (^), a a b
• square brackets ([ and ]), \.\* .* dog
• dollar sign ($),
ABCDEF
• ampersand (&). 100 100
G
• or sign (|).

To match a special character, precede it with a backslash, like this \*.


Regular Expression – Matching any character
A period (.) matches any character except a newline character.

matches but not


This expression...
this... this...
.art dart art
cart hurt
tart dark
Regular Expression – Repeating expressions
You can repeat expressions with an asterisk or plus sign.
• A regular expression followed by an asterisk (*) matches zero or more occurrences of the regular
expression.
• A regular expression followed by a plus sign (+) matches one or more occurrences of the one-character
regular expression.
• A regular expression followed by a question mark (?) matches zero or one occurrence of the one-
character regular expression.
This expression... matches this... but not this...
a+b ab b
aaab baa
a*b b daa
ab
aaab
.*cat cat dog
9393cat
the old cat
c7sb@#puiercat
So to match any series of zero or more characters, use ".*"
Regular Expression – Grouping expressions
If an expression is enclosed in parentheses (( and )), the editor treats it as one expression and applies
any asterisk (*) or plus (+) to the whole expression.

This
matches this... but not this...
expression...
(ab)*c abc ababab
ababababc ababd
(.a)+b xab b
ra5afab aagb
Regular Expression – Choosing one character from many
• A string of characters enclosed in square brackets ([]) matches any one character in that string.
• If the first character in the brackets is a caret (^), it matches any character except those in the
string.
• For example, [abc] matches a, b, or c, but not x, y, or z.
• However, [^abc] matches x, y, or z, but not a, b, or c.
• A minus sign (-) within square brackets indicates a range of consecutive ASCII characters.
• For example, [0-9] is the same as [0123456789].
• If a right square bracket is immediately after a left square bracket, it does not terminate the string but is
considered to be one of the characters to match.
• If any special character, such as backslash (\), asterisk (*), or plus sign (+), is immediately after the left
square bracket, it doesn't have its special meaning and is considered to be one of the characters to
match.
Regular Expression – Choosing one character from many

This expression... matches this... but not this...


[aeiou][0-9] a6 ex
i3 9a
u2 $6
[^cfl]og dog cog
bog fog
END[.] END. END;
END DO
ENDIAN
Regular Expression – Matching the beginning or end of a line
• You can specify that a regular expression match only the beginning or end of the line. These are called
anchor characters:
• If a caret (^) is at the beginning of the entire regular expression, it matches the beginning of a line.
• If a dollar sign ($) is at the end of the entire regular expression, it matches the end of a line.
• If an entire regular expression is enclosed by a caret and dollar sign (^like this$), it matches an entire
line.

This expression... matches this... but not this...


^(the cat).+ the cat runs see the cat run
.+(the cat)$ watch the cat the cat eats
The Lex and Flex Scanner Generators

• lex and its newer cousin flex are scanner generators


• Input is a set of regular expressions and associated actions (written in C).
• Output is table-driven scanner (lex.yy.c)
• flex: an open source implementation of the original UNIX lex utility

28
The Lex and Flex Scanner Generators
Creating a Lexical Analyzer with Lex and
Flex

lex
source lex or flex
lex.yy.c
program compiler
lex.l

C
lex.yy.c a.out
compiler

input sequence
a.out
stream of tokens

29
The Lex and Flex Scanner Generators
Lex Specification

A LEX program consists of three sections : Declarations, Rules and Auxiliary functions
DECLARATIONS
%%
RULES
%%
AUXILIARY FUNCTIONS
30
The Lex and Flex Scanner Generators
The Lex and Flex Scanner Generators
The Lex and Flex Scanner Generators
Lex Specification
Declarations
• The declarations section consists of two parts, auxiliary
declarations and regular definitions.

• The auxiliary declarations are copied as such by LEX to the


output lex.yy.c file. This C code consists of instructions to the C
compiler and are not processed by the LEX tool.

• The auxiliary declarations (which are optional) are written in C


language and are enclosed within ' %{ ' and ' %} ' . It is generally used
to declare functions, include header files, or define global variables
and constants.

33
The Lex and Flex Scanner Generators
Lex Specification
Rules
Rules in a LEX program consists of two parts :
1. The pattern to be matched
2. The corresponding action to be executed

• LEX obtains the regular expressions of the symbols


'number' and 'op' from the declarations section and
generates code into a function yylex() in
the lex.yy.c file.

• This function checks the input stream for the first


match to one of the patterns specified and executes
code in the action part corresponding to the
pattern.

34
The Lex and Flex Scanner Generators
Regular Expressions in Lex
x match the character x
\. match the character .
“string”match contents of string of characters
. match any character except newline
^ match beginning of a line
$ match the end of a line
[xyz] match one character x, y, or z (use \ to escape -)
[^xyz]match any character except x, y, and z
[a-z] match one of a to z
r* closure (match zero or more occurrences)
r+ positive closure (match one or more occurrences)
r? optional (match zero or one occurrence)
r1r2 match r1 then r2 (concatenation)
r1|r2 match r1 or r2 (union)
(r) grouping
{d} match the regular expression defined by d
35
The Lex and Flex Scanner Generators
Lex Specification
Auxiliary functions
• LEX generates C code for the rules specified in the Rules
section and places this code into a single function
called yylex().

• In addition to this LEX generated code, the programmer


may wish to add his own code to the lex.yy.c file.

• The auxiliary functions section allows the programmer to


achieve this.

• The auxiliary declarations and auxiliary functions are


copied as such to the lex.yy.c file

• Once the code is written, lex.yy.c maybe generated using the


command lex "filename.l" and compiled as gcc lex.yy.c

36
The Lex and Flex Scanner Generators
Lab Assignment # 2
Write a lex file to check whether user enter a VALID
operator or INVALID operator. Output Should look like
below;

Deadline: ???

For Assignment:
1. https://www.youtube.com/watch?v=54bo1qaHAfk
2. https://www.youtube.com/watch?v=ilwXAchl4uw37
3. https://codedost.com/flex/flex-programs/
Regular Expression
Representation occurrence of symbols using regular expressions
• letter = [a – z] or [A – Z]
• digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]
• sign = [ + | - ]

Representation of language tokens using regular expressions


• Decimal = (sign)?(digit)+
• Identifier = (letter)(letter | digit)*

• The only problem left with the lexical analyzer is how to verify the validity of a regular expression
used in specifying the patterns of keywords of a language.
• A well-accepted solution is to use finite automata for verification.
Finite Automata
• Finite automata is a state machine that takes a string of symbols as input and changes its
state accordingly.
• Finite automata is a recognizer for regular expressions.
• When a regular expression string is fed into finite automata, it changes its state for each literal.
• If the input string is successfully processed and the automata reaches its final state, it is
accepted, i.e., the string just fed was said to be a valid token of the language in hand.

The mathematical model of finite automata consists of:

• Finite set of states (Q)


• Finite set of input symbols (Σ)
• One Start state (q0)
• Set of final states (qf)
• Transition function (δ)

The transition function (δ) maps the finite set of state (Q) to a finite set of input symbols (Σ),
Q×Σ➔Q
Finite Automata Construction
Let L(r) be a regular language recognized by some finite automata (FA).

•States : States of FA are represented by circles. State names are written inside circles.
•Start state : The state from where the automata starts, is known as the start state. Start state
has an arrow pointed towards it.
•Intermediate states : All intermediate states have at least two arrows; one pointing to and
another pointing out from them.
•Final state : If the input string is successfully parsed, the automata is expected to be in this state.
Final state is represented by double circles.
•Transition : The transition from one state to another state happens when a desired symbol in the
input is found. Upon transition, automata can either move to the next state or stay in the same
state. Movement from one state to another is shown as a directed arrow, where the arrows points
to the destination state. If automata stays on the same state, an arrow pointing from a state to itself
is drawn.
Finite Automata Construction - Example
We assume FA accepts any three digit binary value ending in digit 1.
FA = {Q(q0, qf), Σ(0,1), q0, qf, δ}
Finite Automata

State Graphs
A state

The start state

An accepting
state
42
Finite Automata

State Graphs
a

A transition

43
Finite Automata

• A finite automaton accepts a string


if we can follow transitions
labelled with characters in the
string from start state to some
accepting state.

44
Finite Automata - Example

A FA that accepts only “1”

45
Finite Automata - Example

• A FA that accepts any number of 1’s followed by a


single 0

1
0

46
Finite Automata - Example

• A FA that accepts ab*a


• Alphabet: {a,b}

b
a a

47
Finite Automata – Transition Table

48
Nondeterministic Finite Automaton (NFA)

• NFA stands for non-deterministic finite automata. It is easy to construct an NFA than DFA for a given
regular language.
• The finite automata are called NFA when there exist many paths for specific input from the current state
to the next state.
• Every NFA is not DFA, but each NFA can be translated into DFA.
• NFA is defined in the same way as DFA but with the following two exceptions, it contains multiple next
states, and it contains ε transition.
Nondeterministic Finite Automaton (NFA)

• We can see that from state q0 for input a, there are two next states q1 and q2, similarly, from q0 for input b, the next
states are q0 and q1.

• Thus it is not fixed or determined that with a particular input where to go next. Hence this FA is called non-deterministic
finite automata.
Comparison of NFA/DFA
• One transition per input per state.
• No e – moves
• Can take only one path through the state graph.
• Completely determined by input.
• DFAs are easier to implement – table driven.
• NFAs and DFAs recognize the same set of languages (regular languages)
• For a given language, the NFA can be simpler than the DFA.
• DFA can be exponentially larger than NFA.
• NFAs are the key to automating RE → DFA construction.
RE to Finite Automata
• We can use Thompson's Construction to find out a Finite Automaton from a Regular Expression.
• We will reduce the regular expression into smallest regular expressions and converting these to
NFA, combine NFAs with e moves and finally to DFA.
• Some basic RA expressions are the following −

Case 1 − For a regular expression ‘a’, we can construct the following FA −

Case 2 − For a regular expression ‘ab’, we can construct


the following FA −

52
RE to Finite Automata
Case 3 − For a regular expression (a+b), we can construct the following FA −

Case 4 − For a regular expression (a+b)*, we can construct the following FA −


RE to Finite Automata - Examples
RE to Finite Automata - Examples
building NFA for a ( b|c )*

building NFA for a ( b|c)


Conversion from NFA to DFA
• In this section, we will discuss the method of converting NFA to its equivalent DFA.
• In NFA, when a specific input is given to the current state, the machine goes to multiple states.
• It can have zero, one or more than one move on a given input symbol.
• On the other hand, in DFA, when a specific input is given to the current state, the machine goes to only
one state.
• DFA has only one move on a given input symbol.

• Let, M = (Q, ∑, δ, q0, F) is an NFA which accepts the language L(M).


• There should be equivalent DFA denoted by M' = (Q', ∑', q0', δ', F') such that L(M) = L(M').

Steps for converting NFA to DFA:

Step 1: Initially Q' = ϕ


Step 2: Add q0 of NFA to Q'. Then find the transitions from this start state.
Step 3: In Q', find the possible set of states for each input symbol. If this set of states is not in
Q', then add it to Q'.

Step 4: In DFA, the final state will be all the states which contain F(final states of NFA)
Conversion from NFA to DFA- Example - 1
Now we will obtain δ' transition for state q0.
Convert the given NFA to DFA.

Now we will obtain δ' transition for state q1.

Solution: For the given transition diagram we will first construct the
transition table
Now we will obtain δ' transition for state q1.

State 0 1

→q0 q0 q1
q1 {q1, q2} q1
*q2 q2 {q1, q2}
Conversion from NFA to DFA- Example - 1
Now we will obtain δ' transition for state [q1,q2].
Convert the given NFA to DFA.

State 0 1

→q0 q0 q1
q1 {q1, q2} q1
*q2 q2 {q1, q2}
Conversion from NFA to DFA- Example - 1
Convert the given NFA to DFA.

transition table for the constructed DFA

State 0 1

→[q0] [q0] [q1]


[q1] [q1, q2] [q1]
*[q2] [q2] [q1, q2]
*[q1, q2] [q1, q2] [q1, q2] The state q2 can be eliminated because q2 is an unreachable
state.
Conversion from NFA to DFA- Example - 2
Now we will obtain δ' transition for state q1.
Convert the given NFA to DFA.

Now we will obtain δ' transition for state q0,q1.

State 0 1

→q0 {q0, q1} {q1}


*q1 ϕ {q0, q1}

Now we will obtain δ' transition for state q0.


Conversion from NFA to DFA- Example - 2
Convert the given NFA to DFA.

transition table for the constructed DFA

State 0 1

→[q0] [q0, q1] [q1]


*[q1] ϕ [q0, q1]
*[q0, q1] [q0, q1] [q0, q1]

With these new names the DFA will be as follows:


RE to Finite Automata – A complete example
Design a FA from given regular expression 10 + (0 + 11)0* 1.
RE to Finite Automata - Examples
Design a FA from given regular expression 10 + (0 + 11)0* 1.

Transition Table
RE to Finite Automata - Examples
Design a FA from given regular expression 10 + (0 + 11)0* 1.

Equivalent DFA will be


RE to Finite Automata - Examples
Theory Assignment No. 1

Convert following DFA/NFA to RE- show all steps

Deadline: ???

You might also like