0% found this document useful (0 votes)
55 views23 pages

Lecture 4 Lexical Analysis

Uploaded by

Chaudhary Waqas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views23 pages

Lecture 4 Lexical Analysis

Uploaded by

Chaudhary Waqas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Introduction of Lexical Analysis

Lexical Analysis is the first phase of the compiler also known as a scanner. It converts the High level input
program into a sequence of Tokens.

1. Lexical Analysis can be implemented with the Deterministic finite Automata.

2. The output is a sequence of tokens that is sent to the parser for syntax analysis

What is a
Token?

A lexical token is a sequence of characters that can be treated as a unit in the grammar of the
programming languages. Example of tokens:

 Type token (id, number, real, . . . )

 Punctuation tokens (IF, void, return, . . . )

 Alphabetic tokens (keywords)

Keywords; Examples-for, while, if etc.

Identifier; Examples-Variable name, function name, etc.

Operators; Examples '+', '++', '-' etc.

Separators; Examples ',' ';' etc

Example of Non-Tokens:

 Comments, preprocessor directive, macros, blanks, tabs, newline, etc.

Lexeme: The sequence of characters matched by a pattern to form the corresponding token or a
sequence of input characters that comprises a single token is called a lexeme. eg- “float”,
“abs_zero_Kelvin”, “=”, “-”, “273”, “;” .

How Lexical Analyzer Works?


1. Input preprocessing: This stage involves cleaning up the input text and preparing it for lexical
analysis. This may include removing comments, whitespace, and other non-essential characters
from the input text.

2. Tokenization: This is the process of breaking the input text into a sequence of tokens. This is
usually done by matching the characters in the input text against a set of patterns or regular
expressions that define the different types of tokens.

3. Token classification: In this stage, the lexer determines the type of each token. For example, in a
programming language, the lexer might classify keywords, identifiers, operators, and
punctuation symbols as separate token types.

4. Token validation: In this stage, the lexer checks that each token is valid according to the rules of
the programming language. For example, it might check that a variable name is a valid identifier,
or that an operator has the correct syntax.

5. Output generation: In this final stage, the lexer generates the output of the lexical analysis
process, which is typically a list of tokens. This list of tokens can then be passed to the next stage
of compilation or interpretation.
 The lexical analyzer identifies the error with the help of the automation machine and the
grammar of the given language on which it is based like C, C++, and gives row number and
column number of the error.

Suppose we pass a statement through lexical analyzer – a = b + c; It will generate token


sequence like this: id=id+id; Where each id refers to it’s variable in the symbol table referencing
all details For example, consider the program

int main()
{
// 2 variables
int a, b;
a = 10;
return 0;
}

All the valid tokens are:

'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b' ';'

'a' '=' '10' ';' 'return' '0' ';' '}'

Above are the valid tokens. You can observe that we have omitted comments. As another example,
consider below printf statement.

There are 5 valid token in this printf statement. Exercise 1: Count number of tokens:

int main()
{
int a = 10, b = 20;
printf("sum is:%d",a+b);
return 0;
}

Answer: Total number of token: 27.


Exercise 2: Count number of tokens: int max(int i);

 Lexical analyzer first read int and finds it to be valid and accepts as token.

 max is read by it and found to be a valid function name after reading (

 int is also a token , then again I as another token and finally ;

Answer: Total number of tokens 7:

int, max, ( ,int, i, ), ;

 We can represent in the form of lexemes and tokens as under

Lexemes Tokens Lexemes Tokens

while WHILE a IDENTIEFIER

( LAPREN = ASSIGNMENT

a IDENTIFIER a IDENTIFIER

>= COMPARISON – ARITHMETIC

b IDENTIFIER 2 INTEGER

) RPAREN ; SEMICOLON

Advantages

1. Simplifies Parsing:Breaking down the source code into tokens makes it easier for computers to
understand and work with the code. This helps programs like compilers or interpreters to figure
out what the code is supposed to do. It’s like breaking down a big puzzle into smaller pieces,
which makes it easier to put together and solve.
2. Error Detection: Lexical analysis will detect lexical errors such as misspelled keywords or
undefined symbols early in the compilation process. This helps in improving the overall efficiency
of the compiler or interpreter by identifying errors sooner rather than later.

3. Efficiency: Once the source code is converted into tokens, subsequent phases of compilation or
interpretation can operate more efficiently. Parsing and semantic analysis become faster and
more streamlined when working with tokenized input.

Disadvantages

1. Limited Context: Lexical analysis operates based on individual tokens and does not consider the
overall context of the code. This can sometimes lead to ambiguity or misinterpretation of the
code’s intended meaning especially in languages with complex syntax or semantics.

2. Overhead: Although lexical analysis is necessary for the compilation or interpretation process, it
adds an extra layer of overhead. Tokenizing the source code requires additional computational
resources which can impact the overall performance of the compiler or interpreter.

3. Debugging Challenges: Lexical errors detected during the analysis phase may not always provide
clear indications of their origins in the original source code. Debugging such errors can be
challenging especially if they result from subtle mistakes in the lexical analysis process.

Compiler Design - Syntax Analysis

Syntax analysis or parsing is the second phase of a compiler. In this chapter, we shall learn the basic
concepts used in the construction of a parser.

We have seen that a lexical analyzer can identify tokens with the help of regular expressions and
pattern rules. But a lexical analyzer cannot check the syntax of a given sentence due to the limitations
of the regular expressions. Regular expressions cannot check balancing tokens, such as parenthesis.
Therefore, this phase uses context-free grammar (CFG), which is recognized by push-down automata.

CFG, on the other hand, is a superset of Regular Grammar, as depicted below:


It implies that every Regular Grammar is also context-free, but there exists some problems, which are
beyond the scope of Regular Grammar. CFG is a helpful tool in describing the syntax of programming
languages.

Context-Free Grammar

In this section, we will first see the definition of context-free grammar and introduce terminologies
used in parsing technology.

A context-free grammar has four components:

 A set of non-terminals (V). Non-terminals are syntactic variables that denote sets of strings.
The non-terminals define sets of strings that help define the language generated by the
grammar.

 A set of tokens, known as terminal symbols (Σ). Terminals are the basic symbols from which
strings are formed.

 A set of productions (P). The productions of a grammar specify the manner in which the
terminals and non-terminals can be combined to form strings. Each production consists of
a non-terminal called the left side of the production, an arrow, and a sequence of tokens
and/or on- terminals, called the right side of the production.

 One of the non-terminals is designated as the start symbol (S); from where the production
begins.

The strings are derived from the start symbol by repeatedly replacing a non-terminal (initially the start
symbol) by the right side of a production, for that non-terminal.

Example

We take the problem of palindrome language, which cannot be described by means of Regular
Expression. That is, L = { w | w = wR } is not a regular language. But it can be described by means of
CFG, as illustrated below:

G = ( V, Σ, P, S )

Where:

V = { Q, Z, N }

Σ = { 0, 1 }

P = { Q → Z | Q → N | Q → ℇ | Z → 0Q0 | N → 1Q1 }

S={Q}

This grammar describes palindrome language, such as: 1001, 11100111, 00100, 1010101, 11111, etc.

Syntax Analyzers
A syntax analyzer or parser takes the input from a lexical analyzer in the form of token streams. The
parser analyzes the source code (token stream) against the production rules to detect any errors in the
code. The output of this phase is a parse tree.

This way, the parser accomplishes two tasks, i.e., parsing the code, looking for errors and generating a
parse tree as the output of the phase.

Parsers are expected to parse the whole code even if some errors exist in the program. Parsers use
error recovering strategies, which we will learn later in this chapter.

Derivation

A derivation is basically a sequence of production rules, in order to get the input string. During
parsing, we take two decisions for some sentential form of input:

 Deciding the non-terminal which is to be replaced.

 Deciding the production rule, by which, the non-terminal will be replaced.

To decide which non-terminal to be replaced with production rule, we can have two options.

Left-most Derivation

If the sentential form of an input is scanned and replaced from left to right, it is called left-most
derivation. The sentential form derived by the left-most derivation is called the left-sentential form.

Right-most Derivation

If we scan and replace the input with production rules, from right to left, it is known as right-most
derivation. The sentential form derived from the right-most derivation is called the right-sentential
form.

Example

Production rules:

E→E+E

E→E*E
E → id

Input string: id + id * id

The left-most derivation is:

E→E*E

E→E+E*E

E → id + E * E

E → id + id * E

E → id + id * id

Notice that the left-most side non-terminal is always processed first.

The right-most derivation is:

E→E+E

E→E+E*E

E → E + E * id

E → E + id * id

E → id + id * id

Parse Tree

A parse tree is a graphical depiction of a derivation. It is convenient to see how strings are derived
from the start symbol. The start symbol of the derivation becomes the root of the parse tree. Let us
see this by an example from the last topic.

We take the left-most derivation of a + b * c

The left-most derivation is:

E→E*E

E→E+E*E

E → id + E * E

E → id + id * E

E → id + id * id

Step 1:
E→E*E

Step 2:

E→E+E*E

Step 3:

E → id + E * E
Step 4:

E → id + id * E

Step 5:

E → id + id * id

In a parse tree:

 All leaf nodes are terminals.

 All interior nodes are non-terminals.

 In-order traversal gives original input string.

A parse tree depicts associativity and precedence of operators. The deepest sub-tree is traversed first,
therefore the operator in that sub-tree gets precedence over the operator which is in the parent
nodes.

Ambiguity

A grammar G is said to be ambiguous if it has more than one parse tree (left or right derivation) for at
least one string.
Example

E→E+E

E→E–E

E → id

For the string id + id – id, the above grammar generates two parse trees:

The language generated by an ambiguous grammar is said to be inherently ambiguous. Ambiguity in


grammar is not good for a compiler construction. No method can detect and remove ambiguity
automatically, but it can be removed by either re-writing the whole grammar without ambiguity, or by
setting and following associativity and precedence constraints.

Associativity

If an operand has operators on both sides, the side on which the operator takes this operand is
decided by the associativity of those operators. If the operation is left-associative, then the operand
will be taken by the left operator or if the operation is right-associative, the right operator will take the
operand.

Example

Operations such as Addition, Multiplication, Subtraction, and Division are left associative. If the
expression contains:

id op id op id

it will be evaluated as:

(id op id) op id

For example, (id + id) + id

Operations like Exponentiation are right associative, i.e., the order of evaluation in the same
expression will be:
id op (id op id)

For example, id ^ (id ^ id)

Precedence

If two different operators share a common operand, the precedence of operators decides which will
take the operand. That is, 2+3*4 can have two different parse trees, one corresponding to (2+3)*4 and
another corresponding to 2+(3*4). By setting precedence among operators, this problem can be easily
removed. As in the previous example, mathematically * (multiplication) has precedence over +
(addition), so the expression 2+3*4 will always be interpreted as:

2 + (3 * 4)

These methods decrease the chances of ambiguity in a language or its grammar.

Left Recursion

A grammar becomes left-recursive if it has any non-terminal ‘A’ whose derivation contains ‘A’ itself as
the left-most symbol. Left-recursive grammar is considered to be a problematic situation for top-down
parsers. Top-down parsers start parsing from the Start symbol, which in itself is non-terminal. So,
when the parser encounters the same non-terminal in its derivation, it becomes hard for it to judge
when to stop parsing the left non-terminal and it goes into an infinite loop.

Example:

(1) A => Aα | β

(2) S => Aα | β

A => Sd

(1) is an example of immediate left recursion, where A is any non-terminal symbol and α represents a
string of non-terminals.

(2) is an example of indirect-left recursion.


A top-down parser will first parse the A, which in-turn will yield a string consisting of A itself and the
parser may go into a loop forever.

Removal of Left Recursion

One way to remove left recursion is to use the following technique:

The production

A => Aα | β

is converted into following productions

A => βA'

A'=> αA' | ε

This does not impact the strings derived from the grammar, but it removes immediate left recursion.

Second method is to use the following algorithm, which should eliminate all direct and indirect left
recursions.

START

Arrange non-terminals in some order like A1, A2, A3,…, An

for each i from 1 to n

for each j from 1 to i-1

replace each production of form Ai ⟹Aj𝜸

with Ai ⟹ δ1𝜸 | δ2𝜸 | δ3𝜸 |…| 𝜸

where Aj ⟹ δ1 | δ2|…| δn are current Aj productions

eliminate immediate left-recursion

END

Example
The production set

S => Aα | β

A => Sd

after applying the above algorithm, should become

S => Aα | β

A => Aαd | βd

and then, remove immediate left recursion using the first technique.

A => βdA'

A' => αdA' | ε

Now none of the production has either direct or indirect left recursion.

Left Factoring

If more than one grammar production rules has a common prefix string, then the top-down parser
cannot make a choice as to which of the production it should take to parse the string in hand.

Example

If a top-down parser encounters a production like

A ⟹ αβ | α𝜸 | …

Then it cannot determine which production to follow to parse the string as both productions are
starting from the same terminal (or non-terminal). To remove this confusion, we use a technique
called left factoring.

Left factoring transforms the grammar to make it useful for top-down parsers. In this technique, we
make one production for each common prefixes and the rest of the derivation is added by new
productions.

Example

The above productions can be written as

A => αA'

A'=> β | 𝜸 | …

Now the parser has only one production per prefix which makes it easier to take decisions.

First and Follow Sets

An important part of parser table construction is to create first and follow sets. These sets can provide
the actual position of any terminal in the derivation. This is done to create the parsing table where the
decision of replacing T[A, t] = α with some production rule.
First Set

This set is created to know what terminal symbol is derived in the first position by a non-terminal. For
example,

α→tβ

That is α derives t (terminal) in the very first position. So, t ∈ FIRST(α).

Algorithm for calculating First set

Look at the definition of FIRST(α) set:

 if α is a terminal, then FIRST(α) = { α }.

 if α is a non-terminal and α → ℇ is a production, then FIRST(α) = { ℇ }.

 if α is a non-terminal and α → 𝜸1 𝜸2 𝜸3 … 𝜸n and any FIRST(𝜸) contains t then t is in


FIRST(α).

First set can be seen as:

Follow Set

Likewise, we calculate what terminal symbol immediately follows a non-terminal α in production


rules. We do not consider what the non-terminal can generate but instead, we see what would be the
next terminal symbol that follows the productions of a non-terminal.

Algorithm for calculating Follow set:

 if α is a start symbol, then FOLLOW() = $

 if α is a non-terminal and has a production α → AB, then FIRST(B) is in FOLLOW(A) except ℇ.

 if α is a non-terminal and has a production α → AB, where B ℇ, then FOLLOW(A) is in


FOLLOW(α).

Follow set can be seen as: FOLLOW(α) = { t | S *αt*}

Limitations of Syntax Analyzers

Syntax analyzers receive their inputs, in the form of tokens, from lexical analyzers. Lexical analyzers
are responsible for the validity of a token supplied by the syntax analyzer. Syntax analyzers have the
following drawbacks -

 it cannot determine if a token is valid,

 it cannot determine if a token is declared before it is being used,

 it cannot determine if a token is initialized before it is being used,

 it cannot determine if an operation performed on a token type is valid or not.


These tasks are accomplished by the semantic analyzer, which we shall study in Semantic Analysis.
Explained Other way ---

Lexical Analysis

Lexical Analysis is the first phase of a compiler, where the source code is analyzed to identify meaningful
elements called tokens. It is often referred to as the "scanning" phase, as it involves reading the source
code sequentially and breaking it down into basic units.

Example:

Consider a simple arithmetic expression:

int sum = a + b * 5;

During lexical analysis, this code will be broken down into the following tokens:

 int, sum, =, a, +, b, *, 5, ;

Role of Lexical Analyzer

The lexical analyzer (also known as a lexer or scanner) is responsible for reading the input source code
and converting it into a stream of tokens. Its roles include:

1. Tokenization: Breaking the input string into meaningful symbols or tokens.

o Example: For the expression a + b, the tokens generated could be a, +, b.

2. Removing White Spaces and Comments: The lexer typically removes unnecessary elements like
white spaces, line breaks, and comments that do not contribute to the code's meaning.

3. Error Reporting: If the lexer encounters an invalid token (e.g., an unrecognized character), it
reports an error and may provide the location of the issue.

o Example: If the source code contains an invalid character like # in a language where # is
not defined, the lexer will flag this as an error.

4. Handling Literals and Identifiers: The lexer distinguishes between different types of tokens such
as keywords, identifiers, literals (e.g., numbers, strings), and operators.

Tokens, Lexemes, and Patterns

 Token: A token is a pair consisting of a token name and an optional attribute value. It represents
a class of lexemes.

o Example: The keyword int is a token representing the data type integer.

 Lexeme: A lexeme is a sequence of characters in the source code that matches the pattern for a
token and is identified by the lexical analyzer.
o Example: In the code int sum = 5;, the lexemes are int, sum, =, 5, and ;.

 Pattern: A pattern is a rule that specifies the structure of lexemes that can be recognized as a
particular token.

o Example: The pattern for an identifier might be [a-zA-Z_][a-zA-Z0-9_]*, which matches


any string starting with a letter or underscore, followed by any number of letters, digits,
or underscores.

Regular Expressions and Finite Automata

 Regular Expressions: Regular expressions are used to specify the patterns for tokens in a formal
and concise manner. They define the set of strings that match a particular pattern.

o Example: The regular expression [0-9]+ matches one or more digits, which can be used
to recognize integer literals.

 Finite Automata: Finite automata (FA) are abstract machines used to recognize patterns
described by regular expressions. There are two types of finite automata:

1. Deterministic Finite Automaton (DFA): A DFA has a finite number of states and
transitions between them based on input symbols. It can be used to recognize a regular
language.

 Example: A DFA for recognizing binary numbers (0 or 1) might have states that
transition between accepting or rejecting based on the input bit.

2. Non-deterministic Finite Automaton (NFA): An NFA allows multiple transitions for the
same input symbol or no transition at all. NFAs can be converted to DFAs for practical
implementation.

 Example: An NFA for recognizing the pattern a*b could have multiple paths
where the a* part can accept any number of a characters followed by b.

Summary

 Lexical Analysis involves breaking down source code into tokens.

 The lexical analyzer performs tokenization, removes white spaces and comments, reports errors,
and distinguishes between different types of tokens.

 Tokens are basic units, lexemes are the actual character sequences, and patterns define what
lexemes should look like.

 Regular expressions and finite automata are tools used to define and recognize patterns in the
source code.

You said:

generte automation for d((a+bd)c)*


To generate an automation (finite automaton) for the regular expression d((a+bd)c)*, we'll break down
the regular expression and construct the corresponding deterministic finite automaton (DFA). Here's a
step-by-step explanation:

Regular Expression Breakdown

1. d: This requires that the string starts with the character d.

2. (a + bd): This indicates that after d, there can be either a or bd.

3. c: This c follows whatever was chosen in (a + bd).

4. *: This asterisk * means that the pattern (a + bd)c can repeat zero or more times.

Steps to Create DFA

1. Start State:

o Begin with a start state q0.

2. Transition for d:

o From q0, if the input is d, move to state q1.

3. Handling (a + bd):

o From state q1, if the input is a, transition to state q2.

o Alternatively, if the input is b, transition to state q3.

o From state q3, if the input is d, transition to state q4.

4. Transition for c:

o From state q2, if the input is c, transition back to state q1 (because the pattern can
repeat).

o From state q4, if the input is c, transition back to state q1.

5. Accept State:

o The accepting state should be q1 because the pattern can repeat, so q1 is where the
string is accepted if it ends with c.

DFA Representation in English

1. Start at q0.

2. If the input is d, move to q1.

3. If at q1:
o If the input is a, move to q2.

o If the input is b, move to q3.

4. If at q3 and the input is d, move to q4.

5. If at q2 or q4 and the input is c, move back to q1.

6. Accept the string if it ends in state q1.

Summary of Transitions

 q0 → q1 on input d

 q1 → q2 on input a

 q1 → q3 on input b

 q3 → q4 on input d

 q2 → q1 on input c

 q4 → q1 on input c

Final DFA

The DFA will have states q0, q1, q2, q3, q4, where q0 is the start state and q1 is both the accepting and
repeating state. The transitions will allow the string to be accepted if it follows the pattern d((a+bd)c)*.

4o

You said:

Lexical analyzer examples?

A lexical analyzer (or lexer) is a component of a compiler or interpreter that processes the input source
code and converts it into a sequence of tokens. Here are some examples to illustrate how a lexical
analyzer works:

Example 1: Simple Arithmetic Expression

Source Code:

int result = a + b * 5;

Lexical Analysis Process:

1. Input: int result = a + b * 5;


2. Output Tokens:

o int → Keyword

o result → Identifier

o = → Assignment Operator

o a → Identifier

o + → Addition Operator

o b → Identifier

o * → Multiplication Operator

o 5 → Numeric Literal

o ; → Semicolon

Explanation:

 The lexical analyzer reads the code character by character.

 It recognizes keywords like int, operators like =, +, *, and the semicolon ;.

 It also identifies identifiers like result, a, b, and the numeric literal 5.

Example 2: Variable Declaration and Assignment

Source Code:

name = ""

Lexical Analysis Process:

1. Input: name = ""

2. Output Tokens:

o name → Identifier

o = → Assignment Operator

o "" → String Literal

Explanation:

 The lexer identifies name as an identifier and the equal sign = as an assignment operator.

 The string "" is recognized as a string literal.

Example 3: Conditional Statement


Source Code:

java

if (x > 10) {

y = x * 2;

Lexical Analysis Process:

1. Input: if (x > 10) { y = x * 2; }

2. Output Tokens:
 if → Keyword
 ( → Left Parenthesis
 x → Identifier
 > → Relational Operator
 10 → Numeric Literal
 ) → Right Parenthesis
 { → Left Brace
 y → Identifier
 = → Assignment Operator
 x → Identifier
 → Multiplication Operator
 2 → Numeric Literal
 ; → Semicolon
 } → Right Brace

Explanation:

 The lexical analyzer processes the if statement, identifying keywords (if), operators (>, =, *),
literals (10, 2), and symbols ((, ), {, }, ;).

 Each component is categorized into its respective token type.

Example 4: Function Definition

Source Code:

def add(a, b):


return a + b

Lexical Analysis Process:

1. Input: def add(a, b): return a + b

2. Output Tokens:
 def → Keyword
 add → Identifier
 ( → Left Parenthesis
 a → Identifier
 , → Comma
 b → Identifier
 ) → Right Parenthesis
 : → Colon
 return → Keyword
 a → Identifier
 + → Addition Operator
 b → Identifier

Explanation:

 The lexer identifies function-related keywords (def, return), identifiers (add, a, b), and
operators/symbols ((, ), ,, :, +).

Summary

In each example, the lexical analyzer reads the input code, removes irrelevant characters like white
spaces or comments, and produces a sequence of tokens. These tokens are then passed on to the next
stage of compilation or interpretation (usually syntax analysis).

You might also like