Lecture 4 Lexical Analysis
Lecture 4 Lexical Analysis
Lexical Analysis is the first phase of the compiler also known as a scanner. It converts the High level input
program into a sequence of Tokens.
2. The output is a sequence of tokens that is sent to the parser for syntax analysis
What is a
Token?
A lexical token is a sequence of characters that can be treated as a unit in the grammar of the
programming languages. Example of tokens:
Example of Non-Tokens:
Lexeme: The sequence of characters matched by a pattern to form the corresponding token or a
sequence of input characters that comprises a single token is called a lexeme. eg- “float”,
“abs_zero_Kelvin”, “=”, “-”, “273”, “;” .
2. Tokenization: This is the process of breaking the input text into a sequence of tokens. This is
usually done by matching the characters in the input text against a set of patterns or regular
expressions that define the different types of tokens.
3. Token classification: In this stage, the lexer determines the type of each token. For example, in a
programming language, the lexer might classify keywords, identifiers, operators, and
punctuation symbols as separate token types.
4. Token validation: In this stage, the lexer checks that each token is valid according to the rules of
the programming language. For example, it might check that a variable name is a valid identifier,
or that an operator has the correct syntax.
5. Output generation: In this final stage, the lexer generates the output of the lexical analysis
process, which is typically a list of tokens. This list of tokens can then be passed to the next stage
of compilation or interpretation.
The lexical analyzer identifies the error with the help of the automation machine and the
grammar of the given language on which it is based like C, C++, and gives row number and
column number of the error.
int main()
{
// 2 variables
int a, b;
a = 10;
return 0;
}
'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b' ';'
Above are the valid tokens. You can observe that we have omitted comments. As another example,
consider below printf statement.
There are 5 valid token in this printf statement. Exercise 1: Count number of tokens:
int main()
{
int a = 10, b = 20;
printf("sum is:%d",a+b);
return 0;
}
Lexical analyzer first read int and finds it to be valid and accepts as token.
( LAPREN = ASSIGNMENT
a IDENTIFIER a IDENTIFIER
b IDENTIFIER 2 INTEGER
) RPAREN ; SEMICOLON
Advantages
1. Simplifies Parsing:Breaking down the source code into tokens makes it easier for computers to
understand and work with the code. This helps programs like compilers or interpreters to figure
out what the code is supposed to do. It’s like breaking down a big puzzle into smaller pieces,
which makes it easier to put together and solve.
2. Error Detection: Lexical analysis will detect lexical errors such as misspelled keywords or
undefined symbols early in the compilation process. This helps in improving the overall efficiency
of the compiler or interpreter by identifying errors sooner rather than later.
3. Efficiency: Once the source code is converted into tokens, subsequent phases of compilation or
interpretation can operate more efficiently. Parsing and semantic analysis become faster and
more streamlined when working with tokenized input.
Disadvantages
1. Limited Context: Lexical analysis operates based on individual tokens and does not consider the
overall context of the code. This can sometimes lead to ambiguity or misinterpretation of the
code’s intended meaning especially in languages with complex syntax or semantics.
2. Overhead: Although lexical analysis is necessary for the compilation or interpretation process, it
adds an extra layer of overhead. Tokenizing the source code requires additional computational
resources which can impact the overall performance of the compiler or interpreter.
3. Debugging Challenges: Lexical errors detected during the analysis phase may not always provide
clear indications of their origins in the original source code. Debugging such errors can be
challenging especially if they result from subtle mistakes in the lexical analysis process.
Syntax analysis or parsing is the second phase of a compiler. In this chapter, we shall learn the basic
concepts used in the construction of a parser.
We have seen that a lexical analyzer can identify tokens with the help of regular expressions and
pattern rules. But a lexical analyzer cannot check the syntax of a given sentence due to the limitations
of the regular expressions. Regular expressions cannot check balancing tokens, such as parenthesis.
Therefore, this phase uses context-free grammar (CFG), which is recognized by push-down automata.
Context-Free Grammar
In this section, we will first see the definition of context-free grammar and introduce terminologies
used in parsing technology.
A set of non-terminals (V). Non-terminals are syntactic variables that denote sets of strings.
The non-terminals define sets of strings that help define the language generated by the
grammar.
A set of tokens, known as terminal symbols (Σ). Terminals are the basic symbols from which
strings are formed.
A set of productions (P). The productions of a grammar specify the manner in which the
terminals and non-terminals can be combined to form strings. Each production consists of
a non-terminal called the left side of the production, an arrow, and a sequence of tokens
and/or on- terminals, called the right side of the production.
One of the non-terminals is designated as the start symbol (S); from where the production
begins.
The strings are derived from the start symbol by repeatedly replacing a non-terminal (initially the start
symbol) by the right side of a production, for that non-terminal.
Example
We take the problem of palindrome language, which cannot be described by means of Regular
Expression. That is, L = { w | w = wR } is not a regular language. But it can be described by means of
CFG, as illustrated below:
G = ( V, Σ, P, S )
Where:
V = { Q, Z, N }
Σ = { 0, 1 }
P = { Q → Z | Q → N | Q → ℇ | Z → 0Q0 | N → 1Q1 }
S={Q}
This grammar describes palindrome language, such as: 1001, 11100111, 00100, 1010101, 11111, etc.
Syntax Analyzers
A syntax analyzer or parser takes the input from a lexical analyzer in the form of token streams. The
parser analyzes the source code (token stream) against the production rules to detect any errors in the
code. The output of this phase is a parse tree.
This way, the parser accomplishes two tasks, i.e., parsing the code, looking for errors and generating a
parse tree as the output of the phase.
Parsers are expected to parse the whole code even if some errors exist in the program. Parsers use
error recovering strategies, which we will learn later in this chapter.
Derivation
A derivation is basically a sequence of production rules, in order to get the input string. During
parsing, we take two decisions for some sentential form of input:
To decide which non-terminal to be replaced with production rule, we can have two options.
Left-most Derivation
If the sentential form of an input is scanned and replaced from left to right, it is called left-most
derivation. The sentential form derived by the left-most derivation is called the left-sentential form.
Right-most Derivation
If we scan and replace the input with production rules, from right to left, it is known as right-most
derivation. The sentential form derived from the right-most derivation is called the right-sentential
form.
Example
Production rules:
E→E+E
E→E*E
E → id
Input string: id + id * id
E→E*E
E→E+E*E
E → id + E * E
E → id + id * E
E → id + id * id
E→E+E
E→E+E*E
E → E + E * id
E → E + id * id
E → id + id * id
Parse Tree
A parse tree is a graphical depiction of a derivation. It is convenient to see how strings are derived
from the start symbol. The start symbol of the derivation becomes the root of the parse tree. Let us
see this by an example from the last topic.
E→E*E
E→E+E*E
E → id + E * E
E → id + id * E
E → id + id * id
Step 1:
E→E*E
Step 2:
E→E+E*E
Step 3:
E → id + E * E
Step 4:
E → id + id * E
Step 5:
E → id + id * id
In a parse tree:
A parse tree depicts associativity and precedence of operators. The deepest sub-tree is traversed first,
therefore the operator in that sub-tree gets precedence over the operator which is in the parent
nodes.
Ambiguity
A grammar G is said to be ambiguous if it has more than one parse tree (left or right derivation) for at
least one string.
Example
E→E+E
E→E–E
E → id
For the string id + id – id, the above grammar generates two parse trees:
Associativity
If an operand has operators on both sides, the side on which the operator takes this operand is
decided by the associativity of those operators. If the operation is left-associative, then the operand
will be taken by the left operator or if the operation is right-associative, the right operator will take the
operand.
Example
Operations such as Addition, Multiplication, Subtraction, and Division are left associative. If the
expression contains:
id op id op id
(id op id) op id
Operations like Exponentiation are right associative, i.e., the order of evaluation in the same
expression will be:
id op (id op id)
Precedence
If two different operators share a common operand, the precedence of operators decides which will
take the operand. That is, 2+3*4 can have two different parse trees, one corresponding to (2+3)*4 and
another corresponding to 2+(3*4). By setting precedence among operators, this problem can be easily
removed. As in the previous example, mathematically * (multiplication) has precedence over +
(addition), so the expression 2+3*4 will always be interpreted as:
2 + (3 * 4)
Left Recursion
A grammar becomes left-recursive if it has any non-terminal ‘A’ whose derivation contains ‘A’ itself as
the left-most symbol. Left-recursive grammar is considered to be a problematic situation for top-down
parsers. Top-down parsers start parsing from the Start symbol, which in itself is non-terminal. So,
when the parser encounters the same non-terminal in its derivation, it becomes hard for it to judge
when to stop parsing the left non-terminal and it goes into an infinite loop.
Example:
(1) A => Aα | β
(2) S => Aα | β
A => Sd
(1) is an example of immediate left recursion, where A is any non-terminal symbol and α represents a
string of non-terminals.
The production
A => Aα | β
A => βA'
A'=> αA' | ε
This does not impact the strings derived from the grammar, but it removes immediate left recursion.
Second method is to use the following algorithm, which should eliminate all direct and indirect left
recursions.
START
END
Example
The production set
S => Aα | β
A => Sd
S => Aα | β
A => Aαd | βd
and then, remove immediate left recursion using the first technique.
A => βdA'
Now none of the production has either direct or indirect left recursion.
Left Factoring
If more than one grammar production rules has a common prefix string, then the top-down parser
cannot make a choice as to which of the production it should take to parse the string in hand.
Example
A ⟹ αβ | α𝜸 | …
Then it cannot determine which production to follow to parse the string as both productions are
starting from the same terminal (or non-terminal). To remove this confusion, we use a technique
called left factoring.
Left factoring transforms the grammar to make it useful for top-down parsers. In this technique, we
make one production for each common prefixes and the rest of the derivation is added by new
productions.
Example
A => αA'
A'=> β | 𝜸 | …
Now the parser has only one production per prefix which makes it easier to take decisions.
An important part of parser table construction is to create first and follow sets. These sets can provide
the actual position of any terminal in the derivation. This is done to create the parsing table where the
decision of replacing T[A, t] = α with some production rule.
First Set
This set is created to know what terminal symbol is derived in the first position by a non-terminal. For
example,
α→tβ
Follow Set
Syntax analyzers receive their inputs, in the form of tokens, from lexical analyzers. Lexical analyzers
are responsible for the validity of a token supplied by the syntax analyzer. Syntax analyzers have the
following drawbacks -
Lexical Analysis
Lexical Analysis is the first phase of a compiler, where the source code is analyzed to identify meaningful
elements called tokens. It is often referred to as the "scanning" phase, as it involves reading the source
code sequentially and breaking it down into basic units.
Example:
int sum = a + b * 5;
During lexical analysis, this code will be broken down into the following tokens:
int, sum, =, a, +, b, *, 5, ;
The lexical analyzer (also known as a lexer or scanner) is responsible for reading the input source code
and converting it into a stream of tokens. Its roles include:
2. Removing White Spaces and Comments: The lexer typically removes unnecessary elements like
white spaces, line breaks, and comments that do not contribute to the code's meaning.
3. Error Reporting: If the lexer encounters an invalid token (e.g., an unrecognized character), it
reports an error and may provide the location of the issue.
o Example: If the source code contains an invalid character like # in a language where # is
not defined, the lexer will flag this as an error.
4. Handling Literals and Identifiers: The lexer distinguishes between different types of tokens such
as keywords, identifiers, literals (e.g., numbers, strings), and operators.
Token: A token is a pair consisting of a token name and an optional attribute value. It represents
a class of lexemes.
o Example: The keyword int is a token representing the data type integer.
Lexeme: A lexeme is a sequence of characters in the source code that matches the pattern for a
token and is identified by the lexical analyzer.
o Example: In the code int sum = 5;, the lexemes are int, sum, =, 5, and ;.
Pattern: A pattern is a rule that specifies the structure of lexemes that can be recognized as a
particular token.
Regular Expressions: Regular expressions are used to specify the patterns for tokens in a formal
and concise manner. They define the set of strings that match a particular pattern.
o Example: The regular expression [0-9]+ matches one or more digits, which can be used
to recognize integer literals.
Finite Automata: Finite automata (FA) are abstract machines used to recognize patterns
described by regular expressions. There are two types of finite automata:
1. Deterministic Finite Automaton (DFA): A DFA has a finite number of states and
transitions between them based on input symbols. It can be used to recognize a regular
language.
Example: A DFA for recognizing binary numbers (0 or 1) might have states that
transition between accepting or rejecting based on the input bit.
2. Non-deterministic Finite Automaton (NFA): An NFA allows multiple transitions for the
same input symbol or no transition at all. NFAs can be converted to DFAs for practical
implementation.
Example: An NFA for recognizing the pattern a*b could have multiple paths
where the a* part can accept any number of a characters followed by b.
Summary
The lexical analyzer performs tokenization, removes white spaces and comments, reports errors,
and distinguishes between different types of tokens.
Tokens are basic units, lexemes are the actual character sequences, and patterns define what
lexemes should look like.
Regular expressions and finite automata are tools used to define and recognize patterns in the
source code.
You said:
4. *: This asterisk * means that the pattern (a + bd)c can repeat zero or more times.
1. Start State:
2. Transition for d:
3. Handling (a + bd):
4. Transition for c:
o From state q2, if the input is c, transition back to state q1 (because the pattern can
repeat).
5. Accept State:
o The accepting state should be q1 because the pattern can repeat, so q1 is where the
string is accepted if it ends with c.
1. Start at q0.
3. If at q1:
o If the input is a, move to q2.
Summary of Transitions
q0 → q1 on input d
q1 → q2 on input a
q1 → q3 on input b
q3 → q4 on input d
q2 → q1 on input c
q4 → q1 on input c
Final DFA
The DFA will have states q0, q1, q2, q3, q4, where q0 is the start state and q1 is both the accepting and
repeating state. The transitions will allow the string to be accepted if it follows the pattern d((a+bd)c)*.
4o
You said:
A lexical analyzer (or lexer) is a component of a compiler or interpreter that processes the input source
code and converts it into a sequence of tokens. Here are some examples to illustrate how a lexical
analyzer works:
Source Code:
int result = a + b * 5;
o int → Keyword
o result → Identifier
o = → Assignment Operator
o a → Identifier
o + → Addition Operator
o b → Identifier
o * → Multiplication Operator
o 5 → Numeric Literal
o ; → Semicolon
Explanation:
Source Code:
name = ""
2. Output Tokens:
o name → Identifier
o = → Assignment Operator
Explanation:
The lexer identifies name as an identifier and the equal sign = as an assignment operator.
java
if (x > 10) {
y = x * 2;
2. Output Tokens:
if → Keyword
( → Left Parenthesis
x → Identifier
> → Relational Operator
10 → Numeric Literal
) → Right Parenthesis
{ → Left Brace
y → Identifier
= → Assignment Operator
x → Identifier
→ Multiplication Operator
2 → Numeric Literal
; → Semicolon
} → Right Brace
Explanation:
The lexical analyzer processes the if statement, identifying keywords (if), operators (>, =, *),
literals (10, 2), and symbols ((, ), {, }, ;).
Source Code:
2. Output Tokens:
def → Keyword
add → Identifier
( → Left Parenthesis
a → Identifier
, → Comma
b → Identifier
) → Right Parenthesis
: → Colon
return → Keyword
a → Identifier
+ → Addition Operator
b → Identifier
Explanation:
The lexer identifies function-related keywords (def, return), identifiers (add, a, b), and
operators/symbols ((, ), ,, :, +).
Summary
In each example, the lexical analyzer reads the input code, removes irrelevant characters like white
spaces or comments, and produces a sequence of tokens. These tokens are then passed on to the next
stage of compilation or interpretation (usually syntax analysis).