Unit 2
Unit 2
1010043418
1. Introduction to Lexical Analysis
Lexical analysis is the first phase of the compilation process in computer science and programming
language theory. It is also known as scanning or tokenization. The main objective of lexical analysis is
to break down the source code of a program into smaller units called tokens. These tokens represent the
fundamental building blocks of a programming language, such as keywords, identifiers, literals, and
operators.
2. Scanning: The source code is read character by character, and the scanner identifies sequences of
characters that form a token. It skips whitespace and comments that do not impact the structure of the
program.
3. Tokenization: The scanner groups characters together to form tokens based on predefined patterns or
regular expressions. Each token has a specific meaning in the programming language. Common types of
tokens include keywords like "if" and "while," identifiers like variable names, literals like numbers and
strings, and operators like "+" and "=".
4. Symbol Table: As the tokens are identified, a symbol table is typically maintained. The symbol table
is a data structure that keeps track of identifiers and their associated information, like their data types
and memory locations.
5. Output: The output of the lexical analysis phase is a sequence of tokens, usually represented as a
stream of token names or token codes, along with any additional information like the lexeme (the actual
character sequence that forms the token) and, if applicable, the location in the source code (line number,
column number) where the token was found.
The token stream produced by the lexical analyzer is then passed to the next phase of the compiler
(usually the syntax analysis or parsing phase) to create an abstract syntax tree (AST) and check the
syntactic correctness of the source code. The AST is then used for further processing, optimization, and
eventually code generation to produce the executable program.
Figure 1Interaction of Lexical Analyzer with Parser
• Compiler efficiency is improved. A large amount of time is spent reading the source program and
partitioning into tokens. Buffering techniques are used for reading input characters and
processing tokens that speed up the performance of the compiler.
• Compiler portability is enhanced.
2. Tokens, Patterns, Lexemes:
Token: Token is a sequence of characters in the input that form a meaningful word. In most languages,
the tokens fall into these categories:
• Keywords
• Operators
• Identifiers
• Constants
• Literal strings
• Punctuation.
Pattern: There is a set of strings in the input for which a token is produced as output. This set is
described a rule called pattern.
Lexeme: A lexeme is a sequence of characters in the source program that is matched by the pattern for a
token.
• x is an identifier token.
Pattern for an identifier: [a-zA-Z_][a-zA-Z0-9_]*
Lexeme: x
• = is an operator token.
Pattern for the assignment operator: =
Lexeme: =
• 10 is a number token.
Pattern for a number: [0-9]+
Lexeme: 10
• + is an operator token.
Pattern for the addition operator: +
Lexeme: +
• y is an identifier token.
Pattern for an identifier: [a-zA-Z_][a-zA-Z0-9_]*
Lexeme: y
• ; is a delimiter token.
Pattern for the semicolon delimiter: ;
Lexeme: ;
So, the token stream produced by the lexical analysis phase for this code snippet would be:
IDENTIFIER("x")
OPERATOR("=")
NUMBER("10")
OPERATOR("+")
IDENTIFIER("y")
DELIMITER(";")
• Invalid Characters: If the source code contains characters that are not recognized by the programming
language or are not part of the defined regular expressions, the lexical analyzer will raise an error for
each invalid character.
• 2. Unterminated Strings: If a string literal is not properly terminated with a closing quotation mark, it
will result in an error. For example, `"Hello, World!` is an unterminated string literal. • Unterminated
Comments: If a comment is not properly terminated, it can lead to lexical errors. For example, in some
languages, the comment might begin with `/*` but not have a closing `*/`, causing the rest of the code to
be considered as a comment.
• Incorrect Identifiers: If an identifier does not follow the language's rules for naming variables or uses
reserved keywords, it will lead to a lexical error.
• Ambiguous Tokens: If a particular sequence of characters can be interpreted as multiple token types,
the lexer may not be able to decide the correct token, leading to ambiguity and potential errors. • Integer
Overflow: If a numeric literal exceeds the range that can be represented by the language's data types, it
will lead to an integer overflow error.
• Invalid Numeric Literals: Numeric literals that do not follow the language's rules for representing
numbers, such as having multiple decimal points, may result in lexical errors.
• Incomplete Operators: Some languages have multi-character operators like `<=`, `>=`, `==`, etc. If the
code contains an incomplete or unrecognized operator, it will lead to a lexical error.
• Missing Semicolons: In languages that require statements to end with semicolons, missing semicolons
at the end of statements will result in errors.
• Reserved Words Usage: If the code uses reserved words or keywords in an inappropriate context, the
lexical analyzer may raise errors.
During the lexical analysis phase, the lexer typically reports these errors and tries to recover as best as
possible to continue processing the rest of the source code. The error reporting mechanism may vary
depending on the compiler or programming language being used. In some cases, the lexer may halt
processing after encountering the first lexical error, while in other cases, it may continue scanning the entire
source code and report all errors in one go.
%{
// This section may include code or declarations that will be copied to the generated
lexer. #include <stdio.h>
%}
// Regular Definitions
DIGIT [0-9]
LETTER [a-zA-Z]
IDENTIFIER {LETTER}({LETTER}|{DIGIT})*
NUMBER {DIGIT}+
ASSIGNMENT_OP =
ADDITION_OP \+
SUBTRACTION_OP -
SEMICOLON ;
// Rules Section
%%
{IDENTIFIER} { printf("IDENTIFIER(%s)\n", yytext); }
{NUMBER} { printf("NUMBER(%s)\n", yytext); }
{ASSIGNMENT_OP} { printf("ASSIGNMENT_OP\n"); }
{ADDITION_OP} { printf("ADDITION_OP\n"); }
{SUBTRACTION_OP}{ printf("SUBTRACTION_OP\n"); }
{SEMICOLON} { printf("SEMICOLON\n"); }
. { /* Ignore unrecognized characters */ }
%%
// Custom functions or code can be included after the second '%%' section.
In this example, we use `%{ ... %}` to include C code that will be copied directly to the generated
lexical analyzer (lexer). The section after `%{ ... %}` is for defining regular definitions, which are named
patterns that can be used later in the rules section.
In the rules section, we define how to recognize different tokens based on the regular definitions. For
example, `{IDENTIFIER}` represents the pattern defined as `IDENTIFIER`, which matches any valid
identifier according to the specified regular expression. Similarly, `{NUMBER}` matches any sequence of
digits, and so on.
When a lexical analyzer is generated from this specification, it will recognize and print tokens based
on the patterns defined in the rules section. For example, given the input `x = 10 + y;`, the lexer would
produce the following token stream:
IDENTIFIER(x)
ASSIGNMENT_OP
NUMBER(10)
ADDITION_OP
IDENTIFIER(y)
SEMICOLON
The generated lexer would ignore any whitespace or comments as defined in the rules section (the last rule
`. { /* Ignore unrecognized characters */ }`). This way, it focuses only on the meaningful tokens defined in
the specification.
5. REGULAR EXPRESSIONS
Regular expression is a formula that describes a possible set of string. Component of regular
expression..
X the character x
. any character, usually accept a new line[x y z]
R? a R or nothing (=optionally as R)
R1R2 an R1 followed by an R2
A token is either a single string or one of a collection of strings of a certain ty`pe. If we view the
set of strings in each token class as an language, we can use the regular-expression notation to
describe tokens.
Consider an identifier, which is defined to be a letter followed by zero or more letters or digits.
In regular expression notation we would write.
Identifier = letter (letter | digit)*
Here are the rules that define the regular expression over alphabet .
• is a regular expression denoting { € }, that is, the language containing only the empty
string.
• For each ‘a’ in Σ, is a regular expression denoting { a }, the language with only one
string consisting of the single symbol ‘a’ .
• If R and S are regular expressions, then
6. REGULAR DEFINITIONS
For notational convenience, we may wish to give names to regular expressions and to define
regular expressions using these names as if they were symbols.
Identifiers are the set or string of letters and digits beginning with a letter. The following regular
definition provides a precise specification for this class of string.
Example-1,
Ab*|cd? Is equivalent to (a(b*)) | (c(d?))
Pascal identifier
Letter - A | B | ……| Z | a | b |……| z|
Digits - 0 | 1 | 2 | …. | 9
Id - letter (letter / digit)*
7. Recognition of tokens:
We learn how to express pattern using regular expressions. Now, we must study how to take the
patterns for all the needed tokens and build a piece of code that examins the input string and
finds a prefix that is a lexeme matching one of the patterns.
Stmt →if expr then stmt
| If expr then else stmt
|є
Expr →term relop term
| term
Term →id
|number
For relop ,we use the comparison operations of languages like Pascal or SQL where = is
“equals” and < > is “not equals” because it presents an interesting structure of lexemes.
The terminal of grammar, which are if, then , else, relop ,id and numbers are the names of tokens
as far as the lexical analyzer is concerned, the patterns for the tokens are described using regular
definitions.
digit → [0,9]
digits →digit+
number →digit(.digit)?(e.[+-]?digits)?
letter → [A-Z,a-z]
id →letter(letter/digit)*
if → if
then →then
else →else
relop →< | > |<= | >= | = = | < >
In addition, we assign the lexical analyzer the job stripping out white space, by recognizing the
“token” we defined by:
WS → (blank/tab/newline)+
Here, blank, tab and newline are abstract symbols that we use to express the ASCII characters of
the same names. Token ws is different from the other tokens in that ,when we recognize it, we do
not return it to parser ,but rather restart the lexical analysis from the character that follows the
white space . It is the following token that gets returned to the parser.
Lexeme Token Name Attribute Value
Any WS - -
if If -
then Then -
else Else -
<= Relop LE
== Relop EQ
<> Relop NE
8. TRANSITION DIAGRAM:
Transition Diagram has a collection of nodes or circles, called states. Each state represents a
condition that could occur during the process of scanning the input looking for a lexeme that
matches one of several patterns .
Edges are directed from one state of the transition diagram to another. each edge is labeled by a
symbol or set of symbols.
If we are in one state s, and the next input symbol is a, we look for an edge out of state s labeled
by a. if we find such an edge ,we advance the forward pointer and enter the state of the transition
diagram to which that edge leads.
Some important conventions about transition diagrams are
1. Certain states are said to be accepting or final .These states indicates that a lexeme has
been found, although the actual lexeme may not consist of all positions b/w the
lexeme Begin and forward pointers we always indicate an accepting state by a double
circle.
2. In addition, if it is necessary to return the forward pointer one position, then we shall
additionally place a * near that accepting state.
3. One state is designed the state ,or initial state ., it is indicated by an edge labeled “start”
entering from nowhere .the transition diagram always begins in the state before any input
symbols have been used.
• A finite automaton can be: deterministic (DFA) or non-deterministic (NFA) • This means
that we may use a deterministic or non-deterministic automaton as a lexical analyzer.
• Both deterministic and non-deterministic finite automaton recognize regular
• For each symbol a and state s, there is at most one labeled edge a leaving s. i.e.
transition function is from pair of state-symbol to state (not set of states)
Example:
12. Converting RE to NFA
• This is one way to convert a regular expression into a NFA.
• It guarantees that the resulting NFA will have exactly one final state, and one start
N(r1) and N(r2) are NFAs for regular expressions r1 and r2.
• For regular expression r1 r2
Example:
For a RE (a|b) * a, the NFA construction is shown below.
13. Converting NFA to DFA (Subset Construction)
We merge together NFA states by looking at them from the point of view of the input
characters: • From the point of view of the input, any two states that are connected by an –
transition may as well be the same, since we can move from one to the other without
consuming any character. Thus states which are connected by an -transition will be
represented by the same states in the DFA.
• If it is possible to have multiple transitions based on the same symbol, then we can regard a
transition on a symbol as moving from a state to a set of states (ie. the union of all those
states reachable by a transition on the current symbol). Thus these states will be
combined into a single DFA state.
To perform this operation, let us define two functions:
• The -closure function takes a state and returns the set of states reachable from it based on
(one or more) -transitions. Note that this will always include the state itself. We should
be able to get from a state to any state in its -closure without consuming any input.
• The function move takes a state and a character, and returns the set of states reachable
by one transition on this character.
We can generalise both these functions to apply to sets of states by taking the union of
the application to individual states.
p1 {action 1}
p2 {action 2}
p3 {action 3}
……
……
Where, each p is a regular expression and each action is a program fragment describing
what action the lexical analyzer should take when a pattern p matches a lexeme. In Lex
the actions are written in C.
3. The third section holds whatever auxiliary procedures are needed by the
actions.Alternatively these procedures can be compiled separately and loaded with the
lexical analyzer.
Note: You can refer to a sample lex program given in page no. 109 of chapter 3 of the
book: Compilers: Principles, Techniques, and Tools by Aho, Sethi & Ullman for more
clarity.
The LA scans the characters of the source pgm one at a time to discover tokens. Because of large
amount of time can be consumed scanning characters, specialized buffering techniques have been
developed to reduce the amount of overhead required to process an input character. Buffering
techniques:
1. Buffer pairs
2. Sentinels
The lexical analyzer scans the characters of the source program one a t a time to discover tokens.
Often, however, many characters beyond the next token many have to be examined before the
next token itself can be determined. For this and other reasons, it is desirable for thelexical
analyzer to read its input from an input buffer. Figure shows a buffer divided into two haves of,
say 100 characters each. One pointer marks the beginning of the token being discovered. A look
ahead pointer scans ahead of the beginning point, until the token is discovered .we view the
position of each pointer as being between the character last read and thecharacter next to be read.
In practice each buffering scheme adopts one convention either apointer is at the symbol last
read or the symbol it is ready to read.
Token beginnings look ahead pointerThe distance which the lookahead pointer may have to
travel past the actual token may belarge. For example, in a PL/I program we may see:
DECALRE (ARG1, ARG2… ARG n) Without knowing whether DECLARE is a
keyword or an array name until we see the character that follows the right parenthesis.
In either case, the token itself ends at the second E. If the look ahead pointer travels
beyond the buffer half in which it began, the other half must be loaded with the next
characters from the source file. Since the buffer shown in above figure is of limited
size there is an implied constraint on how much look ahead can be used before the
next token is discovered. In the above example, ifthe look ahead traveled to the left
half and all the way through the left half to the middle, we could not reloadthe right
half, because we would lose characters that had not yet been groupedinto tokens.
While we can make the buffer larger if we chose or use another buffering scheme,we
cannot ignore the fact that overhead is limited.