Lexical Analysis
Lexical Analysis
Lexical analysis is the first phase of a compiler. It takes modified source code from language
preprocessors that are written in the form of sentences. The lexical analyzer breaks these
syntaxes into a series of tokens, by removing any whitespace or comments in the source code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works
closely with the syntax analyzer. It reads character streams from the source code, checks for
legal tokens, and passes the data to the syntax analyzer when it demands.
Tokens
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some
predefined rules for every lexeme to be identified as a valid token. These rules are defined by
grammar rules, by means of a pattern. A pattern explains what can be a token, and these
patterns are defined by means of regular expressions.
Lexical analysis is the first phase of a compiler. It takes modified source code from language
preprocessors that are written in the form of sentences. The lexical analyzer breaks these
syntaxes into a series of tokens, by removing any whitespace or comments in the source code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works
closely with the syntax analyzer. It reads character streams from the source code, checks for
legal tokens, and passes the data to the syntax analyzer when it demands.
Lexical analysis
Tokens
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some
predefined rules for every lexeme to be identified as a valid token. These rules are defined by
grammar rules, by means of a pattern. A pattern explains what can be a token, and these
patterns are defined by means of regular expressions.
In programming language, keywords, constants, identifiers, strings, numbers, operators and
punctuations symbols can be considered as tokens.
Specifications of Tokens
Let us understand how the language theory undertakes the following terms:
Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets, {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a
set of Hexadecimal alphabets, {a-z, A-Z} is a set of English language alphabets.
Strings
Any finite sequence of alphabets (characters) is called a string. Length of the string is the total
number of occurrence of alphabets, e.g., the length of the string tutorialspoint is 14 and is
denoted by |tutorialspoint| = 14. A string having no alphabets, i.e. a string of zero length is
known as an empty string and is denoted by ε (epsilon).
Special symbols
A typical high-level language contains the following symbols:-
Language
A language is considered as a finite set of strings over some finite set of alphabets. Computer
languages are considered as finite sets, and mathematically set operations can be performed
on them. Finite languages can be described by means of regular expressions.
Explore our latest online courses and learn new skills at your own pace. Enroll and become a
certified expert to boost your career.
Regular Expressions
The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme that
belong to the language in hand. It searches for the pattern defined by the language rules.
Regular expressions have the capability to express finite languages by defining a pattern for
finite strings of symbols. The grammar defined by regular expressions is known as regular
grammar. The language defined by regular grammar is known as regular language.
Regular expression is an important notation for specifying patterns. Each pattern matches a set
of strings, so regular expressions serve as names for a set of strings. Programming language
tokens can be described by regular languages. The specification of regular expressions is an
example of a recursive definition. Regular languages are easy to understand and have efficient
implementation.
There are a number of algebraic laws that are obeyed by regular expressions, which can be
used to manipulate regular expressions into equivalent forms.
What is a Lexeme?
A lexeme is an actual string of characters that matches with a pattern and generates a token.
eg- “float”, “abs_zero_Kelvin”, “=”, “-”, “273”, “;” .
The lexical analysis is the first phase of the compiler where a lexical analyser operate as an
interface between the source code and the rest of the phases of a compiler. It reads the input
characters of the source program, groups them into lexemes, and produces a sequence of
tokens for each lexeme. The tokens are sent to the parser for syntax analysis.
If the lexical analyzer is located as a separate pass in the compiler it can need an intermediate
file to locate its output, from which the parser would then takes its input. It can eliminate the
need for the intermediate file, the lexical analyzer and the syntactic analyser (parser) are often
grouped into the same pass where the lexical analyser operates either under the control of the
parser or as a subroutine with the parser.
The lexical analyzer also interacts with the symbol table while passing tokens to the parser.
Whenever a token is discovered, the lexical analyzer returns a representation for that token to
the parser. If the token is a simple construct including parenthesis, comma, or a colon, then it
returns an integer program. If the token is a more complex items including an identifier or
another token with a value, the value is also passed to the parser.
Lexical analyzer separates the characters of the source language into groups that logically
belong together, called tokens. It includes the token name which is an abstract symbol that
define a type of lexical unit and an optional attribute value called token values. Tokens can be
identifiers, keywords, constants, operators, and punctuation symbols including commas and
parenthesis. A rule that represent a group of input strings for which the equal token is make as
output is called the pattern.
For example, it can maintain track of all newline characters so that it can relate an ambiguous
statement line number with each error message. It can be implementing the expansion of
macros, in the case of macro, pre-processors are used in the source program.
Input Buffering
Lexical Analysis has to access secondary memory each time to identify tokens. It is time-
consuming and costly. So, the input strings are stored into a buffer and then scanned by Lexical
Analysis.
Lexical Analysis scans input string from left to right one character at a time to identify tokens. It
uses two pointers to scan tokens −
Look Ahead Pointer (lptr) − It moves ahead to search for the end of the token.
Both pointers start at the beginning of the string, which is stored in the buffer.
The character ("blank space") beyond the token ("int") have to be examined before the token
("int") will be determined.
After processing token ("int") both pointers will set to the next token ('a'), & this process will be
repeated for the whole program.
A buffer can be divided into two halves. If the look Ahead pointer moves towards halfway in First
Half, the second half is filled with new characters to be read. If the look Ahead pointer moves
towards the right end of the buffer of the second half, the first half will be filled with new
characters, and it goes on.
Buffer Pairs −
A specialized buffering technique can decrease the amount of overhead, which is needed to
process an input character in transferring characters. It includes two buffers, each includes N-
character size which is reloaded alternatively.
There are two pointers such as the lexeme Begin and forward are supported. Lexeme Begin
points to the starting of the current lexeme which is discovered. Forward scans ahead before a
match for a pattern are discovered. Before a lexeme is initiated, lexeme begin is set to the
character directly after the lexeme which is only constructed, and forward is set to the character
at its right end.