Lexical Analysis
Lexical Analysis
Lexical Analysis
( LAPREN = ASSIGNMENT
a IDENTIFIER a IDENTIFIER
b IDENTIFIER 2 INTEGER
Lexemes Tokens Lexemes Continued… Tokens Continued…
) RPAREN ; SEMICOLON
It then groups these characters into tokens, which are the smallest meaningful units in
the code. Tokens can include constants like numbers and strings, operators like + and -,
punctuation marks like commas and semicolons, and keywords like “if” and “while.”
After the lexer scans the text, it creates a stream of tokens. This tokenized format is
necessary for the next steps in processing or compiling the program. Lexical analysis is
also a good time to clean up the text by removing extra spaces or organizing certain
types of data. You can think of lexical analysis as a preparation step before more
complicated NLP tasks begin.
NLP is a field in computer science and artificial intelligence focused on how computers
can understand and communicate with humans. The goal of NLP is to enable
computers to read, understand, and respond to human language in a way that makes
sense.
2. Token
A tokenizer is a tool that breaks down text into individual tokens. Each token has its own
meaning. The tokenizer identifies where one token ends and another begins, which can
change based on the context and rules of the language. Tokenization is usually the first
step in natural language processing.
A lexer, or lexical analyzer, is a more advanced tool that not only tokenizes text but also
classifies these tokens into categories. For example, a lexer can sort tokens into
keywords, operators, and values. It is important for the next stage, called parsing, as it
provides tokens to the parser for further analysis.
5. Lexeme
A lexeme is a basic unit of meaning in a language. It represents the core idea of a word.
For example, the words “write,” “wrote,” “writing,” and “written” all relate to the same
lexeme: “write.” In a simple expression like “5 + 6,” the individual elements “5,” “+,” and
“6” are separate lexemes.
The first step involved in lexical analysis is to identify individual symbols in the input.
These symbols include letters, numbers, operators, and special characters. Each
symbol or group of symbols is given a token type, like “number” or “operator.”
The lexer (a tool that performs lexical analysis) is set up to recognize and group-specific
inputs. For example, if the input is “apple123,” the lexer may recognize “apple” as a
word token and “123” as a number token. Similarly, keywords like “if” or “while” are
categorized as specific token types.
The loop works like reading a book, one character at a time until it reaches the end of
the code. It goes through the code without missing any character or symbol, making
sure each part is captured.
The switch statement acts as a quick decision-maker. After the loop reads a character
or a group of characters, the switch statement decides what type of token it is, such as
a keyword, number, or operator. This is similar to organizing items into separate boxes
based on their type, making the code easier to understand.
Regular expressions are rules that describe patterns in text. They help define how
different tokens should look, such as an email or a phone number format. The lexer
uses these rules to identify tokens by matching text against these patterns.
Finite automata are like small machines that follow instructions step-by-step. They take
the rules from regular expressions and apply them to the code. If a part of the code
matches a rule, it’s identified as a token. This makes the process of breaking down code
more efficient and accurate.
One of the main careers in this area is the NLP engineer. NLP engineers have various
responsibilities, such as creating NLP systems, working with speech systems in AI
applications, developing new NLP algorithms, and improving existing models.
According to Glassdoor’s October 2024 data, an NLP engineer’s average annual salary
in India is around ₹10 Lakh per annum
Apart from NLP engineers, There are many other related careers that may involve the
use of lexical analysis in their tasks, depending on your area of focus. These careers
include:
Software Engineers
Data Scientists
Machine Learning Engineers
Artificial Intelligence Engineers
Language Engineers
Research Engineers
2. Lexical analysis
A Python program is read by a parser. Input to the parser is a stream
of tokens, generated by the lexical analyzer. This chapter describes how the
lexical analyzer breaks a file into tokens.
Python reads program text as Unicode code points; the encoding of a source
file can be given by an encoding declaration and defaults to UTF-8, see PEP
3120 for details. If the source file cannot be decoded, a SyntaxError is
raised.
2.1.3. Comments
A comment starts with a hash character (#) that is not part of a string literal, and ends at the end
of the physical line. A comment signifies the end of the logical line unless the implicit line
joining rules are invoked. Comments are ignored by the syntax.
# vim:fileencoding=<encoding-name>
If no encoding declaration is found, the default encoding is UTF-8. If the implicit or explicit
encoding of a file is UTF-8, an initial UTF-8 byte-order mark (b’xefxbbxbf’) is ignored rather
than being a syntax error.
If an encoding is declared, the encoding name must be recognized by Python (see Standard
Encodings). The encoding is used for all lexical analysis, including string literals, comments and
identifiers.
A line ending in a backslash cannot carry a comment. A backslash does not continue a comment.
A backslash does not continue a token except for string literals (i.e., tokens other than string
literals cannot be split across physical lines using a backslash). A backslash is illegal elsewhere
on a line outside a string literal.
Implicitly continued lines can carry comments. The indentation of the continuation lines is not
important. Blank continuation lines are allowed. There is no NEWLINE token between implicit
continuation lines. Implicitly continued lines can also occur within triple-quoted strings (see
below); in that case they cannot carry comments.
2.1.8. Indentation
Leading whitespace (spaces and tabs) at the beginning of a logical line is used to compute the
indentation level of the line, which in turn is used to determine the grouping of statements.
Tabs are replaced (from left to right) by one to eight spaces such that the total number of
characters up to and including the replacement is a multiple of eight (this is intended to be the
same rule as used by Unix). The total number of spaces preceding the first non-blank character
then determines the line’s indentation. Indentation cannot be split over multiple physical lines
using backslashes; the whitespace up to the first backslash determines the indentation.
Indentation is rejected as inconsistent if a source file mixes tabs and spaces in a way that makes
the meaning dependent on the worth of a tab in spaces; a TabError is raised in that case.
A formfeed character may be present at the start of the line; it will be ignored for the indentation
calculations above. Formfeed characters occurring elsewhere in the leading whitespace have an
undefined effect (for instance, they may reset the space count to zero).
The indentation levels of consecutive lines are used to generate INDENT and DEDENT tokens,
using a stack, as follows.
Before the first line of the file is read, a single zero is pushed on the stack; this will never be
popped off again. The numbers pushed on the stack will always be strictly increasing from
bottom to top. At the beginning of each logical line, the line’s indentation level is compared to
the top of the stack. If it is equal, nothing happens. If it is larger, it is pushed on the stack, and
one INDENT token is generated. If it is smaller, it must be one of the numbers occurring on the
stack; all numbers on the stack that are larger are popped off, and for each number popped off a
DEDENT token is generated. At the end of the file, a DEDENT token is generated for each
number remaining on the stack that is larger than zero.
def perm(l):
# Compute the list of all permutations of l
if len(l) <= 1:
return [l]
r = []
for i in range(len(l)):
s = l[:i] + l[i+1:]
p = perm(s)
for x in p:
r.append(l[i:i+1] + x)
return r
(Actually, the first three errors are detected by the parser; only the last error is found by the
lexical analyzer — the indentation of return r does not match a level popped off the stack.)
The syntax of identifiers in Python is based on the Unicode standard annex UAX-31, with
elaboration and changes as defined below; see also PEP 3131 for further details.
Within the ASCII range (U+0001..U+007F), the valid characters for identifiers include the
uppercase and lowercase letters A through Z, the underscore _ and, except for the first character,
the digits 0 through 9. Python 3.0 introduced additional characters from outside the ASCII range
(see PEP 3131). For these characters, the classification uses the version of the Unicode
Character Database as included in the unicodedata module.
Lu - uppercase letters
Ll - lowercase letters
Lt - titlecase letters
Lm - modifier letters
Lo - other letters
Nl - letter numbers
Mn - nonspacing marks
Mc - spacing combining marks
Nd - decimal numbers
Pc - connector punctuations
Other_ID_Start - explicit list of characters in PropList.txt to support backwards
compatibility
Other_ID_Continue - likewise
All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers
is based on NFKC.
A non-normative HTML file listing all valid identifier characters for Unicode 15.1.0 can be
found at https://www.unicode.org/Public/15.1.0/ucd/DerivedCoreProperties.txt
2.3.1. Keywords
The following identifiers are used as reserved words, or keywords of the language, and cannot be
used as ordinary identifiers. They must be spelled exactly as written here:
As soft keywords, their use in the grammar is possible while still preserving compatibility with
existing code that uses these names as identifier names.
match, case, and _ are used in the match statement. type is used in the type statement.
_*
Not imported by from module import *.
_
In a case pattern within a match statement, _ is a soft keyword that denotes a wildcard.
Separately, the interactive interpreter makes the result of the last evaluation available in
the variable _. (It is stored in the builtins module, alongside built-in functions
like print.)
Elsewhere, _ is a regular identifier. It is often used to name “special” items, but it is not
special to Python itself.
Note
__*
Class-private names. Names in this category, when used within the context of a class
definition, are re-written to use a mangled form to help avoid name clashes between
“private” attributes of base and derived classes. See section Identifiers (Names).
2.4. Literals
Literals are notations for constant values of some built-in types.
Bytes literals are always prefixed with 'b' or 'B'; they produce an instance
of the bytes type instead of the str type. They may only contain ASCII
characters; bytes with a numeric value of 128 or greater must be expressed
with escapes.
Added in version 3.3: The 'rb' prefix of raw bytes literals has been added as
a synonym of 'br'.
A string literal with 'f' or 'F' in its prefix is a formatted string literal;
see f-strings. The 'f' may be combined with 'r', but not with 'b' or 'u',
therefore raw formatted strings are possible, but formatted bytes literals are
not.
In triple-quoted literals, unescaped newlines and quotes are allowed (and are
retained), except that three unescaped quotes in a row terminate the literal. (A
“quote” is the character used to open the literal, i.e. either ' or ".)
Unless an 'r' or 'R' prefix is present, escape sequences in string and bytes
literals are interpreted according to rules similar to those used by Standard C.
The recognized escape sequences are:
Escape
Meaning Notes
Sequence
\<newline> Backslash and newline ignored (1)
\\ Backslash (\)
Escape
Meaning Notes
Sequence
Character named name in the Unicode
\N{name} (5)
database
\uxxxx Character with 16-bit hex value xxxx (6)
\Uxxxxxxxx Character with 32-bit hex value xxxxxxxx (7)
Notes:
>>>
>>> 'This string will not include \
... backslashes or newline characters.'
'This string will not include backslashes or
newline characters.'
Unlike Standard C, all unrecognized escape sequences are left in the string
unchanged, i.e., the backslash is left in the result. (This behavior is useful
when debugging: if an escape sequence is mistyped, the resulting output is
more easily recognized as broken.) It is also important to note that the escape
sequences only recognized in string literals fall into the category of
unrecognized escapes for bytes literals.
Even in a raw literal, quotes can be escaped with a backslash, but the
backslash remains in the result; for example, r"\"" is a valid string literal
consisting of two characters: a backslash and a double quote; r"\" is not a
valid string literal (even a raw string cannot end in an odd number of
backslashes). Specifically, a raw literal cannot end in a single
backslash (since the backslash would escape the following quote character).
Note also that a single backslash followed by a newline is interpreted as those
two characters as part of the literal, not as a line continuation.
Note that this feature is defined at the syntactical level, but implemented at
compile time. The ‘+’ operator must be used to concatenate string expressions
at run time. Also note that literal concatenation can use different quoting
styles for each component (even mixing raw strings and triple quoted strings),
and formatted string literals may be concatenated with plain string literals.
2.4.3. f-strings
Added in version 3.6.
Escape sequences are decoded like in ordinary string literals (except when a
literal is also marked as a raw string). After decoding, the grammar for the
contents of the string is:
The parts of the string outside curly braces are treated literally, except that any
doubled curly braces '{{' or '}}' are replaced with the corresponding
single curly brace. A single opening curly bracket '{' marks a replacement
field, which starts with a Python expression. To display both the expression
text and its value after evaluation, (useful in debugging), an equal
sign '=' may be added after the expression. A conversion field, introduced
by an exclamation point '!' may follow. A format specifier may also be
appended, introduced by a colon ':'. A replacement field ends with a closing
curly bracket '}'.
Changed in version 3.12: Prior to Python 3.12, comments were not allowed
inside f-string replacement fields.
When the equal sign '=' is provided, the output will have the expression text,
the '=' and the evaluated value. Spaces after the opening brace '{', within
the expression and after the '=' are all retained in the output. By default,
the '=' causes the repr() of the expression to be provided, unless there is a
format specified. When a format is specified it defaults to the str() of the
expression unless a conversion '!r' is declared.
The result is then formatted using the format() protocol. The format
specifier is passed to the __format__() method of the expression or
conversion result. An empty string is passed when the format specifier is
omitted. The formatted result is then included in the final value of the whole
string.
>>>
Reusing the outer f-string quoting type inside a replacement field is permitted:
>>>
>>> a = dict(x=2)
>>> f"abc {a["x"]} def"
'abc 2 def'
Changed in version 3.12: Prior to Python 3.12, reuse of the same quoting type
of the outer f-string inside a replacement field was not possible.
Backslashes are also allowed in replacement fields and are evaluated the same
way as in any other context:
>>>
See also PEP 498 for the proposal that added formatted string literals,
and str.format(), which uses a related format string mechanism.
Note that numeric literals do not include a sign; a phrase like -1 is actually an
expression composed of the unary operator ‘-’ and the literal 1.
There is no limit for the length of integer literals apart from what can be
stored in available memory.
Underscores are ignored for determining the numeric value of the literal. They
can be used to group digits for enhanced readability. One underscore can
occur between digits, and after base specifiers like 0x.
Note that leading zeros in a non-zero decimal number are not allowed. This is
for disambiguation with C-style octal literals, which Python used before
version 3.0.
7 2147483647 0o177
0b100110111
3 79228162514264337593543950336 0o377
0xdeadbeef
100_000_000_000 0b_1110_0101
Changed in version 3.6: Underscores are now allowed for grouping purposes
in literals.
Note that the integer and exponent parts are always interpreted using radix 10.
For example, 077e010 is legal, and denotes the same number as 77e10. The
allowed range of floating-point literals is implementation-dependent. As in
integer literals, underscores are supported for digit grouping.
Changed in version 3.6: Underscores are now allowed for grouping purposes
in literals.
An imaginary literal yields a complex number with a real part of 0.0. Complex
numbers are represented as a pair of floating-point numbers and have the same
restrictions on their range. To create a complex number with a nonzero real
part, add a floating-point number to it, e.g., (3+4j). Some examples of
imaginary literals:
2.5. Operators
The following tokens are operators:
+ - * ** / // %
@
<< >> & | ^ ~ :=
< > <= >= == !=
2.6. Delimiters
The following tokens serve as delimiters in the grammar:
( ) [ ] { }
, : ! . ; @ =
-> += -= *= /= //= %=
@= &= |= ^= >>= <<= **=
The period can also occur in floating-point and imaginary literals. A sequence
of three periods has a special meaning as an ellipsis literal. The second half of
the list, the augmented assignment operators, serve lexically as delimiters, but
also perform an operation.
The following printing ASCII characters have special meaning as part of other
tokens or are otherwise significant to the lexical analyzer:
' " # \
The following printing ASCII characters are not used in Python. Their
occurrence outside string literals and comments is an unconditional error:
$ ? `
C program to detect tokens in a C program
Here, we will create a c program to detect tokens in a C program. This is called
the lexical analysis phase of the compiler. The lexical analyzer is the part of
the compiler that detects the token of the program and sends it to the syntax
analyzer.
Token is the smallest entity of the code, it is either a keyword, identifier,
constant, string literal, symbol.
Examples of different types of tokens in C.
Example
Keywords: for, if, include, etc
Identifier: variables, functions, etc
separators: ‘,’, ‘;’, etc
operators: ‘-’, ‘=’, ‘++’, etc
Example
#include <stdbool.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
bool isValidDelimiter(char ch) {
if (ch == ' ' || ch == '+' || ch == '-' || ch == '*' ||
ch == '/' || ch == ',' || ch == ';' || ch == '>' ||
ch == '<' || ch == '=' || ch == '(' || ch == ')' ||
ch == '[' || ch == ']' || ch == '{' || ch == '}')
return (true);
return (false);
}
bool isValidOperator(char ch){
if (ch == '+' || ch == '-' || ch == '*' ||
ch == '/' || ch == '>' || ch == '<' ||
ch == '=')
return (true);
return (false);
}
// Returns 'true' if the string is a VALID IDENTIFIER.
bool isvalidIdentifier(char* str){
if (str[0] == '0' || str[0] == '1' || str[0] == '2' ||
str[0] == '3' || str[0] == '4' || str[0] == '5' ||
str[0] == '6' || str[0] == '7' || str[0] == '8' ||
str[0] == '9' || isValidDelimiter(str[0]) == true)
return (false);
return (true);
}
bool isValidKeyword(char* str) {
if (!strcmp(str, "if") || !strcmp(str, "else") || !strcmp(str,
"while") || !strcmp(str, "do") || !strcmp(str, "break") || !
strcmp(str, "continue") || !strcmp(str, "int")
|| !strcmp(str, "double") || !strcmp(str, "float") || !
strcmp(str, "return") || !strcmp(str, "char") || !strcmp(str,
"case") || !strcmp(str, "char")
|| !strcmp(str, "sizeof") || !strcmp(str, "long") || !
strcmp(str, "short") || !strcmp(str, "typedef") || !strcmp(str,
"switch") || !strcmp(str, "unsigned")
|| !strcmp(str, "void") || !strcmp(str, "static") || !
strcmp(str, "struct") || !strcmp(str, "goto"))
return (true);
return (false);
}
bool isValidInteger(char* str) {
int i, len = strlen(str);
if (len == 0)
return (false);
for (i = 0; i < len; i++) {
if (str[i] != '0' && str[i] != '1' && str[i] != '2'&&
str[i] != '3' && str[i] != '4' && str[i] != '5'
&& str[i] != '6' && str[i] != '7' && str[i] != '8' &&
str[i] != '9' || (str[i] == '-' && i > 0))
return (false);
}
return (true);
}
bool isRealNumber(char* str) {
int i, len = strlen(str);
bool hasDecimal = false;
if (len == 0)
return (false);
for (i = 0; i < len; i++) {
if (str[i] != '0' && str[i] != '1' && str[i] != '2' &&
str[i] != '3' && str[i] != '4' && str[i] != '5' && str[i] !
= '6' && str[i] != '7' && str[i] != '8'
&& str[i] != '9' && str[i] != '.' || (str[i] == '-' && i >
0))
return (false);
if (str[i] == '.')
hasDecimal = true;
}
return (hasDecimal);
}
char* subString(char* str, int left, int right) {
int i;
char* subStr = (char*)malloc( sizeof(char) * (right - left +
2));
for (i = left; i <= right; i++)
subStr[i - left] = str[i];
subStr[right - left + 1] = '\0';
return (subStr);
}
void detectTokens(char* str) {
int left = 0, right = 0;
int length = strlen(str);
while (right <= length && left <= right) {
if (isValidDelimiter(str[right]) == false)
right++;
if (isValidDelimiter(str[right]) == true && left == right)
{
if (isValidOperator(str[right]) == true)
printf("Valid operator : '%c'
", str[right]);
right++;
left = right;
} else if (isValidDelimiter(str[right]) == true && left !=
right || (right == length && left != right)) {
char* subStr = subString(str, left, right - 1);
if (isValidKeyword(subStr) == true)
printf("Valid keyword : '%s'
", subStr);
else if (isValidInteger(subStr) == true)
printf("Valid Integer : '%s'
", subStr);
else if (isRealNumber(subStr) == true)
printf("Real Number : '%s'
", subStr);
else if (isvalidIdentifier(subStr) == true
&& isValidDelimiter(str[right - 1]) == false)
printf("Valid Identifier : '%s'
", subStr);
else if (isvalidIdentifier(subStr) == false
&& isValidDelimiter(str[right - 1]) == false)
printf("Invalid Identifier : '%s'
", subStr);
left = right;
}
}
return;
}
int main(){
char str[100] = "float x = a + 1b; ";
printf("The Program is : '%s'
", str);
printf("All Tokens are :
");
detectTokens(str);
return (0);
}
Explore our latest online courses and learn new skills at your own pace. Enroll
and become a certified expert to boost your career.
Output
The Program is : 'float x = a + 1b; '
All Tokens are :
Valid keyword : 'float'
Valid Identifier : 'x'
Valid operator : '='
Valid Identifier : 'a'
Valid operator : '+'
Invalid Identifier : '1b'
build basic compiler on C++. When I call function Lex it's call
function Recognise_Identifier_Keyword twice and then the program end. I
don't know why the function Lex doesn't call the other two functions. I'm testing
the code with string if (x == 5) { y = 10 } else { y <= 20; } here is the
code:
C++:
#include <iostream>
#include <string>
struct Table
{
int Position{ 0 };
char Symbol[254]{ "" };
int Code{ 0 };
Table* next;
};
void printirane(point P) {
cout << "position" << "\t" << "symbol" << "\t" << "token" << "\n";
void main() {
system("chcp 1251");
string symbol;
int token;
int number;
do {
cout << "1) fill the symbolTable. \n";
cout << "2) print the symbolTable. \n";
cout << "0) exit. \n";
cout << "enter a number: ";
cin >> number;
cout << "\n";
switch (number)
{
case 1:
cout << "enter string: ";
// getline(cin, symbol);
cin >> symbol;
Lex(symbol, symbolTable);
break;
case 2:
printirane(symbolTable);
cout << "\n";
break;
}