0% found this document useful (0 votes)

86 views35 pages

Lexical Analysis

Lexical analysis is the initial phase of a compiler that reads source code and organizes it into tokens, which are meaningful sequences of characters. It categorizes tokens into keywords, identifiers, constants, operators, and special symbols, facilitating easier parsing and error detection. The process involves using a lexical analyzer or lexer to convert source code into a token stream, which is essential for subsequent compilation phases.

Uploaded by

Arif Kamal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views35 pages

Lexical Analysis

Uploaded by

Arif Kamal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 35

Introduction of Lexical Analysis

Last Updated : 27 Jan, 2025





Lexical analysis, also known as scanning is the first phase of a

compiler which involves reading the source program character by
character from left to right and organizing them into tokens. Tokens
are meaningful sequences of characters. There are usually only a
small number of tokens for a programming language including
constants (such as integers, doubles, characters, and strings),
operators (arithmetic, relational, and logical), punctuation marks
and reserved keywords.

Lexical Analysis

 The lexical analyzer takes a source program as input, and

produces a stream of tokens as output.
What is a Token?
A lexical token is a sequence of characters that can be treated as a
unit in the grammar of the programming languages.
Categories of Tokens
 Keywords: In C programming, keywords are reserved words with
specific meanings used to define the language’s structure like if,
else, for, and void. These cannot be used as variable names or
identifiers, as doing so causes compilation errors. C programming
has a total of 32 keywords.
 Identifiers: Identifiers in C are names for variables, functions,
arrays, or other user-defined items. They must start with a letter
or an underscore (_) and can include letters, digits, and
underscores. C is case-sensitive, so uppercase and lowercase
letters are different. Identifiers cannot be the same as keywords
like if, else or for.
 Constants: Constants are fixed values that cannot change
during a program’s execution, also known as literals. In C,
constants include types like integers, floating-point numbers,
characters, and strings.
 Operators: Operators are symbols in C that perform actions on
variables or other data items, called operands.
 Special Symbols: Special symbols in C are compiler tokens used
for specific purposes, such as separating code elements or
defining operations. Examples include ; (semicolon) to end
statements, , (comma) to separate values, {} (curly braces) for
code blocks, and [] (square brackets) for arrays. These symbols
play a crucial role in the program’s structure and syntax.
Read more about Tokens.
What is a Lexeme?
A lexeme is an actual string of characters that matches with a
pattern and generates a token.
eg- “float”, “abs_zero_Kelvin”, “=”, “-”, “273”, “;” .
Lexemes and Tokens Representation
Lexemes Tokens Lexemes Continued… Tokens Continued…

while WHILE a IDENTIEFIER

( LAPREN = ASSIGNMENT

a IDENTIFIER a IDENTIFIER

>= COMPARISON – ARITHMETIC

b IDENTIFIER 2 INTEGER
Lexemes Tokens Lexemes Continued… Tokens Continued…

) RPAREN ; SEMICOLON

How Lexical Analyzer Works?

Tokens in a programming language can be described using regular
expressions. A scanner, or lexical analyzer, uses a Deterministic
Finite Automaton (DFA) to recognize these tokens, as DFAs are
designed to identify regular languages. Each final state of the DFA
corresponds to a specific token type, allowing the scanner to classify
the input. The process of creating a DFA from regular expressions
can be automated, making it easier to handle token recognition
efficiently.
Read more about Working of Lexical Analyzer in Compiler.
The lexical analyzer identifies the error with the help of the
automation machine and the grammar of the given language on
which it is based like C, C++, and gives row number and column
number of the error.
Suppose we pass a statement through lexical analyzer: a = b + c;
It will generate token sequence like this: id=id+id; Where each id
refers to it’s variable in the symbol table referencing all details For
example, consider the program
int main()
{
// 2 variables
int a, b;
a = 10;
return 0;
}
All the valid tokens are:
‘int’ ‘main’ ‘(‘ ‘)’ ‘{‘ ‘int’ ‘a’ ‘,’ ‘b’ ‘;’
‘a’ ‘=’ ’10’ ‘;’ ‘return’ ‘0’ ‘;’ ‘}’
Above are the valid tokens. You can observe that we have omitted
comments. As another example, consider below printf statement.
There are 5 valid token in this
printf statement.
Exercise 1: Count number of tokens:
int main()
{
int a = 10, b = 20;
printf(“sum is:%d”,a+b);
return 0;
}
Answer: Total number of token: 27.
Exercise 2: Count number of tokens:
int max(int i);
 Lexical analyzer first read int and finds it to be valid and accepts
as token.
 max is read by it and found to be a valid function name after
reading (
 int is also a token , then again I as another token and finally ;
Answer: Total number of tokens 7: int, max, ( ,int, i, ), ;
Advantages
 Simplifies Parsing: Breaking down the source code into tokens
makes it easier for computers to understand and work with the
code. This helps programs like compilers or interpreters to figure
out what the code is supposed to do. It’s like breaking down a big
puzzle into smaller pieces, which makes it easier to put together
and solve.
 Error Detection: Lexical analysis will detect lexical errors such
as misspelled keywords or undefined symbols early in the
compilation process. This helps in improving the overall efficiency
of the compiler or interpreter by identifying errors sooner rather
than later.
 Efficiency: Once the source code is converted into tokens,
subsequent phases of compilation or interpretation can operate
more efficiently. Parsing and semantic analysis become faster
and more streamlined when working with tokenized input.
Disadvantages
 Limited Context: Lexical analysis operates based on individual
tokens and does not consider the overall context of the code. This
can sometimes lead to ambiguity or misinterpretation of the
code’s intended meaning especially in languages with complex
syntax or semantics.
 Overhead: Although lexical analysis is necessary for the
compilation or interpretation process, it adds an extra layer of
overhead. Tokenizing the source code requires additional
computational resources which can impact the overall
performance of the compiler or interpreter.
 Debugging Challenges: Lexical errors detected during the
analysis phase may not always provide clear indications of their
origins in the original source code. Debugging such errors can be
challenging especially if they result from subtle mistakes in the
lexical analysis process.
For Previous Year Questions on Lexical Analysis, refer
to https://www.geeksforgeeks.org/lexical-analysis-gq/.
Introduction of Lexical Analysis – FAQs
What is the purpose of lexical analysis in a compiler?
Lexical analysis converts source code into tokens. Those are the
atomic units of meaning, keywords, operators, identifiers, etc.,
which are extracted with lexical analysis. Unwanted elements
include white space and comments.
What is a token in lexical analysis?
A token is a classified unit of code, or a keyword such as int and if,
an identifier such as x or y, an operator like + or =, or a literal such
as constants, 10, “hello”.
How does lexical analysis handle errors?
It detects and outputs lexical errors including illegal characters or
malformed tokens, which are then corrected by the developer for
further processing.
What does lexical analysis pass on to the next phase?
After lexical analysis, tokens are passed on to the syntax analysis
phase. The syntax analysis checks whether the code structure
complies with the grammar of the chosen programming language.
What Is Lexical Analysis?
Lexical analysis, also known as scanning, is an important first step in Natural Language
Processing and computer science. In programming, this process involves a tool called
a lexical analyzer, or lexer, which reads the source code one character at a time.

It then groups these characters into tokens, which are the smallest meaningful units in
the code. Tokens can include constants like numbers and strings, operators like + and -,
punctuation marks like commas and semicolons, and keywords like “if” and “while.”

After the lexer scans the text, it creates a stream of tokens. This tokenized format is
necessary for the next steps in processing or compiling the program. Lexical analysis is
also a good time to clean up the text by removing extra spaces or organizing certain
types of data. You can think of lexical analysis as a preparation step before more
complicated NLP tasks begin.

Key Term In Lexical Analysis

When learning about lexical analysis, it is essential to know some key terms. These
terms will help you to understand how lexical analysis works and its role in natural
language processing (NLP) and artificial intelligence. Some of the common key terms
in lexical analysis are written below for your reference:

1. NLP (Natural Language Processing)

NLP is a field in computer science and artificial intelligence focused on how computers
can understand and communicate with humans. The goal of NLP is to enable
computers to read, understand, and respond to human language in a way that makes
sense.

2. Token

A token is a group of characters treated as a single unit. Each token represents a

specific meaning. In programming, tokens can include keywords, operators, and other
important elements.
3. Tokenizer

A tokenizer is a tool that breaks down text into individual tokens. Each token has its own
meaning. The tokenizer identifies where one token ends and another begins, which can
change based on the context and rules of the language. Tokenization is usually the first
step in natural language processing.

4. Lexer (Lexical Analyzer)

A lexer, or lexical analyzer, is a more advanced tool that not only tokenizes text but also
classifies these tokens into categories. For example, a lexer can sort tokens into
keywords, operators, and values. It is important for the next stage, called parsing, as it
provides tokens to the parser for further analysis.

5. Lexeme

A lexeme is a basic unit of meaning in a language. It represents the core idea of a word.
For example, the words “write,” “wrote,” “writing,” and “written” all relate to the same
lexeme: “write.” In a simple expression like “5 + 6,” the individual elements “5,” “+,” and
“6” are separate lexemes.

Step Involved In Lexical Analysis

Lexical analysis is a process that breaks down an input text into small parts called
tokens or lexemes. This helps computers to understand the text for further analysis.
Although the exact steps can vary depending on the organizational requirement and
complexity of the text but most of the processes follow these basic steps:

Step 1 – Identify Tokens

The first step involved in lexical analysis is to identify individual symbols in the input.
These symbols include letters, numbers, operators, and special characters. Each
symbol or group of symbols is given a token type, like “number” or “operator.”

Step 2 – Assign Strings to Tokens

The lexer (a tool that performs lexical analysis) is set up to recognize and group-specific
inputs. For example, if the input is “apple123,” the lexer may recognize “apple” as a
word token and “123” as a number token. Similarly, keywords like “if” or “while” are
categorized as specific token types.

Step 3 – Return Lexeme or Value

The lexer breaks the input into the smallest meaningful parts called lexemes. It then
returns these lexemes along with their token type. This process helps the next stage of
analysis which is to understand what each token represents in the text.

Difference Types Of Lexical Analysis

When choosing a method for lexical analysis, there are two main approaches: “Loop
and Switch” and “Regular Expressions with Finite Automata.” Both methods help users
to analyze input text by breaking it into smaller parts called tokens, making it easier for
computers to process.

Let us understand each of these types of lexical analysis in brief:

1. Loop and Switch Algorithm

The loop works like reading a book, one character at a time until it reaches the end of
the code. It goes through the code without missing any character or symbol, making
sure each part is captured.

The switch statement acts as a quick decision-maker. After the loop reads a character
or a group of characters, the switch statement decides what type of token it is, such as
a keyword, number, or operator. This is similar to organizing items into separate boxes
based on their type, making the code easier to understand.

2. Regular Expressions and Finite Automata

Regular expressions are rules that describe patterns in text. They help define how
different tokens should look, such as an email or a phone number format. The lexer
uses these rules to identify tokens by matching text against these patterns.

Finite automata are like small machines that follow instructions step-by-step. They take
the rules from regular expressions and apply them to the code. If a part of the code
matches a rule, it’s identified as a token. This makes the process of breaking down code
more efficient and accurate.

Is Lexical Analysis A Good Choice for Text Processing?

If you are looking to use lexical analysis for text processing, it is essential to know its
pros and cons. Lexical analysis is commonly used for text pre-processing in Natural
Language Processing (NLP), but with advantages and uses everything has its
disadvantages as well. Here are some key benefits and drawbacks of using lexical
analysis:

Advantages of Lexical Analysis

 Cleans Up Data: Lexical analysis removes unnecessary elements like extra
spaces or comments from the text, making the text cleaner and easier to work
with.
 Simplifies Further Analysis: It breaks the text into smaller units called tokens
and filters out irrelevant data, making it easier to perform other analyses like
syntax checking.
 Reduces Input Size: It helps in compressing the input data by organizing it into
tokens, this simplifies and speeds up the text processing.
Limitations of Lexical Analysis

 Ambiguity in Token Categorization: Lexical analysis can sometimes struggle

to categorize tokens correctly which can lead to confusion and difficulties.
 Lookahead Challenges: The lexer generally requires looking ahead at the next
characters to decide on token categories, which can be tricky.
 Limited Understanding: Since lexical analyzers only focus on individual tokens,
they might not catch errors like incorrect sequences or misspelled identifiers in
the overall text.
Who Uses Lexical Analysis?
Many careers and fields use lexical analysis in their day-to-day life. With the growth of
NLP (Natural Language Processing) and artificial intelligence, the use of these skills is
expanding across different industries.

One of the main careers in this area is the NLP engineer. NLP engineers have various
responsibilities, such as creating NLP systems, working with speech systems in AI
applications, developing new NLP algorithms, and improving existing models.

According to Glassdoor’s October 2024 data, an NLP engineer’s average annual salary
in India is around ₹10 Lakh per annum

Apart from NLP engineers, There are many other related careers that may involve the
use of lexical analysis in their tasks, depending on your area of focus. These careers
include:

 Software Engineers
 Data Scientists
 Machine Learning Engineers
 Artificial Intelligence Engineers
 Language Engineers
 Research Engineers
2. Lexical analysis
A Python program is read by a parser. Input to the parser is a stream
of tokens, generated by the lexical analyzer. This chapter describes how the
lexical analyzer breaks a file into tokens.

Python reads program text as Unicode code points; the encoding of a source
file can be given by an encoding declaration and defaults to UTF-8, see PEP
3120 for details. If the source file cannot be decoded, a SyntaxError is
raised.

2.1. Line structure

A Python program is divided into a number of logical lines.

2.1.1. Logical lines

The end of a logical line is represented by the token NEWLINE. Statements cannot cross logical
line boundaries except where NEWLINE is allowed by the syntax (e.g., between statements in
compound statements). A logical line is constructed from one or more physical lines by
following the explicit or implicit line joining rules.

2.1.2. Physical lines

A physical line is a sequence of characters terminated by an end-of-line sequence. In source files
and strings, any of the standard platform line termination sequences can be used - the Unix form
using ASCII LF (linefeed), the Windows form using the ASCII sequence CR LF (return
followed by linefeed), or the old Macintosh form using the ASCII CR (return) character. All of
these forms can be used equally, regardless of platform. The end of input also serves as an
implicit terminator for the final physical line.
When embedding Python, source code strings should be passed to Python APIs using the
standard C conventions for newline characters (the \n character, representing ASCII LF, is the
line terminator).

2.1.3. Comments
A comment starts with a hash character (#) that is not part of a string literal, and ends at the end
of the physical line. A comment signifies the end of the logical line unless the implicit line
joining rules are invoked. Comments are ignored by the syntax.

2.1.4. Encoding declarations

If a comment in the first or second line of the Python script matches the regular
expression coding[=:]\s*([-\w.]+), this comment is processed as an encoding
declaration; the first group of this expression names the encoding of the source code file. The
encoding declaration must appear on a line of its own. If it is the second line, the first line must
also be a comment-only line. The recommended forms of an encoding expression are

# -- coding: <encoding-name> --

which is recognized also by GNU Emacs, and

# vim:fileencoding=<encoding-name>

which is recognized by Bram Moolenaar’s VIM.

If no encoding declaration is found, the default encoding is UTF-8. If the implicit or explicit
encoding of a file is UTF-8, an initial UTF-8 byte-order mark (b’xefxbbxbf’) is ignored rather
than being a syntax error.

If an encoding is declared, the encoding name must be recognized by Python (see Standard
Encodings). The encoding is used for all lexical analysis, including string literals, comments and
identifiers.

2.1.5. Explicit line joining

Two or more physical lines may be joined into logical lines using backslash characters (\), as
follows: when a physical line ends in a backslash that is not part of a string literal or comment, it
is joined with the following forming a single logical line, deleting the backslash and the
following end-of-line character. For example:

if 1900 < year < 2100 and 1 <= month <= 12 \

and 1 <= day <= 31 and 0 <= hour < 24 \
and 0 <= minute < 60 and 0 <= second < 60: # Looks like a
valid date
return 1

A line ending in a backslash cannot carry a comment. A backslash does not continue a comment.
A backslash does not continue a token except for string literals (i.e., tokens other than string
literals cannot be split across physical lines using a backslash). A backslash is illegal elsewhere
on a line outside a string literal.

2.1.6. Implicit line joining

Expressions in parentheses, square brackets or curly braces can be split over more than one
physical line without using backslashes. For example:

month_names = ['Januari', 'Februari', 'Maart', # These are the

'April', 'Mei', 'Juni', # Dutch names
'Juli', 'Augustus', 'September', # for the
months
'Oktober', 'November', 'December'] # of the year

Implicitly continued lines can carry comments. The indentation of the continuation lines is not
important. Blank continuation lines are allowed. There is no NEWLINE token between implicit
continuation lines. Implicitly continued lines can also occur within triple-quoted strings (see
below); in that case they cannot carry comments.

2.1.7. Blank lines

A logical line that contains only spaces, tabs, formfeeds and possibly a comment, is ignored (i.e.,
no NEWLINE token is generated). During interactive input of statements, handling of a blank
line may differ depending on the implementation of the read-eval-print loop. In the standard
interactive interpreter, an entirely blank logical line (i.e. one containing not even whitespace or a
comment) terminates a multi-line statement.

2.1.8. Indentation
Leading whitespace (spaces and tabs) at the beginning of a logical line is used to compute the
indentation level of the line, which in turn is used to determine the grouping of statements.

Tabs are replaced (from left to right) by one to eight spaces such that the total number of
characters up to and including the replacement is a multiple of eight (this is intended to be the
same rule as used by Unix). The total number of spaces preceding the first non-blank character
then determines the line’s indentation. Indentation cannot be split over multiple physical lines
using backslashes; the whitespace up to the first backslash determines the indentation.
Indentation is rejected as inconsistent if a source file mixes tabs and spaces in a way that makes
the meaning dependent on the worth of a tab in spaces; a TabError is raised in that case.

Cross-platform compatibility note: because of the nature of text editors on non-UNIX

platforms, it is unwise to use a mixture of spaces and tabs for the indentation in a single source
file. It should also be noted that different platforms may explicitly limit the maximum
indentation level.

A formfeed character may be present at the start of the line; it will be ignored for the indentation
calculations above. Formfeed characters occurring elsewhere in the leading whitespace have an
undefined effect (for instance, they may reset the space count to zero).

The indentation levels of consecutive lines are used to generate INDENT and DEDENT tokens,
using a stack, as follows.

Before the first line of the file is read, a single zero is pushed on the stack; this will never be
popped off again. The numbers pushed on the stack will always be strictly increasing from
bottom to top. At the beginning of each logical line, the line’s indentation level is compared to
the top of the stack. If it is equal, nothing happens. If it is larger, it is pushed on the stack, and
one INDENT token is generated. If it is smaller, it must be one of the numbers occurring on the
stack; all numbers on the stack that are larger are popped off, and for each number popped off a
DEDENT token is generated. At the end of the file, a DEDENT token is generated for each
number remaining on the stack that is larger than zero.

Here is an example of a correctly (though confusingly) indented piece of Python code:

def perm(l):
# Compute the list of all permutations of l
if len(l) <= 1:
return [l]
r = []
for i in range(len(l)):
s = l[:i] + l[i+1:]
p = perm(s)
for x in p:
r.append(l[i:i+1] + x)
return r

The following example shows various indentation errors:

def perm(l): # error: first line indented

for i in range(len(l)): # error: not indented
s = l[:i] + l[i+1:]
p = perm(l[:i] + l[i+1:]) # error: unexpected indent
for x in p:
r.append(l[i:i+1] + x)
return r # error: inconsistent dedent

(Actually, the first three errors are detected by the parser; only the last error is found by the
lexical analyzer — the indentation of return r does not match a level popped off the stack.)

2.1.9. Whitespace between tokens

Except at the beginning of a logical line or in string literals, the whitespace characters space, tab
and formfeed can be used interchangeably to separate tokens. Whitespace is needed between two
tokens only if their concatenation could otherwise be interpreted as a different token (e.g., ab is
one token, but a b is two tokens).

2.2. Other tokens

Besides NEWLINE, INDENT and DEDENT, the following categories of tokens
exist: identifiers, keywords, literals, operators, and delimiters. Whitespace characters (other than
line terminators, discussed earlier) are not tokens, but serve to delimit tokens. Where ambiguity
exists, a token comprises the longest possible string that forms a legal token, when read from left
to right.

2.3. Identifiers and keywords

Identifiers (also referred to as names) are described by the following lexical definitions.

The syntax of identifiers in Python is based on the Unicode standard annex UAX-31, with
elaboration and changes as defined below; see also PEP 3131 for further details.

Within the ASCII range (U+0001..U+007F), the valid characters for identifiers include the
uppercase and lowercase letters A through Z, the underscore _ and, except for the first character,
the digits 0 through 9. Python 3.0 introduced additional characters from outside the ASCII range
(see PEP 3131). For these characters, the classification uses the version of the Unicode
Character Database as included in the unicodedata module.

Identifiers are unlimited in length. Case is significant.

identifier ::= xid_start xid_continue*

id_start ::= <all characters in general categories Lu, Ll, Lt,
Lm, Lo, Nl, the underscore, and characters with the Other_ID_Start
property>
id_continue ::= <all characters in id_start, plus characters in
the categories Mn, Mc, Nd, Pc and others with the Other_ID_Continue
property>
xid_start ::= <all characters in id_start whose NFKC
normalization is in "id_start xid_continue*">
xid_continue ::= <all characters in id_continue whose NFKC
normalization is in "id_continue*">

All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers
is based on NFKC.

A non-normative HTML file listing all valid identifier characters for Unicode 15.1.0 can be
found at https://www.unicode.org/Public/15.1.0/ucd/DerivedCoreProperties.txt

2.3.1. Keywords
The following identifiers are used as reserved words, or keywords of the language, and cannot be
used as ordinary identifiers. They must be spelled exactly as written here:

False await else import pass

None break except in raise
True class finally is return
and continue for lambda try
as def from nonlocal while
assert del global not with
async elif if or yield

2.3.2. Soft Keywords

Added in version 3.10.
Some identifiers are only reserved under specific contexts. These are known as soft keywords.
The identifiers match, case, type and _ can syntactically act as keywords in certain contexts,
but this distinction is done at the parser level, not when tokenizing.

As soft keywords, their use in the grammar is possible while still preserving compatibility with
existing code that uses these names as identifier names.

match, case, and _ are used in the match statement. type is used in the type statement.

Changed in version 3.12: type is now a soft keyword.

2.3.3. Reserved classes of identifiers

Certain classes of identifiers (besides keywords) have special meanings. These classes are
identified by the patterns of leading and trailing underscore characters:

_*
Not imported by from module import *.

_
In a case pattern within a match statement, _ is a soft keyword that denotes a wildcard.

Separately, the interactive interpreter makes the result of the last evaluation available in
the variable _. (It is stored in the builtins module, alongside built-in functions
like print.)

Elsewhere, _ is a regular identifier. It is often used to name “special” items, but it is not
special to Python itself.

Note

The name _ is often used in conjunction with internationalization; refer to the

documentation for the gettext module for more information on this convention.

It is also commonly used for unused variables.

__*__
System-defined names, informally known as “dunder” names. These names are defined
by the interpreter and its implementation (including the standard library). Current system
names are discussed in the Special method names section and elsewhere. More will likely
be defined in future versions of Python. Any use of __*__ names, in any context, that
does not follow explicitly documented use, is subject to breakage without warning.

__*
Class-private names. Names in this category, when used within the context of a class
definition, are re-written to use a mangled form to help avoid name clashes between
“private” attributes of base and derived classes. See section Identifiers (Names).
2.4. Literals
Literals are notations for constant values of some built-in types.

2.4.1. String and Bytes literals

String literals are described by the following lexical definitions:

stringliteral ::= [stringprefix](shortstring |

longstring)
stringprefix ::= "r" | "u" | "R" | "U" | "f" | "F"
| "fr" | "Fr" | "fR" | "FR" | "rf"
| "rF" | "Rf" | "RF"
shortstring ::= "'" shortstringitem* "'" | '"'
shortstringitem* '"'
longstring ::= "'''" longstringitem* "'''" | '"""'
longstringitem* '"""'
shortstringitem ::= shortstringchar | stringescapeseq
longstringitem ::= longstringchar | stringescapeseq
shortstringchar ::= <any source character except "\"
or newline or the quote>
longstringchar ::= <any source character except "\">
stringescapeseq ::= "\" <any source character>
bytesliteral ::= bytesprefix(shortbytes | longbytes)
bytesprefix ::= "b" | "B" | "br" | "Br" | "bR" |
"BR" | "rb" | "rB" | "Rb" | "RB"
shortbytes ::= "'" shortbytesitem* "'" | '"'
shortbytesitem* '"'
longbytes ::= "'''" longbytesitem* "'''" | '"""'
longbytesitem* '"""'
shortbytesitem ::= shortbyteschar | bytesescapeseq
longbytesitem ::= longbyteschar | bytesescapeseq
shortbyteschar ::= <any ASCII character except "\" or
newline or the quote>
longbyteschar ::= <any ASCII character except "\">
bytesescapeseq ::= "\" <any ASCII character>

One syntactic restriction not indicated by these productions is that whitespace

is not allowed between the stringprefix or bytesprefix and the rest of
the literal. The source character set is defined by the encoding declaration; it is
UTF-8 if no encoding declaration is given in the source file; see
section Encoding declarations.

In plain English: Both types of literals can be enclosed in matching single

quotes (') or double quotes ("). They can also be enclosed in matching groups
of three single or double quotes (these are generally referred to as triple-
quoted strings). The backslash (\) character is used to give special meaning to
otherwise ordinary characters like n, which means ‘newline’ when escaped (\
n). It can also be used to escape characters that otherwise have a special
meaning, such as newline, backslash itself, or the quote character. See escape
sequences below for examples.

Bytes literals are always prefixed with 'b' or 'B'; they produce an instance
of the bytes type instead of the str type. They may only contain ASCII
characters; bytes with a numeric value of 128 or greater must be expressed
with escapes.

Both string and bytes literals may optionally be prefixed with a

letter 'r' or 'R'; such constructs are called raw string literals and raw bytes
literals respectively and treat backslashes as literal characters. As a result, in
raw string literals, '\U' and '\u' escapes are not treated specially.

Added in version 3.3: The 'rb' prefix of raw bytes literals has been added as
a synonym of 'br'.

Support for the unicode legacy literal (u'value') was reintroduced to

simplify the maintenance of dual Python 2.x and 3.x codebases. See PEP
414 for more information.

A string literal with 'f' or 'F' in its prefix is a formatted string literal;
see f-strings. The 'f' may be combined with 'r', but not with 'b' or 'u',
therefore raw formatted strings are possible, but formatted bytes literals are
not.

In triple-quoted literals, unescaped newlines and quotes are allowed (and are
retained), except that three unescaped quotes in a row terminate the literal. (A
“quote” is the character used to open the literal, i.e. either ' or ".)

2.4.1.1. Escape sequences

Unless an 'r' or 'R' prefix is present, escape sequences in string and bytes
literals are interpreted according to rules similar to those used by Standard C.
The recognized escape sequences are:
Escape
Meaning Notes
Sequence
\<newline> Backslash and newline ignored (1)

\\ Backslash (\)

\' Single quote (')

\" Double quote (")

\a ASCII Bell (BEL)

\b ASCII Backspace (BS)

\f ASCII Formfeed (FF)

\n ASCII Linefeed (LF)

\r ASCII Carriage Return (CR)

\t ASCII Horizontal Tab (TAB)

\v ASCII Vertical Tab (VT)

\ooo Character with octal value ooo (2,4)

\xhh Character with hex value hh (3,4)

Escape sequences only recognized in string literals are:

Escape
Meaning Notes
Sequence
Character named name in the Unicode
\N{name} (5)
database
\uxxxx Character with 16-bit hex value xxxx (6)
\Uxxxxxxxx Character with 32-bit hex value xxxxxxxx (7)

Notes:

1. A backslash can be added at the end of a line to ignore the newline:

>>>
>>> 'This string will not include \
... backslashes or newline characters.'
'This string will not include backslashes or
newline characters.'

The same result can be achieved using triple-quoted strings, or

parentheses and string literal concatenation.

2. As in Standard C, up to three octal digits are accepted.

Changed in version 3.11: Octal escapes with value larger

than 0o377 produce a DeprecationWarning.

Changed in version 3.12: Octal escapes with value larger

than 0o377 produce a SyntaxWarning. In a future Python version
they will be eventually a SyntaxError.

3. Unlike in Standard C, exactly two hex digits are required.

4. In a bytes literal, hexadecimal and octal escapes denote the byte with
the given value. In a string literal, these escapes denote a Unicode
character with the given value.
5. Changed in version 3.3: Support for name aliases [1] has been added.
6. Exactly four hex digits are required.
7. Any Unicode character can be encoded this way. Exactly eight hex
digits are required.

Unlike Standard C, all unrecognized escape sequences are left in the string
unchanged, i.e., the backslash is left in the result. (This behavior is useful
when debugging: if an escape sequence is mistyped, the resulting output is
more easily recognized as broken.) It is also important to note that the escape
sequences only recognized in string literals fall into the category of
unrecognized escapes for bytes literals.

Changed in version 3.6: Unrecognized escape sequences produce

a DeprecationWarning.

Changed in version 3.12: Unrecognized escape sequences produce

a SyntaxWarning. In a future Python version they will be eventually
a SyntaxError.

Even in a raw literal, quotes can be escaped with a backslash, but the
backslash remains in the result; for example, r"\"" is a valid string literal
consisting of two characters: a backslash and a double quote; r"\" is not a
valid string literal (even a raw string cannot end in an odd number of
backslashes). Specifically, a raw literal cannot end in a single
backslash (since the backslash would escape the following quote character).
Note also that a single backslash followed by a newline is interpreted as those
two characters as part of the literal, not as a line continuation.

2.4.2. String literal concatenation

Multiple adjacent string or bytes literals (delimited by whitespace), possibly
using different quoting conventions, are allowed, and their meaning is the
same as their concatenation. Thus, "hello" 'world' is equivalent
to "helloworld". This feature can be used to reduce the number of
backslashes needed, to split long strings conveniently across long lines, or
even to add comments to parts of strings, for example:

re.compile("[A-Za-z_]" # letter or underscore

"[A-Za-z0-9_]*" # letter, digit or
underscore
)

Note that this feature is defined at the syntactical level, but implemented at
compile time. The ‘+’ operator must be used to concatenate string expressions
at run time. Also note that literal concatenation can use different quoting
styles for each component (even mixing raw strings and triple quoted strings),
and formatted string literals may be concatenated with plain string literals.

2.4.3. f-strings
Added in version 3.6.

A formatted string literal or f-string is a string literal that is prefixed

with 'f' or 'F'. These strings may contain replacement fields, which are
expressions delimited by curly braces {}. While other string literals always
have a constant value, formatted strings are really expressions evaluated at run
time.

Escape sequences are decoded like in ordinary string literals (except when a
literal is also marked as a raw string). After decoding, the grammar for the
contents of the string is:

f_string ::= (literal_char | "{{" | "}}" |

replacement_field)*
replacement_field ::= "{" f_expression ["="] ["!"
conversion] [":" format_spec] "}"
f_expression ::= (conditional_expression | "*"
or_expr)
("," conditional_expression | ","
"*" or_expr)* [","]
| yield_expression
conversion ::= "s" | "r" | "a"
format_spec ::= (literal_char |
replacement_field)*
literal_char ::= <any code point except "{", "}"
or NULL>

The parts of the string outside curly braces are treated literally, except that any
doubled curly braces '{{' or '}}' are replaced with the corresponding
single curly brace. A single opening curly bracket '{' marks a replacement
field, which starts with a Python expression. To display both the expression
text and its value after evaluation, (useful in debugging), an equal
sign '=' may be added after the expression. A conversion field, introduced
by an exclamation point '!' may follow. A format specifier may also be
appended, introduced by a colon ':'. A replacement field ends with a closing
curly bracket '}'.

Expressions in formatted string literals are treated like regular Python

expressions surrounded by parentheses, with a few exceptions. An empty
expression is not allowed, and both lambda and assignment
expressions := must be surrounded by explicit parentheses. Each expression
is evaluated in the context where the formatted string literal appears, in order
from left to right. Replacement expressions can contain newlines in both
single-quoted and triple-quoted f-strings and they can contain comments.
Everything that comes after a # inside a replacement field is a comment (even
closing braces and quotes). In that case, replacement fields must be closed in a
different line.

>>> f"abc{a # This is a comment }"

... + 3}"
'abc5'

Changed in version 3.7: Prior to Python 3.7, an await expression and

comprehensions containing an async for clause were illegal in the
expressions in formatted string literals due to a problem with the
implementation.

Changed in version 3.12: Prior to Python 3.12, comments were not allowed
inside f-string replacement fields.

When the equal sign '=' is provided, the output will have the expression text,
the '=' and the evaluated value. Spaces after the opening brace '{', within
the expression and after the '=' are all retained in the output. By default,
the '=' causes the repr() of the expression to be provided, unless there is a
format specified. When a format is specified it defaults to the str() of the
expression unless a conversion '!r' is declared.

Added in version 3.8: The equal sign '='.

If a conversion is specified, the result of evaluating the expression is

converted before formatting. Conversion '!s' calls str() on the result, '!
r' calls repr(), and '!a' calls ascii().

The result is then formatted using the format() protocol. The format
specifier is passed to the __format__() method of the expression or
conversion result. An empty string is passed when the format specifier is
omitted. The formatted result is then included in the final value of the whole
string.

Top-level format specifiers may include nested replacement fields. These

nested fields may include their own conversion fields and format specifiers,
but may not include more deeply nested replacement fields. The format
specifier mini-language is the same as that used by
the str.format() method.

Formatted string literals may be concatenated, but replacement fields cannot

be split across literals.

Some examples of formatted string literals:

>>>

>>> name = "Fred"

>>> f"He said his name is {name!r}."
"He said his name is 'Fred'."
>>> f"He said his name is {repr(name)}." # repr() is
equivalent to !r
"He said his name is 'Fred'."
>>> width = 10
>>> precision = 4
>>> value = decimal.Decimal("12.34567")
>>> f"result: {value:{width}.{precision}}" # nested
fields
'result: 12.35'
>>> today = datetime(year=2017, month=1, day=27)
>>> f"{today:%B %d, %Y}" # using date format
specifier
'January 27, 2017'
>>> f"{today=:%B %d, %Y}" # using date format
specifier and debugging
'today=January 27, 2017'
>>> number = 1024
>>> f"{number:#0x}" # using integer format specifier
'0x400'
>>> foo = "bar"
>>> f"{ foo = }" # preserves whitespace
" foo = 'bar'"
>>> line = "The mill's closed"
>>> f"{line = }"
'line = "The mill\'s closed"'
>>> f"{line = :20}"
"line = The mill's closed "
>>> f"{line = !r:20}"
'line = "The mill\'s closed" '

Reusing the outer f-string quoting type inside a replacement field is permitted:

>>>

>>> a = dict(x=2)
>>> f"abc {a["x"]} def"
'abc 2 def'

Changed in version 3.12: Prior to Python 3.12, reuse of the same quoting type
of the outer f-string inside a replacement field was not possible.

Backslashes are also allowed in replacement fields and are evaluated the same
way as in any other context:

>>>

>>> a = ["a", "b", "c"]

>>> print(f"List a contains:\n{"\n".join(a)}")
List a contains:
a
b
c

Changed in version 3.12: Prior to Python 3.12, backslashes were not

permitted inside an f-string replacement field.

Formatted string literals cannot be used as docstrings, even if they do not

include expressions.
>>>

>>> def foo():

... f"Not a docstring"
...
>>> foo.__doc__ is None
True

See also PEP 498 for the proposal that added formatted string literals,
and str.format(), which uses a related format string mechanism.

2.4.4. Numeric literals

There are three types of numeric literals: integers, floating-point numbers, and
imaginary numbers. There are no complex literals (complex numbers can be
formed by adding a real number and an imaginary number).

Note that numeric literals do not include a sign; a phrase like -1 is actually an
expression composed of the unary operator ‘-’ and the literal 1.

2.4.5. Integer literals

Integer literals are described by the following lexical definitions:

integer ::= decinteger | bininteger | octinteger |

hexinteger
decinteger ::= nonzerodigit (["_"] digit)* | "0"+
(["_"] "0")*
bininteger ::= "0" ("b" | "B") (["_"] bindigit)+
octinteger ::= "0" ("o" | "O") (["_"] octdigit)+
hexinteger ::= "0" ("x" | "X") (["_"] hexdigit)+
nonzerodigit ::= "1"..."9"
digit ::= "0"..."9"
bindigit ::= "0" | "1"
octdigit ::= "0"..."7"
hexdigit ::= digit | "a"..."f" | "A"..."F"

There is no limit for the length of integer literals apart from what can be
stored in available memory.

Underscores are ignored for determining the numeric value of the literal. They
can be used to group digits for enhanced readability. One underscore can
occur between digits, and after base specifiers like 0x.
Note that leading zeros in a non-zero decimal number are not allowed. This is
for disambiguation with C-style octal literals, which Python used before
version 3.0.

Some examples of integer literals:

7 2147483647 0o177
0b100110111
3 79228162514264337593543950336 0o377
0xdeadbeef
100_000_000_000 0b_1110_0101

Changed in version 3.6: Underscores are now allowed for grouping purposes
in literals.

2.4.6. Floating-point literals

Floating-point literals are described by the following lexical definitions:

floatnumber ::= pointfloat | exponentfloat

pointfloat ::= [digitpart] fraction | digitpart "."
exponentfloat ::= (digitpart | pointfloat) exponent
digitpart ::= digit (["_"] digit)*
fraction ::= "." digitpart
exponent ::= ("e" | "E") ["+" | "-"] digitpart

Note that the integer and exponent parts are always interpreted using radix 10.
For example, 077e010 is legal, and denotes the same number as 77e10. The
allowed range of floating-point literals is implementation-dependent. As in
integer literals, underscores are supported for digit grouping.

Some examples of floating-point literals:

3.14 10. .001 1e100 3.14e-10 0e0

3.14_15_93

Changed in version 3.6: Underscores are now allowed for grouping purposes
in literals.

2.4.7. Imaginary literals

Imaginary literals are described by the following lexical definitions:
imagnumber ::= (floatnumber | digitpart) ("j" | "J")

An imaginary literal yields a complex number with a real part of 0.0. Complex
numbers are represented as a pair of floating-point numbers and have the same
restrictions on their range. To create a complex number with a nonzero real
part, add a floating-point number to it, e.g., (3+4j). Some examples of
imaginary literals:

3.14j 10.j 10j .001j 1e100j 3.14e-10j

3.14_15_93j

2.5. Operators
The following tokens are operators:

+ - * ** / // %
@
<< >> & | ^ ~ :=
< > <= >= == !=

2.6. Delimiters
The following tokens serve as delimiters in the grammar:

( ) [ ] { }
, : ! . ; @ =
-> += -= *= /= //= %=
@= &= |= ^= >>= <<= **=

The period can also occur in floating-point and imaginary literals. A sequence
of three periods has a special meaning as an ellipsis literal. The second half of
the list, the augmented assignment operators, serve lexically as delimiters, but
also perform an operation.

The following printing ASCII characters have special meaning as part of other
tokens or are otherwise significant to the lexical analyzer:

' " # \

The following printing ASCII characters are not used in Python. Their
occurrence outside string literals and comments is an unconditional error:

$ ? `
C program to detect tokens in a C program
Here, we will create a c program to detect tokens in a C program. This is called
the lexical analysis phase of the compiler. The lexical analyzer is the part of
the compiler that detects the token of the program and sends it to the syntax
analyzer.
Token is the smallest entity of the code, it is either a keyword, identifier,
constant, string literal, symbol.
Examples of different types of tokens in C.

Example
Keywords: for, if, include, etc
Identifier: variables, functions, etc
separators: ‘,’, ‘;’, etc
operators: ‘-’, ‘=’, ‘++’, etc

Example
#include <stdbool.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
bool isValidDelimiter(char ch) {
if (ch == ' ' || ch == '+' || ch == '-' || ch == '*' ||
ch == '/' || ch == ',' || ch == ';' || ch == '>' ||
ch == '<' || ch == '=' || ch == '(' || ch == ')' ||
ch == '[' || ch == ']' || ch == '{' || ch == '}')
return (true);
return (false);
}
bool isValidOperator(char ch){
if (ch == '+' || ch == '-' || ch == '*' ||
ch == '/' || ch == '>' || ch == '<' ||
ch == '=')
return (true);
return (false);
}
// Returns 'true' if the string is a VALID IDENTIFIER.
bool isvalidIdentifier(char* str){
if (str[0] == '0' || str[0] == '1' || str[0] == '2' ||
str[0] == '3' || str[0] == '4' || str[0] == '5' ||
str[0] == '6' || str[0] == '7' || str[0] == '8' ||
str[0] == '9' || isValidDelimiter(str[0]) == true)
return (false);
return (true);
}
bool isValidKeyword(char* str) {
if (!strcmp(str, "if") || !strcmp(str, "else") || !strcmp(str,
"while") || !strcmp(str, "do") || !strcmp(str, "break") || !
strcmp(str, "continue") || !strcmp(str, "int")
|| !strcmp(str, "double") || !strcmp(str, "float") || !
strcmp(str, "return") || !strcmp(str, "char") || !strcmp(str,
"case") || !strcmp(str, "char")
|| !strcmp(str, "sizeof") || !strcmp(str, "long") || !
strcmp(str, "short") || !strcmp(str, "typedef") || !strcmp(str,
"switch") || !strcmp(str, "unsigned")
|| !strcmp(str, "void") || !strcmp(str, "static") || !
strcmp(str, "struct") || !strcmp(str, "goto"))
return (true);
return (false);
}
bool isValidInteger(char* str) {
int i, len = strlen(str);
if (len == 0)
return (false);
for (i = 0; i < len; i++) {
if (str[i] != '0' && str[i] != '1' && str[i] != '2'&&
str[i] != '3' && str[i] != '4' && str[i] != '5'
&& str[i] != '6' && str[i] != '7' && str[i] != '8' &&
str[i] != '9' || (str[i] == '-' && i > 0))
return (false);
}
return (true);
}
bool isRealNumber(char* str) {
int i, len = strlen(str);
bool hasDecimal = false;
if (len == 0)
return (false);
for (i = 0; i < len; i++) {
if (str[i] != '0' && str[i] != '1' && str[i] != '2' &&
str[i] != '3' && str[i] != '4' && str[i] != '5' && str[i] !
= '6' && str[i] != '7' && str[i] != '8'
&& str[i] != '9' && str[i] != '.' || (str[i] == '-' && i >
0))
return (false);
if (str[i] == '.')
hasDecimal = true;
}
return (hasDecimal);
}
char* subString(char* str, int left, int right) {
int i;
char* subStr = (char*)malloc( sizeof(char) * (right - left +
2));
for (i = left; i <= right; i++)
subStr[i - left] = str[i];
subStr[right - left + 1] = '\0';
return (subStr);
}
void detectTokens(char* str) {
int left = 0, right = 0;
int length = strlen(str);
while (right <= length && left <= right) {
if (isValidDelimiter(str[right]) == false)
right++;
if (isValidDelimiter(str[right]) == true && left == right)
{
if (isValidOperator(str[right]) == true)
printf("Valid operator : '%c'
", str[right]);
right++;
left = right;
} else if (isValidDelimiter(str[right]) == true && left !=
right || (right == length && left != right)) {
char* subStr = subString(str, left, right - 1);
if (isValidKeyword(subStr) == true)
printf("Valid keyword : '%s'
", subStr);
else if (isValidInteger(subStr) == true)
printf("Valid Integer : '%s'
", subStr);
else if (isRealNumber(subStr) == true)
printf("Real Number : '%s'
", subStr);
else if (isvalidIdentifier(subStr) == true
&& isValidDelimiter(str[right - 1]) == false)
printf("Valid Identifier : '%s'
", subStr);
else if (isvalidIdentifier(subStr) == false
&& isValidDelimiter(str[right - 1]) == false)
printf("Invalid Identifier : '%s'
", subStr);
left = right;
}
}
return;
}
int main(){
char str[100] = "float x = a + 1b; ";
printf("The Program is : '%s'
", str);
printf("All Tokens are :
");
detectTokens(str);
return (0);
}

Explore our latest online courses and learn new skills at your own pace. Enroll
and become a certified expert to boost your career.

Output
The Program is : 'float x = a + 1b; '
All Tokens are :
Valid keyword : 'float'
Valid Identifier : 'x'
Valid operator : '='
Valid Identifier : 'a'
Valid operator : '+'
Invalid Identifier : '1b'

build basic compiler on C++. When I call function Lex it's call
function Recognise_Identifier_Keyword twice and then the program end. I
don't know why the function Lex doesn't call the other two functions. I'm testing
the code with string if (x == 5) { y = 10 } else { y <= 20; } here is the
code:
C++:
#include <iostream>
#include <string>

using namespace std;

struct Table
{
int Position{ 0 };
char Symbol[254]{ "" };
int Code{ 0 };
Table* next;
};

typedef Table* point;

//typedef Table* pointer();
//point symbolTable;
point symbolTable = NULL;

point P, Check, Prev;

point Last = NULL;
//point Del = NULL;

int Create(point& symbolTable, string symbol, int kod) {

P = new Table;
strncpy_s(P->Symbol, symbol.c_str(), sizeof(P->Symbol - 1));
P->Code = kod;
// P->Symbol = str;
Check = symbolTable;
P->next = NULL;

while (Check != NULL) {

if (strcmp(P->Symbol, Check->Symbol) == 0) {
return Check->Position;
} //if
Check = Check->next;
P->Position++;
} //while

if (symbolTable == NULL) { symbolTable = P; }

else { Last->next = P; }
Last = P;
P->Position++;
return P->Position;
} // Create()

void show_Lex_Error(const string& programText) {

cout << "LEX error" << programText;
}

int token = 100;

bool Recognise_Identifier_Keyword(const string& programText, point& symbolTable) {
cout << "test indentifier \n";
int count = 0;
// string buffer = "";
for (int i = 0; i < programText.size(); i++) {
if (isalnum(programText[i]) && isalpha(programText[0])) {
count++;
}
}
if (programText == "if" || programText == "else" || programText == "while" ||
programText == "break") {
token++;
Create(symbolTable, programText, token);
return true;
}
if (programText.size() == count) {
Create(symbolTable, programText, 1);
return true;
}
else { return false; }
}

int konst = 200;

bool Recognise_Number(const string& programText, point& symbolTable) {
cout << "test number \n";
int count = 0, dotCounter = 0;
// string buffer = "";
for (int i = 0; i < programText.size(); i++) {
if (isdigit(programText[i]) || dotCounter < 1) {
if (programText[i] == '.') {
dotCounter++;
}
count++;
}
}
if (programText.size() == count) {
konst++;
Create(symbolTable, programText, konst);
return true;
}
else { return false; }
}
int op = 300;
bool Recognise_Operator(const string& programText, point& symbolTable) {
cout << "test operator \n";
if (programText == "=>" || programText == "<=" || programText == "==" ||
programText == "!=" || programText == "=" || programText == "+" || programText == "-"
|| programText == "*" || programText == "/" || programText == "(" || programText == ")" ||
programText == "[" || programText == "]" || programText == "{" || programText == "}")
{
op++;
Create(symbolTable, programText, op);
return true;
}
else { return false; }
}
void Lex(const string& programText, point& symbolTable) {
string buffer ="";
char symbol = ' ';
for (int i = 0; i < programText.size(); i++) {
if (programText[i] == ' ') {
Recognise_Identifier_Keyword(buffer, symbolTable);
Recognise_Number(buffer, symbolTable);
Recognise_Operator(buffer, symbolTable);
}
if (!Recognise_Identifier_Keyword(buffer, symbolTable) &&
!Recognise_Number(buffer, symbolTable) &&
!Recognise_Operator(buffer, symbolTable)) {
show_Lex_Error(buffer);
// return; // Exit the function when an error is detected
}

/* if (programText[i] == ' ') {

if (!Recognise_Identifier_Keyword(buffer, symbolTable) &&
!Recognise_Number(buffer, symbolTable) &&
!Recognise_Operator(bu ffer, symbolTable)) {
show_Lex_Error(buffer);
return; // Exit the function when an error is detected
}
}
*/
symbol = programText[i];
buffer += symbol;
if (programText[i] == ' ') {
buffer = "";
}
}
}

void printirane(point P) {
cout << "position" << "\t" << "symbol" << "\t" << "token" << "\n";

while (P != NULL) // или while (P)

{
cout << P->Position << "\t " << P->Symbol << "\t" << P->Code << "\n";
P = P->next;
} //while
} //printirane

void main() {
system("chcp 1251");
string symbol;
int token;
int number;
do {
cout << "1) fill the symbolTable. \n";
cout << "2) print the symbolTable. \n";
cout << "0) exit. \n";
cout << "enter a number: ";
cin >> number;
cout << "\n";
switch (number)
{

case 1:
cout << "enter string: ";
// getline(cin, symbol);
cin >> symbol;
Lex(symbol, symbolTable);
break;
case 2:
printirane(symbolTable);
cout << "\n";
break;
}

} while (number != 0);

}

// if (x == 5) { y = 10 } else { y <= 20; }

Discovering Computers 2016: Building Solutions
No ratings yet
Discovering Computers 2016: Building Solutions
21 pages
GDS-2000_trainingmanual
No ratings yet
GDS-2000_trainingmanual
23 pages
ESOFT Admin Module-Assessment Resource 8311 8311-1723612790046-Unit 05 Security Assignment(2024 Update)
No ratings yet
ESOFT Admin Module-Assessment Resource 8311 8311-1723612790046-Unit 05 Security Assignment(2024 Update)
17 pages
Mathematics-Ii (M2) For RGPV Bhopal by Dr. Akhilesh Jain, UNIT - 2, Second Order Linear Diff Eq
100% (1)
Mathematics-Ii (M2) For RGPV Bhopal by Dr. Akhilesh Jain, UNIT - 2, Second Order Linear Diff Eq
13 pages
Discovering Computers 2016: Communicating Digital Content
No ratings yet
Discovering Computers 2016: Communicating Digital Content
38 pages
Lab Report 2.
No ratings yet
Lab Report 2.
6 pages
CH-05 Semantic Analysis
No ratings yet
CH-05 Semantic Analysis
115 pages
M Ch-20 Application of Derivatives
No ratings yet
M Ch-20 Application of Derivatives
7 pages
Lexical Analysis in Compiler Design With Example
No ratings yet
Lexical Analysis in Compiler Design With Example
8 pages
Nursing Escape Room Guide Update - Final
No ratings yet
Nursing Escape Room Guide Update - Final
32 pages
Classification of Books Using Python and Flask
No ratings yet
Classification of Books Using Python and Flask
5 pages
Lexical Analysis
No ratings yet
Lexical Analysis
128 pages
D2 2 Technology Selection Report
No ratings yet
D2 2 Technology Selection Report
132 pages
Assignments-DISCRETE MATHEMATICS
No ratings yet
Assignments-DISCRETE MATHEMATICS
3 pages
Lexical Analyzer Using DFA by Ingale, Vayadande, Verma, Yeole, Zawar and Jamadar
No ratings yet
Lexical Analyzer Using DFA by Ingale, Vayadande, Verma, Yeole, Zawar and Jamadar
4 pages
CS606 1
No ratings yet
CS606 1
3 pages
OSINT - Digital Mosaic Case Study
No ratings yet
OSINT - Digital Mosaic Case Study
9 pages
Compiler Construction Lec 1b
No ratings yet
Compiler Construction Lec 1b
27 pages
SEN 317 Lecture 2n
No ratings yet
SEN 317 Lecture 2n
19 pages
South Korea To Swap Textbooks For Tablets
No ratings yet
South Korea To Swap Textbooks For Tablets
3 pages
Logic Gates Presentation in Blue Clean Style
No ratings yet
Logic Gates Presentation in Blue Clean Style
10 pages
Recognition of Token in Lexical Analysis-3
No ratings yet
Recognition of Token in Lexical Analysis-3
10 pages
12 - An Information Security Governance Framework
No ratings yet
12 - An Information Security Governance Framework
13 pages
Eval l9026 Yo
No ratings yet
Eval l9026 Yo
10 pages
Lecture 2.1 - Lexical Analysis
No ratings yet
Lecture 2.1 - Lexical Analysis
24 pages
Module 5 of daa
No ratings yet
Module 5 of daa
5 pages
6502 Computer
No ratings yet
6502 Computer
3 pages
Compiler Construction Lec 1b
No ratings yet
Compiler Construction Lec 1b
37 pages
CS606 Assignment 1
No ratings yet
CS606 Assignment 1
4 pages
Inst_Neo_suite_SP_us
No ratings yet
Inst_Neo_suite_SP_us
61 pages
Lexical Analysis in Automata Theory and Compiler Design
No ratings yet
Lexical Analysis in Automata Theory and Compiler Design
12 pages
Platinum 4410: FTTH Redefined For You!
No ratings yet
Platinum 4410: FTTH Redefined For You!
2 pages
Unit 2 Lexical Analyzer
No ratings yet
Unit 2 Lexical Analyzer
30 pages
UNIT I BKS Lesson 3 Lexical Analysis and Role of Lexical Analyzer
No ratings yet
UNIT I BKS Lesson 3 Lexical Analysis and Role of Lexical Analyzer
28 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
16 pages
Chapter 2-Lexical Analysis
No ratings yet
Chapter 2-Lexical Analysis
48 pages
HW_31712
No ratings yet
HW_31712
22 pages
Phoresis Extended Minicap
No ratings yet
Phoresis Extended Minicap
37 pages
Assignments 1.1 to 1.4 UNIT 1 Fundamentals of CO & DE
No ratings yet
Assignments 1.1 to 1.4 UNIT 1 Fundamentals of CO & DE
2 pages
Microsoft 365 E5
No ratings yet
Microsoft 365 E5
1 page
Comp Final
No ratings yet
Comp Final
16 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
18 pages
Chapter 2 Lexical Analysis (Scanning) (1)
No ratings yet
Chapter 2 Lexical Analysis (Scanning) (1)
56 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
16 pages
Ch2-CC
No ratings yet
Ch2-CC
47 pages
e3-chap-04
No ratings yet
e3-chap-04
26 pages
R.V. College of Engineering
No ratings yet
R.V. College of Engineering
56 pages
Lect 05
No ratings yet
Lect 05
38 pages
Unit 2
No ratings yet
Unit 2
14 pages
4890840
No ratings yet
4890840
21 pages
Lecture 3
No ratings yet
Lecture 3
4 pages
3.Role of Lexical Analyzer
No ratings yet
3.Role of Lexical Analyzer
4 pages
Lecture 3- Lexical Analysis (1)
No ratings yet
Lecture 3- Lexical Analysis (1)
42 pages
Lab Manual-CC
No ratings yet
Lab Manual-CC
19 pages
BC200405108
No ratings yet
BC200405108
5 pages
2. Lexical Analyzer
No ratings yet
2. Lexical Analyzer
16 pages
49049chapter 1 MIS
No ratings yet
49049chapter 1 MIS
36 pages
Lexical Analysis and Parsing CD
No ratings yet
Lexical Analysis and Parsing CD
107 pages
Lexical Analyzer: Design and Implementation With LEX Tool
No ratings yet
Lexical Analyzer: Design and Implementation With LEX Tool
13 pages
File (6)
No ratings yet
File (6)
26 pages
48583FrequencyDistributions Week 1.2.3 (F)
No ratings yet
48583FrequencyDistributions Week 1.2.3 (F)
124 pages
Lexical Analysis
No ratings yet
Lexical Analysis
38 pages
Evil Twin
No ratings yet
Evil Twin
29 pages
role of a lexical AN
No ratings yet
role of a lexical AN
26 pages
What Is The Role of The Lexical Analyzer in Compiler Design
No ratings yet
What Is The Role of The Lexical Analyzer in Compiler Design
2 pages
ACD unit-2 part-2
No ratings yet
ACD unit-2 part-2
20 pages
CD - CH2 - Lexical Analysis
No ratings yet
CD - CH2 - Lexical Analysis
67 pages
Automata Theory and Compiler Design: Name: Smitha.A Usn: 1Vj21Cs042 Branch: Cse
No ratings yet
Automata Theory and Compiler Design: Name: Smitha.A Usn: 1Vj21Cs042 Branch: Cse
9 pages
Lexical Analysis
No ratings yet
Lexical Analysis
15 pages
Blue Signal Premium by Leo
No ratings yet
Blue Signal Premium by Leo
3 pages
Chapter 1
No ratings yet
Chapter 1
42 pages
L2 Lexical Analysis
No ratings yet
L2 Lexical Analysis
59 pages
Course File Contents and Check List (Dossier)
No ratings yet
Course File Contents and Check List (Dossier)
1 page
Lecture 4 Lexical Analysis
No ratings yet
Lecture 4 Lexical Analysis
23 pages
Lecture-2-10022025-035804pm
No ratings yet
Lecture-2-10022025-035804pm
27 pages
CH02 COA9e
No ratings yet
CH02 COA9e
61 pages
Chapter 2 Lexical Analysis (Scanning) Edited
No ratings yet
Chapter 2 Lexical Analysis (Scanning) Edited
46 pages
Status Config Fail
No ratings yet
Status Config Fail
2 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
74 pages
Upload 1
No ratings yet
Upload 1
3 pages
BA-A-S120Mini User Manual en
100% (1)
BA-A-S120Mini User Manual en
19 pages
IAT334 Lec01 Intro
No ratings yet
IAT334 Lec01 Intro
58 pages
Lexical Analysis
No ratings yet
Lexical Analysis
5 pages
CH06 COA9e
No ratings yet
CH06 COA9e
46 pages
2 Lexical Analyzer
No ratings yet
2 Lexical Analyzer
21 pages
Day 2 - Lexial Analyzer
No ratings yet
Day 2 - Lexial Analyzer
37 pages
Lexical Analysis
No ratings yet
Lexical Analysis
12 pages
002chapter 2 - Lexical Analysis
No ratings yet
002chapter 2 - Lexical Analysis
114 pages
Compilers and Translators Assignment
No ratings yet
Compilers and Translators Assignment
3 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
10 pages
Chapter 2
No ratings yet
Chapter 2
6 pages
CD - CH2 - Lexical Analysis
No ratings yet
CD - CH2 - Lexical Analysis
59 pages
Case Study of Lexical Analyzer PDF
100% (1)
Case Study of Lexical Analyzer PDF
3 pages
CBAP Project 2 Reference - Stanford Library
100% (1)
CBAP Project 2 Reference - Stanford Library
10 pages
L4 - Lexical Analysis (Introduction)
No ratings yet
L4 - Lexical Analysis (Introduction)
11 pages
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
100% (1)
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
37 pages
Stalkersoup Installation Guide 1.1.0017A: Preparation
No ratings yet
Stalkersoup Installation Guide 1.1.0017A: Preparation
11 pages
Mixed Cost
No ratings yet
Mixed Cost
4 pages
Certificate Declaration: Topic Name
No ratings yet
Certificate Declaration: Topic Name
16 pages
2-Lexical Analysis
No ratings yet
2-Lexical Analysis
52 pages
Test Your Understanding (Copy) - Attempt Review
No ratings yet
Test Your Understanding (Copy) - Attempt Review
4 pages
Code Creation: Machine Inventions
From Everand
Code Creation: Machine Inventions
Pasquale De Marco
No ratings yet
Coding for beginners The basic syntax and structure of coding
From Everand
Coding for beginners The basic syntax and structure of coding
Diamond Moore
No ratings yet

Lexical Analysis

Uploaded by

Lexical Analysis

Uploaded by

Introduction of Lexical Analysis

Last Updated : 27 Jan, 2025

Lexical analysis, also known as scanning is the first phase of a

 The lexical analyzer takes a source program as input, and

while WHILE a IDENTIEFIER

>= COMPARISON – ARITHMETIC

How Lexical Analyzer Works?

Key Term In Lexical Analysis

1. NLP (Natural Language Processing)

A token is a group of characters treated as a single unit. Each token represents a

4. Lexer (Lexical Analyzer)

Step Involved In Lexical Analysis

Step 1 – Identify Tokens

Step 2 – Assign Strings to Tokens

Step 3 – Return Lexeme or Value

Difference Types Of Lexical Analysis

Let us understand each of these types of lexical analysis in brief:

1. Loop and Switch Algorithm

2. Regular Expressions and Finite Automata

Is Lexical Analysis A Good Choice for Text Processing?

Advantages of Lexical Analysis

 Ambiguity in Token Categorization: Lexical analysis can sometimes struggle

2.1. Line structure

2.1.1. Logical lines

2.1.2. Physical lines

2.1.4. Encoding declarations

# -*- coding: <encoding-name> -*-

which is recognized also by GNU Emacs, and

which is recognized by Bram Moolenaar’s VIM.

2.1.5. Explicit line joining

if 1900 < year < 2100 and 1 <= month <= 12 \

2.1.6. Implicit line joining

month_names = ['Januari', 'Februari', 'Maart', # These are the

2.1.7. Blank lines

Cross-platform compatibility note: because of the nature of text editors on non-UNIX

Here is an example of a correctly (though confusingly) indented piece of Python code:

The following example shows various indentation errors:

def perm(l): # error: first line indented

2.1.9. Whitespace between tokens

2.2. Other tokens

2.3. Identifiers and keywords

Identifiers are unlimited in length. Case is significant.

identifier ::= xid_start xid_continue*

The Unicode category codes mentioned above stand for:

False await else import pass

2.3.2. Soft Keywords

Changed in version 3.12: type is now a soft keyword.

2.3.3. Reserved classes of identifiers

The name _ is often used in conjunction with internationalization; refer to the

It is also commonly used for unused variables.

2.4.1. String and Bytes literals

stringliteral ::= [stringprefix](shortstring |

One syntactic restriction not indicated by these productions is that whitespace

In plain English: Both types of literals can be enclosed in matching single

Both string and bytes literals may optionally be prefixed with a

Support for the unicode legacy literal (u'value') was reintroduced to

2.4.1.1. Escape sequences

\' Single quote (')

\" Double quote (")

\a ASCII Bell (BEL)

\b ASCII Backspace (BS)

\f ASCII Formfeed (FF)

\n ASCII Linefeed (LF)

\r ASCII Carriage Return (CR)

\t ASCII Horizontal Tab (TAB)

\v ASCII Vertical Tab (VT)

\ooo Character with octal value ooo (2,4)

Escape sequences only recognized in string literals are:

1. A backslash can be added at the end of a line to ignore the newline:

The same result can be achieved using triple-quoted strings, or

2. As in Standard C, up to three octal digits are accepted.

Changed in version 3.11: Octal escapes with value larger

Changed in version 3.12: Octal escapes with value larger

3. Unlike in Standard C, exactly two hex digits are required.

Changed in version 3.6: Unrecognized escape sequences produce

Changed in version 3.12: Unrecognized escape sequences produce

# -- coding: <encoding-name> --