0% found this document useful (0 votes)
19 views

Cat 1

The document discusses the introduction to compiler design and lexical analysis. It describes the structure and phases of a compiler including lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization and code generation. It also explains key concepts like lexeme, tokens, symbol tables and their role in different compiler phases. The lexical analysis phase breaks down source code into tokens which are then used by subsequent phases.

Uploaded by

Meet m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Cat 1

The document discusses the introduction to compiler design and lexical analysis. It describes the structure and phases of a compiler including lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization and code generation. It also explains key concepts like lexeme, tokens, symbol tables and their role in different compiler phases. The lexical analysis phase breaks down source code into tokens which are then used by subsequent phases.

Uploaded by

Meet m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 150

BCSE307L_COMPILER DESIGN

INTRODUCTION TO COMPILATION AND LEXICAL ANALYSIS


Dr. B.V. Baiju,
SCOPE,Assistant Professor
VIT, Vellore
Structure and Phases of a Compiler
Compiler
• A compiler is a translator that converts the high-level language into the
machine language.
• High-level language is written by a developer and machine language can
be understood by the processor.
• Compiler is used to show errors to the programmer.
• The main purpose of compiler is to change the code written in one
language without changing the meaning of the program.

• The compilation is divided into two phases:


– Analysis (Machine Independent/Language Dependent)
– Synthesis (Machine Dependent/Language-Independent)
1. Analysis (Front end of a compiler)
• Analysis phase reads the source program and splits it into multiple tokens
and constructs the intermediate representation of the source program.

• It also checks the program for some errors like lexical errors, grammar
errors, and syntax errors.
• The analysis part also collects information about the source program and
stores it in a data structure called a symbol table.
• Symbol table will be used all over the compilation process.
2. Synthesis (Back end of a compiler)
• It will get the analysis phase input(intermediate representation and
symbol table) and produces the targeted machine level code.
Phases of compilation
• The compilation
process contains the
sequence of various
phases.
• Each phase takes
source program in
one representation
and produces output
in another
representation.
• Each phase takes
input from its
previous stage.
• The symbol table,
which stores
information about
the entire source
program, is used by
all phases of the
compiler
a. Lexical Analysis
• The first phase of a compiler is called lexical analysis or scanning.
• The lexical analyzer reads the stream of characters making up the source
program and groups the characters into meaningful sequences called
lexemes.
• This process can be left to right and character by character.
• The primary functions of this phase are:
– Identify the lexical units in a source code
– Classify lexical units into classes like constants, reserved words, and
enter them in different tables. It will Ignore comments in the source
program
– Identify token which is not a part of the language
• For each lexeme, the lexical analyzer produces as output a token of the
form
(token-name, attribute-value)
b. Syntax Analysis
• The second phase of the compiler is syntax analysis or parsing.
• It determines whether or not a text follows the expected format.
• The main aim of this phase is to make sure that the source code was
written by the programmer is correct or not.
• Syntax analysis is based on the rules based on the specific programing
language by constructing the parse tree with the help of tokens.
• List of tasks performed in this phase
– Obtain tokens from the lexical analyzer
– Checks if the expression is syntactically correct or not
– Report all syntax errors
– Construct a hierarchical structure which is known as a parse tree
• Example
c. Semantic Analysis
• Semantic analysis checks the semantic consistency of the code.
• It uses the syntax tree of the previous phase along with the symbol
table to verify that the given source code is semantically consistent.
• It also checks whether the code is conveying an appropriate meaning.
• An important part of semantic analysis is type checking, where the
compiler checks that each operator has matching operands.
• Functions of Semantic analyses phase are:
– Helps you to store type information gathered and save it in symbol
table or syntax tree
– Allows you to perform type checking
– In the case of type mismatch, where there are no exact type
correction rules which satisfy the desired operation a semantic error
is shown
– Collects type information and checks for type compatibility
– Checks if the source language permits the operands or not
Example
float x = 20.2;
float y = x*30;
The semantic analyzer will typecast the integer 30 to float 30.0 before
multiplication
d. Intermediate Code Generation
• In the intermediate code generation, compiler generates the source
code into the intermediate code.
• Intermediate code is generated between the high-level language and
the machine language.
• The intermediate code should be generated in such a way that you can
easily translate it into the target machine code.
• The two most important kinds of intermediate representations are:
– Tree i.e. “parse trees” and “syntax tree”.
– A Linear representation i.e., “three address code”.
Example
total = count + rate * 5
Intermediate code with the help of address code method is:
t1 := int_to_float(5)
t2 := rate * t1
t3 := count + t2
total := t3
e. Code Optimization
• Code optimization is an optional phase.
• It is used to improve the intermediate code so that the output of the
program could run faster and take less space.
• It removes the unnecessary lines of the code and arranges the
sequence of statements in order to speed up the program execution.
• Removing unreachable code and getting rid of unused variables
• Removing statements which are not altered from the loop
f. Code Generation
• Code generation is the last and final phase of a compiler.
• It gets inputs from code optimization phases and produces the page
code or object code as a result.
• The objective of this phase is to allocate storage and generate
relocatable machine code.
• It also allocates memory locations for the variable.
• The instructions in the intermediate code are converted into machine
instructions.
• This phase coverts the optimize or intermediate code into the target
language (Machine Code)
Use of symbol tables in different phases
Lexical Analysis
• It is the first phase of the compiler.
• The input for lexical analysis is source code.
• Scans the pure HLL code line by line.
• After taking source code as an input, it breaks them into valid tokens by
removing whitespace, comment from source code.
• If there are any invalid tokens present in the source code, it will show an
error.
• The output of the lexical analysis is a sequence of tokens, which will be
further sent to the syntax analysis as an input.

program gcd (input, output);


var i, j : integer;
begin
read (i, j);
while i <> j do
if i > j then i := i - j else j := j - i;
writeln (i)
end.

program gcd ( input , output ) ;


var i , j : integer ; begin
read ( i , j ) ; while
i <> j do if i > j
then i := i - j else j
:= i - i ; writeln ( i
) end .
BCSE307L_COMPILER DESIGN

INTRODUCTION TO COMPILATION AND LEXICAL ANALYSIS


Dr. B.V. Baiju,
SCOPE,Assistant Professor
VIT, Vellore
LEXEME
• The sequence of characters matched by a pattern to form the
corresponding token or a sequence of input characters that comprises a
single token is called a lexeme
• Words in the source program.
• In preprocessing Comment, Pre-processor directive, Macro, Whitespace
are removed
• Example: Given the expression;
c=a+b*5
TOKENS
• Token is a sequence of characters in the input that form a meaningful
word.
• A token is a pair consisting of a token name and an optional attribute
value.
• The token names are the input symbols that the parser processes
• In most languages, the tokens fall into these categories:
Keywords for, while, if etc
Identifier Variable name,
function name
Operators '+', '++', '-' etc.
Separators ',' ';'
Example of Non-Tokens:
• Comments
• Preprocessor directive
• Macros
• Blanks
• Tabs
• Newline
Example : Consider the following code that is fed to Lexical Analyzer
#include <stdio.h> Tokens
int maximum(int x, int y)
{ Lexeme Token
// This will compare 2 numbers int Keyword
if (x > y) maximum Identifier
return x; ( Operator
else {
int Keyword
return y;
} x Identifier
} , Operator
int Keyword
Non-Tokens
Type Examples
y Identifier
Comment // This will compare 2 numbers ) Operator
Pre-processor directive #include <stdio.h> { Operator
Pre-processor directive #define NUMS 8,9 If Keyword
Macro NUMS
Whitespace /n /b /t
• Example : Find out the Lexemes and LEXEMES TOKENS
Tokens for the given code int keyword
TOKENS
main keyword
#include<iostream> ( operator
// example
) operator
int main(){
{ operator
int a, b;
int keyword
a = 7;
a identifier
return 0;
, operator
}
b identifier
; operator
NON TOKENS
a identifier
TYPE TOKENS = assignment symbol
10 10(number)
preprocessor directive #include<iostream>
; operator
comment // example return keyword
0 0(number)
; operator
} operator
Example
1. For the given program find out the valid tokens
int main()
Valid tokens
{
// 2 variables 'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b' ';'
int a, b; 'a' '=' '10' ';' 'return' '0' ';' '}'
a = 10;
return 0;
}
2. Count number of tokens
int main() Total Number of Tokens :
{
27
int a = 10, b = 20;
printf("sum is :%d",a+b);
return 0;
}
3. Count number of tokens
Total number of tokens:
int max(int i); 7
int, max, ( ,int, i, ), ;

4. Find the number of tokens in the following C statement

printf(“i=%d, &i=%x”,i,&i); Total number of tokens:


a. 3
b. 26
c. 10
d. 20
Total number of tokens:
a. 27
b. 29
c. 26
d. 24
PATTERN
• A pattern is a rule matched by a sequence of characters(lexemes) used
by the token.
• It can be defined by regular expressions or grammar rules.
• It is the description of class of tokens.
• In the case of a keyword as a token, the pattern is just the sequence of
characters that form the keyword.

Token Lexeme Pattern


ID x y n0 Letter followed by letters and digits
NUM -123 1.456e-5 Any numeric constant
IF if if
LPAREN ( (
Any string of characters (except ``) between ``
LITERAL ``Hello''
and ``
Criteria Token Lexeme Pattern
It is a sequence of
Token is basically a characters in the
It specifies a set
sequence of source code that are
of rules that a
characters that are matched by given
Definition scanner follows
treated as a unit as it predefined language
to create a
cannot be further rules for every lexeme
token.
broken down. to be specified as a
valid token.
All the reserved The sequence of
Interpretation of keywords of that characters that
int, goto
type Keyword language(main, make the
printf, etc.) keyword.
It must start
with the
Interpretation of Name of a variable, alphabet,
main, a
type Identifier function, etc followed by the
alphabet or a
digit.
Criteria Token Lexeme Pattern
Interpretation of All the operators are
+, = +, =
type Operator considered tokens.
Each kind of
punctuation is
Interpretation of considered a token.
(, ), {, } (, ), {, }
type Punctuation e.g. semicolon,
bracket, comma,
etc.
any string of
Interpretation of a grammar rule or “Welcome to characters
type Literal boolean literal. GeeksforGeeks!” (except ‘ ‘)
between ” and “
Attributes for Tokens
• The lexical analyzer collects information about tokens into their
associated attributes.
Token <Token_name, optional_attribute>
• The tokens influence parsing decisions and attributes influence the
translation of tokens.
• Usually a token has a single attribute i.e. pointer to the symbol table
entry in which the information about the token is kept.
• The token names and Lexeme <Token, token_attribute>
associated attribute
E <id, pointer to symbol-table entry for E>
values for the Fortran
= <assign_op>
statement are written
below as a sequence of M <id, pointer to symbol-table entry for M>
pairs. * <mult_op>
C <id, pointer to symbol-table entry for C>
E = M * C ** 2 ** <exp_op>
2 <number, integer value 2>
BCSE307L_COMPILER DESIGN

INTRODUCTION TO COMPILATION AND LEXICAL ANALYSIS


Dr. B.V. Baiju,
SCOPE,Assistant Professor
VIT, Vellore
SPECIFICATIONS OF TOKENS
• Specification of tokens depends on the pattern of the lexeme.
• Regular expressions is used to specify the different types of patterns
that can actually form tokens.

Strings and Languages

Operation on Languages

Regular Expression

Regular Definition
Strings and Languages

1. String
• Alphabet or character class is a finite set of symbols. Denoted as 

Example :
 The set of digits (symbols)  = {0, 1} forms a binary alphabet.
 ASCII used in almost every computer, denotes the alphabet A using the set
of digits {0, 1} i.e. A = 01000001.
  = {a, b,…..,z} is the set of lower case letters

• Symbols can be letters, digits and punctuation.


• String is a finite set of alphabets. It is generated from 

  = {a, b} we can derive n number of strings a, ab,ba,aab,……


  = {0, 1} possible strings possible are 0,01,1,10,100,010,……
2. Language
• Language is a set or a collection of strings over some fixed alphabets.

L= {0,01,1,10,100,010,……}
L={a,b,ab,ba,abb,….}
• {ε} – set containing only empty string is language under φ.
Operations on strings
Length of String
• The length of the string can be determined by the number of alphabets
in the string.
• The string is represented by the letter ‘s’ and |s| represents the length
of the string.
s = banana, |s| = 6
s= 1100 , |s| = 4
Empty string
• The empty string or the string with length 0 is represented by ‘∈’.
• The string does not contain any character
|∈|=0
TERM DEFINITION EXAMPLE
Prefix of s A string obtained by ban is a prefix of banana.
removing zero or more S=abcd
trailing symbols of string s Prefix: ∈, a, ab, abc, abcd
Suffix of s A string formed by nana is a suffix of banana.
deleting zero or more of s = abcd
the leading symbols of s. Suffix: ∈, d, cd, bcd, abcd
Substring of s A string obtained by nan is a substring of banana.
deleting a prefix and a s=banana
suffix from s Substring :∈ nan,na,anan
Proper prefix, Any nonempty string x S= abcd
suffix, or substring that is a prefix, suffix or Proper Prefix : a, ab, abc
of s substring of s that s <> x. Proper Suffix :d, cd, bcd
Substring : bcd, abc, cd, ab
Subsequence of s Any string formed by baaa is a subsequence of
deleting zero or more not banana
necessarily contiguous S=abcd
symbols from s Subsequence : abd, bcd, bd
Operation on Languages

OPERATION DEFINITION EXAMPLE


Union of L and M L υ M = { s | s is in L If L = {a, b} and M = {c, d}
(LυM) or s is in M } L ∪ M = {a, b, c, d}

Concatenation of L LM = { st | s is in L If L = {a, b} and M = {c, d}


and M.(LM) and t is in M } L ⋅ M = {ac, ad, bc, bd}
Kleene closure of L L* denotes “zero or If L = {a, b}
(L*) more concatenation L* = {∈, a, b, aa, bb, aaa, bbb, …}
of” L
Positive closure of L L+ denotes “one or If L = {a, b}
(L+) more Concatenation L+ = {a, b, aa, bb, aaa, bbb, …}
of” L
Regular Expression

• A regular expression is a sequence of symbols used to specify lexeme


patterns.
id = letter ( letter | digit) *
• A regular expression is helpful in describing the languages that can be
built using operators such as union, concatenation, and closure over the
symbols.
• A regular expression ‘r’ that denotes a language L(r) is built recursively
over the smaller regular expression using set of defined rules.
• Regular expressions arise from Chomsky type 3 grammars.
• A Chomsky type 3 grammar has productions of the form:
A → a or A →aB
Basis
• There are two rules that form the basis:
(i) ε is a regular expression, and L(ε) is { ε }, that is, the language whose
individual member is the empty string.
(ii) If ‘a’ is a symbol in Σ, then ‘a’ is a regular expression, and L(a) = {a},
that is, the language with one string, of length one, with ‘a’ in its
one position.
Induction:
• There are four parts to the induction whereby larger regular expressions
are built from smaller ones
• Suppose r and s are regular expressions denoting languages L(r) and L(s),
respectively.
(i) (r)|(s) is a regular expression denoting the language L(r) U L(s).
(ii) (r)(s) is a regular expression denoting the language L(r)L(s).
(iii) (r)* is a regular expression denoting (L(r))*.
(iv) (r) is a regular expression denoting L(r)
• A language denoted by a regular expression is said to be a regular set.
• A regular expression is made up of symbols of the language being
defined together with operators that support:
– concatenation (traditionally specified by symbol adjacency in the
regular expression)
– alternation (symbols or groups of symbols are separated by the |
operator) and
– repetition (symbols or groups of symbols are followed by the ∗
operator to signify zero or more repetitions).
• Parentheses can also be used to group symbols.
• The repetition operator has highest precedence, followed by
concatenation, with alternation having the lowest precedence.
Example
• abc denotes the set of strings with the single member {abc}.
• a|b|c denotes the set {a, b, c}.
• a∗ denotes {ε, a, aa, aaa, . . .}. ε is the empty string.
• ab∗ denotes the infinite set {a, ab, abb, abbb, . . .}.
• (a|b)∗ denotes the set of strings made up of zero or more a’s or b’s.
• a(bc|d)∗e denotes the set of strings including {ae, abce, abcde, ade, . . .}.
• The regular expression notation is simple yet powerful.
• It is compact and unambiguous.
• An integer constant can be defined as
digit→ 0|1|2|3|4|5|6|7|8|9
int constant→ digit digit∗
• An identifier as an initial letter followed by a sequence, possibly empty,
of letters or digits can be defined as
digit→ 0|1|2|3|4|5|6|7|8|9
letter→ a|b|c| . . . |z|A|B|C| . . . |Z
Identifier → letter (letter | digit)∗
• Unsigned numbers (integer or floating point) are strings such as 5280,
0.01234, 6.336E4, or 1.89E-4 can be defined as
digit 0|19
digits -+ digit digit*
optionalFraction - ). . digits | e
optionalExponent ( E ( + | - | e ) digits ) | e
number -> digits optionalFraction optionalExponent
• Algebraic laws for regular expressions

AXIOM DESCRIPTION
r|s = s|r | is commutative
r|(s|t) = (r|s)|t | is associative
(rs)t = r(st) Concatenation is associative
r(s|t) = rs|rt Concatenation distributes over |
(s|t)r = sr|tr

εr = rε = r ε is the identity element for concatenation


r* = (r|ε)* Relation between * and ε
r** = r* * is idempotent
Regular Definition

• The regular definition is the name given to the regular expression.


• The regular definition (name) of a regular expression is used in the
subsequent expressions.
• The regular definition used in an expression appears as if it is a symbol.
• If ∑ is an alphabet of basic symbols, then a regular definition is a
sequence of definitions of the form
d1 → r1
d2 → r2

dn → rn
• where each di is a distinct name, and each ri is a regular expression
over the symbols in ∑ U {d1, d2, … , di-1}, i.e., the basic symbols and the
previously defined names.
• Regular Expressions for Identifiers

Letter  a|b|c|…….|z|A|B|…..|Z
digit  0|1|2|…..|9
idLetter (Letter / digit)*
or

Id  *a b c ….z A B C….Z+ (*a b c ….z A B C….Z+ *0 1 2…9+) *


Extended Regular Expression
• Kleene introduced regular expressions with the basic operators for
union, concatenation, and Kleene closure in the 1950s, many extensions
have been added to regular expressions to enhance their ability to
specify string patterns.
1. One or more instances(+)
• The unary postfix operator + represents the positive closure of a regular
expression and its language.
• If r is a regular expression, the r+ denotes the language (L(r))+
• The operator + has the same precedence and associativity as the
operator *.
• Two useful algebraic laws,
r* = r+\e
r+ = rr* = r*r
relate the Kleene closure and positive closure.
2. Zero or one instance (?)
• The unary postfix operator ? denotes zero or one occurrence of the
regular expression r.
• The notation r? is a shorthand for r | ε.
• If r is a regular expression, then ( r )? is a regular expression that denotes
the language
• The ? operator has the same precedence and associativity as * and +.
3. Character classes
• Consider a regular expression a1| a2| …| an.
• Now here each ai is the symbol of the alphabet you can replace them by
shorthand [a1, a2, … an] or by [a1 – an].
• We can describe identifiers as being strings generated by the regular
expression, [A–Za–z][A– Za–z0–9]*
BCSE307L_COMPILER DESIGN

INTRODUCTION TO COMPILATION AND LEXICAL ANALYSIS


Dr. B.V. Baiju,
SCOPE,Assistant Professor
VIT, Vellore
LEX - A Lexical Analyzer Generator
• A LEX (Lexical Analyzer Generator) is a tool that generates lexical
analyzer.
• It is used with YACC (Yet Another Compiler Compiler) parser generator.
• The lexical analyzer is a program that transforms an input stream into a
sequence of tokens.
• It reads the input stream and produces the source code as output
through implementing the lexical analyzer in the C program.
• Two Rules to remember of Lex:
1. Lex will always match the longest (number of characters) token
possible.
Example:
Input: abc
Then [a-z]+ matches abc rather than a or ab or bc.
2. If two or more possible tokens are of the same length, then the
token with the regular expression that is defined first in the lex
specification is favored.
1. Function of Lex
• lex.l is an a input file written in a language which describes the
generation of lexical analyzer. The lex compiler transforms lex.l to a C
program known as lex.yy.c
• lex.yy.c is compiled by the C compiler to a file called a.out
• a.out is lexical analyzer that transforms an input stream into a sequence
of tokens.

• yylval is a global variable which is shared by lexical analyzer and parser


to return the name and an attribute value of token.
2. Structure of LEX
(i) Declarations %{
• This section includes declaration of variables,
declarations
constants and regular definitions.
%}
• It consist of two parts
(i) auxiliary declarations %%
translation rules
%{ %%
#include<stdio.h> auxiliary functions
int global_variable;
%}
(ii) regular definitions

number [0-9]+
op[-|+|*|/|^|=]

• They are not processed by the lex tool instead are copied by the lex to
the output file lex.yy.c file.
• It is bracketed with %{ and %}
(ii) Translation rules
• It contains regular expressions (patterns to be matched) and code
segments (corresponding code to be executed).
• Rule has the form
Pattern {Action}
• Pattern is a regular expression or regular definition.
– Starts from the first column
• Action refers to segments of code.
– Must begin on the same line
– Multiple sentences are enclosed within braces ({})

%% INPUT OUTPUT
{number} {printf("" number");} 13 number
{op} {printf(" operator");} + operator
%% 13 + 17 number operator number
(iii) Auxiliary functions
• LEX generates C code for the rules specified in the Rules section and
places this code into a single function called yylex().
• This section holds additional functions which are used in actions. These
functions are compiled separately and loaded with lexical analyzer.
• These functions are compiled separately and loaded with lexical analyzer.

yylval global variable can be used to send a second code back


yytext global variable points to character string of matched lexeme
yyleng global variable holding the length of the matched lexeme
yylex() the function to call to invoke lex
3. Conflict resolution in lex
• yylex() function uses two important rules for selecting the right actions
for execution in case there exists more than one pattern matching a string
in a given input.
(i) Always prefer a longer prefix to a shortest one
“break” {return BREAK;}
[a-zA-Z][a-zA-Z0-9]* {return IDENTIFIER;}
• In this case if 'break' is found in the input, it is matched with the first
pattern and BREAK is returned by yylex() function.
If another word eg, 'random' is found, it will be matched with the second
pattern and yylex() returns IDENTIFIER.
(ii) If two or more patterns are matched for the longest prefix, then the first
pattern listed in lex program is preferred.
/* Declarations section */ Input: -
%% Output: MINUS
Input: --
“-” {return MINUS;} Output: DECREMENT
“--” {return DECREMENT;}
%% Input: ---
/* Auxiliary functions */ Output: DECREMENT MINUS
4. The Lookahead Operator
• Lookahead operator is the additional operator that is read by lex in order
to distinguish additional pattern for a token.
• Lexical analyzer is used to read one character ahead of valid lexeme and
then pull back to produce token.
• The /(slash) is placed at the end of an input to indicate the end of part of
a pattern that matches with a lexeme.
• In some languages keywords are not reserved. So the statements
IF (I, J) = 5 and IF(condition) THEN
• results in conflict whether to produce IF as an array name or a keyword.
To resolve this the lex rule for keyword IF can be written as,
IF/\ (.* \) {
letter }
• LEX program to recognize numbers
%{
Running Lex program:
#include <stdio.h>
%}
[student@localhost ~]$
lex 1a.l
%%
[0-9]+ { printf(“Saw an integer: %s\n”, yytext); }
[student@localhost ~]$ cc
. { ;}
lex.yy.c
%%
[student@localhost ~]$
main( )
./a.out
{
printf(“Enter some input that consists of an
integer number\n”); Enter some input that consists
yylex(); of an integer number
}
hello 2345
int yywrap()
{ Saw an integer: 2345
return 1;
}
%{
#include <stdlib.h>
Initialize the header files along with a global
#include <stdio.h>
int count = 0; variable count.
%}
%%
[0-9]+ { int no = atoi(yytext); Define the regular definition of the expected
printf("%x",no); tokens, such as the decimal number input will
count++; contain digits
} 0−9
[\n] return 0;
%%
int main(void)
{
printf("Enter any number(s) to be converted to
hexadecimal:\n");
yylex(); Define the main() function, which calls
printf("\n"); the yylex() keyword.
return 0;
}
%{
#include<stdio.h>
//Identifier and error tokens Compilation
#define ID 1 lex filename.c
#define ER 2 gcc lex.yy.c
%}
./a.out
low [a-z]
upp [A-Z]
number [0-9]
%option noyywrap I/O
%%
({low}|{upp})({low}|{upp})*({number}) return ID; Input: z0
(.)* return ER;
Output: Accept

%% Input: z
int main(){ Ouptut: Reject
int token = yylex();
if(token==ID)
printf("Accept\n");
else if(token==ER)
printf("Reject\n");
return 1;
}
BCSE307L_COMPILER DESIGN

SYNTAX ANALYSIS
Dr. B.V. Baiju,
SCOPE,Assistant Professor
VIT, Vellore
SYNTAX ANALYSIS
• In syntax analysis, the compiler checks the syntactic structure of the input
string, i.e., whether the given string follows the grammar or not.
• It uses a data structure called a parse tree or syntax tree to make
comparisons.
• It is used to check if the code is grammatically correct or not.
• It helps us to detect all types of syntax errors.
• It gives an exact description of the error.
• It rejects invalid code before actual compiling.
• Syntax Analyser Terminology

Sentence A sentence is a group of character over some alphabet.


Lexeme A lexeme is the lowest level syntactic unit of a language (e.g.,
total, start).
Token A token is just a category of lexemes.
Keywords and It is an identifier which is used as a fixed part of the syntax of
reserved words a statement. It is a reserved word which you can’t use as a
variable name or identifier.
Noise words Noise words are optional which are inserted in a statement to
enhance the readability of the sentence.
Comments It is a syntactic element which marks the start or end of some
syntactic unit.
Comments It is a very important part of the documentation. It mostly
display by, /* */, or//Blank (spaces)
Character set ASCII, Unicode
Identifiers It is a restrictions on the length which helps you to reduce the
readability of the sentence.
Operator symbols + and – performs two basic arithmetic operations.
Role of the parser
• The process of transforming the data from one format to another is
called Parsing.
• This process can be accomplished by the parser.
• The parser is a component of the translator that helps to organise linear
text structure following the set of defined rules which is known as
grammar.
• It reports any syntax errors in the program.
• A parser in NLP uses the grammar rules (formal grammar rules) to verify
if the input text is valid or not syntactically.
• A parser is a program that generates a parse tree for the given string, if
the string is generated from the underlying grammar.
BCSE307L_COMPILER DESIGN

SYNTAX ANALYSIS
Dr. B.V. Baiju,
SCOPE,Assistant Professor
VIT, Vellore
Top–Down Parsing
• The top–down parser starts by constructing the parse tree with a single
node labelled with the start symbol.
• It can then build up the complete parse tree by creating the subtrees one
by one, in a left-to-right order.
• In building a subtree, the root node of that subtree is created and than all
the sub-subtrees of that subtree are generated.
Parse Trees and the Leftmost Derivation
• Illustrate the parsing for the expression x+y*z.
1. The parser starts off by constructing a tree
containing just the starting symbol as the root
node.
2. The next step in the pre-order generation is to set
up the leftmost subnode. This is done by looking at
the grammar of the language and noting
that<expr>is defined as
<expr> ::= <term> | <expr> + <term>

<expr> ::= <expr> + <term>


3. Now deal with the nodes and their subtrees from
left to right. This time we use the production given
below and the tree becomes:
<expr> ::= <term>
4. The lower <term> node has to be 5. The <factor> node is given a
tackled next. Use the production single child via the production

<term> ::= <factor> <factor> ::= x

• Our input is x+y*z, the x is matched and the remaining input is +y*z.
6. The next step is to deal with 7. The latest <term> is given the
the third node <term>. We use child <factor> from the production
the below production and the
tree becomes: <term> ::= <factor>

<term> ::= <term> * <factor>


8. That <factor> corresponds to 9. y is matched with the input and the
y. We use the production remaining input is *z.
• Finally we use the production
<factor> ::= y
<factor> ::= z
RECURSIVE DESCENT PARSING
• Recursive Descent Parser uses the technique of Top-Down Parsing
without backtracking.
• It can be defined as a Parser that uses the various recursive procedure to
process the input string with no backtracking.
• Backtracking: Making repeated input scans until we find a correct path.
• For implementing a recursive descent parser for a grammar.
– The grammar must not be left recursive
– The grammar must be left factored that means it should not have
common prefixes for alternates
– Language should have a recursion facility.
• An example of a recursive descent parser is a parser for parsing
arithmetic expressions.
• Consider the following grammar:
E -> iEI # Procedure to match a
EI -> +iE / ε character and input new
character
• E and E’ are non-terminals and ‘+’, ‘i’ are
match(char t)
characters.
{
• Epsilon represents the end of the
if (l == t) {
recursion
l = getchar();
# Procedure for E # Procedure for E’ }
E() E’() else{
{ { printf("Error");
# l is the lookahead if (l == ‘+’) }
if (l == ‘i’) { }
{ match(‘+’) main()
match(‘i’); match(‘i’) {
E’(); E’(); E();
} } if (l==’$’){
} else printf(“Parsing Successful”);
return; }
} }
• Write down the algorithm using Recursive procedures to implement the
following Grammar.
E → TE′
E′ → +TE′
T → FT′
T′ →∗ FT′|ε
F → (E)|id

You might also like