Compiler Module 1 Important Questions
Compiler Module 1 Important Questions
Once a lexeme is identified, the forward pointer is updated, and the lexeme is recorded
as a token. The lexemeBegin pointer is then positioned to the character immediately
following the lexeme.
Let's say we have the following line of code in a programming language:
int variableName = 42;
Here's how the input buffering scheme in a lexical analyzer would identify the
lexeme "variableName":
Initial State:
The forward pointer and the lexemeBegin pointer are both pointing
to the beginning of the line.
Scanning:
The lexical analyzer starts scanning characters from the beginning of
the line.
It encounters the characters "i", "n", "t", which form the keyword "int".
This is recognized as a token representing the data type.
The forward pointer moves to the character immediately after
"t".
lexemeBegin pointer is typically updated after a complete lexeme
has been identified and recorded as a token. It marks the beginning of
the next lexeme to be analyzed in the input stream.
Identifying the Identifier:
The lexical analyzer encounters the characters "v", "a", "r", "i",
"a", "b", "l", "e", "N", "a", "m", "e", which form the identifier
"variableName".
The lexical analyzer continues scanning characters until the end of the line,
identifying and recording tokens for other lexemes like the assignment
operator "=", the integer literal "42", and the semicolon ";"
When the forward pointer reaches the end of a buffer, the lexical analyzer reloads the
other buffer from the input file and moves the forward pointer to the beginning of the
newly loaded buffer.
Tokens are basically a sequence of characters that are treated as a single unit as they
cannot be further broken down
Tokens are the building blocks of a program and are used to convey information to the
compiler or interpreter
Examples of tokens include
identifiers (user-defined names)
keywords (int, float, goto, continue, break)
operators (+,-,/, * )
literals
punctuation symbols,
the expression a := b + 10 consists of five tokens: two identifiers ( a and b ), an
assignment operator ( := ), an addition operator ( + ), and a numeric literal ( 10 ).
1. Lexical Analyzer
1. Position
2. Initial
1. Position
3. rate
2. Syntax Analyzer
id1 +
id2 *
id3 60
3. Semantic Analyzer
id1 +
id2 *
id3 int_to_real
60
temp1 = int_to_real(60)
temp2 = id3 * temp1
temp3 = id2 + temp2
id1 = temp3
5. Code Optimizer
6. Code generator
MOVF id3, R2
MULF #60.0, R2
MOVF id2,R1
ADDF R2,R1
MOVF R1,id1
; Move the value of id3 into register R2
MOVF id3, R2
Scanner Generators:
Generate lexical analyzers (scanners) from a description of the tokens of a
programming language using regular expressions.
Lexical analyzers produced by scanner generators recognize patterns in the input
source code and generate tokens for further processing by the compiler.
Example LEX
Parser Generators:
Parser generators, such as Yacc (Yet Another Compiler Compiler) automatically
generate syntax analyzers (parsers) from a grammatical description of a
programming language.
Syntax analyzers produced by parser generators parse the input source code
according to the specified grammar rules, generating parse trees or abstract syntax
trees (ASTs) that represent the syntactic structure of the code.
Syntax-Directed Translation Engines:
A Syntax-Directed Translation Engine is a tool used during the compilation process of
a programming language. Its main job is to help convert code from one form to
another.
set of instructions or rules that guide the translation process. These rules are closely
tied to the grammar of the programming language being compiled.
Produce collections of routines for walking a parse tree and generating intermediate
code.
Automatic code generators:
This generator takes an intermediate code as input, and converts each operation of
the intermediate code into the equivalent machine language.
6. Trace the output after each phase of the compiler for the
assignment statement: a = b + c * 10, if variables given are of
float type.
Refer Question Number 4, it uses example Position := initial + rate * 60
7. Regular Expressions
What are regular expressions?
Write regular expression for the language accepting all the string which are starting
with 1 and ending with 0 over {0,1}
The answer is
1(0+1) * 0
Here We start with 1, so one is put first
After 1 we can have any combination of 0s and 1s so we do (0+1) *
At the end we need 0 so putting 0 at the end
Write regular rexpression for the language L over {0,1} such that all the string do not
contain the substring 01
Our language is
L = [ε, 0, 1, 00, 11, 10, 100, ...]
The expression is
(1 0 )
Design Regular expression to accept all possible combination of a's and b's
Suppose we want to write a compiler for a new programming language called X. We need
to consider three languages involved in this process:
Source Language (X): The language for which we are writing the compiler.
Object Language (or Target Language) (Z): The language that the compiler will
produce code for.
Implementation Language (Y): The language in which the compiler itself is written.
It employs two buffers of the same size N (Size of a disk block) to efficiently handle input.
While one buffer is being processed, the other buffer is filled with characters from the input
file
Input buffering allows for lookahead, which is essential for identifying certain lexemes that
require examining characters beyond the current position, such as identifiers and
compound operators.
Examples of identifiers: variableName , function_name , ClassName123 , etc.
Examples of compound operators: += , -= , <= , >= , && , || , etc.
The lexical analyzer maintains two pointers:
lexemeBegin , marking the start of the current lexeme
Once a lexeme is identified, the forward pointer is updated, and the lexeme is recorded
as a token. The lexemeBegin pointer is then positioned to the character immediately
following the lexeme.
Let's say we have the following line of code in a programming language:
int variableName = 42;
Here's how the input buffering scheme in a lexical analyzer would identify the
lexeme "variableName":
Initial State:
The forward pointer and the lexemeBegin pointer are both pointing
to the beginning of the line.
Scanning:
The lexical analyzer starts scanning characters from the beginning of
the line.
It encounters the characters "i", "n", "t", which form the keyword "int".
This is recognized as a token representing the data type.
The forward pointer moves to the character immediately after
"t".
lexemeBegin pointer is typically updated after a complete lexeme
has been identified and recorded as a token. It marks the beginning of
the next lexeme to be analyzed in the input stream.
Identifying the Identifier:
The lexical analyzer encounters the characters "v", "a", "r", "i",
"a", "b", "l", "e", "N", "a", "m", "e", which form the identifier
"variableName".
The lexical analyzer continues scanning characters until the end of the line,
identifying and recording tokens for other lexemes like the assignment
operator "=", the integer literal "42", and the semicolon ";"
When the forward pointer reaches the end of a buffer, the lexical analyzer reloads the
other buffer from the input file and moves the forward pointer to the beginning of the
newly loaded buffer.
Tokens are basically a sequence of characters that are treated as a single unit as they
cannot be further broken down
Tokens are the building blocks of a program and are used to convey information to the
compiler or interpreter
Examples of tokens include
identifiers (user-defined names)
keywords (int, float, goto, continue, break)
operators (+,-,/, * )
literals
punctuation symbols,
In the statement int num = 10; , the tokens are:
Keyword token: int
Identifier token: num
Assignment operator token: =
Integer literal token: 10
Semicolon token: ;
Lexemes
A lexeme is a sequence of characters in the source program that matches the
pattern for a token
Lexemes are the actual occurrences of the tokens in the source code.
Example:
; : Semicolon lexeme
Pattern
A pattern is a set of rules that the scanner or the lexical analyzer follows to create a token
For example, in case of keywords, the pattern is just the sequence of characters that form
the keyword.\
Regular expression pattern for integer literals in a programming language: [0-9]+
This pattern specifies that an integer literal lexeme consists of one or more digits
(0-9) in sequence.