01 Compiler_Scanner intro [Recovered]

The document discusses the implementation of a scanner in a compiler, which transforms a stream of characters into words classified by syntactic categories. It outlines the process of constructing scanners using regular expressions, DFA (Deterministic Finite Automaton), and various implementation strategies, including table-driven and direct-coded scanners. Additionally, it addresses challenges such as excess roll back during scanning and methods to optimize performance.

Uploaded by

126003020

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

01 Compiler_Scanner intro [Recovered]

Uploaded by

126003020

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 20

Scanner

Introduction
• Transform a stream of characters into a stream of words in the input
language
• Each word must be classified into a syntactic category, or “part of
speech.”
• Only pass in the compiler to touch every character in the input
program
• Aggregate characters to form words and applies a set of rules to
determine whether or not each word is legal in the source language
• If the word is valid, the scanner assigns it a syntactic category, or part
of speech
Overview
• Compiler’s scanner reads an input stream that consists of characters
and produces an output stream that contains words, each labelled
with its syntactic category
• Classification of words according to their grammatical usage
• Scanner applies a set of rules that describe the lexical structure of the
input programming language called microsyntax
• specifies how to group characters into words and, conversely, how to separate
words that run together
note: punctuation marks and other symbols as words

• Keyword- a word that is reserved for a particular syntactic purpose

and, thus, cannot be used as an identifier
Example
• while ,static - match the rule for an identifier but have special meanings
• Recognize keywords, the scanner can either use dictionary lookup or
encode the keywords directly into its microsyntax rules
• Lexical structure of programming languages lends itself to efficient
scanners
• Compiler writer starts from a specification of the language’s
microsyntax
• Either encodes the microsyntax into a notation accepted by a
scanner generator,
• constructs an executable scanner
• uses that specification to build a hand-crafted scanner
• Both generated and hand-crafted scanners can be implemented to
require just O(1) time per character, so they run in time proportional
to the number of characters in the input stream
Cycle of Construction
Implementing scanners
• For most languages, the compiler writer can produce an
acceptably fast scanner directly from a set of regular
expressions
• Compiler writer creates an Regular Expression (RE) for
each syntactic category and gives the “REs” as input to
a scanner generator
• Generator constructs an “NFA” for each RE, joins them
with ϵ-transitions
• Creates a corresponding “DFA”, and minimizes the “DFA”
• At that point, the scanner generator must convert the
“DFA” into executable code
Implementation strategies
• Implementation strategies for converting a DFA into

executable code
• Table-driven scanner
• Direct-coded scanner
• Hand-coded scanner
• All of these scanners operate in the same manner, by
simulating DFA
• Repeatedly read the next character in the input and
simulate the DFA transition caused by that character
• Process stops when the DFA recognizes a word
Implementation strategies
• When the current state, s- has no outbound transition on the
current input character
• If s is an accepting state- the scanner recognizes the word and
returns a lexeme and its syntactic category to the calling
procedure
• If s is a nonaccepting state -the scanner must determine whether
or not it passed through an accepting state on the way to s
• If the scanner did encounter an accepting state, it should roll
back its internal state and its input stream to that point and report
success
• If it did not, it should report the failure
Table-Driven Scanners
• Uses a skeleton scanner for control and a set of
generated tables that encode language-specific
knowledge

• Compiler writer provides a set of lexical

patterns, specified as regular expressions.
• Scanner generator then produces tables that
drive the skeleton scanner
Table-Driven
Scanners
• Table-driven scanner for
the RE which was our
first attempt at an RE for
iloc register names
• Left side -shows the
skeleton scanner
• Right side- shows the
tables for and the
underlying “DFA”
Table-Driven Scanners

• The skeleton scanner divides into four sections:

• Initializations
• Scanning loop that models the DFA’s behavior
• Roll back loop in case the DFA overshoots the end of the token
• Final section that interprets and reports the results
• Scanning loop repeats the two basic actions of a scanner:
• Read a character
• Simulate the DFA’s action
• It halts when the DFA enters the error state,
• Two tables, CharCat and transition table , encode all knowledge
about the DFA
• Roll back loop uses a stack of states to revert the scanner to its
most recent accepting state
Table-Driven Scanners

• Skeleton scanner uses the variable state to hold the current state of the simulated
DFA
• Updates state using a two-step, table-lookup process
• Classifies char into one of a small set of categories using the Char- Cat
table.
The scanner for has three categories: Register, Digit, or Other.
• Uses the current state and the character category as indices into the transition
table
• Two-step translation,
• Character to category then state
• Category to new state then lets the scanner use a compressed transition table
• Tradeoff between direct access into a larger table and indirect access into the
compressed table is straightforward
Table-Driven Scanners
• Complete table would eliminate the mapping through CharCat, but
would increase the memory footprint of the table
• Uncompressed transition table grows as the product of the number
of states in the DFA and the number of characters in Σ; it can grow
to the point where it will not stay in cache
• With a small, compact character set, such as ascii, CharCat can be
represented as a simple table lookup
• Relevant portions of CharCat should stay in the cache
• In that case, table compression adds one cache reference per input
character
• As the character set grows (e.g. Unicode), more complex implementations of
CharCat may be needed
• The precise tradeoff between the per-character costs of both
compressed and uncompressed tables will depend on properties of
both the language and the computer that runs the scanner
Table-Driven Scanners
• To provide a character-by-character interface to the input stream,
Skeleton scanner uses
• NextChar macro, which sets its sole parameter to contain the
next character in the input stream
• RollBack macro moves the input stream back by one
character
• If the scanner reads too far, state will not contain an accepting
state at the end of the first while loop
• In that case, the second while loop uses the state trace from the
stack to roll the state, lexeme, and input stream back to the most
recent accepting state
• In most languages, the scanner’s overshoot will be limited
• Pathological behavior, however, can cause the scanner to
examine individual characters many times, significantly
increasing the overall cost of scanning
• In most programming languages, the amount of roll back is small
relative to the word lengths
Avoiding Excess Roll Back
• Some regular expressions can produce quadratic calls to roll back
in the scanner
• Problem arises from to have the scanner return the longest word
that is a prefix of the input stream
• RE ab | (ab) c - The corresponding DFA recognizes either ab or
any number of occurrences of ab followed by a final c
• On the input string “ababababc”, a scanner built from the DFA
will read all the characters and return the entire string as a single
word
• If, however, the input is “abababab”, it must scan all of the
characters before it can determine that the longest prefix is “ab”
• On the next invocation, it will scan “ababab” to return “ab”
• Third call will scan “abab” to return “ab”, and the final call will
simply return “ab” without any roll back
• Worst, case- it can spend quadratic time reading the input stream
Avoiding Excess Roll Back

• Differs from the earlier scanner in three important ways:

• First, it has a global counter, InputPos, to record position in the
input stream
• Second, it has a bit-array, Failed, to record dead-end
transitions as the scanner finds them
• Failed has a row for each state and a column for each
position in the input stream
• Third, it has an initialization routine that must be called before
NextWord() is invoked
• Routine sets InputPos to zero and sets Failed uniformly to false
Avoiding Excess Roll Back
• This scanner, called the maximal munch scanner, avoids the pathological behavior
by marking dead-end transitions as they are popped from the stack
• Thus, over time, it records specific (state,input position)pairs that cannot lead to
an accepting state
• Inside the scanning loop, the first while loop, the code tests each (state,input
position) pair and breaks out of the scanning loop whenever a failed transition is
attempted
• Optimizations can drastically reduce the space requirements of this scheme
• Most programming languages have simple enough microsyntax that this kind of
quadratic roll back cannot occur
• If building a scanner for a language that can exhibit this behavior, the scanner can
avoid it for a small additional overhead per character
Generating the Transition and
Classifier Tables
• Given a dfa, the scanner generator can generate the tables in a
straightforward fashion
• Initial table has one column for every character in the input
alphabet and one row for each state in the dfa
• For each state, in order, the generator examines the outbound
transitions and fills the row with the appropriate states
• Generator can collapse identical columns into a single instance; as
it does so, it can construct the character classifier. (Two characters
belong in the same class if and only if they have identical columns
in .)
• If the dfa has been minimized, no two rows can be identical, so row
compression is not an issue
Changing Languages
• To model another DFA, the compiler writer can simply supply new tables
Direct-Coded Scanners
• To improve the performance of a table-driven scanner, must reduce the cost of one or
both of its basic actions:
• Read a character
• Compute the next DFA transition

• Direct-coded scanners reduce the cost of computing DFA transitions by replacing the
explicit representation of the DFA’s state and transition graph with an implicit one
• The implicit representation simplifies the two-step, table-lookup computation
• Eliminates the memory references entailed in that computation and allows other
specializations.
• Resulting scanner has the same functionality as the table-driven scanner, but with a
lower overhead per character
• A direct-coded scanner is no harder to generate than the equivalent table-driven
scanner
• The table-driven scanner spends most of its time inside the central while loop; thus, the
heart of a direct-coded scanner is an alternate implementation of that while loop.
• With some detail abstracted, that loop performs the following actions:

Programming Languaged Scanning Week 1-2
No ratings yet
Programming Languaged Scanning Week 1-2
7 pages
Compiler Construction
No ratings yet
Compiler Construction
35 pages
Lexical Analysis - Part II From Regular Expression To Scanner Comp 412
No ratings yet
Lexical Analysis - Part II From Regular Expression To Scanner Comp 412
17 pages
04 Regular Expressions & FAs
No ratings yet
04 Regular Expressions & FAs
46 pages
Regular Expression
No ratings yet
Regular Expression
46 pages
unit5
No ratings yet
unit5
43 pages
Lecture08 4up
No ratings yet
Lecture08 4up
5 pages
CT 1 Compiler Design Set C
No ratings yet
CT 1 Compiler Design Set C
77 pages
Applications of FA
No ratings yet
Applications of FA
29 pages
Lecture 3-4 Updated
No ratings yet
Lecture 3-4 Updated
26 pages
Chapter 33
No ratings yet
Chapter 33
107 pages
CS606 Midterm
No ratings yet
CS606 Midterm
11 pages
lect03
No ratings yet
lect03
19 pages
Lecture 09
No ratings yet
Lecture 09
31 pages
(Lec 1-3) Introduction To Compilers
No ratings yet
(Lec 1-3) Introduction To Compilers
34 pages
compiler construction Lecture 3-4
No ratings yet
compiler construction Lecture 3-4
78 pages
Compiler Assignment
No ratings yet
Compiler Assignment
12 pages
CC Exp CS 20
No ratings yet
CC Exp CS 20
61 pages
SLD 2
No ratings yet
SLD 2
67 pages
Ch2-CC
No ratings yet
Ch2-CC
47 pages
Compiler Design
No ratings yet
Compiler Design
42 pages
Lexical Analysis
No ratings yet
Lexical Analysis
88 pages
1st Phase Lexical Analyzer
No ratings yet
1st Phase Lexical Analyzer
33 pages
Compiler Construction: Vana Doufexi
No ratings yet
Compiler Construction: Vana Doufexi
18 pages
1586345305compiler Construction Lecture 1
No ratings yet
1586345305compiler Construction Lecture 1
4 pages
Compiler Design (All Modules) - 09
No ratings yet
Compiler Design (All Modules) - 09
1 page
2024 CSN352 Lec 8
No ratings yet
2024 CSN352 Lec 8
48 pages
Compiler Construction Lecture Notes
No ratings yet
Compiler Construction Lecture Notes
27 pages
Chapter Two LexicalAnalysis
No ratings yet
Chapter Two LexicalAnalysis
16 pages
CSC 453 Lexical Analysis (Scanning) : Saumya Debray
No ratings yet
CSC 453 Lexical Analysis (Scanning) : Saumya Debray
27 pages
Compiler Design
No ratings yet
Compiler Design
11 pages
Compiler Course: Lexical Analysis
No ratings yet
Compiler Course: Lexical Analysis
50 pages
Lecture 3
No ratings yet
Lecture 3
22 pages
Compiler Construction Notes
No ratings yet
Compiler Construction Notes
21 pages
The Structure of A Compiler: Any Compiler Must Perform Two Major Tasks
No ratings yet
The Structure of A Compiler: Any Compiler Must Perform Two Major Tasks
57 pages
Code Source Tokens Scanner Parser IR
No ratings yet
Code Source Tokens Scanner Parser IR
26 pages
Lexical Analysis Finite Automata
No ratings yet
Lexical Analysis Finite Automata
12 pages
Compiler Design Lab
No ratings yet
Compiler Design Lab
62 pages
Compiler Design in C (Allen I. Holub)
100% (1)
Compiler Design in C (Allen I. Holub)
986 pages
Sri Vidya College of Engineering and Technology Question Bank
No ratings yet
Sri Vidya College of Engineering and Technology Question Bank
5 pages
Compiler Construction Lecture Notes: Why Study Compilers?
No ratings yet
Compiler Construction Lecture Notes: Why Study Compilers?
16 pages
Chapter 3 - Lexical Analysis
100% (3)
Chapter 3 - Lexical Analysis
51 pages
Lecture 09
No ratings yet
Lecture 09
31 pages
UNIT II
No ratings yet
UNIT II
5 pages
Visvesvaraya Technological University: Artificial Intelligence & Data Science
No ratings yet
Visvesvaraya Technological University: Artificial Intelligence & Data Science
11 pages
Chapter 1 - Introduction
No ratings yet
Chapter 1 - Introduction
40 pages
Compiler Lecture 5
No ratings yet
Compiler Lecture 5
12 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
55 pages
Subject: - : Sixth Semester B.Tech Degree Model Examination, March 2018
No ratings yet
Subject: - : Sixth Semester B.Tech Degree Model Examination, March 2018
30 pages
Compiler Construction Final Notes for End Sem Exam
No ratings yet
Compiler Construction Final Notes for End Sem Exam
37 pages
Csc3205-Lexical - Analysis PDF
No ratings yet
Csc3205-Lexical - Analysis PDF
33 pages
CS3501 CD QB-UNIT 1
No ratings yet
CS3501 CD QB-UNIT 1
6 pages
Compiler Bipin
No ratings yet
Compiler Bipin
94 pages
Mod 1 Atc - Merged
No ratings yet
Mod 1 Atc - Merged
15 pages
A Process of Recognizing The Lexical Components in A
No ratings yet
A Process of Recognizing The Lexical Components in A
39 pages
2 3 Marks
No ratings yet
2 3 Marks
24 pages
2.1 Constituents of Lexical Analysis
No ratings yet
2.1 Constituents of Lexical Analysis
10 pages
CD 22-23 Answers
No ratings yet
CD 22-23 Answers
28 pages
Rust Mini Reference: A Hitchhiker's Guide to the Modern Programming Languages, #5
From Everand
Rust Mini Reference: A Hitchhiker's Guide to the Modern Programming Languages, #5
Harry Yoon
No ratings yet
Mastering Kafka Streams: From Basics to Expert Proficiency
From Everand
Mastering Kafka Streams: From Basics to Expert Proficiency
William Smith
No ratings yet
Unit 1- Web appln security prinicples
No ratings yet
Unit 1- Web appln security prinicples
140 pages
dwdm notes part 2
No ratings yet
dwdm notes part 2
56 pages
03_PARSING
No ratings yet
03_PARSING
71 pages
Compiler Engineering Syllabus
No ratings yet
Compiler Engineering Syllabus
2 pages
Ielts Recent Actual Test With Answers Volume 1 - Listening Practice Test 1 v9 560
No ratings yet
Ielts Recent Actual Test With Answers Volume 1 - Listening Practice Test 1 v9 560
39 pages
11th Computer Science Chapter 9 To 12 Model Question Paper English Medium
No ratings yet
11th Computer Science Chapter 9 To 12 Model Question Paper English Medium
4 pages
C Programming - Question Bank PDF
90% (10)
C Programming - Question Bank PDF
21 pages
Java Concepts
No ratings yet
Java Concepts
129 pages
C PPTSTL
No ratings yet
C PPTSTL
72 pages
FundamentalsOf C
100% (2)
FundamentalsOf C
46 pages
Pseudocode & Algorithm
No ratings yet
Pseudocode & Algorithm
8 pages
Gui
No ratings yet
Gui
516 pages
GMSH Manual
No ratings yet
GMSH Manual
270 pages
Unit I Python Introduction
No ratings yet
Unit I Python Introduction
65 pages
WUC 203 Unit 5
No ratings yet
WUC 203 Unit 5
170 pages
Variables Types (Var, Let, Const)
No ratings yet
Variables Types (Var, Let, Const)
3 pages
Java Theroy ! Easy To Learn
No ratings yet
Java Theroy ! Easy To Learn
53 pages
Modul Java Android
No ratings yet
Modul Java Android
87 pages
CC102 Lesson Proper For Week 1 To 5
No ratings yet
CC102 Lesson Proper For Week 1 To 5
28 pages
Legit Prog Compilation of Sources 1
No ratings yet
Legit Prog Compilation of Sources 1
86 pages
CTL (Synopsys Format) For Core Based Test
No ratings yet
CTL (Synopsys Format) For Core Based Test
9 pages
C Languagenotes
No ratings yet
C Languagenotes
33 pages
Q3 Computer 10 Week 5
No ratings yet
Q3 Computer 10 Week 5
5 pages
I Puc Viva Questions
No ratings yet
I Puc Viva Questions
3 pages
CPP 1
No ratings yet
CPP 1
7 pages
Question Bank Class Xi Cs Answer To Be Uploaded-4
No ratings yet
Question Bank Class Xi Cs Answer To Be Uploaded-4
66 pages
Java Programming
No ratings yet
Java Programming
14 pages
Robot Framework Latest
No ratings yet
Robot Framework Latest
410 pages
ROSCOE - B001632e - Extended Facilities For System Programmers Guide
No ratings yet
ROSCOE - B001632e - Extended Facilities For System Programmers Guide
300 pages
Python UNIT 2
No ratings yet
Python UNIT 2
31 pages
Chapter 1 of Digital
No ratings yet
Chapter 1 of Digital
46 pages
JS (Backend)
No ratings yet
JS (Backend)
62 pages
Logic and Process Automation II PDF
No ratings yet
Logic and Process Automation II PDF
26 pages
Summary & Schedule Section
No ratings yet
Summary & Schedule Section
33 pages

01 Compiler_Scanner intro [Recovered]

Uploaded by

01 Compiler_Scanner intro [Recovered]

Uploaded by

Scanner

• Keyword- a word that is reserved for a particular syntactic purpose

• Compiler writer provides a set of lexical

• The skeleton scanner divides into four sections:

• Differs from the earlier scanner in three important ways:

You might also like