Chapter 2 Lexical Analysis (Scanning) (1)
Chapter 2 Lexical Analysis (Scanning) (1)
Lexical Analysis
1
Objective
At the end of this session students will be able to:
Understand the basic roles of lexical analyzer (LA): Lexical Analysis
Versus Parsing, Tokens , Patterns, and Lexemes, Attributes for
Tokens and Lexical Errors.
Understand the specification of Tokens: Strings and Languages,
Operations on Languages, Regular Expressions, Regular Definitions
and Extensions of Regular Expressions
Lexical Analysis.
5
Contd. Parser uses tokens
produced by the LA to
create a tree-like
intermediate
representation that
depicts the
It uses the syntax tree
grammatical structure
and thetoken
of the information
stream. in
the symbol table to
check the source
program for semantic
consistency with the
language definition.
An important part of
In the process of semantic analysis is
translating a source type checking, where
program into target code, a the compiler checks
compiler may construct that each operator
one or more intermediate The has matching
machine-independent
representations, which are operands
code-optimization phase
easy to produce and attempts to improve the
easy to translate into intermediate code so that
the target machine better target code will
result. Usually better
means faster, but other
objectives may be desired,
6 such as shorter code, or
target code that consumes
The Role of Lexical Analyzer
Symbol
Table
table.
Separators = { ;, ,}
12
Consideration for a simple design of Lexical
Analyzer
Lexical Analyzer can allow source program to be
1.Free-Format Input:- the alignment of lexeme should not be necessary
in determining the correctness of the source program such restriction
put extra load on Lexical Analyzer
2.Blanks Significance:- Simplify the task of identifying tokens
E.g. Int a indicates <Int is keyword> <a is identifier>
Inta indicates <Inta is Identifier>
3.Keyword must be reserved:- Keywords should be reserved otherwise
LA will have to predict whether the given lexeme should be treated as a
keyword or as identifier
E.g. if then then then =else;
Else else
Approaches to=implementation
then;
The above statements are misleading as then and else keywords are
Use assembly language- Most efficient but most difficult to
not reserved.
implement
13 Use high level languages like C- Efficient but difficult to
Lexical Errors
Are primarily of two kinds.
language.
numeric constants.
error.
5. Pre-scanning
16
Input Buffering
There are some ways that the task of reading the source
program can be speeded
This task is made difficult by the fact that we often have to
look one or more characters beyond the next lexeme before
we can be sure we have the right lexeme
In C language: we need to look after -, = or < to decide what
token to return
We shall introduce a two-buffer scheme that handles large loo-
kaheads safely
We then consider an improvement involving sentinels that
saves time checking for the ends of buffers
Because of the amount of time taken to process characters
17
and the large number of characters that must be processed
Contd.
Each buffer is of the same size N, and N is usually the size of a disk
block, e.g., 4096 bytes
Using one system read command we can read N characters
into a buffer, rather than using one system call per character
If fewer than N characters remain in the input file, then a special
program
Two pointers to the input are maintained:
1. Pointer lexemeBegin, marks the beginning of the
18
current lexeme, whose extent we are attempting to
Contd.
Once the lexeme is determined, forward is set to the character at its
right end (involves retracting)
Then, after the lexeme is recorded as an attribute value of the
token returned to the parser, lexemeBegin is set to the
character immediately after the lexeme just found
Advancing forward requires that we first test whether we have
reached the end of one of the buffers, and if so, we must reload the
other buffer from the input, and move forward to the beginning of the
newly loaded buffer
Sentinels
If we use the previous scheme, we must check each time we
advance forward, that we have not moved off one of the buffers; if
we do, then we must also reload the other buffer
19
Thus, for each character read, we must make two tests: one for
Contd.
lexeme patterns.
While they cannot express all possible patterns, they are very
alphabet.
Abstract languages like , the empty set, or {Ɛ} , the set
containing only the empty string, are languages under this
definition.
22
So too are the set of all syntactically well-formed java
Terms for Parts of Strings
The following string-related terms are commonly used:
1.A prefix of string S is any string obtained by removing zero or more
symbols from the end of s.
For example, ban, banana, and Ɛ are prefixes of banana.
2.A suffix of string s is any string obtained by removing zero or more
symbols from the beginning of s.
For example, nana, banana, and Ɛ are suffixes of banana.
3.A substring of s is obtained by deleting any prefix and any suffix
from s.
For instance, banana, nan, and Ɛ are substrings of banana.
4.The proper prefixes, suffixes, and substrings of a string s are those,
prefixes, suffixes, and substrings, respectively, of s that are not
24
Example Kleene closure of L,
Note that:- L0 , the "concatenation of L zero times," is defined to be {Ɛ} ,
Contd.
Let L be the set of letters {A, B , . . . , Z , a, b, . . . , z} and let D be the
set of digits {0, 1, . . . 9}. We may think of L and D in two, essentially
equivalent, ways.
One way is that L and D are, respectively, the alphabets of
and +.
30
3. Character classes. A regular expression a1|a2|·· ·| an , where
Example
Using the above short hands we can rewrite the regular expression
for examples a and c in slide number 29 as follows :
Examples
(a) Regular Definition for Java Identifiers
(c) Regular Definition for unsigned
ID letter(letter|digit)* numbers such as 5280 , 0.0 1 234,
6. 336E4, or 1. 89E-4
letter [A-Za-z_] digit [0-9]
digit [0-9] digits digit+
number digits (. digits)? (E [+ -]?
Exercises digits)?
1.Consult the language reference manuals to determine
A. The sets of characters that form the input alphabet (excluding
those that may only appear in character strings or
comments) ,
B. The lexical form of numerical constants, and
C. The lexical form of identifiers , for C++ and java
programming languages.
2.Describe the languages denoted by the following regular ex
pressions:
31 A. a(a|b) *a
B. ((Ɛ|a)b * )*
Contd.
order.
32
Recognition Regular Expressions
1. Starting point is the language grammar to understand the tokens:
stmt if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr term relop term
| term
term id
| number
34
Fig. Transition diagram Transition diagram for reserved words and identifi
Contd.
0
a
A Transition
0 0
1
1
38
Nondeterministic Finite Automata (NFA)
A nondeterministic finite automaton (NFA) consists of:
1. A finite set of states S.
2. A set of input symbols ∑, the input alphabet. We assume that
Ɛ, which stands for the empty string, is never a member of ∑ .
3. A transition function that gives, for each state, and for each
symbol in ∑ U {Ɛ } a set of next states.
4. A state s0 from S that is distinguished as the start state (or
initial state) .
5. A set of states F, a subset of S, that is distinguished as the
accepting states (or final states) .
We can represent either an NFA or DFA by a transition graph,
where the nodes are states and the labeled edges represent the
transition function.
There is an edge labeled a from state s to state t if and
only if t is one of the next states for state s and input a.
39
This graph is very much like a transition diagram, except:
Contd.
An NFA can get into multiple states
b a
Q1 Q2 Q3
b
Input: a b a
40
Transition Tables
We can also represent an NFA by a transition table, whose rows
correspond to states, and whose columns correspond to the input
symbols and Ɛ.
The entry for a given state and input is the value of the transition
function applied to those arguments.
If the transition function has no information about that state-input
pair, we put in the table for the pair.
Example:- The transition
The transition table hastablethe
for the NFA
Inputon the
a pervious
b slideƐ is
represented
advantageas: that we can easily State
find the transitions on a given Q1 Q1 { Q1 , Q2
state and input. }
Its disadvantage is that it takes Q2 Q3
a lot of space, when the input
Q3
alphabet is large, yet most states
41
do not have any moves on most of
Acceptance of Input Strings by Automata
An NFA accepts input string x if and only if there is some path in the
accepting(Final) states, such that the symbols along the path spell
out x.
Note that Ɛ labels along the path are effectively ignored, since
the empty string does not contribute to the string constructed
along the path.
Example:- Strings aaba and bbbba are accepted by the NFA in slide
40.
42
labeling some path from the start to an accepting state.
Deterministic Finite Automata (DFA)
A deterministic finite automaton (DFA) is a special case of an NFA
where:
2. For each state S and input symbol a, there is exactly one edge
out of s labeled a.
If we are using a transition table to represent a DFA, then each entry
is a single state.
we may therefore represent this state without the curly braces
c c
a
b
c q3 a A DFA that can accept the strings
which begin with a or b, or begin with c
b
and contain at most one a.
2)
46
JavaCC – A Lexical Analyzer and Parser
Generator
describe tokens
and
47
Flow for Using JavaCC
When there is more than one rule that matches the input,
2. If two different rules match the same string, use the rule
For example, the “else” string matches both the ELSE and
********************************************* */
public class testLA
{
int y=7;
float w+=2;
char z=t*8;
int x,y=2,z+=3;
if(x>0)
{
sum/=x;
}
else if(x==0)
{
sum=x;
}
else {}
}
while(x>=n)
{
float sum=2*3+4-6;
}
53 }
Tokens in JavaCC
manager class
constants
getNextToken()
54 also created:
Using the Generated TokenManager in Java
Code
To create a Java program that uses the lexical analyzer created by
JavaCC, you need to instantiate a variable of type
simpleTokenManager, where simple is the name of your .jj file.
The constructor for simpleTokenManager requires an object of type
SimpleCharStream.
The constructor for SimpleCharStream requires a standard
Java.io.InputStream.
Thus, you can use the general lexical analyzer as follows:
Token t;
simpleTokenManager tm;
Java.io.InputStream infile;
tm = new simpleTokenManager(new SimpleCharStream(infile));
infile = new Java.io.FileInputstream(“Inputfile.txt”);
t = tm.getNextToken();
55 while (t.kind != simpleConstants.EOF)
{
Project Work (15%)
Due:
commands
Your testing results (outputs) with javaTokenTest <filename>
56 command