0% found this document useful (0 votes)

2 views

Chapter 2 Lexical Analysis (Scanning) (1)

Chapter Two covers lexical analysis, which involves breaking down source code into tokens and lexemes for further processing by the compiler. It explains the roles of the lexical analyzer, the specification and generation of tokens, and the distinction between lexical analysis and parsing. Additionally, it discusses error handling, input buffering techniques, and the use of regular expressions for token specification.

Uploaded by

atinasianegash

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Chapter 2 Lexical Analysis (Scanning) (1)

Uploaded by

atinasianegash

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 56

Chapter Two

Lexical Analysis

1
Objective
At the end of this session students will be able to:
 Understand the basic roles of lexical analyzer (LA): Lexical Analysis
Versus Parsing, Tokens , Patterns, and Lexemes, Attributes for
Tokens and Lexical Errors.
 Understand the specification of Tokens: Strings and Languages,
Operations on Languages, Regular Expressions, Regular Definitions
and Extensions of Regular Expressions

 Understand the generation of Tokens: Transition Diagrams,

Recognition of Reserved Words and Identifiers, Completion of the

Running Example, Architecture of a Transition-Diagram-Based

Lexical Analysis.

2 Understand the basics of Automata: Nondeterministic Finite

Introduction
 The lexical analysis phase of compilation breaks the text file

(program) into smaller chunks called tokens.

 A token describes a pattern of characters having same meaning

in the source program. (such as identifiers, operators, keywords,

numbers, delimeters and so on)

The lexical analysis phase of the compiler is often called

tokenization
 The lexical analyzer (LA) reads the stream of characters making up

the source program and groups the characters into meaningful

sequences called lexemes.

 For each lexeme, the lexical analyzer produces as output a token of
3
the form {token- name, attribute-value} that it passes on to the
Example.
 For example, suppose a source program contains the assignment
statement
posit ion = initial + rate * 60
 The characters in this assignment could be grouped into the following
lexemes and mapped into the following tokens passed on to the
syntax analyzer:

1. position is a lexeme that would be mapped into a token {id, 1},

where id is an abstract symbol standing for identifier and 1
points to the symbol table entry for position.
 The symbol-table entry for an identifier holds information about
the identifier, such as its name and type.

2. The assignment symbol = is a lexeme that is mapped into the

4 token {=}. Since this token needs no attribute-value, the second

Contd.

4. + is a lexeme that is mapped into the token {+}.

5. rate is a lexeme that is mapped into the token { id, 3 }, where 3

points to the symbol-table entry for rate.

6. * is a lexeme that is mapped into the token {* }.

7. 60 is a lexeme that is mapped into the token { 60 }.

Blanks separating the lexemes would be discarded by the lexical analyzer.

5
Contd. Parser uses tokens
produced by the LA to
create a tree-like
intermediate
representation that
depicts the
 It uses the syntax tree
grammatical structure
and thetoken
of the information
stream. in
the symbol table to
check the source
program for semantic
consistency with the
language definition.
 An important part of
In the process of semantic analysis is
translating a source type checking, where
program into target code, a the compiler checks
compiler may construct that each operator
one or more intermediate The has matching
machine-independent
representations, which are operands
code-optimization phase
easy to produce and attempts to improve the
easy to translate into intermediate code so that
the target machine better target code will
result. Usually better
means faster, but other
objectives may be desired,
6 such as shorter code, or
target code that consumes
The Role of Lexical Analyzer

Source token To semantic

Lexical
program Parser analysis
Analyzer
getNextToken

Symbol
Table

 It is common for the lexical analyzer to interact with the symbol

table.

When the lexical analyzer discovers a lexeme constituting an

identifier, it needs to enter that lexeme into the symbol table.

In some cases, information regarding the kind of identifier

7
may be read from the symbol table by the lexical analyzer to
Other Tasks of an LA
 Since LA is the part of the compiler that reads the source text, it
may perform certain other tasks besides identification of lexemes
such as:
1. Stripping out comments and whitespace (blank, newline, tab,
and perhaps other characters that are used to separate tokens
in the input).
2. Correlating error messages generated by the compiler with the
source program.
 For instance, the LA may keep track of the number of
newline characters seen, so it can associate a line number
with each error message.
3. If the source program uses a macro-preprocessor, the
expansion of macros may also be performed by the lexical
analyzer.
8
 Sometimes, lexical analyzers are divided into a cascade of two
Lexical Analysis Versus Parsing
 There are a number of reasons why the analysis portion of a compiler
is normally separated into lexical analysis and parsing (syntax
analysis) phases:
1. Simplicity of Design:- is the most important consideration that

often allows us to simplify compilation tasks.

For example, a parser that had to deal with comments and
whitespace as syntactic units would be considerably more
complex than one that can assume comments and whitespace
have already been removed by the lexical analyzer.
If we are designing a new language, separating lexical and
syntactic concerns can lead to a cleaner overall language design.
2. Compiler efficiency is improved:- A separate lexical analyzer

allows us to apply specialized techniques that serve only the lexical

task, not the job of parsing.
9
In addition, specialized buffering techniques for reading input
Tokens, Patterns, and Lexemes
A lexeme is a sequence of characters in the source program that
matches the pattern for a token.
Example: -5, i, sum, 100, for ; int 10 + -
Token:- is a pair consisting of a token
One name
token for andkeyword.
each an optional attribute
The pattern for a
keyword is the same as the keyword itself.
value.
 Tokens for the operators , either individually or in
Example: classes such as the token in comparison operators
 One token representing all identifiers.
Identifiers = { i, sum } 
One or more tokens representing constants, such
int_Constatnt = { 10, 100,as-5numbers
} and literal strings .
 Tokens for each punctuation symbol, such as left
Oppr = { +, -}
and right parentheses, comma, and semicolon
rev_Words = { for, int}

Separators = { ;, ,}

Pattern is a rule describing the set of lexemes that can represent a

particular token in source program

10
It is a description of the form that the lexemes of a token may
Contd.
 One efficient but complex brute-force approach is to read
character by character to check if each one follows the
right sequence of a token.
 The below example shows a partial code for this brute-force
lexical analyzer.
 What we’d like is a set of tools that will allow us to easily
create
if (c and modify
= nextchar() ==a‘c’)
lexical
{ analyzer that has the same run-
if (c
time = nextchar()
efficiency == ‘l’) { method.
as the brute-force
// code to handle the rest of either “class” or any
 The first tool that we use to attack this problem is Deterministic
identifier that starts with “cl”
} else
Finite if (c ==
Automata ‘a’) {
(DFA).
// code to handle the rest of either “case” or any
identifier that starts with “ca”
} else {
// code to handle any identifier that starts with c
11 }
} else if ( c = …..) {
Contd.

12
Consideration for a simple design of Lexical
Analyzer
Lexical Analyzer can allow source program to be
1.Free-Format Input:- the alignment of lexeme should not be necessary
in determining the correctness of the source program such restriction
put extra load on Lexical Analyzer
2.Blanks Significance:- Simplify the task of identifying tokens
E.g. Int a indicates <Int is keyword> <a is identifier>
Inta indicates <Inta is Identifier>
3.Keyword must be reserved:- Keywords should be reserved otherwise
LA will have to predict whether the given lexeme should be treated as a
keyword or as identifier
E.g. if then then then =else;
Else else
Approaches to=implementation
then;
The above statements are misleading as then and else keywords are
 Use assembly language- Most efficient but most difficult to
not reserved.
implement
13  Use high level languages like C- Efficient but difficult to
Lexical Errors
 Are primarily of two kinds.

1. Lexemes whose length exceed the bound specified by the

language.

 Most languages have bound on the precision of

numeric constants.

 A constant whose bound exceeds this bound is a lexical

error.

2. Illegal character in the program

 Characters like ~, , ,  occurring in a given

14
programming language (but not within a string or
Handling Lexical Errors
 It is hard for a LA to tell , without the aid of other components,

that there is a source-code error.

 For instance, if the string fi is encountered for the first time in

a java program in the context : fi ( a == 2*4+3 ) a lexical

analyzer cannot tell whether fi is a misspelling of the keyword

if or an undeclared function identifier.

 However, suppose a situation arises in which the lexical analyzer is

unable to proceed because none of the patterns for tokens

matches any prefix of the remaining input.

 The simplest recovery strategy is "panic mode" recovery.
 We delete successive characters from the remaining input ,

15 until the lexical analyzer can find a well-formed token at the

Contd.
 Other possible error-recovery actions are:

1. Insert a missing character into the remaining input.

2. Replacing an incorrect character by a correct character

3. Transpose two adjacent characters(such as , fi=>if).

4. Deleting an extraneous character

5. Pre-scanning

16
Input Buffering
 There are some ways that the task of reading the source
program can be speeded
 This task is made difficult by the fact that we often have to
look one or more characters beyond the next lexeme before
we can be sure we have the right lexeme
 In C language: we need to look after -, = or < to decide what

token to return
 We shall introduce a two-buffer scheme that handles large loo-

kaheads safely
 We then consider an improvement involving sentinels that
saves time checking for the ends of buffers
 Because of the amount of time taken to process characters
17
and the large number of characters that must be processed
Contd.

 Each buffer is of the same size N, and N is usually the size of a disk
block, e.g., 4096 bytes
 Using one system read command we can read N characters
into a buffer, rather than using one system call per character
 If fewer than N characters remain in the input file, then a special

character, represented by eof, marks the end of the source file

and is different from any possible character of the source

program
 Two pointers to the input are maintained:
1. Pointer lexemeBegin, marks the beginning of the
18
current lexeme, whose extent we are attempting to
Contd.
 Once the lexeme is determined, forward is set to the character at its
right end (involves retracting)
 Then, after the lexeme is recorded as an attribute value of the
token returned to the parser, lexemeBegin is set to the
character immediately after the lexeme just found
 Advancing forward requires that we first test whether we have
reached the end of one of the buffers, and if so, we must reload the
other buffer from the input, and move forward to the beginning of the
newly loaded buffer

Sentinels
 If we use the previous scheme, we must check each time we
advance forward, that we have not moved off one of the buffers; if
we do, then we must also reload the other buffer
19
 Thus, for each character read, we must make two tests: one for
Contd.

Fig. Sentinels at the end of each buffer

 The sentinel is a special Switch (*forward++) {

case eof:
character that cannot be part if (forward is at end of first buffer) {
reload second buffer;
of the source program, and a
forward = beginning of second
natural choice is the character buffer;
}
eof else if {forward is at end of second
 buffer) {
Note that eof retains its
reload first buffer;\
use as a marker for the forward = beginning of first
buffer;
end of the entire input }
 else /* eof within a buffer marks the
Any eof that appears other end of input */
20 than at the end of buffer terminate lexical analysis;
break;
Specification of Tokens
 Regular expressions are an important notation for specifying

lexeme patterns.
 While they cannot express all possible patterns, they are very

effective in specifying those types of patterns that we actually

need for tokens.

 In theory of compilation regular expressions are used to formalize

the specification of tokens

 Regular expressions are means for specifying regular languages

Strings and Languages

 An alphabet is any finite set of symbols. Typical examples of

symbols are letters , digits, and punctuation.

21  The set {0, 1 } is the binary alphabet.
Contd.
 A string over an alphabet is a finite sequence of symbols drawn

from that alphabet .

 In language theory, the terms "sentence" and "word" are often

used as synonyms for "string."

 The length of a string s, usually written |s|, is the number of
occurrences of symbols in s.
 For example, banana is a string of length six.

 The empty string, denoted Ɛ, is the string of length zero.

 A language is any countable set of strings over some fixed

alphabet.
 Abstract languages like  , the empty set, or {Ɛ} , the set
containing only the empty string, are languages under this
definition.
22
 So too are the set of all syntactically well-formed java
Terms for Parts of Strings
 The following string-related terms are commonly used:
1.A prefix of string S is any string obtained by removing zero or more
symbols from the end of s.
 For example, ban, banana, and Ɛ are prefixes of banana.
2.A suffix of string s is any string obtained by removing zero or more
symbols from the beginning of s.
 For example, nana, banana, and Ɛ are suffixes of banana.
3.A substring of s is obtained by deleting any prefix and any suffix
from s.
 For instance, banana, nan, and Ɛ are substrings of banana.
4.The proper prefixes, suffixes, and substrings of a string s are those,
prefixes, suffixes, and substrings, respectively, of s that are not

23 Ɛ or not equal to s itself.

Operations on Languages
 The following string-related terms are commonly used:
 In lexical analysis, the most important operations on languages
are union, concatenation, and closure, which are defined formally
below:
1.Union (U):- is the familiar operation on sets.
Example Union of L and M, L U M = {s|s is in L or s is in M}
2.Concatenation :- is all strings formed by taking a string from the first
language and a string from the second language, in all possible ways ,
and concatenating them.
Example Concatenation of L and M, LM = {st|s is in L and t is in
M}
3.Kleene closure:- the kleene closure of a language L, denoted by L*,
is the set of strings you get by concatenating L zero, or more times.

24
Example Kleene closure of L,
Note that:- L0 , the "concatenation of L zero times," is defined to be {Ɛ} ,
Contd.
Let L be the set of letters {A, B , . . . , Z , a, b, . . . , z} and let D be the
set of digits {0, 1, . . . 9}. We may think of L and D in two, essentially
equivalent, ways.
 One way is that L and D are, respectively, the alphabets of

uppercase and lowercase letters and of digits.

 The second way is that L and D are languages, all of whose strings

happen to be of length one.

 Here are some other languages that can be constructed from
languages L and D, using the above operators:
1. L U D is the set of letters and digits - strictly speaking the
language with 62 strings of length one, each of which strings is
either one letter or one digit.
2. LD is the set of 520 strings of length two, each consisting of one
25
letter followed by one digit.
Regular Expressions
 In theory of compilation regular expressions are used to formalize the
specification of tokens
 Regular expressions are means for specifying regular languages
Example: ID  letter_(letter|digit)*
letter A|B|…|Z|a|b|…|z|_
digit 0|1|2|…|9
 Each regular expression is a pattern specifying the form of strings
 The regular expressions are built recursively out of smaller regular
expressions, using the following two rules:
R1:- Ɛ is a regular expression, and L(Ɛ) = {Ɛ}, that is , the language
whose sole member is the empty string.
R2:- If a is a symbol in ∑ then a is a regular expression, L(a) = {a}, that
is , the language with one string, of length one , with a in its one
position.
Note:- By convention, we use italics for symbols, and boldface for their
26
corresponding regular expression.
Contd.
 There are four parts to the induction whereby larger regular
expressions are built from smaller ones.
 Suppose r and s are regular expressions denoting languages L(r)
and L(s) , respectively.
1. (r) |( s) is a regular expression denoting the language L(r) U
L(s) .
2. (r) (s) is a regular expression denoting the language L(r)L (s) .
3. (r )* is a regular expression denoting (L (r) )*.
4. (r) is a regular expression denoting L(r) .
 This last rule says that we can add additional pairs of parentheses
around expressions without changing the language they denote.
 Regular expressions often contain unnecessary pairs of
parentheses .
 We may drop certain pairs of parentheses if we adopt the
conventions that :
27
a. The unary operator * has highest precedence and is left
Contd.
 A language that can be defined by a regular expression is
called a regular set.
 If two regular expressions r and s denote the same regular set , we
say they are equivalent and write r = s.
 For instance, (a|b) = (b|a).
 There are a number of algebraic laws for regular expressions; each
LAW DESCRIPTION
law asserts that expressions of two different forms are equivalent.
r|s = s|r | is commutative
r|(s|t) = (r|s)| |is associative
t
r(st) = (rs)t Concatenation is associative
r(s|t) = rs|rt; Concatenation distributes over |
(s|t)r= sr|tr
Ɛr=rƐ=r Ɛ is the identity for concatenation
r* = (r|Ɛ)* Ɛ is guaranteed in a closure
r** = r* * is idempotent
28
Table: The algebraic laws that hold for arbitrary regular
Regular Definitions
 If ∑ is an alphabet of basic symbols, then a regular definition is a
sequence of definitions of the form :
Where
d1  r1 1. Each di is a new symbol, not in ∑
d2  r2 and not the same as any other of
… the d's, and
dn  rn 2. Each ri is a regular expression
Examples over the alphabet ∑ U { d1 ,d2 , ...
) Regular Definition for Java Identifiers
(b) Regular Definition for Java statements
,di-1}.
stmt  if expr then stmt
ID  letter(letter|digit)*
| if expr then stmt
letter A|B|…|Z|a|b|…|z|_
else stmt
digit 0|1|2|…|9 |Ɛ
(c) Regular Definition for unsigned
expr  term relop term
numbers such as 5280 , 0.0 1 234,
6. 336E4, or 1. 89E-4 | term
digit 0|1|2|…|9 term  id
digits digit digit* | number
optFrac . digts|Ɛ
29
optExp (E(+|-|Ɛ) digts|Ɛ
Extensions of Regular Expressions
 Many extensions have been added to regular expressions to
enhance their ability to specify string patterns.
 Here are few notational extensions:
1. One or more instances(+):- is a unary postfix operator that
represents the positive closure of a regular expression and its
language. That is , if r is a regular expression, then (r)+ denotes
the language (L( r) ) + .
1. The operator + has the same precedence and associativity as
the operator *.
2. Two useful algebraic laws, r*=r+|Ɛ and r+=rr*=r*r.
2. Zero or one instance(?):- r? is equivalent to r|Ɛ , or put another
way, L(r?) = L(r) U {Ɛ}.
 The ? operator has the same precedence and associativity as *

and +.
30
3. Character classes. A regular expression a1|a2|·· ·| an , where
Example
 Using the above short hands we can rewrite the regular expression
for examples a and c in slide number 29 as follows :
Examples
(a) Regular Definition for Java Identifiers
(c) Regular Definition for unsigned
ID  letter(letter|digit)* numbers such as 5280 , 0.0 1 234,
6. 336E4, or 1. 89E-4
letter [A-Za-z_] digit [0-9]
digit [0-9] digits digit+
number  digits (. digits)? (E [+ -]?
Exercises digits)?
1.Consult the language reference manuals to determine
A. The sets of characters that form the input alphabet (excluding
those that may only appear in character strings or
comments) ,
B. The lexical form of numerical constants, and
C. The lexical form of identifiers , for C++ and java
programming languages.
2.Describe the languages denoted by the following regular ex
pressions:
31 A. a(a|b) *a
B. ((Ɛ|a)b * )*
Contd.

3. Write regular definitions for the following languages:

A. All strings of lowercase letters that contain the five vowels in

order.

B. All strings of lowercase letters in which the letters are in

ascending lexicographic order.

C. All strings of binary digits with no repeated digit s.

D. All strings of binary digits with at most one repeated digit.

E. All strings of a ' s and b's where every a is preceded by b.

F. All strings of a 's and b's that contain substring abab.

32
Recognition Regular Expressions
1. Starting point is the language grammar to understand the tokens:
stmt  if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr  term relop term
| term
term  id
| number

2. The next step is to formalize 3.

theWe also need to handle
patterns:
whitespaces:
digit  [0-9]
Digits  digit+ ws  (blank | tab |
number  digit(.digits)? (E[+-]? newline)+
Digit)?
letter  [A-Za-z_]
id  letter (letter|digit)*
If  if
33
Then  then
Transition Diagram
Lexemes Token Names Attribute Value
Any ws - -
if if -
then then -
else else -
Any id id Pointer to table entry
Any number number Pointer to table entry
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE

Fig. Transition diagram for relop

Table: Tokens, their patterns, and
attribute values

34
Fig. Transition diagram Transition diagram for reserved words and identifi
Contd.

Fig. Transition diagram for unsigned

numbers

Fig. Transition diagram for white spaces where

delim represents one or more whitespace
characters
35
Finite Automata
 The lexical analyzer tools use finite automata, at the heart of the
transition, to convert the input program into a lexical analyzer .
 These are essentially graphs, like transition diagrams, with a few
differences:
1. Finite automata are recognizers ; they simply say "yes" or "no"
about each possible input string.
2. Finite automata come in two flavors :
A. Nondeterministic finite automata (NFA) have no restrictions
on the labels of their edges . A symbol can label several
edges out of the same state, and Ɛ, the empty string, is a
possible label.
B. Deterministic finite automata (DFA) have, for each state,
and for each symbol of its input alphabet exactly one edge
with that symbol leaving that state.
36
 Both deterministic and nondeterministic finite automata are
Finite Automata State Graphs
A state Example 1: A finite automaton that
accepts only “1” 1

The start state

(Initial State)
Example 2: A finite automaton
accepting any number of 1’s followed
Accepting state
by a single 0 1
(Final State)

0
a
A Transition

Q: Check that “1110” is accepted but

37
“110…” is not?
Contd.
uestion: What language does this recognize?

0 0

1
1

38
Nondeterministic Finite Automata (NFA)
 A nondeterministic finite automaton (NFA) consists of:
1. A finite set of states S.
2. A set of input symbols ∑, the input alphabet. We assume that
Ɛ, which stands for the empty string, is never a member of ∑ .
3. A transition function that gives, for each state, and for each
symbol in ∑ U {Ɛ } a set of next states.
4. A state s0 from S that is distinguished as the start state (or
initial state) .
5. A set of states F, a subset of S, that is distinguished as the
accepting states (or final states) .
 We can represent either an NFA or DFA by a transition graph,
where the nodes are states and the labeled edges represent the
transition function.
 There is an edge labeled a from state s to state t if and
only if t is one of the next states for state s and input a.
39
 This graph is very much like a transition diagram, except:
Contd.
An NFA can get into multiple states

b a
Q1 Q2 Q3

b
Input: a b a

Rule: NFA accepts if it can get in a final state

40
Transition Tables
 We can also represent an NFA by a transition table, whose rows
correspond to states, and whose columns correspond to the input
symbols and Ɛ.
 The entry for a given state and input is the value of the transition
function applied to those arguments.
 If the transition function has no information about that state-input
pair, we put  in the table for the pair.
Example:- The transition
The transition table hastablethe
for the NFA
Inputon the
a pervious
b slideƐ is
represented
advantageas: that we can easily State
find the transitions on a given Q1 Q1 { Q1 , Q2 
state and input. }
Its disadvantage is that it takes Q2 Q3  
a lot of space, when the input
Q3   
alphabet is large, yet most states
41
do not have any moves on most of
Acceptance of Input Strings by Automata
 An NFA accepts input string x if and only if there is some path in the

transition graph from the start state to one of the

accepting(Final) states, such that the symbols along the path spell

out x.
 Note that Ɛ labels along the path are effectively ignored, since
the empty string does not contribute to the string constructed
along the path.

Example:- Strings aaba and bbbba are accepted by the NFA in slide

40.

 The language defined (or accepted) by an NFA is the set of strings

42
labeling some path from the start to an accepting state.
Deterministic Finite Automata (DFA)
 A deterministic finite automaton (DFA) is a special case of an NFA

where:

1. There are no moves on input Ɛ , and

2. For each state S and input symbol a, there is exactly one edge

out of s labeled a.
 If we are using a transition table to represent a DFA, then each entry

is a single state.
 we may therefore represent this state without the curly braces

that we use to form sets.

 While the NFA is an abstract representation of an algorithm to

recognize the strings of a certain language, the DFA is a simple,

43
concrete algorithm for recognizing strings.
Contd.
 A DFA is a collection of states and transitions:- Given the input string,

the transitions tell us how to move among the states.

 One of the states is denoted as the initial state, and a subset of
the states are final states.
 We start from the initial state, move from state to state via the
transitions, and check to see if we are in the final state when we
have checked each character in the string.
 If we are, then the string is accepted, otherwise, the string is
rejected

 A DFA is a quintuple, a machine with five parameters, M = (Q, ∑, δ,

q0, F), where

 Q is a finite set of states
44
 ∑ is a finite set called the alphabet
Example
b b
1) q1 c q2 a q4

c c
a
b
c q3 a A DFA that can accept the strings
which begin with a or b, or begin with c
b
and contain at most one a.

45 A DFA accepting (a|b) * abb

Introduction to Java Compiler Compiler
(JavaCC)

46
JavaCC – A Lexical Analyzer and Parser
Generator

 Java Compiler Compiler (JavaCC) is the most popular Lexical

analyzer and parser generator for use with Java applications.

 In addition to this, JavaCC provides other standard capabilities

related to parser generation such as tree building (via a tool

called JJTree included with JavaCC), actions, debugging, etc.

 JavaCC takes a set of regular expressions as input that

describe tokens

 Creates a DFA that recognizes the input set of tokens,

and
47
Flow for Using JavaCC

 javacc is a “top-down” parser generator.

 Some parser generators (such as yacc , bison, and JavaCUP)
need a separate lexical-analyzer generator.
 With javaCC, you can specify the tokens within the parser
48
generator.
Structure of a JavaCC file

 Input files to JavaCC have the extension “.jj”

 A sample example simple.jj file is shown on next slide # 51 and 52
 Several tokens can be defined in the same TOKEN block, with the rules
separated by a “│”. (see example from slide # 51)
 By convention, token names are in all UPPERCASE.
 When JavaCC runs on the file simple.jj, JavaCC creates a class
simpleTokenManager, which contains a method getNextToken().
 The getNextToken() method implements the DFA that recognizes the
tokens described by the regular expressions
 Every time getNextToken is called, it finds the rule whose regular
expression matches to the next sequence of characters in the input file,
and returns the token for that rule
49
Contd.

 Inputfile file is used as input file for simple.jj code, as shown on

slide number 53.

 When there is more than one rule that matches the input,

JavaCC uses the following strategy:

1. Always match to the longest possible string

2. If two different rules match the same string, use the rule

that appears first in the .jj file

 For example, the “else” string matches both the ELSE and

IDENTIFIER rules, getNextToken() will return the ELSE token

50 
But “else21” matches to the IDENTIFIER token only
<INNER_COMMENT>
Simple.jj SKIP:
{
PARSER_BEGIN(simple) <~[]>
public class simple |<"*/">
{ {
} count--;
PARSER_END(simple) if(count==0); SwitchTo(DEFAULT);
TOKEN_MGR_DECLS: }
{ }
public static int count=0;
} TOKEN:
SKIP: {
{ <ELSE:"else">
<" "> |<FOR:"for">
|<"\n"> |<AND:"and">
|<"\t"> |<CLASS:"class">
|<"\r"> |<PUBLIC:"public">
|<"\b"> |<PROTECTED:"protected">
|<"//"(~["\n"])*"\n"> |<PRIVATE:"private">
|<"/*"> {count++;} : |<DO:"do">
INNER_COMMENT |<IF:"if">
} |<WHILE:"while">
<INNER_COMMENT> |<INT:"int">
SKIP: |<FLOAT:"float">
{ |<CHAR:"char">
<"/*">{count++;}:INNER_COMMENT |<VOID:"void">
51 }
}
Contd.
TOKEN:
{
<PLUS:"+"> TOKEN:
|<MINUS:"-"> {
|<TIMES:"*"> <IDENTIFIERS:["a"-"z","A"-"Z"]
|<DIVIDE:"/"> (["a"-"z","A"-"Z","0"-"9"])*>
|<SEMICOLON:";"> |<INTEGER_LITERAL:(["0"-"9"])+>
|<LEFT_PARENTHESIS:"("> }
|<RIGHT_PARENTHESIS:")"> void expression():
|<LEFT_SQUARE_BRACKET:"["> {}
|<RIGHT_SQUARE_BRACKET:"]"> {
|<LEFT_BRACE:"{"> term()((<PLUS>|<MINUS>) term())*
|<RIGHT_BRACE:"}"> }
|<DOT:".">
|<EQUAL_TO:"=="> void term():
|<NOT_EQUAL_TO:"!="> {}
|<LESS_THAN:"<"> {
|<GREATER_THAN:">"> factor() ((<TIMES>|<DIVIDE>) factor())*
|<LESS_THAN_OR_EQUAL_TO:"<="> }
|<GREATER_THAN_OR_EQUAL_TO:">=">
|<ASSIGNMENT:"="> void factor():
|<LOGICAL_AND:"&&"> {}
|<LOGICAL_OR:"||"> {
|<NOT:"!"> <INTEGER_LITERAL>|<IDENTIFIERS>
|<DOLLAR:"$"> }
52 } ||
Inputfile
/* this is an input file for simple.jj file
Prepared By:
Hailu.G

********************************************* */
public class testLA
{
int y=7;
float w+=2;
char z=t*8;
int x,y=2,z+=3;
if(x>0)
{
sum/=x;
}
else if(x==0)
{
sum=x;
}
else {}
}
while(x>=n)
{
float sum=2*3+4-6;
}
53 }
Tokens in JavaCC

 When we run JavaCC on the input file simple.jj, JavaCC

creates several files to implement a lexical analyzer

1. simpleTokenManager.Java: to implement the token

manager class

2. simpleConstants.Java: an interface that defines a set of

constants

3. Token.Java: describes the tokens returned by

getNextToken()

4. In addition to these, the following additional files are

54 also created:

Using the Generated TokenManager in Java
Code
 To create a Java program that uses the lexical analyzer created by
JavaCC, you need to instantiate a variable of type
simpleTokenManager, where simple is the name of your .jj file.
 The constructor for simpleTokenManager requires an object of type
SimpleCharStream.
 The constructor for SimpleCharStream requires a standard
Java.io.InputStream.
 Thus, you can use the general lexical analyzer as follows:
Token t;
simpleTokenManager tm;
Java.io.InputStream infile;
tm = new simpleTokenManager(new SimpleCharStream(infile));
infile = new Java.io.FileInputstream(“Inputfile.txt”);
t = tm.getNextToken();
55 while (t.kind != simpleConstants.EOF)
{
Project Work (15%)

Due:

Project Title: A Lexical Analyzer for simpleJava tokens

You need to write a JavaCC file “SJava.jj”that create a lexical analyzer

for simpleJava tokens

Study chapter 2 and JavaCC completely to figure out how to design,

implement and test your lexical analyzer.

 Your SJava.jj program

 Your compilation results (outputs) with javacc and javac

commands
 Your testing results (outputs) with javaTokenTest <filename>

56 command

Etz Hayim Torah and Commentary PDF
14% (7)
Etz Hayim Torah and Commentary PDF
2 pages
(Linguistics) Frank Palmer Semantics PDF
85% (53)
(Linguistics) Frank Palmer Semantics PDF
170 pages
Sudoku Code Asm File
No ratings yet
Sudoku Code Asm File
6 pages
Hid Understanding Card Data Formats WP en
No ratings yet
Hid Understanding Card Data Formats WP en
5 pages
Chapter 2-Lexical Analysis
No ratings yet
Chapter 2-Lexical Analysis
48 pages
Chapter 2 Lexical Analysis (Scanning) Edited
No ratings yet
Chapter 2 Lexical Analysis (Scanning) Edited
46 pages
CD - CH2 - Lexical Analysis
No ratings yet
CD - CH2 - Lexical Analysis
59 pages
CD - CH2 - Lexical Analysis
No ratings yet
CD - CH2 - Lexical Analysis
67 pages
002chapter 2 - Lexical Analysis
No ratings yet
002chapter 2 - Lexical Analysis
114 pages
L2 Lexical Analysis
No ratings yet
L2 Lexical Analysis
59 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
74 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
16 pages
cd UNIT-1
No ratings yet
cd UNIT-1
60 pages
2-Lexical Analysis
No ratings yet
2-Lexical Analysis
52 pages
Comp Final
No ratings yet
Comp Final
16 pages
Unit 1
No ratings yet
Unit 1
24 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
16 pages
L4 - Lexical Analysis
No ratings yet
L4 - Lexical Analysis
44 pages
2. Lexical Analyzer
No ratings yet
2. Lexical Analyzer
16 pages
Lexical Analysis and Parsing CD
No ratings yet
Lexical Analysis and Parsing CD
107 pages
Lexical Analysis
No ratings yet
Lexical Analysis
12 pages
Compiler Design Chapter 2
No ratings yet
Compiler Design Chapter 2
14 pages
Lecture-2-10022025-035804pm
No ratings yet
Lecture-2-10022025-035804pm
27 pages
Unit 2
No ratings yet
Unit 2
14 pages
HW_31712
No ratings yet
HW_31712
22 pages
Lexical Analysis
No ratings yet
Lexical Analysis
38 pages
ACD unit-2 part-2
No ratings yet
ACD unit-2 part-2
20 pages
role of a lexical AN
No ratings yet
role of a lexical AN
26 pages
Lecture 3- Lexical Analysis (1)
No ratings yet
Lecture 3- Lexical Analysis (1)
42 pages
Lecture 4 Lexical Analysis
No ratings yet
Lecture 4 Lexical Analysis
23 pages
Module 5 Lexical Analyser
No ratings yet
Module 5 Lexical Analyser
10 pages
@CD_ch2 compiler design
No ratings yet
@CD_ch2 compiler design
26 pages
Program Compilation Lec 7
No ratings yet
Program Compilation Lec 7
18 pages
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
100% (1)
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
37 pages
Lecture 02
No ratings yet
Lecture 02
150 pages
Automata Theory and Compiler Design: Name: Smitha.A Usn: 1Vj21Cs042 Branch: Cse
No ratings yet
Automata Theory and Compiler Design: Name: Smitha.A Usn: 1Vj21Cs042 Branch: Cse
9 pages
Lexical Analysis in Compiler Design With Example
No ratings yet
Lexical Analysis in Compiler Design With Example
8 pages
An Analysis of Compiler Design in Context of Lexical Analyzer
No ratings yet
An Analysis of Compiler Design in Context of Lexical Analyzer
5 pages
Lesson 08 2
No ratings yet
Lesson 08 2
33 pages
CD Laqs
No ratings yet
CD Laqs
29 pages
R.V. College of Engineering
No ratings yet
R.V. College of Engineering
56 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
10 pages
Unit 2 Lexical Analyzer
No ratings yet
Unit 2 Lexical Analyzer
30 pages
What Is The Role of The Lexical Analyzer in Compiler Design
No ratings yet
What Is The Role of The Lexical Analyzer in Compiler Design
2 pages
Ch2_Lexical Analysis
No ratings yet
Ch2_Lexical Analysis
71 pages
Compiler Design: Lexical Analysis
No ratings yet
Compiler Design: Lexical Analysis
68 pages
Week 4 Lec 8 CC p1-1
No ratings yet
Week 4 Lec 8 CC p1-1
23 pages
Lexical Analysis - Compiler Design: Token, Pattern and Lexeme
No ratings yet
Lexical Analysis - Compiler Design: Token, Pattern and Lexeme
5 pages
Ch2 - Lexical Analysis
No ratings yet
Ch2 - Lexical Analysis
76 pages
Lexical Analysis
No ratings yet
Lexical Analysis
35 pages
Chapter 2
No ratings yet
Chapter 2
6 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
31 pages
5.Tokens, Patterns, and Lexemes
No ratings yet
5.Tokens, Patterns, and Lexemes
7 pages
Lexical Analysis
No ratings yet
Lexical Analysis
14 pages
CD - Ch.1
No ratings yet
CD - Ch.1
28 pages
Week 5-6
No ratings yet
Week 5-6
33 pages
4 Lexical Analysis
No ratings yet
4 Lexical Analysis
60 pages
Compilers: Topic 2: Lexical Analysis
No ratings yet
Compilers: Topic 2: Lexical Analysis
29 pages
Unit2
No ratings yet
Unit2
61 pages
CD Unit 1
No ratings yet
CD Unit 1
54 pages
Learning Materials, CD, Unit-2 (Lexical Analysis)
No ratings yet
Learning Materials, CD, Unit-2 (Lexical Analysis)
13 pages
Upload 1
No ratings yet
Upload 1
3 pages
Unit 2-LEXICAL ANALYSIS
No ratings yet
Unit 2-LEXICAL ANALYSIS
46 pages
Lexical Analysis: Risul Islam Rasel
No ratings yet
Lexical Analysis: Risul Islam Rasel
148 pages
Chapter 5-Logic
No ratings yet
Chapter 5-Logic
26 pages
grafic2 pdf
No ratings yet
grafic2 pdf
2 pages
AI-Constraint Satisfaction Problems (1)
No ratings yet
AI-Constraint Satisfaction Problems (1)
28 pages
chapter_one computer_security (2)
No ratings yet
chapter_one computer_security (2)
33 pages
RM-Chapter 3
No ratings yet
RM-Chapter 3
28 pages
ch-5
No ratings yet
ch-5
23 pages
banchi and afia palestian flag
No ratings yet
banchi and afia palestian flag
2 pages
emebet and alemu bangladish flag
No ratings yet
emebet and alemu bangladish flag
1 page
chapter 3
No ratings yet
chapter 3
20 pages
Chapter 7
No ratings yet
Chapter 7
31 pages
Lesson 6 Daffodils
No ratings yet
Lesson 6 Daffodils
4 pages
Module For (Personal Development, Mr. Erwil L. Agbon)
No ratings yet
Module For (Personal Development, Mr. Erwil L. Agbon)
2 pages
Smartforms Seminar - 20100211
No ratings yet
Smartforms Seminar - 20100211
50 pages
01 - English - t II 2023 - Viii - Set a -Ak
No ratings yet
01 - English - t II 2023 - Viii - Set a -Ak
6 pages
Sequence Poesie Engagee
No ratings yet
Sequence Poesie Engagee
12 pages
Lesson 18
No ratings yet
Lesson 18
8 pages
Unit Plan - 1AS (Unit 2 - Communication - The Press Our Findings Show) - 1
88% (40)
Unit Plan - 1AS (Unit 2 - Communication - The Press Our Findings Show) - 1
23 pages
Prolegomena To Methods For Using The Ayutthayan Laws As Historical Source Material
No ratings yet
Prolegomena To Methods For Using The Ayutthayan Laws As Historical Source Material
24 pages
Stupa - Asokan Architecture
No ratings yet
Stupa - Asokan Architecture
3 pages
Selenium IQ
No ratings yet
Selenium IQ
14 pages
Lembar Perbaikan Thesis
No ratings yet
Lembar Perbaikan Thesis
7 pages
The Summer of The Beautiful White Horse-REVISION
No ratings yet
The Summer of The Beautiful White Horse-REVISION
5 pages
Framework
No ratings yet
Framework
17 pages
Outst6 10 Minute Tests Unit4
No ratings yet
Outst6 10 Minute Tests Unit4
6 pages
Consonants PDF
No ratings yet
Consonants PDF
4 pages
Soal Ujian Bahasa Inggris Kelas XI Semester Genap SMK Jadi 5
86% (7)
Soal Ujian Bahasa Inggris Kelas XI Semester Genap SMK Jadi 5
5 pages
Notepad Tricks: Matrix Effect Trick
No ratings yet
Notepad Tricks: Matrix Effect Trick
10 pages
The Philippine Hymn: Peter Julian N. Ungsod Zyra O. Garmino
No ratings yet
The Philippine Hymn: Peter Julian N. Ungsod Zyra O. Garmino
12 pages
pakistans-unwanted-problem-afghan-refugees-by-ayaz-amir
No ratings yet
pakistans-unwanted-problem-afghan-refugees-by-ayaz-amir
3 pages
Data - Structures Using C PDF
No ratings yet
Data - Structures Using C PDF
122 pages
From Stammering To Eloquence: Ranganathan As A Speaker and Orator
No ratings yet
From Stammering To Eloquence: Ranganathan As A Speaker and Orator
4 pages
Addiional Maths
No ratings yet
Addiional Maths
284 pages
Knockout Discussion Guide
No ratings yet
Knockout Discussion Guide
2 pages
Advanced Folder Structure React
No ratings yet
Advanced Folder Structure React
4 pages
603 2143 1 PB
No ratings yet
603 2143 1 PB
16 pages
Key Links
0% (1)
Key Links
2 pages

Chapter 2 Lexical Analysis (Scanning) (1)

Uploaded by

Chapter 2 Lexical Analysis (Scanning) (1)

Uploaded by

Chapter Two

 Understand the generation of Tokens: Transition Diagrams,

Recognition of Reserved Words and Identifiers, Completion of the

Running Example, Architecture of a Transition-Diagram-Based

2 Understand the basics of Automata: Nondeterministic Finite

(program) into smaller chunks called tokens.

in the source program. (such as identifiers, operators, keywords,

numbers, delimeters and so on)

The lexical analysis phase of the compiler is often called

the source program and groups the characters into meaningful

sequences called lexemes.

1. position is a lexeme that would be mapped into a token {id, 1},

2. The assignment symbol = is a lexeme that is mapped into the

4 token {=}. Since this token needs no attribute-value, the second

4. + is a lexeme that is mapped into the token {+}.

5. rate is a lexeme that is mapped into the token { id, 3 }, where 3

points to the symbol-table entry for rate.

6. * is a lexeme that is mapped into the token {* }.

7. 60 is a lexeme that is mapped into the token { 60 }.

Source token To semantic

 It is common for the lexical analyzer to interact with the symbol

When the lexical analyzer discovers a lexeme constituting an

identifier, it needs to enter that lexeme into the symbol table.

In some cases, information regarding the kind of identifier

often allows us to simplify compilation tasks.

allows us to apply specialized techniques that serve only the lexical

Pattern is a rule describing the set of lexemes that can represent a

particular token in source program

1. Lexemes whose length exceed the bound specified by the

 Most languages have bound on the precision of

 A constant whose bound exceeds this bound is a lexical

2. Illegal character in the program

 Characters like ~, , ,  occurring in a given

that there is a source-code error.

a java program in the context : fi ( a == 2*4+3 ) a lexical

analyzer cannot tell whether fi is a misspelling of the keyword

if or an undeclared function identifier.

unable to proceed because none of the patterns for tokens

matches any prefix of the remaining input.

15 until the lexical analyzer can find a well-formed token at the

1. Insert a missing character into the remaining input.

2. Replacing an incorrect character by a correct character

3. Transpose two adjacent characters(such as , fi=>if).

4. Deleting an extraneous character

character, represented by eof, marks the end of the source file

and is different from any possible character of the source

Fig. Sentinels at the end of each buffer

 The sentinel is a special Switch (*forward++) {

effective in specifying those types of patterns that we actually

need for tokens.

the specification of tokens

Strings and Languages

symbols are let­ters , digits, and punctuation.

from that alphabet .

used as synonyms for "string."

 The empty string, denoted Ɛ, is the string of length zero.

 A language is any countable set of strings over some fixed

23 Ɛ or not equal to s itself.

uppercase and lowercase letters and of digits.

happen to be of length one.

3. Write regular definitions for the following languages:

A. All strings of lowercase letters that contain the five vowels in

B. All strings of lowercase letters in which the letters are in

ascending lexicographic order.

C. All strings of binary digits with no repeated digit s.

D. All strings of binary digits with at most one repeated digit.

E. All strings of a ' s and b's where every a is preceded by b.

F. All strings of a 's and b's that contain substring abab.

2. The next step is to formalize 3.

Fig. Transition diagram for relop

Fig. Transition diagram for unsigned

Fig. Transition diagram for white spaces where

The start state

Q: Check that “1110” is accepted but

Rule: NFA accepts if it can get in a final state

transition graph from the start state to one of the

symbols are letters , digits, and punctuation.