0% found this document useful (0 votes)
2 views

Chapter 2 Lexical Analysis (Scanning) (1)

Chapter Two covers lexical analysis, which involves breaking down source code into tokens and lexemes for further processing by the compiler. It explains the roles of the lexical analyzer, the specification and generation of tokens, and the distinction between lexical analysis and parsing. Additionally, it discusses error handling, input buffering techniques, and the use of regular expressions for token specification.

Uploaded by

atinasianegash
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Chapter 2 Lexical Analysis (Scanning) (1)

Chapter Two covers lexical analysis, which involves breaking down source code into tokens and lexemes for further processing by the compiler. It explains the roles of the lexical analyzer, the specification and generation of tokens, and the distinction between lexical analysis and parsing. Additionally, it discusses error handling, input buffering techniques, and the use of regular expressions for token specification.

Uploaded by

atinasianegash
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 56

Chapter Two

Lexical Analysis

1
Objective
At the end of this session students will be able to:
 Understand the basic roles of lexical analyzer (LA): Lexical Analysis
Versus Parsing, Tokens , Patterns, and Lexemes, Attributes for
Tokens and Lexical Errors.
 Understand the specification of Tokens: Strings and Languages,
Operations on Languages, Regular Expressions, Regular Definitions
and Extensions of Regular Expressions

 Understand the generation of Tokens: Transition Diagrams,

Recognition of Reserved Words and Identifiers, Completion of the

Running Example, Architecture of a Transition-Diagram-Based

Lexical Analysis.

2 Understand the basics of Automata: Nondeterministic Finite


Introduction
 The lexical analysis phase of compilation breaks the text file

(program) into smaller chunks called tokens.


 A token describes a pattern of characters having same meaning

in the source program. (such as identifiers, operators, keywords,

numbers, delimeters and so on)

The lexical analysis phase of the compiler is often called


tokenization
 The lexical analyzer (LA) reads the stream of characters making up

the source program and groups the characters into meaningful

sequences called lexemes.


 For each lexeme, the lexical analyzer produces as output a token of
3
the form {token- name, attribute-value} that it passes on to the
Example.
 For example, suppose a source program contains the assignment
statement
posit ion = initial + rate * 60
 The characters in this assignment could be grouped into the following
lexemes and mapped into the following tokens passed on to the
syntax analyzer:

1. position is a lexeme that would be mapped into a token {id, 1},


where id is an abstract symbol standing for identifier and 1
points to the symbol table entry for position.
 The symbol-table entry for an identifier holds information about
the identifier, such as its name and type.

2. The assignment symbol = is a lexeme that is mapped into the

4 token {=}. Since this token needs no attribute-value, the second


Contd.

4. + is a lexeme that is mapped into the token {+}.

5. rate is a lexeme that is mapped into the token { id, 3 }, where 3

points to the symbol-table entry for rate.

6. * is a lexeme that is mapped into the token {* }.

7. 60 is a lexeme that is mapped into the token { 60 }.


Blanks separating the lexemes would be discarded by the lexical analyzer.

5
Contd. Parser uses tokens
produced by the LA to
create a tree-like
intermediate
representation that
depicts the
 It uses the syntax tree
grammatical structure
and thetoken
of the information
stream. in
the symbol table to
check the source
program for semantic
consistency with the
language definition.
 An important part of
In the process of semantic analysis is
translating a source type checking, where
program into target code, a the compiler checks
compiler may construct that each operator
one or more intermediate The has matching
machine-independent
representations, which are operands
code-optimization phase
easy to produce and attempts to improve the
easy to translate into intermediate code so that
the target machine better target code will
result. Usually better
means faster, but other
objectives may be desired,
6 such as shorter code, or
target code that consumes
The Role of Lexical Analyzer

Source token To semantic


Lexical
program Parser analysis
Analyzer
getNextToken

Symbol
Table

 It is common for the lexical analyzer to interact with the symbol

table.

When the lexical analyzer discovers a lexeme constituting an

identifier, it needs to enter that lexeme into the symbol table.

In some cases, information regarding the kind of identifier


7
may be read from the symbol table by the lexical analyzer to
Other Tasks of an LA
 Since LA is the part of the compiler that reads the source text, it
may perform certain other tasks besides identification of lexemes
such as:
1. Stripping out comments and whitespace (blank, newline, tab,
and perhaps other characters that are used to separate tokens
in the input).
2. Correlating error messages generated by the compiler with the
source program.
 For instance, the LA may keep track of the number of
newline characters seen, so it can associate a line number
with each error message.
3. If the source program uses a macro-preprocessor, the
expansion of macros may also be performed by the lexical
analyzer.
8
 Sometimes, lexical analyzers are divided into a cascade of two
Lexical Analysis Versus Parsing
 There are a number of reasons why the analysis portion of a compiler
is normally separated into lexical analysis and parsing (syntax
analysis) phases:
1. Simplicity of Design:- is the most important consideration that

often allows us to simplify compilation tasks.


For example, a parser that had to deal with comments and
whitespace as syntactic units would be considerably more
complex than one that can assume comments and whitespace
have already been removed by the lexical analyzer.
If we are designing a new language, separating lexical and
syntactic concerns can lead to a cleaner overall language design.
2. Compiler efficiency is improved:- A separate lexical analyzer

allows us to apply specialized techniques that serve only the lexical


task, not the job of parsing.
9
In addition, specialized buffering techniques for reading input
Tokens, Patterns, and Lexemes
A lexeme is a sequence of characters in the source program that
matches the pattern for a token.
Example: -5, i, sum, 100, for ; int 10 + -
Token:- is a pair consisting of a token
One name
token for andkeyword.
each an optional attribute
The pattern for a
keyword is the same as the keyword itself.
value.
 Tokens for the operators , either individually or in
Example: classes such as the token in comparison operators
 One token representing all identifiers.
Identifiers = { i, sum } 
One or more tokens representing constants, such
int_Constatnt = { 10, 100,as-5numbers
} and literal strings .
 Tokens for each punctuation symbol, such as left
Oppr = { +, -}
and right parentheses, comma, and semicolon
rev_Words = { for, int}

Separators = { ;, ,}

Pattern is a rule describing the set of lexemes that can represent a

particular token in source program


10
It is a description of the form that the lexemes of a token may
Contd.
 One efficient but complex brute-force approach is to read
character by character to check if each one follows the
right sequence of a token.
 The below example shows a partial code for this brute-force
lexical analyzer.
 What we’d like is a set of tools that will allow us to easily
create
if (c and modify
= nextchar() ==a‘c’)
lexical
{ analyzer that has the same run-
if (c
time = nextchar()
efficiency == ‘l’) { method.
as the brute-force
// code to handle the rest of either “class” or any
 The first tool that we use to attack this problem is Deterministic
identifier that starts with “cl”
} else
Finite if (c ==
Automata ‘a’) {
(DFA).
// code to handle the rest of either “case” or any
identifier that starts with “ca”
} else {
// code to handle any identifier that starts with c
11 }
} else if ( c = …..) {
Contd.

12
Consideration for a simple design of Lexical
Analyzer
Lexical Analyzer can allow source program to be
1.Free-Format Input:- the alignment of lexeme should not be necessary
in determining the correctness of the source program such restriction
put extra load on Lexical Analyzer
2.Blanks Significance:- Simplify the task of identifying tokens
E.g. Int a indicates <Int is keyword> <a is identifier>
Inta indicates <Inta is Identifier>
3.Keyword must be reserved:- Keywords should be reserved otherwise
LA will have to predict whether the given lexeme should be treated as a
keyword or as identifier
E.g. if then then then =else;
Else else
Approaches to=implementation
then;
The above statements are misleading as then and else keywords are
 Use assembly language- Most efficient but most difficult to
not reserved.
implement
13  Use high level languages like C- Efficient but difficult to
Lexical Errors
 Are primarily of two kinds.

1. Lexemes whose length exceed the bound specified by the

language.

 Most languages have bound on the precision of

numeric constants.

 A constant whose bound exceeds this bound is a lexical

error.

2. Illegal character in the program

 Characters like ~, , ,  occurring in a given


14
programming language (but not within a string or
Handling Lexical Errors
 It is hard for a LA to tell , without the aid of other components,

that there is a source-code error.


 For instance, if the string fi is encountered for the first time in

a java program in the context : fi ( a == 2*4+3 ) a lexical

analyzer cannot tell whether fi is a misspelling of the keyword

if or an undeclared function identifier.


 However, suppose a situation arises in which the lexical analyzer is

unable to proceed because none of the patterns for tokens

matches any prefix of the remaining input.


 The simplest recovery strategy is "panic mode" recovery.
 We delete successive characters from the remaining input ,

15 until the lexical analyzer can find a well-formed token at the


Contd.
 Other possible error-recovery actions are:

1. Insert a missing character into the remaining input.

2. Replacing an incorrect character by a correct character

3. Transpose two adjacent characters(such as , fi=>if).

4. Deleting an extraneous character

5. Pre-scanning

16
Input Buffering
 There are some ways that the task of reading the source
program can be speeded
 This task is made difficult by the fact that we often have to
look one or more characters beyond the next lexeme before
we can be sure we have the right lexeme
 In C language: we need to look after -, = or < to decide what

token to return
 We shall introduce a two-buffer scheme that handles large loo-

kaheads safely
 We then consider an improvement involving sentinels that
saves time checking for the ends of buffers
 Because of the amount of time taken to process characters
17
and the large number of characters that must be processed
Contd.

 Each buffer is of the same size N, and N is usually the size of a disk
block, e.g., 4096 bytes
 Using one system read command we can read N characters
into a buffer, rather than using one system call per character
 If fewer than N characters remain in the input file, then a special

character, represented by eof, marks the end of the source file

and is different from any possible character of the source

program
 Two pointers to the input are maintained:
1. Pointer lexemeBegin, marks the beginning of the
18
current lexeme, whose extent we are attempting to
Contd.
 Once the lexeme is determined, forward is set to the character at its
right end (involves retracting)
 Then, after the lexeme is recorded as an attribute value of the
token returned to the parser, lexemeBegin is set to the
character immediately after the lexeme just found
 Advancing forward requires that we first test whether we have
reached the end of one of the buffers, and if so, we must reload the
other buffer from the input, and move forward to the beginning of the
newly loaded buffer

Sentinels
 If we use the previous scheme, we must check each time we
advance forward, that we have not moved off one of the buffers; if
we do, then we must also reload the other buffer
19
 Thus, for each character read, we must make two tests: one for
Contd.

Fig. Sentinels at the end of each buffer

 The sentinel is a special Switch (*forward++) {


case eof:
character that cannot be part if (forward is at end of first buffer) {
reload second buffer;
of the source program, and a
forward = beginning of second
natural choice is the character buffer;
}
eof else if {forward is at end of second
 buffer) {
Note that eof retains its
reload first buffer;\
use as a marker for the forward = beginning of first
buffer;
end of the entire input }
 else /* eof within a buffer marks the
Any eof that appears other end of input */
20 than at the end of buffer terminate lexical analysis;
break;
Specification of Tokens
 Regular expressions are an important notation for specifying

lexeme patterns.
 While they cannot express all possible patterns, they are very

effective in specifying those types of patterns that we actually

need for tokens.


 In theory of compilation regular expressions are used to formalize

the specification of tokens


 Regular expressions are means for specifying regular languages

Strings and Languages


 An alphabet is any finite set of symbols. Typical examples of

symbols are let­ters , digits, and punctuation.


21  The set {0, 1 } is the binary alphabet.
Contd.
 A string over an alphabet is a finite sequence of symbols drawn

from that alphabet .


 In language theory, the terms "sentence" and "word" are often

used as synonyms for "string."


 The length of a string s, usually written |s|, is the number of
occurrences of symbols in s.
 For example, banana is a string of length six.

 The empty string, denoted Ɛ, is the string of length zero.

 A language is any countable set of strings over some fixed

alphabet.
 Abstract languages like  , the empty set, or {Ɛ} , the set
containing only the empty string, are languages under this
definition.
22
 So too are the set of all syntactically well-formed java
Terms for Parts of Strings
 The following string-related terms are commonly used:
1.A prefix of string S is any string obtained by removing zero or more
symbols from the end of s.
 For example, ban, banana, and Ɛ are prefixes of banana.
2.A suffix of string s is any string obtained by removing zero or more
symbols from the beginning of s.
 For example, nana, banana, and Ɛ are suffixes of banana.
3.A substring of s is obtained by deleting any prefix and any suffix
from s.
 For instance, banana, nan, and Ɛ are substrings of banana.
4.The proper prefixes, suffixes, and substrings of a string s are those,
prefixes, suffixes, and substrings, respectively, of s that are not

23 Ɛ or not equal to s itself.


Operations on Languages
 The following string-related terms are commonly used:
 In lexical analysis, the most important operations on languages
are union, concatenation, and closure, which are defined formally
below:
1.Union (U):- is the familiar operation on sets.
Example Union of L and M, L U M = {s|s is in L or s is in M}
2.Concatenation :- is all strings formed by taking a string from the first
language and a string from the second language, in all possible ways ,
and concatenating them.
Example Concatenation of L and M, LM = {st|s is in L and t is in
M}
3.Kleene closure:- the kleene closure of a language L, denoted by L*,
is the set of strings you get by concatenating L zero, or more times.

24
Example Kleene closure of L,
Note that:- L0 , the "concatenation of L zero times," is defined to be {Ɛ} ,
Contd.
Let L be the set of letters {A, B , . . . , Z , a, b, . . . , z} and let D be the
set of digits {0, 1, . . . 9}. We may think of L and D in two, essentially
equivalent, ways.
 One way is that L and D are, respectively, the alphabets of

uppercase and lowercase letters and of digits.


 The second way is that L and D are languages, all of whose strings

happen to be of length one.


 Here are some other languages that can be constructed from
languages L and D, using the above operators:
1. L U D is the set of letters and digits - strictly speaking the
language with 62 strings of length one, each of which strings is
either one letter or one digit.
2. LD is the set of 520 strings of length two, each consisting of one
25
letter followed by one digit.
Regular Expressions
 In theory of compilation regular expressions are used to formalize the
specification of tokens
 Regular expressions are means for specifying regular languages
Example: ID  letter_(letter|digit)*
letter A|B|…|Z|a|b|…|z|_
digit 0|1|2|…|9
 Each regular expression is a pattern specifying the form of strings
 The regular expressions are built recursively out of smaller regular
expres­sions, using the following two rules:
R1:- Ɛ is a regular expression, and L(Ɛ) = {Ɛ}, that is , the language
whose sole member is the empty string.
R2:- If a is a symbol in ∑ then a is a regular expression, L(a) = {a}, that
is , the language with one string, of length one , with a in its one
position.
Note:- By convention, we use italics for symbols, and boldface for their
26
corresponding regular expression.
Contd.
 There are four parts to the induction whereby larger regular
expressions are built from smaller ones.
 Suppose r and s are regular expressions denoting languages L(r)
and L(s) , respectively.
1. (r) |( s) is a regular expression denoting the language L(r) U
L(s) .
2. (r) (s) is a regular expression denoting the language L(r)L (s) .
3. (r )* is a regular expression denoting (L (r) )*.
4. (r) is a regular expression denoting L(r) .
 This last rule says that we can add additional pairs of parentheses
around expressions without changing the language they denote.
 Regular expressions often contain unnecessary pairs of
parentheses .
 We may drop certain pairs of parentheses if we adopt the
conventions that :
27
a. The unary operator * has highest precedence and is left
Contd.
 A language that can be defined by a regular expression is
called a regular set.
 If two regular expressions r and s denote the same regular set , we
say they are equivalent and write r = s.
 For instance, (a|b) = (b|a).
 There are a number of algebraic laws for regular expressions; each
LAW DESCRIPTION
law asserts that expressions of two different forms are equivalent.
r|s = s|r | is commutative
r|(s|t) = (r|s)| |is associative
t
r(st) = (rs)t Concatenation is associative
r(s|t) = rs|rt; Concatenation distributes over |
(s|t)r= sr|tr
Ɛr=rƐ=r Ɛ is the identity for concatenation
r* = (r|Ɛ)* Ɛ is guaranteed in a closure
r** = r* * is idempotent
28
Table: The algebraic laws that hold for arbitrary regular
Regular Definitions
 If ∑ is an alphabet of basic symbols, then a regular definition is a
sequence of definitions of the form :
Where
d1  r1 1. Each di is a new symbol, not in ∑
d2  r2 and not the same as any other of
… the d's, and
dn  rn 2. Each ri is a regular expression
Examples over the alphabet ∑ U { d1 ,d2 , ...
) Regular Definition for Java Identifiers
(b) Regular Definition for Java statements
,di-1}.
stmt  if expr then stmt
ID  letter(letter|digit)*
| if expr then stmt
letter A|B|…|Z|a|b|…|z|_
else stmt
digit 0|1|2|…|9 |Ɛ
(c) Regular Definition for unsigned
expr  term relop term
numbers such as 5280 , 0.0 1 234,
6. 336E4, or 1. 89E-4 | term
digit 0|1|2|…|9 term  id
digits digit digit* | number
optFrac . digts|Ɛ
29
optExp (E(+|-|Ɛ) digts|Ɛ
Extensions of Regular Expressions
 Many extensions have been added to regular expressions to
enhance their ability to specify string patterns.
 Here are few notational extensions:
1. One or more instances(+):- is a unary postfix operator that
represents the positive closure of a regular expression and its
language. That is , if r is a regular expression, then (r)+ denotes
the language (L( r) ) + .
1. The operator + has the same precedence and associativity as
the operator *.
2. Two useful algebraic laws, r*=r+|Ɛ and r+=rr*=r*r.
2. Zero or one instance(?):- r? is equivalent to r|Ɛ , or put another
way, L(r?) = L(r) U {Ɛ}.
 The ? operator has the same precedence and associativity as *

and +.
30
3. Character classes. A regular expression a1|a2|·· ·| an , where
Example
 Using the above short hands we can rewrite the regular expression
for examples a and c in slide number 29 as follows :
Examples
(a) Regular Definition for Java Identifiers
(c) Regular Definition for unsigned
ID  letter(letter|digit)* numbers such as 5280 , 0.0 1 234,
6. 336E4, or 1. 89E-4
letter [A-Za-z_] digit [0-9]
digit [0-9] digits digit+
number  digits (. digits)? (E [+ -]?
Exercises digits)?
1.Consult the language reference manuals to determine
A. The sets of characters that form the input alphabet (excluding
those that may only appear in character strings or
comments) ,
B. The lexical form of numerical constants, and
C. The lexical form of identifiers , for C++ and java
programming languages.
2.Describe the languages denoted by the following regular ex­
pressions:
31 A. a(a|b) *a
B. ((Ɛ|a)b * )*
Contd.

3. Write regular definitions for the following languages:

A. All strings of lowercase letters that contain the five vowels in

order.

B. All strings of lowercase letters in which the letters are in

ascending lexicographic order.

C. All strings of binary digits with no repeated digit s.

D. All strings of binary digits with at most one repeated digit.

E. All strings of a ' s and b's where every a is preceded by b.

F. All strings of a 's and b's that contain substring abab.

32
Recognition Regular Expressions
1. Starting point is the language grammar to understand the tokens:
stmt  if expr then stmt
| if expr then stmt else stmt

expr  term relop term
| term
term  id
| number

2. The next step is to formalize 3.


theWe also need to handle
patterns:
whitespaces:
digit  [0-9]
Digits  digit+ ws  (blank | tab |
number  digit(.digits)? (E[+-]? newline)+
Digit)?
letter  [A-Za-z_]
id  letter (letter|digit)*
If  if
33
Then  then
Transition Diagram
Lexemes Token Names Attribute Value
Any ws - -
if if -
then then -
else else -
Any id id Pointer to table entry
Any number number Pointer to table entry
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE

Fig. Transition diagram for relop


Table: Tokens, their patterns, and
attribute values

34
Fig. Transition diagram Transition diagram for reserved words and identifi
Contd.

Fig. Transition diagram for unsigned


numbers

Fig. Transition diagram for white spaces where


delim represents one or more whitespace
characters
35
Finite Automata
 The lexical analyzer tools use finite automata, at the heart of the
transition, to convert the input program into a lexical analyzer .
 These are essentially graphs, like transition diagrams, with a few
differences:
1. Finite automata are recognizers ; they simply say "yes" or "no"
about each possible input string.
2. Finite automata come in two flavors :
A. Nondeterministic finite automata (NFA) have no restrictions
on the labels of their edges . A symbol can label several
edges out of the same state, and Ɛ, the empty string, is a
possible label.
B. Deterministic finite automata (DFA) have, for each state,
and for each symbol of its input alphabet exactly one edge
with that symbol leaving that state.
36
 Both deterministic and nondeterministic finite automata are
Finite Automata State Graphs
A state Example 1: A finite automaton that
accepts only “1” 1

The start state


(Initial State)
Example 2: A finite automaton
accepting any number of 1’s followed
Accepting state
by a single 0 1
(Final State)

0
a
A Transition

Q: Check that “1110” is accepted but


37
“110…” is not?
Contd.
uestion: What language does this recognize?

0 0

1
1

38
Nondeterministic Finite Automata (NFA)
 A nondeterministic finite automaton (NFA) consists of:
1. A finite set of states S.
2. A set of input symbols ∑, the input alphabet. We assume that
Ɛ, which stands for the empty string, is never a member of ∑ .
3. A transition function that gives, for each state, and for each
symbol in ∑ U {Ɛ } a set of next states.
4. A state s0 from S that is distinguished as the start state (or
initial state) .
5. A set of states F, a subset of S, that is distinguished as the
accepting states (or final states) .
 We can represent either an NFA or DFA by a transition graph,
where the nodes are states and the labeled edges represent the
transition function.
 There is an edge labeled a from state s to state t if and
only if t is one of the next states for state s and input a.
39
 This graph is very much like a transition diagram, except:
Contd.
An NFA can get into multiple states

b a
Q1 Q2 Q3

b
Input: a b a

Rule: NFA accepts if it can get in a final state

40
Transition Tables
 We can also represent an NFA by a transition table, whose rows
correspond to states, and whose columns correspond to the input
symbols and Ɛ.
 The entry for a given state and input is the value of the transition
function applied to those arguments.
 If the transition function has no information about that state-input
pair, we put  in the table for the pair.
Example:- The transition
The transition table hastablethe
for the NFA
Inputon the
a pervious
b slideƐ is
represented
advantageas: that we can easily State
find the transitions on a given Q1 Q1 { Q1 , Q2 
state and input. }
Its disadvantage is that it takes Q2 Q3  
a lot of space, when the input
Q3   
alphabet is large, yet most states
41
do not have any moves on most of
Acceptance of Input Strings by Automata
 An NFA accepts input string x if and only if there is some path in the

transition graph from the start state to one of the

accepting(Final) states, such that the symbols along the path spell

out x.
 Note that Ɛ labels along the path are effectively ignored, since
the empty string does not contribute to the string constructed
along the path.

Example:- Strings aaba and bbbba are accepted by the NFA in slide

40.

 The language defined (or accepted) by an NFA is the set of strings

42
labeling some path from the start to an accepting state.
Deterministic Finite Automata (DFA)
 A deterministic finite automaton (DFA) is a special case of an NFA

where:

1. There are no moves on input Ɛ , and

2. For each state S and input symbol a, there is exactly one edge

out of s labeled a.
 If we are using a transition table to represent a DFA, then each entry

is a single state.
 we may therefore represent this state without the curly braces

that we use to form sets.


 While the NFA is an abstract representation of an algorithm to

recognize the strings of a certain language, the DFA is a simple,


43
concrete algorithm for recognizing strings.
Contd.
 A DFA is a collection of states and transitions:- Given the input string,

the transitions tell us how to move among the states.


 One of the states is denoted as the initial state, and a subset of
the states are final states.
 We start from the initial state, move from state to state via the
transitions, and check to see if we are in the final state when we
have checked each character in the string.
 If we are, then the string is accepted, otherwise, the string is
rejected

 A DFA is a quintuple, a machine with five parameters, M = (Q, ∑, δ,

q0, F), where


 Q is a finite set of states
44
 ∑ is a finite set called the alphabet
Example
b b
1) q1 c q2 a q4

c c
a
b
c q3 a A DFA that can accept the strings
which begin with a or b, or begin with c
b
and contain at most one a.

2)

45 A DFA accepting (a|b) * abb


Introduction to Java Compiler Compiler
(JavaCC)

46
JavaCC – A Lexical Analyzer and Parser
Generator

 Java Compiler Compiler (JavaCC) is the most popular Lexical

analyzer and parser generator for use with Java applications.

 In addition to this, JavaCC provides other standard capabilities

related to parser generation such as tree building (via a tool

called JJTree included with JavaCC), actions, debugging, etc.

 JavaCC takes a set of regular expressions as input that

describe tokens

 Creates a DFA that recognizes the input set of tokens,

and
47
Flow for Using JavaCC

 javacc is a “top-down” parser generator.


 Some parser generators (such as yacc , bison, and JavaCUP)
need a separate lexical-analyzer generator.
 With javaCC, you can specify the tokens within the parser
48
generator.
Structure of a JavaCC file

 Input files to JavaCC have the extension “.jj”


 A sample example simple.jj file is shown on next slide # 51 and 52
 Several tokens can be defined in the same TOKEN block, with the rules
separated by a “│”. (see example from slide # 51)
 By convention, token names are in all UPPERCASE.
 When JavaCC runs on the file simple.jj, JavaCC creates a class
simpleTokenManager, which contains a method getNextToken().
 The getNextToken() method implements the DFA that recognizes the
tokens described by the regular expressions
 Every time getNextToken is called, it finds the rule whose regular
expression matches to the next sequence of characters in the input file,
and returns the token for that rule
49
Contd.

 Inputfile file is used as input file for simple.jj code, as shown on

slide number 53.

 When there is more than one rule that matches the input,

JavaCC uses the following strategy:

1. Always match to the longest possible string

2. If two different rules match the same string, use the rule

that appears first in the .jj file

 For example, the “else” string matches both the ELSE and

IDENTIFIER rules, getNextToken() will return the ELSE token


50 
But “else21” matches to the IDENTIFIER token only
<INNER_COMMENT>
Simple.jj SKIP:
{
PARSER_BEGIN(simple) <~[]>
public class simple |<"*/">
{ {
} count--;
PARSER_END(simple) if(count==0); SwitchTo(DEFAULT);
TOKEN_MGR_DECLS: }
{ }
public static int count=0;
} TOKEN:
SKIP: {
{ <ELSE:"else">
<" "> |<FOR:"for">
|<"\n"> |<AND:"and">
|<"\t"> |<CLASS:"class">
|<"\r"> |<PUBLIC:"public">
|<"\b"> |<PROTECTED:"protected">
|<"//"(~["\n"])*"\n"> |<PRIVATE:"private">
|<"/*"> {count++;} : |<DO:"do">
INNER_COMMENT |<IF:"if">
} |<WHILE:"while">
<INNER_COMMENT> |<INT:"int">
SKIP: |<FLOAT:"float">
{ |<CHAR:"char">
<"/*">{count++;}:INNER_COMMENT |<VOID:"void">
51 }
}
Contd.
TOKEN:
{
<PLUS:"+"> TOKEN:
|<MINUS:"-"> {
|<TIMES:"*"> <IDENTIFIERS:["a"-"z","A"-"Z"]
|<DIVIDE:"/"> (["a"-"z","A"-"Z","0"-"9"])*>
|<SEMICOLON:";"> |<INTEGER_LITERAL:(["0"-"9"])+>
|<LEFT_PARENTHESIS:"("> }
|<RIGHT_PARENTHESIS:")"> void expression():
|<LEFT_SQUARE_BRACKET:"["> {}
|<RIGHT_SQUARE_BRACKET:"]"> {
|<LEFT_BRACE:"{"> term()((<PLUS>|<MINUS>) term())*
|<RIGHT_BRACE:"}"> }
|<DOT:".">
|<EQUAL_TO:"=="> void term():
|<NOT_EQUAL_TO:"!="> {}
|<LESS_THAN:"<"> {
|<GREATER_THAN:">"> factor() ((<TIMES>|<DIVIDE>) factor())*
|<LESS_THAN_OR_EQUAL_TO:"<="> }
|<GREATER_THAN_OR_EQUAL_TO:">=">
|<ASSIGNMENT:"="> void factor():
|<LOGICAL_AND:"&&"> {}
|<LOGICAL_OR:"||"> {
|<NOT:"!"> <INTEGER_LITERAL>|<IDENTIFIERS>
|<DOLLAR:"$"> }
52 } ||
Inputfile
/* this is an input file for simple.jj file
Prepared By:
Hailu.G

********************************************* */
public class testLA
{
int y=7;
float w+=2;
char z=t*8;
int x,y=2,z+=3;
if(x>0)
{
sum/=x;
}
else if(x==0)
{
sum=x;
}
else {}
}
while(x>=n)
{
float sum=2*3+4-6;
}
53 }
Tokens in JavaCC

 When we run JavaCC on the input file simple.jj, JavaCC

creates several files to implement a lexical analyzer

1. simpleTokenManager.Java: to implement the token

manager class

2. simpleConstants.Java: an interface that defines a set of

constants

3. Token.Java: describes the tokens returned by

getNextToken()

4. In addition to these, the following additional files are

54 also created:

Using the Generated TokenManager in Java
Code
 To create a Java program that uses the lexical analyzer created by
JavaCC, you need to instantiate a variable of type
simpleTokenManager, where simple is the name of your .jj file.
 The constructor for simpleTokenManager requires an object of type
SimpleCharStream.
 The constructor for SimpleCharStream requires a standard
Java.io.InputStream.
 Thus, you can use the general lexical analyzer as follows:
Token t;
simpleTokenManager tm;
Java.io.InputStream infile;
tm = new simpleTokenManager(new SimpleCharStream(infile));
infile = new Java.io.FileInputstream(“Inputfile.txt”);
t = tm.getNextToken();
55 while (t.kind != simpleConstants.EOF)
{
Project Work (15%)

Due:

Project Title: A Lexical Analyzer for simpleJava tokens

You need to write a JavaCC file “SJava.jj”that create a lexical analyzer

for simpleJava tokens

Study chapter 2 and JavaCC completely to figure out how to design,

implement and test your lexical analyzer.


 Your SJava.jj program

 Your compilation results (outputs) with javacc and javac

commands
 Your testing results (outputs) with javaTokenTest <filename>

56 command

You might also like