0% found this document useful (0 votes)
2 views

unit5

The document outlines the function and implementation of a scanner in programming languages, detailing its role in delivering tokens and skipping irrelevant characters. It explains the structure of tokens, the use of finite automata for token recognition, and the distinction between identifiers and keywords. Additionally, it discusses scanner generators like FLEX and the process of building a scanner from regular expressions.

Uploaded by

tahailuan9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

unit5

The document outlines the function and implementation of a scanner in programming languages, detailing its role in delivering tokens and skipping irrelevant characters. It explains the structure of tokens, the use of finite automata for token recognition, and the distinction between identifiers and keywords. Additionally, it discusses scanner generators like FLEX and the process of building a scanner from regular expressions.

Uploaded by

tahailuan9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Unit 5

Scanner

1
Task of a scanner
• Deliver tokens

• Skip meaningless characters


• blanks
• Tabulator characters
• End-of-line characters (CR,LF)
• Comments
• Lexeme: One or more character in a string that has a
meaning (word of the language)
• Token: Terminal symbols of a real grammar
2
Tokens have a syntactic structure

Why is scanning not a part of parsing?


• The parser will become complicated
• Distinguish between identifiers and keywords
• Need the rules to omit unneccesảy strings like spaces, comments

3
Token classes of KPL

• Unsigned integer
• Identifier
• Key word: begin,end, if,then, while, do, call, const, var, procedure,
program,type, function,of,integer,char,else,for, to,array
• Character constant
• Operators:
• Arithmetic
+ - */
• Relational
= != < > <= >=
• Assign :=
• Separators
( ) . : ; (. .)

4
Finite Automata

• A finite automaton is a state machine that takes a string of


symbols as input and changes its state accordingly.
• Finite automata theory is a part of our life.
• Have you ever seen a vendor machine , or any equipment
controlled by an automaton like a washing machine,a traffic
light, an elevator, etc ?
• In compiler, finite automata can be used to recognize whether
a string adheres to the syntax of a language. A finite
automaton can be used to build a syntax tree in a bottom-up
parsing method
Repesentations of finite automata

• Informal method: state diagram


• Intuitive method, easy to understand,

• Convenient for human to write code manually

• Formal method: mathematical model


• Machine readable form
• Use in code generation
State diagrams of finite automata

• State diagrams are directed graphs whose nodes are states


and whose arcs are labeled by one or more symbols from
some alphabet Σ.

• One state is initial (denoted by a short incoming arrow)

• Several states are final/accepting (denoted by a double circle).

• DFA: For every symbol a Σ there is an arc labeled a


emanating from every state
Formal Definition of a Deterministic Finite Automaton (DFA)

A DFA is represented as the five-tuple: M = (Q, , δ,q0, F) where


1. Q is a finite set of states,
2.  is the alphabet of input symbols,
3. q0  Q is the start/initial state,
4. F  Q Set of final states
5. : Q  → Q is a transition function.
This function
• Takes a state and input symbol as arguments.
• Returns a state.
• One “rule” would be written δ(q, a) = p, where q and p are states, and a is
an input symbol.
• Intuitively: if the DFA is in state q, and input a is received, then the DFA
goes to state p (note: q = p OK).
Simple Example – One way door

• Consider a one-way
automatic door.
• This door has two pads that
can sense when someone is
standing on them, a front
and rear pad.
• We want people to walk Front Rear
through the front and
toward the rear, but not Pad Pad
allow someone to walk the
other direction:
One Way Door

• Let’s assign the codes to our different input cases:


Nobody on either pad
Person on front pad
Person on rear pad
Person on front and rear pad
• We can design an automaton so that the door doesn’t open if
someone is still on the rear pad and hit them:
a,c,d b b,c,d

Start C O

a
DFA Example D1

• An automaton with a set of final states to recognize a


language.
• Recognize set of all strings over{a,b} contain 3 consecutive a’s
• Formal definition
M = (Q, , δ,q0, F)
Q = {q0, q1, q2, q3},  = {a,b}, F ={q3}

 a b
q0 q1 q0
q1 q2 q0
q2 q3 q0
q3 q3 q3
Input and output of a lexical analyzer (scanner)

• Input: source program • Output : List of token

Program Example1; (* Example 1*) 1-1:KW_PROGRAM


Begin 1-9:TK_IDENT(Example1)
End. (* Example1*) 1-17:SB_SEMICOLON
2-1:KW_BEGIN
3-1:KW_END
3-4:SB_PERIOD
• Note: TK_IDENT is a token
while example1, a1, writeI are
lexemes

12
Recognizing KPL’s tokens

• All KPL’s tokens make up a regular language.


• They can be described with regular grammar, regular
expression
• They can be recognized by a Deterministic Finite Automaton
(DFA)
• The scanner is a big DFA

13
The scanner as a Deterministic Finite Automaton

After every recognized token,


the scanner starts in state 0
again
If an illegal character is met,
the scanner would change to
the states 30 or 38 which tell
the scanner to stop scanning
and return error messages.
Notice the yellow states
Scanner Implementation

• Character classification

• Data structure for tokens

• Token recognition

15
Lexical rules of KPL

• Only use unsigned integer. Range 0  231-1


• The KPL identifier is made with a combination of lowercase
or uppercase letters, digits. An identifier must start with a
letter. The length <=15.
• Only allows character constants. A character constant is
enclosed with a pair of single quote marks. ‘’’
• The language do not use string constant.
• - is use for subtraction only. The language does not allow
unary minus and negative numbers
• The relational operator “not equal to” is represented by !=

16
Classification of characters based on their ASCII code
typedef enum {
CHAR_SPACE, // Spaces (include space, tab, backspace…
CHAR_LETTER, // Letters
CHAR_DIGIT, // digits
CHAR_PLUS, // ‘+’
CHAR_MINUS, // ‘-’
CHAR_TIMES, // ‘*’
CHAR_SLASH, // ‘/’
CHAR_LT, // ‘<‘
CHAR_GT, // ‘<‘
CHAR_EXCLAIMATION, // ‘!’
CHAR_EQ, // ‘=‘
CHAR_COMMA, // ‘,’
CHAR_PERIOD, // ‘.’
CHAR_COLON, // ‘:’
CHAR_SEMICOLON, // ‘;’
CHAR_SINGLEQUOTE, // ‘\’’
CHAR_LPAR, // ‘(‘
CHAR_RPAR, // ‘)’
CHAR_UNKNOWN // invalid characters
} CharCode;
CharCode charCodes[256] ={……}

17
Data structure for list of tokens

enum {
TK_NONE, TK_IDENT, TK_NUMBER, TK_CHAR, TK_EOF,

KW_PROGRAM, KW_CONST, KW_TYPE, KW_VAR,


KW_INTEGER, KW_CHAR, KW_ARRAY, KW_OF,
KW_FUNCTION, KW_PROCEDURE,
KW_BEGIN, KW_END, KW_CALL,
KW_IF, KW_THEN, KW_ELSE,
KW_WHILE, KW_DO, KW_FOR, KW_TO,

SB_SEMICOLON, SB_COLON, SB_PERIOD, SB_COMMA,


SB_ASSIGN, SB_EQ, SB_NEQ, SB_LT, SB_LE, SB_GT, SB_GE,
SB_PLUS, SB_MINUS, SB_TIMES, SB_SLASH,
SB_LPAR, SB_RPAR, SB_LSEL, SB_RSEL
};

18
Scanner implementation based on DFA

state = 0;

currentChar = readChar();

token = getToken();

while (token!=EOF)

state =0;

token = getToken();

}
Token recognizer
switch (state)
{
case 0 :
switch (currentChar)
{
case space
state = 2;
case lpar
state = 38;
case letter
state = 3;
case digit
state =7;
case plus
state = 9;
case lt
state = 13
……
}
Token recognizer (cont’d)

case 9:
readChar();
return SB_PLUS;

case 13:
readChar();
if (currentChar = EQ)state = 14 else state =
15;
case 14:
readChar();
return SB_LE;
case 15:
return SB_LT;
Token recognizer (cont’d)
case 2:
while (currentChar= space) // skip blanks
readChar();
return getToken();
case 35:
readChar();
if (currentChar= EOF)state =41;
else
switch (currentChar)
{
case period
state = 36;// token lsel
case times
state =37; //skip comment
default
state =41; // token lpar
}
return getToken();
}
Skip comments
case 37: // skip comment
readChar();
while (currentChar != times)
{
state = 37;
readChar();
}
state = 38;
case 38:
readChar();
while (currentChar == times)
{
state = 38;
currentChar = readChar();
}
If (currentChar == lpar) state = 39; else state =40;
Distinction between identifiers and keywords

• Variable ch is assigned with the first character of the lexeme.


• Read all digits and letters into string t
• Use binary search algorithm to find if there is an entry for that
string in table of keyword
• If found t.kind = order of the keyword
• Otherwise, t.kind =ident
• At last, variable ch contains the first character of the next
lexeme

24
Distinction between identifiers and keywords

case 4:
if (checkKeyword (token) == TK_NONE)state = 5;
else state =6;
case 5:
install_ident();// save to symbol table
case 6
return checkKeyword(token);
…………
Initialize a symbol table

• The following information about identifiers is saved


• Name:string
• Attribute : type name, variable name, constant name. . .
• Data type
• Scope
• Address and size of the memory where the lexeme is
located
•...

26
Scanner Generators

• Compiler compiler (generator)


• Regular expressions
• Model of a scanner generator

• A popular scanner generator: FLEX

27
Model of a compiler

Syntax Semantic
Source program structure
Scanner Tokens Parser analyzer
(Stream of Characters)

Intermediate code generator


Symbol Table Code optimizer

Intermediate code

Target
Code
Generator

Assembly code

28
Scanner and parser generator

• FLEX (LEX)
– Generate C code for the scanner
– Lexical rules are expressed by a set of regular expressionss

• BISON (YACC)
– Generate C code for the parser following LR(1) method (bottom up)
– Grammar is expressed by BNF

29
Input and output of a scanner generator

• Input: set of regular expressions with regular operators:


union, concatenation and star (plus)

NUMBER [0 - 9]+
DELIMITER [ \n\t\r]
CHAR \'[[:print:]]\'
IDENT [a-zA-Z][a-zA-Z0-9]*
COMMENT \(\*([^*]|(\*+[^*)]))*\*+\)
ERROR [^+\-*/,;.:()=a-zA-Z0-9<>]

• Output : a scanner written in C language.

30
Model of a scanner generator

Output: the scanner


Input a regular expression program

R Scanner
generator P

S  L( R) Accept a token

String S P Error message


S  L( R)
Regular expression

• Have you ever seen a regular expression?


Regular expressions in Javascript
Regular expressions

• Similar to arithmetic expression, we can use the regular


operations to build up expressions describing languages,
which are called regular expressions.

• Example is: (0 1)0 .

• Applications

• Patterns for searching

• Description of tokens for scanner generators

• Pattern in programming languages like Python, in tools of


UNIX like owk, grep
Formal definition of a regular expression

Say that R is a regular expression if R is


1. a for some a in the alphabet Σ,
2. ε,
3. ,
Assume r1 and r2 are regular expressions denote languages
R1 and R2
4. (r1 +r2), is the regular expression denotes R1  R2
5. (r1 r2), is the regular expressions denotes R1 ◦ R2
6. (r1∗ ), is the regular expression denotes R1*
How a scanner program is built?

• The scanner program is built from an deterministic finite


automaton
• We need a process to convert from a set of regular
expression to a deterministic finite automaton

Regular
-NFA NFA DFA
expression
Minimization

Minimum DFA
FLEX: The fast lexical analyzer generator

• A tool for generating scanners.


• Reads the given input files for a description of a scanner to
generate.
• The description is in the form of pairs of regular expressions
and C code, called rules.
• flex generates as output a C source file, lex.yy.c by default,
which defines a routine yylex().
• This file can be compiled and linked with the flex runtime
library to produce an executable.
• When the executable is run, it analyzes its input for
occurrences of the regular expressions. Whenever it finds
one, it executes the corresponding C code.
Steps to using flex

1. Create a description or rules file for flex to operate on

2. Run FLEX on the input file. flex produces a C file called lex.yy.c
with the scanning function yylex().

3. Run the C compiler on the C file to produce a lexical analyzer


Flex Files and Procedure

Scanner in c code
Rule file
*.l Flex compiler lex.yy.c

lex.yy.c C compiler scanner.exe

Test file scanner.exe tokens


Flex Programs

The flex input file consists of three section separated by a line with
just %%

%{
auxiliary declarations
%}
regular definitions
%%
translation rules
%%
auxiliary procedures
Auxiliary declarations and regular definitions

41
Translation rules

42
Auxiliary procedures

43

You might also like