0% found this document useful (0 votes)
41 views

Scanner

The document discusses lexical analysis and describes finite state automata used to recognize lexical categories in a programming language. It defines regular expressions for lexical categories like identifiers, integers and reals. It then presents finite state automata diagrams to recognize these lexical categories and combinations of categories.

Uploaded by

Wolf Andrew 73
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Scanner

The document discusses lexical analysis and describes finite state automata used to recognize lexical categories in a programming language. It defines regular expressions for lexical categories like identifiers, integers and reals. It then presents finite state automata diagrams to recognize these lexical categories and combinations of categories.

Uploaded by

Wolf Andrew 73
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Lexical analysis.

Scanner
example.

October 21, 2023

1 Lexical analysis
1.1 Regular definitions

ident → [a−zA−Z][a−zA−Z0−9] ∗
int−const → 0
int−const → [1−9][0−9]∗
real−const → int−const ′.′ int−const
real−const → int−const ′.′
′ ′
real−const → . int−const

1.2 Finite automata used for lexical analysis


The finite automaton for the identifier lexical category is presented in Figure
1.
a−z, A−Z, 0−

a−z, A−Z
0 1

Figure 1: Finite automaton for the identifier lexical category

The finite automaton for the integer−constant lexical category is pre-


sented in Figure 2.

1
1
0

0 0−9
1−9

Figure 2: Finite automaton for the integer−constant lexical category (de-


noted by MCI)

The non-deterministic finite automaton for the real −constant lexical


category is presented in Figure 3.
.
MCI MCI
ϵ

ϵ .
0 MCI 1
ϵ

.
2
MCI

Figure 3: Non-deterministic finite automaton for the real−constant lexical


category

The deterministic finite automaton for both real−constant and integer−


constant lexical categories is presented in Figure 4.

2
CI
0−9
1 .
0 CR

0−9 0−9
0 3 4 CR

1−9 .
2

CI
. 0−9

Figure 4: Deterministic finite automaton for real−constant and integer−


constant lexical categories

Finally, in Figure 5 is presented a deterministic finite automaton for three


lexical categories: ident, real−constant and integer−constant.

CI
0−9
1 .
0 CR

0−9 0−9
0 3 4 CR
1−9
.
2
CI 0−9

a−z,A−Z
.
5
ID
6
a−z,A−Z,0−9

Figure 5: Deterministic finite automaton for ident, real − constant and


integer−constant lexical categories

Finally, in Figure 6 is presented a deterministic finite automaton for all


lexical categories.

3
=
a−z,A−Z,0−9 19 18
IDENT
20 =
6 INTCONST
0−9 17 !
1 21
a−z,A−Z 0 . REALCONST = >
0−9 0−9 16
0 3 4 22
1−9 =
. REALCONST
, 2 15 <
0−9
INTCONST
.
7 14
5
; 13 =
8
( 9 12 *
10 11 +
) { }

Figure 6: Deterministic finite automaton for all lexical categories

The names of some final states in this DFA are:


• State 7: COMMA,
• State 8: SEMICOLON ,
• State 9: LPAREN ,
• State 10: RPAREN ,
• State 11: LBRACE,
• State 12: RBRACE,
• State 13: PLUS,
• State 14: TIMES,
• State 15: ASSIGN ,
• State 16: LT ,
• State 17: GT ,
• State 19: NEQ,
• State 20: GTE,
• State 21: LTE,
• State 22: EQ.

4
1.3 Source files

exceptions.h
# pragma once

# ifndef EXP
# define
EXP

# include < string >

class Exception {
std :: string mess
;
public :
Exception ( std :: string str): mess ( str) {}
void print ();
};

# endif

exceptions.cpp

# include " exceptions .

h" # include < iostream >

void Exception :: print () {


std :: cout << mess ;
}

toktype.h
# pragma once

# ifndef TKTYPE
# define TKTYPE

# include < string >


# include <map >

// enum to store the type of a token


enum class Token Type {
INTCONST , REALCONST , IDENT , INT , DOUBLE , RETURN , IF , ELSE ,
5
PLUS , TIMES , ASSIGN , LPAREN , RPAREN , LBRACE , RBRACE , COMMA ,
SEMICOLON , EQ , NEQ , LT , LTE , GT , GTE , END

6
};

// map used to associate the token type of a keyword


typedef std :: map < std :: string , TokenType > Keyword Map ;

// map used to write the type of a token


typedef std :: map < TokenType , std :: string > Token Type Map ;

# endif

token.h
# pragma once

# ifndef TOK
# define TOK

# include < fstream >

# include " toktype . h"

class Token {
Token Type type ;
std :: string lexeme ;
union {
int intConst ;
double realConst ;
std :: string ident ;
short oth ;
};
public :
Token ( Token Type t, std :: string word , int
n): type ( t), lexeme ( word ), intConst ( n)
{}
Token ( Token Type t, std :: string word , double x) :
type ( t), lexeme ( word ), realConst ( x) {}
Token ( Token Type t, std :: string word , std :: string id
): type ( t), lexeme ( word ), ident ( id ) {}
Token ( Token Type t, std :: string word ):
type ( t), lexeme ( word ), oth (1) {}
// copy constructor is implicit defaulted
Token ( const Token & tok );
// destructor is implicit defaulted
~ Token () {}
// assignment operator is implicit defaulted
7
Token & operator =( const Token & tok );
Token Type getType () const { return type ; }
int getInt () const { return intConst ; }
double getReal () const { return realConst ; }
std :: string getIdent () const { return ident ; }
void print ( std :: ofstream & out) const;
};

# endif

token.cpp
# include " token . h"

# include < iostream >


# include < string >
# include <map >

// map used to write the type of a token


Token Type Map token Type Map {
{ Token Type :: INTCONST , " INTCONST "}, { Token Type :: REALCONST , "
REALCONST "}
{ Token Type :: IDENT , " IDENT "}, { Token Type :: INT , " INT "},
{ Token Type :: DOUBLE , " DOUBLE "}, { Token Type :: RETURN , " RETURN "},
{ Token Type :: IF , " IF"}, { Token Type :: ELSE , " ELSE "},
{ Token Type :: PLUS , " PLUS "}, { Token Type :: TIMES , " TIMES "},
{ Token Type :: ASSIGN , " ASSIGN "}, { Token Type :: LPAREN , " LPAREN "},
{ Token Type :: RPAREN , " RPAREN "}, { Token Type :: LBRACE , " LBRACE "},
{ Token Type :: RBRACE , " RBRACE "}, { Token Type :: COMMA , " COMMA "},
{ Token Type :: SEMICOLON , " SEMICOLON "}, { Token Type :: EQ , " EQ"},
{ Token Type :: NEQ , " NEQ "}, { Token Type :: LT , " LT"},
{ Token Type :: LTE , " LTE "}, { Token Type :: GT , " GT "},
{ Token Type :: GTE , " GTE "}, { Token Type :: END , " END "}
};

Token :: Token ( const Token & tok ) {


lexeme = tok . lexeme ;
// copy operations
switch ( tok . type ) {
case Token Type :: INTCONST :
intConst = tok . intConst ;
break ;
case Token Type :: REALCONST :
realConst = tok . realConst ;
break ;
8
case Token Type :: IDENT :
new (& ident )( std :: string )( tok . ident );
break ;
default :
oth = tok . oth
; break ;
}
type = tok . type ;
}

Token & Token :: operator =( const Token & tok ) {


lexeme = tok . lexeme ;
if ( type == Token Type :: IDENT && tok . type == Token Type :: IDENT ) {
ident = tok . ident ;
return * this ;
}
// the current token type = IDENT
if ( type == Token Type :: IDENT )
{
ident .~ basic_string <char >(); // destroy explicitly
}
// copy operations
switch ( tok . type ) {
case Token Type :: INTCONST :
intConst = tok . intConst ;
break ;
case Token Type :: REALCONST :
realConst = tok . realConst ;
break ;
case Token Type :: IDENT :
new (& ident )( std :: string )( tok . ident );
break ;
default :
oth = tok . oth
; break ;
}
type = tok . type ;
return * this ;
}

void Token :: print ( std :: ofstream & out) const


{ out << " Token ␣{\ n";
out << "\ tLexeme ␣ = ␣ " << lexeme << std :: endl ;
out << "\ tToken ␣type ␣= ␣" << token Type Map [ type ] << std :: endl ;

9
switch ( type ) {
case Token Type :: INTCONST
: out << "\ tLValue ␣ = ␣ " ;
out << intConst << std :: endl ;
break ;
case Token Type :: REALCONST :
out << "\ tLValue ␣ = ␣ " ;
out << realConst << std ::
endl ; break ;
case Token Type :: IDENT :
out << "\ tLValue ␣ = ␣ " ;
out << ident << std :: endl ;
break ;
default :
break ;
}
out << "}\ n";
}

scanner.h
# pragma once

# ifndef SCAN
# define SCAN

# include < fstream


> # include < string
> # include < vector
> # include " token
. h"

class Scanner {
std :: string file Name ;
std :: ifstream inputFile
; long currPosition ;
public :
Scanner( std :: string name
); Token nextToken ();
};

# endif

scanner.cpp

10
# include < algorithm >
# include < iostream >
# include < string >

# include " scanner.


h" # include " token .
h"
# include " toktype . h"
# include " exceptions . h"

// map used to associate the token type of a keyword


Keyword Map keyword Map {
{" int " , Token Type :: INT }, {" double ", Token Type :: DOUBLE },
{" return ", Token Type :: RETURN }, {" if", Token Type :: IF},
{" else ", Token Type :: ELSE },
};

Scanner :: Scanner( std :: string name ) : file Name ( name ) {


inputFile . open ( name );
if (! inputFile . is_open ()) {
throw Exception (" Input ␣ file ␣does ␣ not ␣ exist " );
}
currPosition = inputFile . tellg ();
};

Token Scanner :: nextToken () {


char c, c1 ;
// read from the current position
inputFile . seekg ( currPosition , std :: ios :: beg );
while (( c = inputFile . get ()) && inputFile . good ()) {
if (( c == ’␣’) || ( c == ’\ n’) || ( c == ’\ t’)) { // skip spaces
while (( c = inputFile . get ()) && ( c == ’␣’ || c == ’\ n’) || ( c
== ’
}
switch ( c) {
case ’+ ’:
{
// lexeme
std :: string tLexeme
; tLexeme . push_back (
c);
// token
Token t( Token Type :: PLUS , tLexeme );
11
// save the current position

12
currPosition = inputFile . tellg ();
return t;
}
case ’*’:
{
// lexeme
std :: string tLexeme
; tLexeme . push_back (
c);
// token
Token t( Token Type :: TIMES , tLexeme );
// save the current position
currPosition = inputFile . tellg ();
return t;
}
case ’,’:
{
// lexeme
std :: string tLexeme
; tLexeme . push_back (
c);
// token
Token t( Token Type :: COMMA , tLexeme );
// save the current position
currPosition = inputFile . tellg ();
return t;
}
case ’;’:
{
// lexeme
std :: string tLexeme
; tLexeme . push_back (
c);
// token
Token t( Token Type :: SEMICOLON , tLexeme );
// save the current position
currPosition = inputFile . tellg ();
return t;
}
case ’(’:
{
// lexeme
std :: string tLexeme
; tLexeme . push_back (
c);
13
// token
Token t( Token Type :: LPAREN , tLexeme );

14
// save the current position
currPosition = inputFile . tellg ();
return t;
}
case ’)’:
{
// lexeme
std :: string tLexeme
; tLexeme . push_back (
c);
// token
Token t( Token Type :: RPAREN , tLexeme );
// save the current position
currPosition = inputFile . tellg ();
return t;
}
case ’{’:
{
// lexeme
std :: string tLexeme
; tLexeme . push_back (
c);
// token
Token t( Token Type :: LBRACE , tLexeme );
// save the current position
currPosition = inputFile . tellg ();
return t;
}
case ’}’:
{
// lexeme
std :: string tLexeme
; tLexeme . push_back (
c);
// token
Token t( Token Type :: RBRACE , tLexeme );
// save the current position
currPosition = inputFile . tellg ();
return t;
}
case ’!’:
{
// lexeme
std :: string tLexeme
; tLexeme . push_back (
15
c); inputFile >> c1 ;

16
if ( c1 == ’= ’) {
tLexeme . push_back ( c1 );
// token
Token t( Token Type :: NEQ , tLexeme );
// save the current position
currPosition = inputFile . tellg ();
return t;
}
else {
throw Exception (" Illegal ␣ character ␣ after ␣!" );
}
}
case ’= ’:
{
// lexeme
std :: string tLexeme
; tLexeme . push_back (
c); inputFile >> c1 ;
if ( c1 == ’= ’) {
tLexeme . push_back ( c1 );
// token
Token t( Token Type :: EQ , tLexeme );
// save the current position
currPosition = inputFile . tellg ();
return t;
}
else {
// assignment operator
// push back the last char in the stream
inputFile . putback ( c1 );
// token
Token t( Token Type :: ASSIGN , tLexeme );
// save the current position
currPosition = inputFile . tellg ();
return t;
}
}
case ’ <’:
{
// lexeme
std :: string tLexeme
; tLexeme . push_back (
c); inputFile >> c1 ;

17
if ( c1 == ’= ’) {
tLexeme . push_back ( c1 );
// token
Token t( Token Type :: LTE , tLexeme );
// save the current position
currPosition = inputFile . tellg ();
return t;
}
else {
// < operator
// push back the last char in the stream
inputFile . putback ( c1 );
// token
Token t( Token Type :: LT , tLexeme );
// save the current position
currPosition = inputFile . tellg ();
return t;
}
}
case ’ >’:
{
// lexeme
std :: string tLexeme
; tLexeme . push_back (
c); inputFile >> c1 ;
if ( c1 == ’= ’) {
tLexeme . push_back ( c1 );
// token
Token t( Token Type :: GTE , tLexeme );
// save the current position
currPosition = inputFile . tellg ();
return t;
}
else {
// > operator
// push back the last char in the stream
inputFile . putback ( c1 );
// token
Token t( Token Type :: GT , tLexeme );
// save the current position
currPosition = inputFile . tellg ();
return t;
}

18
}
case ’.’: // can be a constant . y
{
// lexeme
std :: string tLexeme
; tLexeme . push_back (
c);
// numerical value
double xVal = 0.0;
// fractional part
double fPart = 0.0;
double y = 1.0 / 10.0;
while (( c1 = inputFile . get ()) && isdigit ( c1 )) {
fPart = fPart + ( c1 - ’0 ’) * y;
y /= 10.0;
tLexeme . push_back ( c1 );
}
// the real value
xVal = xVal + fPart;
// push back the last char in the stream
inputFile . putback ( c1 );
// the token
Token t( Token Type :: REALCONST , tLexeme , xVal );
// save the current position
currPosition = inputFile . tellg ();
return t;
}
case ’0 ’: // 0 constant or 0. x constant
{
std :: string tLexeme ;
tLexeme . push_back ( c);
// numerical values are stored in these variables
int iVal = 0;
double xVal = 0.0;
inputFile >> c1 ;
if ( isdigit ( c1 )) {
throw Exception (" A ␣ decimal ␣ constant ␣ must ␣ start ␣ with ␣ 1 -9 ");
}
else if ( c1 == ’.’) { // A real number 0. x
// the fractional part
double fPart = 0.0;
double y = 1.0 / 10.0;
tLexeme . push_back ( c1 );
while (( c1 = inputFile . get ()) && isdigit ( c1 )) {

19
fPart = fPart + ( c1 - ’0 ’) *
y; y /= 10.0;
tLexeme . push_back ( c1 );
}
// the real value
xVal = xVal + fPart;
// push back the last char in the stream
inputFile . putback ( c1 );
// the token
Token t( Token Type :: REALCONST , tLexeme , xVal );
// save the current position
currPosition = inputFile . tellg ();
return t;
}
else { // 0 constant
// push back the last char in the stream
inputFile . putback ( c1 );
// the token
Token t( Token Type :: INTCONST , tLexeme , iVal );
// save the current position
currPosition = inputFile . tellg ();
return t;
}
}
default :
{
if ( isdigit ( c)) { // a n constant or a x. y constant
// lexeme
std :: string tLexeme ;
tLexeme . push_back ( c);
// numerical values are stored in these variables
int iVal = 0;
double xVal = 0.0;
// update the values
iVal = 10 * iVal + c - ’0 ’;
xVal = 10 * xVal + c - ’0 ’;
while (( c1 = inputFile . get ()) && isdigit ( c1 )) { // decimal
part iVal = 10 * iVal + c1 - ’0 ’;
xVal = 10 * xVal + c1 - ’0 ’;
tLexeme . push_back ( c1 );
}
if ( c1 != ’.’) { // An integer constant
// push back the last char in the stream

20
inputFile . putback ( c1 );
// the token
Token t( Token Type :: INTCONST , tLexeme , iVal );
// save the current position
currPosition = inputFile . tellg ();
return t;
}
else { // A real costant x. y
// the fractional part
double fPart = 0.0;
double y = 1.0 / 10.0;
tLexeme . push_back ( c1 );
while (( c1 = inputFile . get ()) && isdigit ( c1 ))
{ fPart = fPart + ( c1 - ’0 ’) * y;
y /= 10.0;
tLexeme . push_back ( c1 );
}
// the real value
xVal = xVal + fPart;
// push back the last char in the stream
inputFile . putback ( c1 );
// the token
Token t( Token Type :: REALCONST , tLexeme , xVal );
// save the current position
currPosition = inputFile . tellg ();
return t;
}
}
else if ( isalpha ( c)) { // keyword or identifier
// lexeme
std :: string tLexeme ;
tLexeme . push_back ( c);
// store lexeme characters
while (( c1 = inputFile . get ()) && isalnum ( c1 )) {
tLexeme . push_back ( c1 );
}
// push back the last char in the stream
inputFile . putback ( c1 );
// keyword test
if ( keyword Map . find ( tLexeme ) != keyword Map . end ())
{
// keyword detected
// search the keyword token type

21
Token t( keyword Map [ tLexeme ], tLexeme );
// save the current position
currPosition = inputFile . tellg ();
return t;
}
else {
// identifier
// the token - the last tLexeme is the identifier
Token t( Token Type :: IDENT , tLexeme , tLexeme );
// save the current position
currPosition = inputFile . tellg ();
return t;
}

}
else {
throw Exception (" Unknown ␣character");
}
}
}
}
// end of the file - the special token END
std :: string s(" END ");
Token t( Token Type :: END , s);
return t;
}

Source0.cpp
# include < iostream >
# include < string >

# include " scanner.


h" # include " token .
h"
# include " exceptions . h"

std :: ofstream outputFile ;

int main () {
std :: string file Name ;
std :: string outFile Name
; Token Type type ;
std :: cout << " Input ␣ file ␣ name : ␣";
std :: cin >> file Name ;
22
std :: cout << " Output ␣ file ␣ name :
␣"; std :: cin >> outFile Name ;
outputFile . open ( outFile Name , std :: ofstream :: out | std :: ofstream ::
app );

try {
Scanner sc( file Name );
do {
Token t = sc. nextToken ();
t. print ( outputFile );
type = t. getType ();
} while ( type != Token Type :: END );
}
catch ( Exception e) {
e. print ();
return 1;
}

return 0;
}

23

You might also like