Flex Coursz
Flex Coursz
%%
\n ++num_lines; ++num_chars;
. ++num_chars;
%%
int main()
{
yylex();
printf( "# of lines = %d, # of chars = %d\n",
num_lines, num_chars );
}
This scanner counts the number of characters and the number of lines in its input. It
produces no output other than the final report on the character and line counts. The
first line declares two globals, num_lines and num_chars, which are accessible both inside
yylex() and in the main() routine declared after the second ‘%%’. There are two rules, one
which matches a newline (‘\n’) and increments both the line count and the character count,
and one which matches any character other than a newline (indicated by the ‘.’ regular
expression).
A somewhat more complicated example:
/* scanner for a toy Pascal-like language */
%{
/* need this for the call to atof() below */
#include <math.h>
%}
DIGIT [0-9]
ID [a-z][a-z0-9]*
%%
Chapter 4: Some Simple Examples 5
{DIGIT}+ {
printf( "An integer: %s (%d)\n", yytext,
atoi( yytext ) );
}
{DIGIT}+"."{DIGIT}* {
printf( "A float: %s (%g)\n", yytext,
atof( yytext ) );
}
if|then|begin|end|procedure|function {
printf( "A keyword: %s\n", yytext );
}
%%
yylex();
}
This is the beginnings of a simple scanner for a language like Pascal. It identifies different
types of tokens and reports on what it has seen.
The details of this example will be explained in the following sections.
Chapter 5: Format of the Input File 6
#include <stdint.h>
#include <inttypes.h>
}
Multiple %top blocks are allowed, and their order is preserved.
/* Definitions Section */
Chapter 5: Format of the Input File 8
%x STATE_X
%%
/* Rules Section */
ruleA /* after regex */ { /* code block */ } /* after code block */
/* Rules Section (indented) */
<STATE_X>{
ruleC ECHO;
ruleD ECHO;
%{
/* code block */
%}
}
%%
/* User Code Section */
Chapter 6: Patterns 9
6 Patterns
The patterns in the input (see Section 5.2 [Rules Section], page 7) are written using an
extended set of regular expressions. These are:
‘x’ match the character ’x’
‘.’ any character (byte) except newline
‘[xyz]’ a character class; in this case, the pattern matches either an ’x’, a ’y’, or a ’z’
‘[abj-oZ]’
a "character class" with a range in it; matches an ’a’, a ’b’, any letter from ’j’
through ’o’, or a ’Z’
‘[^A-Z]’ a "negated character class", i.e., any character but those in the class. In this
case, any character EXCEPT an uppercase letter.
‘[^A-Z\n]’
any character EXCEPT an uppercase letter or a newline
‘[a-z]{-}[aeiou]’
the lowercase consonants
‘r*’ zero or more r’s, where r is any regular expression
‘r+’ one or more r’s
‘r?’ zero or one r’s (that is, “an optional r”)
‘r{2,5}’ anywhere from two to five r’s
‘r{2,}’ two or more r’s
‘r{4}’ exactly 4 r’s
‘{name}’ the expansion of the ‘name’ definition (see Chapter 5 [Format], page 6).
‘"[xyz]\"foo"’
the literal string: ‘[xyz]"foo’
‘\X’ if X is ‘a’, ‘b’, ‘f’, ‘n’, ‘r’, ‘t’, or ‘v’, then the ANSI-C interpretation of ‘\x’.
Otherwise, a literal ‘X’ (used to escape operators such as ‘*’)
‘\0’ a NUL character (ASCII code 0)
‘\123’ the character with octal value 123
‘\x2a’ the character with hexadecimal value 2a
‘(r)’ match an ‘r’; parentheses are used to override precedence (see below)
‘(?r-s:pattern)’
apply option ‘r’ and omit option ‘s’ while interpreting pattern. Options may
be zero or more of the characters ‘i’, ‘s’, or ‘x’.
‘i’ means case-insensitive. ‘-i’ means case-sensitive.
‘s’ alters the meaning of the ‘.’ syntax to match any single byte whatsoever.
‘-s’ alters the meaning of ‘.’ to match any byte except ‘\n’.
Chapter 6: Patterns 10
• If your scanner is case-insensitive (the ‘-i’ flag), then ‘[:upper:]’ and ‘[:lower:]’ are
equivalent to ‘[:alpha:]’.
• Character classes with ranges, such as ‘[a-Z]’, should be used with caution in a case-
insensitive scanner if the range spans upper or lowercase characters. Flex does not
know if you want to fold all upper and lowercase characters together, or if you want the
literal numeric range specified (with no case folding). When in doubt, flex will assume
that you meant the literal numeric range, and will issue a warning. The exception to
this rule is a character range such as ‘[a-z]’ or ‘[S-W]’ where it is obvious that you
want case-folding to occur. Here are some examples with the ‘-i’ flag enabled:
Range Result Literal Range Alternate Range
‘[a-t]’ ok ‘[a-tA-T]’
‘[A-T]’ ok ‘[a-tA-T]’
‘[A-t]’ ambiguous ‘[A-Z\[\\\]_‘a-t]’ ‘[a-tA-T]’
‘[_-{]’ ambiguous ‘[_‘a-z{]’ ‘[_‘a-zA-Z{]’
‘[@-C]’ ambiguous ‘[@ABC]’ ‘[@A-Z\[\\\]_‘abc]’
• A negated character class such as the example ‘[^A-Z]’ above will match a newline
unless ‘\n’ (or an equivalent escape sequence) is one of the characters explicitly present
in the negated character class (e.g., ‘[^A-Z\n]’). This is unlike how many other regular
expression tools treat negated character classes, but unfortunately the inconsistency is
historically entrenched. Matching newlines means that a pattern like ‘[^"]*’ can match
the entire input unless there’s another quote in the input.
Flex allows negation of character class expressions by prepending ‘^’ to the POSIX
character class name.
[:^alnum:] [:^alpha:] [:^blank:]
[:^cntrl:] [:^digit:] [:^graph:]
[:^lower:] [:^print:] [:^punct:]
[:^space:] [:^upper:] [:^xdigit:]
Flex will issue a warning if the expressions ‘[:^upper:]’ and ‘[:^lower:]’ appear in
a case-insensitive scanner, since their meaning is unclear. The current behavior is to
skip them entirely, but this may change without notice in future revisions of flex.
•
The ‘{-}’ operator computes the difference of two character classes. For example,
‘[a-c]{-}[b-z]’ represents all the characters in the class ‘[a-c]’ that are not in the
class ‘[b-z]’ (which in this case, is just the single character ‘a’). The ‘{-}’ operator
is left associative, so ‘[abc]{-}[b]{-}[c]’ is the same as ‘[a]’. Be careful not to
accidentally create an empty set, which will never match.
•
The ‘{+}’ operator computes the union of two character classes. For example,
‘[a-z]{+}[0-9]’ is the same as ‘[a-z0-9]’. This operator is useful when preceded
by the result of a difference operation, as in, ‘[[:alpha:]]{-}[[:lower:]]{+}[q]’,
which is equivalent to ‘[A-Zq]’ in the "C" locale.
• A rule can have at most one instance of trailing context (the ‘/’ operator or the ‘$’
operator). The start condition, ‘^’, and ‘<<EOF>>’ patterns can only occur at the
beginning of a pattern, and, as well as with ‘/’ and ‘$’, cannot be grouped inside
parentheses. A ‘^’ which does not occur at the beginning of a rule or a ‘$’ which does
Chapter 6: Patterns 13
not occur at the end of a rule loses its special properties and is treated as a normal
character.
• The following are invalid:
foo/bar$
<sc1>foo<sc2>bar
Note that the first of these can be written ‘foo/bar\n’.
• The following will result in ‘$’ or ‘^’ being treated as a normal character:
foo|(bar$)
foo|^bar
If the desired meaning is a ‘foo’ or a ‘bar’-followed-by-a-newline, the following could
be used (the special | action is explained below, see Chapter 8 [Actions], page 15):
foo |
bar$ /* action goes here */
A similar trick will work for matching a ‘foo’ or a ‘bar’-at-the-beginning-of-a-line.
Chapter 7: How the Input Is Matched 14
8 Actions
Each pattern in a rule has a corresponding action, which can be any arbitrary C statement.
The pattern ends at the first non-escaped whitespace character; the remainder of the line
is its action. If the action is empty, then when the pattern is matched the input token is
simply discarded. For example, here is the specification for a program which deletes all
occurrences of ‘zap me’ from its input:
%%
"zap me"
This example will copy all other characters in the input to the output since they will be
matched by the default rule.
Here is a program which compresses multiple blanks and tabs down to a single blank,
and throws away whitespace found at the end of a line:
%%
[ \t]+ putchar( ’ ’ );
[ \t]+$ /* ignore this token */
If the action contains a ‘{’, then the action spans till the balancing ‘}’ is found, and
the action may cross multiple lines. flex knows about C strings and comments and won’t
be fooled by braces found within them, but also allows actions to begin with ‘%{’ and will
consider the action to be all the text up to the next ‘%}’ (regardless of ordinary braces inside
the action).
An action consisting solely of a vertical bar (‘|’) means “same as the action for the next
rule”. See below for an illustration.
Actions can include arbitrary C code, including return statements to return a value
to whatever routine called yylex(). Each time yylex() is called it continues processing
tokens from where it last left off until it either reaches the end of the file or executes a
return.
Actions are free to modify yytext except for lengthening it (adding characters to its
end–these will overwrite later characters in the input stream). This however does not apply
when using %array (see Chapter 7 [Matching], page 14). In that case, yytext may be freely
modified in any way.
Actions are free to modify yyleng except they should not do so if the action also includes
use of yymore() (see below).
There are a number of special directives which can be included within an action:
ECHO copies yytext to the scanner’s output.
BEGIN followed by the name of a start condition places the scanner in the corresponding
start condition (see below).
REJECT directs the scanner to proceed on to the “second best” rule which matched
the input (or a prefix of the input). The rule is chosen as described above in
Chapter 7 [Matching], page 14, and yytext and yyleng set up appropriately.
It may either be one which matched as much text as the originally chosen rule
but came later in the flex input file, or one which matched less text. For
example, the following will both count the words in the input and call the
routine special() whenever ‘frob’ is seen:
Chapter 8: Actions 16
int word_count = 0;
%%
[a-z]+ ECHO;
An argument of 0 to yyless() will cause the entire current input string to be scanned
again. Unless you’ve changed how the scanner will subsequently process its input (using
BEGIN, for example), this will result in an endless loop.
Note that yyless() is a macro and can only be used in the flex input file, not from
other source files.
unput(c) puts the character c back onto the input stream. It will be the next character
scanned. The following action will take the current token and cause it to be rescanned
enclosed in parentheses.
{
int i;
/* Copy yytext because unput() trashes yytext */
char *yycopy = strdup( yytext );
unput( ’)’ );
for ( i = yyleng - 1; i >= 0; --i )
unput( yycopy[i] );
unput( ’(’ );
free( yycopy );
}
Note that since each unput() puts the given character back at the beginning of the input
stream, pushing back strings must be done back-to-front.
An important potential problem when using unput() is that if you are using %pointer
(the default), a call to unput() destroys the contents of yytext, starting with its rightmost
character and devouring one character to the left with each call. If you need the value of
yytext preserved after a call to unput() (as in the above example), you must either first
copy it elsewhere, or build your scanner using %array instead (see Chapter 7 [Matching],
page 14).
Finally, note that you cannot put back ‘EOF’ to attempt to mark the input stream with
an end-of-file.
input() reads the next character from the input stream. For example, the following is
one way to eat up C comments:
%%
"/*" {
register int c;
for ( ; ; )
{
while ( (c = input()) != ’*’ &&
c != EOF )
; /* eat up text of comment */
if ( c == ’*’ )
{
while ( (c = input()) == ’*’ )
;
Chapter 8: Actions 18
if ( c == ’/’ )
break; /* found the end */
}
if ( c == EOF )
{
error( "EOF in comment" );
break;
}
}
}
(Note that if the scanner is compiled using C++, then input() is instead referred to as
yyinput(), in order to avoid a name clash with the C++ stream by the name of input.)
YY_FLUSH_BUFFER; flushes the scanner’s internal buffer so that the next time the scanner
attempts to match a token, it will first refill the buffer using YY_INPUT() (see Chapter 9
[Generated Scanner], page 19). This action is a special case of the more general yy_flush_
buffer; function, described below (see Chapter 11 [Multiple Input Buffers], page 27)
yyterminate() can be used in lieu of a return statement in an action. It terminates
the scanner and returns a 0 to the scanner’s caller, indicating “all done”. By default,
yyterminate() is also called when an end-of-file is encountered. It is a macro and may be
redefined.