100% found this document useful (2 votes)

346 views21 pages

Flex and Bison

This document provides an introduction and overview of Flex and Bison, which are tools for generating lexical analyzers and parsers. Flex is used to generate a lexer that breaks input into tokens, while Bison generates a parser based on a context-free grammar. The document then demonstrates a simple example of using Flex alone to lex a made-up file format called "Snazzle" without parsing. It walks through the Flex file that defines tokens for integers, floats, and strings and prints them out. This highlights the basic structure of Flex files and how they are compiled into a lexer.

Uploaded by

Mark Rudson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

346 views21 pages

Flex and Bison

Uploaded by

Mark Rudson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

www.aquamentus.

com
What are Flex and Bison?
lex vs. flex, yacc vs. bison
what we'll do!
Flex
reading from a file
Bison
a makefile
forcing carriage returns
line numbers
Tips
directly returning terminal characters
actions before the end of the grammar
whitespace in flex/bison files
avoiding -lfl and link problems with yywrap
renaming identifiers
moving output
stalling in Windows
Advanced
start states
reading gzipped input
Regular expression overview
Acknowledgements

WHAT ARE FLEX AND BISON?

Flex and Bison are aging unix utilities that help you write very fast parsers for almost arbitrary file formats. Formally, they implement Look-
Ahead-Left-Right (as opposed to "recursive descent") parsing of non-ambiguous context-free (as opposed to "natural language") grammars.

Why should you learn Flex/Bison pattern syntax when you could just write your own parser? Well, several reasons. First, Flex and Bison will
generate a parser that is virually guaranteed to be faster than anything you could write manually in a reasonable amount of time. Second,
updating and fixing Flex and Bison source files is a lot easier than updating and fixing custom parser code. Third, Flex and Bison have mechanisms
for error handling and recovery, which is something you definitely don't want to try to bolt onto a custom parser. Finally, Flex and Bison have been
around for a long time, so they far freer from bugs than newer code.

This webpage is supposed to be a tutorial for complete novices needing to use Flex and Bison for some real project. I wrote it as I was
learning them, so hopefully it explains things right as you need to know them. (The Flex and Bison tutorials I found Googling around tended to be
really heavy on theory and arcane details.) However, to really learn these tools you should take a look at a few books:

Flex & Bison by John Levine. This is the definitive text for learning these tools.
Effective Flex & Bison, which covers advanced topics and good practice.

lex vs. flex, yacc vs. bison

In addition to hearing about "flex and bison", you will also hear about "lex and yacc". "lex and yacc" are the original tools; "flex and bison" are
their almost completely compatible newer versions. Only very old code uses lex and yacc; most of the world has moved on to Flex and Bison.

All four of the above are C-based tools; they're written in C, but more important their output is C code. However, my project was in C++ -- so
this is also a tutorial on how to use C++ with Flex and Bison.

what we'll do!

Let's create a Flex/Bison parser for a silly format that I'll call "snazzle". Snazzle files have four general sections:

a header that just says "sNaZZle" followed by a version number,

a section that declares type names, like "type foo" and "type bas",
a section with actual data consisting of exactly 4 numbers and one of the aforementioned type names,
and a footer that just says "end".

(I don't have a particular file format or program in mind, I'm just coming up with this.) Here's an example of our silly format:

sNaZZle 1.3
type foo
type bar
type bas
0 0 10 5 foo
20 10 30 20 foo
7 8 12 13 bas
78 124 100 256 bar
end

For making parsers, you want to start at the high level because that's how Bison thinks about them. Happily, we already started at the high
level by saying that a "snazzlefile" is comprised of four general parts; we'll call them "header", "typedefs", "body", and "footer". Then we break
down those four general parts into the pieces that comprise them, and so on until we get all the way down what are called terminal symbols.
Terminal symbols are the smallest units of your grammar -- for our example here, integers like "15" are one of our terminal symbols. (But not the
individual characters "1" and "5" that make it up.) Terminal symbols are the boundary between Flex and Bison: Flex sees the individual "1" and "5"
characters but combines them into "15" before giving it to Bison.

FLEX
For our silly snazzle files, we'll make 3 terminal symbol types: a NUMBER that's basically an integer, a FLOAT that's a floating-point number
(which we only need for the version), and a STRING which is pretty much everything else. Furthermore, since this is a tutorial, let's show off Flex's
power by making a program that just lexes our input without using Bison to parse it. Here's the first whack:

snazzle.l

1 %{
2 #include <iostream>
3 using namespace std;
4 extern int yylex();
5 %}
6 %option noyywrap
7 %%
8 [ \t\n] ;
9 [0-9]+\.[0-9]+ { cout << "Found a floating-point number:" << yytext << endl; }
10 [0-9]+ { cout << "Found an integer:" << yytext << endl; }
11 [a-zA-Z0-9]+ { cout << "Found a string: " << yytext << endl; }
12 %%
13 int main(int, char**) {
14 // lex through the input:
15 while (yylex());
16 }

Flex and Bison files have three sections:

1. the first is sort of "control" information,

2. the second is the actual token (Flex) or grammar (Bison) definitions,
3. the last is C code to be copied verbatim to the output.

These sections are divided by %%, which you see on lines 7 and 12. Let's go through this line by line.

Lines 1 (and 5) are delimeters that tell Flex that lines 2 through 4 are C code to be copied directly to the generated lexer.
Lines 2 and 3 are to get access to cout.
Line 4 is to declare the yylex function that we're going to call later. The yylex function is actually what Flex generates, so it will magically
appear when we run our snazzle.l through Flex. Note that mixing C and C++ has several delicate points, one of which is "name-mangling" -
C++ compilation of a function named yylex will result in a symbol named something like _Z5yylexv. That's fine as long as everyone
calling yylex is also compiled in C++ mode, so that their linkage is similarly mangled. However, if you want to compile any part of it in C
mode, you will probably have to turn off name-mangling for it like so:
declare with extern "C" int yylex();
define it with #define YY_DECL extern "C" int yylex()
Line 6 is a handy Flex option that keeps it from looking for (or trying to use) the yywrap function. We're not going to use yywrap at all
here, and I don't want to have to add "-lfl" to the compilation just to pull in the default empty definition of it.
Line 7 is %%, which means we're done with the control section and moving on to the token section. Notice that we don't have much in our
control section -- the Bison control section gets a lot more usage than the Flex one.
Lines 8-11 are all the same (simple) format: a regular expression (a topic in its own right, which I touch on not at all exhaustively at the end
of this page) and an action. When Flex is reading through an input file and can match one of the regular expressions , it executes the action
.
The regular expression is not Perl's idea of a regular expression, so you can't use "\d", but the normal stuff is all available. The action is just
C code that is copied into the eventual flex output; accordingly, you can have a single statement or you can have curly braces with a whole
bunch of statements. Some specifics on the file format I discovered: the action has to be left-justified (if there's whitespace beginning a line
where a pattern is expected, the line is considered a comment!); the separation between the pattern and the action is just whitespace (even
just a single space will do); the action is not limited to a single line if you use curly braces.
Line 12 is another %% delimiter, meaning we're done with the second section and we can go onto the third.
Lines 13-16 are the third section, which is exclusively for copied C code. (Notice that, unlike the control section at the top, there is no "%{"
or "%}".) We don't normally need to put anything in this section for the Flex file -- but for this example, if we put the main() function in
here, then all of our code is in one file and then we don't have to link a separate main.o.

This example can be compiled by running this:

% flex snazzle.l

This will produce the file "lex.yy.c", which we can then compile with g++:

% g++ lex.yy.c -o snazzle

If all went as planned, you should be able to run it and enter stuff on STDIN to be lexed:

% ./snazzle
90
Found an integer:90
23.4
Found a floating-point number:23.4
4 5 6
Found an integer:4
Found an integer:5
Found an integer:6
this is text!
Found a string: this
Found a string: is
Found a string: text
!

Pretty cool! Notice that the exclamation mark at the very end was just echoed: when Flex finds something that doesn't match any of the
regexs it echos it to STDOUT. Usually this is a good indication that your token definitions aren't complete enough, but to get rid of it for now you
could just add . ; to the token section -- that will match any one character (the ".") and do nothing with it (the empty C statement ";").

Now for a little more detail on the syntax for the middle section. In general, it is really as simple as "a regex match" followed by "what to do
with it". The "what to do with it" can vary a lot though - for example, most parsers completely ignore whitespace, which you can do by making the
"action" just a semicolon:

[ \t] ; // ignore all whitespace

Most parsers also want to keep track of the line number, which you would do by catching all the carriage returns and incrementing a line counter.
However, you want the carriage return itself to be ignored as if it were whitespace, so even though you have an action performed here you want
tonot put a return statement in it:

\n { ++linenum; } // increment line count, but don't return a token

Pretty much everything else will need to return a token to Bison, but more on that later.

reading from a file

OK, first big upgrade: reading from a file. Right now we're reading from STDIN, which is kind of annoying, because most of the time you'd
really like to pick a file to read from. Flex reads its input from a global pointer to a C FILE variable called yyin, which is set to STDIN by default.
All you have to do is set that pointer to your own file handle, and it'll read from there instead. That changes our example as follows (differences
are in bold):

snazzle.l
%{
#include <iostream>
using namespace std;
extern int yylex();
%}
%%
[ \t\n] ;
[0-9]+\.[0-9]+ { cout << "Found a floating-point number:" << yytext << endl; }
[0-9]+ { cout << "Found an integer:" << yytext << endl; }
[a-zA-Z0-9]+ { cout << "Found a string: " << yytext << endl; }
. ;
%%
int main(int, char**) {

// open a file handle to a particular file:

FILE *myfile = fopen("a.snazzle.file", "r");
// make sure it's valid:
if (!myfile) {
cout << "I can't open a.snazzle.file!" << endl;
return -1;
}
// set lex to read from it instead of defaulting to STDIN:
yyin = myfile;

// lex through the input:

while(yylex());
}

Here's how to run our new code, though there's nothing earth-shattering:

% flex snazzle.l
% g++ lex.yy.c -lfl -o snazzle
% cat a.snazzle.file
90
23.4
4 5 6
this is text!
% ./snazzle
Found an integer:90
Found a floating-point number:23.4
Found an integer:4
Found an integer:5
Found an integer:6
Found a string: this
Found a string: is
Found a string: text
%

BISON
The first thing we have to do in our silly parser is to start defining the terminal tokens; for us, those are ints, floats, and strings. Even though
it is Flex that figures out and returns tokens by type, defining those types is done in the Bison file. Also, let's move main() into the third section
("copied-C-code") of the Bison file:

snazzle.y
%{
#include <cstdio>
#include <iostream>
using namespace std;

// Declare stuff from Flex that Bison needs to know about:

extern int yylex();
extern int yyparse();
extern FILE *yyin;

void yyerror(const char *s);

// Bison fundamentally works by asking flex to get the next token, which it
// returns as an object of type "yystype". Initially (by default), yystype
// is merely a typedef of "int", but for non-trivial projects, tokens could
// be of any arbitrary data type. So, to deal with that, the idea is to
// override yystype's default typedef to be a C union instead. Unions can
// hold all of the types of tokens that Flex could return, and this this means
// we can return ints or floats or strings cleanly. Bison implements this
// mechanism with the %union directive:
%union {
int ival;
float fval;
char *sval;
}

// Define the "terminal symbol" token types I'm going to use (in CAPS
// by convention), and associate each with a field of the %union:
%token <ival> INT
%token <fval> FLOAT
%token <sval> STRING

%%
// This is the actual grammar that bison will parse, but for right now it's just
// something silly to echo to the screen what bison gets from flex. We'll
// make a real one shortly:
snazzle:
INT snazzle {
cout << "bison found an int: " << $1 << endl;
}
| FLOAT snazzle {
cout << "bison found a float: " << $1 << endl;
}
| STRING snazzle {
cout << "bison found a string: " << $1 << endl; free($1);
}
| INT {
cout << "bison found an int: " << $1 << endl;
}
| FLOAT {
cout << "bison found a float: " << $1 << endl;
}
| STRING {
cout << "bison found a string: " << $1 << endl; free($1);
}
;
%%

int main(int, char**) {

// Open a file handle to a particular file:
FILE *myfile = fopen("a.snazzle.file", "r");
// Make sure it is valid:
if (!myfile) {
cout << "I can't open a.snazzle.file!" << endl;
return -1;
}
// Set Flex to read from it instead of defaulting to STDIN:
yyin = myfile;

// Parse through the input:

yyparse();

void yyerror(const char *s) {

cout << "EEK, parse error! Message: " << s << endl;
// might as well halt now:
exit(-1);
}

Now that the Bison file has declared the terminal types (%tokens), we can implement them in the Flex file:

snazzle.l
%{
#include <cstdio>

#include "snazzle.tab.h" // to get the token types from Bison

%}
%%
[ \t\n] ;
[0-9]+\.[0-9]+ { yylval.fval = atof(yytext); return FLOAT; }
[0-9]+ { yylval.ival = atoi(yytext); return INT; }
[a-zA-Z0-9]+ {
// We have to strdup yytext because Flex will change it for the next token.
// Note that this memory must be freed somewhere, so that's why we call
// free() above in the Bison section. (Aside: we use free() instead of
// delete because strdup is a C function that uses malloc, not a C++
// function that uses new.)
yylval.sval = strdup(yytext);
return STRING;
}
%%

Since the Flex file now has to #include a file generated by Bison (snazzle.tab.h), we have to run Bison first. Also note that bison must be
run with -d to create a .h file for us.

% bison -d snazzle.y
% flex snazzle.l
% g++ snazzle.tab.c lex.yy.c -lfl -o snazzle
% ./snazzle
bison found a string: text
bison found a string: is
bison found a string: this
bison found an int: 6
bison found an int: 5
bison found an int: 4
bison found a float: 23.4
bison found an int: 90
%

You'll notice that the input is echoed out backward! This is because for each of the rules you define, Bison does not perform the action until it
matches the complete rule . In the above example, all of the recursive rules were right-recursive (i.e. they looked like "foo: bar foo", instead of
"foo: foo bar"). The right-recursive search will print the output backwards since it has to match EVERYTHING before it can start figuring out what's
what, AND it has another major drawback: if your input is big enough, right-recursive rules will overflow Bison's internal stack! So a better
implementation of the Bison grammar would be left-recursion:

snazzle.y
%{
#include <cstdio>
#include <iostream>
using namespace std;

// stuff from flex that bison needs to know about:

extern int yylex();
extern int yyparse();
extern FILE *yyin;

void yyerror(const char *s);

%union {
int ival;
float fval;
char *sval;
}

// define the "terminal symbol" token types I'm going to use (in CAPS
// by convention), and associate each with a field of the union:
%token <ival> INT
%token <fval> FLOAT
%token <sval> STRING

%%
// this is the actual grammar that bison will parse, but for right now it's just
// something silly to echo to the screen what bison gets from flex. We'll
// make a real one shortly:
snazzle:
snazzle INT {
cout << "bison found an int: " << $2 << endl;
}
| snazzle FLOAT {
cout << "bison found a float: " << $2 << endl;
}
| snazzle STRING {
cout << "bison found a string: " << $2 << endl; free($2);
}
| INT {
cout << "bison found an int: " << $1 << endl;
}
| FLOAT {
cout << "bison found a float: " << $1 << endl;
}
| STRING {
cout << "bison found a string: " << $1 << endl; free($1);
}
;
%%

int main(int, char**) {

// open a file handle to a particular file:
FILE *myfile = fopen("a.snazzle.file", "r");
// make sure it is valid:
if (!myfile) {
cout << "I can't open a.snazzle.file!" << endl;
return -1;
}
// Set flex to read from it instead of defaulting to STDIN:
yyin = myfile;
// Parse through the input:
yyparse();
}

void yyerror(const char *s) {

cout << "EEK, parse error! Message: " << s << endl;
// might as well halt now:
exit(-1);
}

(Notice that we also had to change the $1 in the cout to a $2 since the thing we wanted to print out is now the second element in the rule.)
This gives us the output we would hope for:

% bison -d snazzle.y
% g++ snazzle.tab.c lex.yy.c -lfl -o snazzle
% ./snazzle
bison found an int: 90
bison found a float: 23.4
bison found an int: 4
bison found an int: 5
bison found an int: 6
bison found a string: this
bison found a string: is
bison found a string: text
%

Now that the groundwork is completed, we can finally define the real file format in the Bison file. First we make the easy tokens in Flex: the
ones that are used to represent constant strings like SNAZZLE, TYPE, and END, respectively representing the strings "sNaZZle", "type", and "end".
Then we flesh out the actual rules in the Bison grammar, and end up with this beautiful object d'art:
snazzle.l
%{
#include "snazzle.tab.h"
%}
%%
[ \t\n] ;
sNaZZle { return SNAZZLE; }
type { return TYPE; }
end { return END; }
[0-9]+\.[0-9]+ { yylval.fval = atof(yytext); return FLOAT; }
[0-9]+ { yylval.ival = atoi(yytext); return INT; }
[a-zA-Z0-9]+ {
yylval.sval = strdup(yytext);
return STRING;
}
. ;
%%

snazzle.y
%{
#include <cstdio>
#include <iostream>
using namespace std;

// stuff from flex that bison needs to know about:

extern int yylex();
extern int yyparse();
extern FILE *yyin;

void yyerror(const char *s);

%union {
int ival;
float fval;
char *sval;
}

// define the constant-string tokens:

%token SNAZZLE TYPE
%token END

// define the "terminal symbol" token types I'm going to use (in CAPS
// by convention), and associate each with a field of the union:
%token <ival> INT
%token <fval> FLOAT
%token <sval> STRING

// the first rule defined is the highest-level rule, which in our

// case is just the concept of a whole "snazzle file":
snazzle:
header template body_section footer {
cout << "done with a snazzle file!" << endl;
}
;
header:
SNAZZLE FLOAT {
cout << "reading a snazzle file version " << $2 << endl;
}
;
template:
typelines
;
typelines:
typelines typeline
| typeline
;
typeline:
TYPE STRING {
cout << "new defined snazzle type: " << $2 << endl;
free($2);
}
;
body_section:
body_lines
;
body_lines:
body_lines body_line
| body_line
;
body_line:
INT INT INT INT STRING {
cout << "new snazzle: " << $1 << $2 << $3 << $4 << $5 << endl;
free($5);
}
;
footer:
END
;

int main(int, char**) {

// open a file handle to a particular file:
FILE *myfile = fopen("in.snazzle", "r");
// make sure it's valid:
if (!myfile) {
cout << "I can't open a.snazzle.file!" << endl;
return -1;
}
// Set flex to read from it instead of defaulting to STDIN:
yyin = myfile;

// Parse through the input:

yyparse();
}

void yyerror(const char *s) {

cout << "EEK, parse error! Message: " << s << endl;
// might as well halt now:
exit(-1);
}

This is compiled and run just like the others:

% bison -d snazzle.y
% flex snazzle.l
% g++ snazzle.tab.c lex.yy.c -lfl -o snazzle
% cat in.snazzle
sNaZZle 1.3
type foo
type bar
type bas
0 0 10 5 foo
20 10 30 20 foo
7 8 12 13 bas
78 124 100 256 bar
end

% ./snazzle
reading a snazzle file version 1.3
new defined snazzle type: foo
new defined snazzle type: bar
new defined snazzle type: bas
new snazzle: 00105foo
new snazzle: 20103020foo
new snazzle: 781213bas
new snazzle: 78124100256bar
done with a snazzle file!
%

So, that concludes this little tutorial, or at least the baseline part of it. I'm now going to pick a few upgrades at random and show you how
they might be done.

a makefile
When you finally get sick of manually rerunning Flex and Bison, and also forgetting in which order to do it, I heartily recommend setting up a
makefile. Here's a sample one I made:

% cat Makefile
snazzle.tab.c snazzle.tab.h: snazzle.y
bison -d snazzle.y

lex.yy.c: snazzle.l snazzle.tab.h

flex snazzle.l

snazzle: lex.yy.c snazzle.tab.c snazzle.tab.h

g++ snazzle.tab.c lex.yy.c -lfl -o snazzle
% make snazzle
bison -d snazzle.y
flex snazzle.l
g++ snazzle.tab.c lex.yy.c -lfl -o snazzle
%

forcing carriage returns

Next tweak: you'll notice that this completely ignores whitespace, so that you could put the entire snazzle file on a single line and it'll still be
okay. You may or may not want this behaviour, but let's assume that's bad, and require there to be carriage returns after the lines just like there is
in the sample file "in.snazzle". To do this, we need to do two things: recognize the '\n' token (flex), and add it to the grammar (bison):

snazzle.l
%{
#include "snazzle.tab.h"
%}
%%
[ \t] ;
sNaZZle { return SNAZZLE; }
type { return TYPE; }
end { return END; }
[0-9]+\.[0-9]+ { yylval.fval = atof(yytext); return FLOAT; }
[0-9]+ { yylval.ival = atoi(yytext); return INT; }
[a-zA-Z0-9]+ {
yylval.sval = strdup(yytext);
return STRING;
}
\n return ENDL;
. ;
%%

snazzle.y
%{
#include <cstdio>
#include <iostream>
using namespace std;

// stuff from flex that bison needs to know about:

extern int yylex();
extern int yyparse();
extern FILE *yyin;
void yyerror(const char *s);
%}

%union {
int ival;
float fval;
char *sval;
}

// define the constant-string tokens:

%token SNAZZLE TYPE
%token END ENDL

// define the "terminal symbol" token types I'm going to use (in CAPS
// by convention), and associate each with a field of the union:
%token <ival> INT
%token <fval> FLOAT
%token <sval> STRING
%%
// the first rule defined is the highest-level rule, which in our
// case is just the concept of a whole "snazzle file":
snazzle:
header template body_section footer {
cout << "done with a snazzle file!" << endl;
}
;
header:
SNAZZLE FLOAT ENDL {
cout << "reading a snazzle file version " << $2 << endl;
}
;
template:
typelines
;
typelines:
typelines typeline
| typeline
;
typeline:
TYPE STRING ENDL {
cout << "new defined snazzle type: " << $2 << endl; free($2);
}
;
body_section:
body_lines
;
body_lines:
body_lines body_line
| body_line
;
body_line:
INT INT INT INT STRING ENDL {
cout << "new snazzle: " << $1 << $2 << $3 << $4 << $5 << endl;
free($5);
}
;
footer:
END ENDL
;
%%

int main(int, char**) {

// open a file handle to a particular file:
FILE *myfile = fopen("a.snazzle.file", "r");
// make sure it's valid:
if (!myfile) {
cout << "I can't open a.snazzle.file!" << endl;
return -1;
}
// Set lex to read from it instead of defaulting to STDIN:
yyin = myfile;

// Parse through the input:

yyparse();
}

void yyerror(const char *s) {

cout << "EEK, parse error! Message: " << s << endl;
// might as well halt now:
exit(-1);
}

But when we go to run it, we get a parse error:

% make
% ./snazzle
reading a snazzle file version 1.3
new defined snazzle type: foo
new defined snazzle type: bar
new defined snazzle type: bas
new snazzle: 00105foo
new snazzle: 20103020foo
new snazzle: 781213bas
new snazzle: 78124100256bar
done with a snazzle file!
EEK, parse error! Message: syntax error
%

Why?? Well, it turns out that in.snazzle has an extra carriage return at the end:

% cat in.snazzle
sNaZZle 1.3
type foo
type bar
type bas
0 0 10 5 foo
20 10 30 20 foo
7 8 12 13 bas
78 124 100 256 bar
end

And if we remove it, snazzle would be happy:

% cat in.snazzle
sNaZZle 1.3
type foo
type bar
type bas
0 0 10 5 foo
20 10 30 20 foo
7 8 12 13 bas
78 124 100 256 bar
end
% ./snazzle
reading a snazzle file version 1.3
new defined snazzle type: foo
new defined snazzle type: bar
new defined snazzle type: bas
new snazzle: 00105foo
new snazzle: 20103020foo
new snazzle: 781213bas
new snazzle: 78124100256bar
done with a snazzle file!
%

But that really isn't the best solution, as requiring all input files to have exactly one carriage return after data is a little unreasonable. We need
to make it more general. Specifically, we should allow any number of carriage returns between lines, which we'll do by defining an ENDLS rule that
matches one or more ENDL tokens, and swapping our grammar over to using it instead of using ENDL directly:

snazzle.y
%{
#include <cstdio>
#include <iostream>
using namespace std;

// stuff from flex that bison needs to know about:

extern int yylex();
extern int yyparse();
extern FILE *yyin;

void yyerror(const char *s);

%union {
int ival;
float fval;
char *sval;
}

// define the constant-string tokens:

%token SNAZZLE TYPE
%token END ENDL

// define the "terminal symbol" token types I'm going to use (in CAPS
// by convention), and associate each with a field of the union:
%token <ival> INT
%token <fval> FLOAT
%token <sval> STRING

%%
// the first rule defined is the highest-level rule, which in our
// case is just the concept of a whole "snazzle file":
snazzle:
header template body_section footer {
cout << "done with a snazzle file!" << endl;
}
;
header:
SNAZZLE FLOAT ENDLS {
cout << "reading a snazzle file version " << $2 << endl;
}
;
template:
typelines
;
typelines:
typelines typeline
| typeline
;
typeline:
TYPE STRING ENDLS {
cout << "new defined snazzle type: " << $2 << endl;
free($2);
}
;
body_section:
body_lines
;
body_lines:
body_lines body_line
| body_line
;
body_line:
INT INT INT INT STRING ENDLS {
cout << "new snazzle: " << $1 << $2 << $3 << $4 << $5 << endl;
free($5);
}
;
footer:
END ENDLS
;
ENDLS:
ENDLS ENDL
| ENDL ;
%%

Note that this didn't require any changes to the flex file -- the underlying tokens didn't change, just how we used them. And happily, this
works perfectly:

% make
flex snazzle.l
g++ snazzle.tab.c lex.yy.c -lfl -o snazzle
% cat in.snazzle
sNaZZle 1.3
type foo
type bar
type bas

0 0 10 5 foo
20 10 30 20 foo
7 8 12 13 bas

78 124 100 256 bar

end

line numbers
Next little tweak: wouldn't it have been nice if that parse error had given us the line to look at, so that we didn't have to guess-and-check the
grammar? Unfortunately Flex has no guaranteed way to get the line number (well, there's yylineno, but it's almost completely manual, and in
some situations makes your parser very slow, so you might as well just do it yourself). The best way to keep track of the line number is to have a
global variable that you update whenever you see a carriage return:

snazzle.l
%{
#include "snazzle.tab.h"
int line_num = 1;
%}
%%
[ \t] ;
sNaZZle { return SNAZZLE; }
type { return TYPE; }
end { return END; }
[0-9]+\.[0-9]+ { yylval.fval = atof(yytext); return FLOAT; }
[0-9]+ { yylval.ival = atoi(yytext); return INT; }
[a-zA-Z0-9]+ {
yylval.sval = strdup(yytext);
return STRING;
}
\n { ++line_num; return ENDL; }
. ;
%%

snazzle.y
%{
#include <cstdio>
#include <iostream>
using namespace std;

// stuff from flex that bison needs to know about:

extern int yylex();
extern int yyparse();
extern FILE *yyin;
extern int line_num;

void yyerror(const char *s);

%union {
int ival;
float fval;
char *sval;
}

// define the constant-string tokens:

%token SNAZZLE TYPE
%token END ENDL

// define the "terminal symbol" token types I'm going to use (in CAPS
// by convention), and associate each with a field of the union:
%token <ival> INT
%token <fval> FLOAT
%token <sval> STRING

int main(int, char**) {

// open a file handle to a particular file:
FILE *myfile = fopen("in.snazzle", "r");
// make sure it's valid:
if (!myfile) {
cout << "I can't open a.snazzle.file!" << endl;
return -1;
}
// set lex to read from it instead of defaulting to STDIN:
yyin = myfile;

// parse through the input until there is no more:

do {
yyparse();
} while (!feof(yyin));

void yyerror(const char *s) {

cout << "EEK, parse error on line " << line_num << "! Message: " << s << endl;
// might as well halt now:
exit(-1);
}

Which is pretty cool when you make a "type oops" definition in the middle of the body instead of where it's supposed to be:

% make
bison -d snazzle.y
flex snazzle.l
g++ snazzle.tab.c lex.yy.c -lfl -o snazzle
% cat in.snazzle
sNaZZle 1.3
type foo
type bar
type bas
0 0 10 5 foo
20 10 30 20 foo
type oops
7 8 12 13 bas
78 124 100 256 bar
end
% ./snazzle
reading a snazzle file version 1.3
new defined snazzle type: foo
new defined snazzle type: bar
new defined snazzle type: bas
new snazzle: 00105foo
new snazzle: 20103020foo
EEK, parse error on line 7! Message: syntax error
%

Of course, "syntax error" isn't very helpful, but at least you've got the line number. :)

TIPS
directly returning terminal characters
If you have individual characters that are also terminal symbols, you can have Flex return them directly instead of having to create a %token
for them:

\( { return '('; }
; { return ';'; }

You could also list a bunch of them in a single line with this shortcut:

[\{\}:;=,] { return yytext[0]; }

On the Bison side, you can also use them directly:

thing: FOO '(' BAR ';'

actions before the end of the grammar

The coolest thing I've discovered about Bison so far is that you can put action statements in the middle of the grammarinstead of just at the
end. This means you can get information as soon as it's read, which I find is very useful. For example, a C++ function call might look like this:

functioncall:
returntype name '(' args ')' body

You can always add actions at the end of the line, like this:

functioncall:
returntype name '(' args ')' body {
cout << "defined function '" << $2 << "'' << endl;
}

But then you get the function name after 'body' is evaluated, which means the whole thing has been parsed before you find out what it was!
However, you can embed the code block in the middle of the grammar and it will be run before 'body' is evaluated:

functioncall:
returntype name {
cout << "defining function '" << $2 << << endl;
}
'(' args ')' body

whitespace in flex/bison files

I've discovered that, unlike most unix tools, Flex and Bison are surprisingly lenient about whitespacing. This is really nice, because it's hard
enough to understand their syntax without the ability to reindent things. The following are all equivalent:

thing: FOO ':' BAR { cout << "found a foo-bar!" << endl; }
| BAS ':' BAR { cout << "found a bas-bar!" << endl; }

thing:
FOO ':' BAR {
cout << "found a foo-bar!" << endl;
}
|
BAS ':' BAR {
cout << "found a bas-bar!" << endl;
}

thing:
FOO ':' BAR {
cout << "found a foo-bar!" << endl; }
| BAS ':' BAR {
cout << "found a bas-bar!" << endl; }

avoiding -lfl and link problems with yywrap

If you know that you only want to parse one file at a time (usually a good assumption), you don't need to provide your own yywrap function
that would just return 1. Inside your flex file, you can say %option noyywrap and the generated output will define yywrap as a local macro
returning 1. This has the added bonus of removing the requirement to link with the flex libraries, since they're only needed to get the default
yywrap defined as a function returning 1.

renaming identifiers
Flex and Bison generate code that uses the global namespace, which means that if you ever try to have more than one in a single program
you're going to have identifier collisions. To get around that, you can tell both Flex and Bison to prefix their identifiers with a custom string you
specify.

In Flex, you can use either the command line option -P foo or the option syntax %option prefix="foo" in the first section of the file.

In Bison, you can use either the command line option -p foo or the option syntax %name-prefix "foo".

moving output
Similar to why you rename identifiers, you usually don't want to use the default output file names because multiple parsers will all step on
each other.

In Flex, you can specify either -o lex.foo.cc on the command line (it has to be before your input file!) or %option outfile="lex.foo.cc" in
the first section of the .l file.

In Bison, you can specify either -o "foo.tab.cc" on the command line or %output "foo.tab.cc" in the first section of the .y file. However,
Bison usually names its output after its input anyway; an input file named "foo.y" will already generate "foo.tab.c".

stalling in Windows
I have heard from a loyal reader (oh hai there Alena) that running these examples from Windows presents a logistical problem -- Visual C++
runs it in its own terminal, and when the program is done it immediately closes the terminal, so you can't see any output.

The suggestion is to add a call to the Windows-specific pause program, which stalls until you press enter. Thus, add this to your main
function after yylex/yyparse and before return:

int main(int, char**) {

...
yyparse();
system("pause");
return 0;
}

ADVANCED
start states
In a Flex/Bison grammer, it is almost impossible to allow multiline commenting such as C's /* .. */. What you really want is for the parser
go into a sort of "ignore everything" state when it sees "/*", and go back to normal when it sees "*/". Clearly you wouldn't want Bison to do this,
or else you'd have to put optional 'comment' targets all over every contruct in your syntax; you can see how it fell to Flex to implement some way
to do this.

Flex allows you to specify start states

, which are just regex pattern rules like any other but they're only matched if you're in a particular
state. The syntax for a pattern that should only be matched if we're in state 'FOO' looks like this:

<FOO>bar ; // we're in state FOO and we saw "bar"

Note that you come up with the state names -- they're not predefined or anything. Though when you create a state, you do have to declare it
in your Flex file's control section (that's the first section, before the first "%%"):

%x FOO

So how do you get into the state? There's a special Flex construct (I think it's ultimately a C macro) that goes into any regular code block to
get there:

bar { BEGIN(FOO); } // we saw a "bar", so go into the "FOO" state

And how about getting back out of that state and going back to where you were initially? Instead of somthing obvious (say, "END()"?), they
decided to make a default state called "INITIAL." Any Flex match pattern that doesn't have a state in front of it is assumed to be in the INITIAL
state. To get back to it, you just jump to it:

bas { BEGIN(INITIAL); } // jump back to normal pattern matching

Note that the BEGIN thing is for all intents and purposes normal C code, despite the fact that it's not. :) What that means is that you can treat it
like code and put it anywhere you want -- you can even make it conditional:

bar { do_something(); BEGIN(FOO); do_something_else(); }

..
bar { if (some_value == 42) BEGIN(FOO); }
..

Back to the original problem -- multiline commenting. Here's a way to do C-style block commenting using start states:

\/* { // start of a comment: go to a 'COMMENTS' state.

BEGIN(COMMENTS);
}
<COMMENTS>*\/ { // end of a comment: go back to normal parsing.
BEGIN(INITIAL);
}
<COMMENTS>\n { ++linenum; } // still have to increment line numbers inside of comments!
<COMMENTS>. ; // ignore every other character while we're in this state

Note that you can also have a Flex match pertain to multiple states by listing them all:
<MYSTATE,ANOTHERSTATE>foo { ... }

This occasionally comes in handy, such as keeping you from having to duplicate the line-counting pattern/code above.

reading gzipped input

This is, surprisingly, rather easy. The basic idea is to filter Flex's input through libz. libz is very nice because it will pass through, unchanged,
any input which isn't zipped -- so we don't need to check if the input is zipped and handle it differently!

For reference, the API to libz is available here (last I looked). I have been consistently surprised with how easy it is to use libz directly to avoid
silly hacks like "popen"ing "gunzip -c" and redirecting its output. Kudos to the libz team!

In the first section of your Flex file, you'll want to declare that you have a better input function, and then tell Flex it has to use yours instead:
extern int my_abstractread(char *buff, int buffsize);
#define YY_INPUT(buff, res, buffsize) (res = my_abstractread(buff, buffsize))

Somewhere or another (I put it in my top-level files because it doesn't have to go in with the Flex/Bison source) you need to define your input
function. Mine looks like this:

#include <zlib.h> // direct access to the gzip API

// (ew!) global variable for the gzip stream.

gzFile my_gzfile = 0;

int my_abstractread(char *buff, int buffsize) {

// called on a request by Flex to get the next 'buffsize' bytes of data

// from the input, and return the number of bytes read.

int res = gzread(my_gzfile, buff, buffsize);

if (res == -1) {
// file error!
abort();
}

return res;
}

Then, instead of using fopen, you use libz:

// libz is very nice in that it handles unzipped files as well, which

// means that no matter what the input file is we can just pass it
// through gzopen and not worry about it:
my_gzfile = gzopen(filename.c_str(), "r");
if (!my_gzfile) {
// file cannot be opened
abort();
}

REGULAR EXPRESSION OVERVIEW

You've almost used regexs before without knowing it, for example when you use wildcards on file names:

% ls *asdf*
asdf
asdf2
asdf3
new.asdf
new.asdf.mangled
%

The shell's idea of regular expressions isn't quite accurate with the "real" definition of regular expressions, but at least the idea is the same.
Here, we're telling ls to list all the files that have "asdf" somewhere in the name. We could ask for just the filesstarting with asdf by saying
"asdf*", or all the files ending with asdf with "*asdf". The asterisk basically means "anything can go here".

Regular expressions are really just the scientific deconstruction of such pattern matching. There's actually an entire O'Reilly book dedicated to
them, if you really want to see what they're all about. (I'd love to make a wisecrack about being great for insomnia, but right now it's ranked
#5,076 on Amazon's sales rank. Guess it can't be too unbearable!)

With that said, my little overview here is clearly not going to be exhaustive, but should give you the general idea. Flex's regular expressions
consist of "characters" and "meta-characters". "Characters" are interpreted literally as real text, whereas "meta-characters" change how the search
works. For example, listing "foo" as a regular expression will match exactly one string, which is "foo". But if you add the metacharacter "+" (which
means "one or more of the previous character") after the "f" to get "f+oo", it will match "foo", "ffoo", and "ffffffffffffffffoo". A table of meta-
characters follows:

metacharacter description

+ previous expression can match one or more times

* previous expression can match zero or more times

? previous expression can match zero or one time

. can match any character except the carriage return '\n'

I say "expression" above instead of just "character" because you can also make groups of things by enclosing whatever you want in
parentheses.

Brackets are very cool -- by saying "[abcde]", it will match any oneof the characters in the brackets, so you'll match "a" and "b" but not "ac".
If you add a plus after the closing bracket, though, then you willmatch "ac" as well. Brackets also allow negation: "[^abc]+" will match anything
that's not in the brackets, so "d" and "foo" would pass but not "b".

Most useful of all, brackets allow for ranges: "[a-e]" is the same as "[abcde]". So you will frequently see stuff like "[0-9]+" to match integer
values, or "[a-zA-Z]+" to match words, etc.

So how do you match an actual bracket, or an actual period, if the regular expression searcher is immediately going to think you're defining a
meta-character? By adding a backslash (\) in front of it! So "a\.b" will match the string "a.b" but not "acb", whereas "a.b" as a regular expression
will match both. And to be complete, backslash itself can be backslashed, so that "a\\." will match the string "a\b". (What fun!)

This stuff just takes some playing with to get used to. It's not really that bad. :)

ACKNOWLEDGEMENTS
Many people over the years have contacted me with valuable suggestions for this tutorial. In no particular order:

Pieter Nobel
Robert Selberg
Brian Zarnce
Philip Mulcahy
Federico Tomassetti
Brandon Ortiz
Bill Evans
Alena Kirillova
Brandon Finley
Ofer Biran

Chris verBurg
2018-12-16

(2019) An Introduction To Functional Programming With Go
No ratings yet
(2019) An Introduction To Functional Programming With Go
255 pages
Ada Syntax Card
No ratings yet
Ada Syntax Card
2 pages
PPT
No ratings yet
PPT
17 pages
An Introduction To Telephone Line Interfacing Using The PIC Microcontroller PDF
No ratings yet
An Introduction To Telephone Line Interfacing Using The PIC Microcontroller PDF
46 pages
2.15-2.16-LINEAR-REGRESSION
No ratings yet
2.15-2.16-LINEAR-REGRESSION
27 pages
GUI Application Development Using
No ratings yet
GUI Application Development Using
204 pages
(F) Lex & Bison/Yacc: Language Tools For C/C++ CS 550 Programming Languages
No ratings yet
(F) Lex & Bison/Yacc: Language Tools For C/C++ CS 550 Programming Languages
18 pages
Course Book Sde 26042020r3 Final
No ratings yet
Course Book Sde 26042020r3 Final
81 pages
Ada Reference Manual
100% (1)
Ada Reference Manual
582 pages
3.3 Prove lines are Perpendicular
No ratings yet
3.3 Prove lines are Perpendicular
12 pages
ALGEBRA (Practice Sheet) PDF
No ratings yet
ALGEBRA (Practice Sheet) PDF
132 pages
Huw Collingbourne - The Little Book of Delphi Programming - Learn To Program With Object Pascal (2020, Dark Neon)
No ratings yet
Huw Collingbourne - The Little Book of Delphi Programming - Learn To Program With Object Pascal (2020, Dark Neon)
194 pages
Trigo Identities and Equations and Graph
No ratings yet
Trigo Identities and Equations and Graph
16 pages
Algorithms - Assignment - UNIT 4
No ratings yet
Algorithms - Assignment - UNIT 4
28 pages
KPSEA 2023 Mathematics
No ratings yet
KPSEA 2023 Mathematics
12 pages
All Answers
67% (3)
All Answers
63 pages
Java Fundamentals
No ratings yet
Java Fundamentals
41 pages
Screenshot 2024-04-10 at 18.20.23
No ratings yet
Screenshot 2024-04-10 at 18.20.23
23 pages
Matrix Laboratory
100% (1)
Matrix Laboratory
66 pages
Web Programming Lab MCA 4TH Sem
No ratings yet
Web Programming Lab MCA 4TH Sem
72 pages
CSE322 Formal Languages and Automation Theory 150+ MCQ Questions
No ratings yet
CSE322 Formal Languages and Automation Theory 150+ MCQ Questions
70 pages
Ada Programming Language
100% (1)
Ada Programming Language
32 pages
Java Reflection Tutorial PDF
100% (2)
Java Reflection Tutorial PDF
29 pages
World Distribution of Thunderstrom Days Part2
No ratings yet
World Distribution of Thunderstrom Days Part2
100 pages
Psycho Py Manual
No ratings yet
Psycho Py Manual
162 pages
Programming in C&DS TextBOOK 1
No ratings yet
Programming in C&DS TextBOOK 1
129 pages
Programming in Java
No ratings yet
Programming in Java
359 pages
C Language Introduction
100% (1)
C Language Introduction
192 pages
ANTLR Mega Tutorial
100% (1)
ANTLR Mega Tutorial
70 pages
Python 101 - Introduction To Python
No ratings yet
Python 101 - Introduction To Python
81 pages
Java Reflection Tutorial
100% (2)
Java Reflection Tutorial
40 pages
1869-Article Text-5585-1-10-20220804
No ratings yet
1869-Article Text-5585-1-10-20220804
9 pages
Components of Android
No ratings yet
Components of Android
36 pages
University of Cambridge Department of Physics
No ratings yet
University of Cambridge Department of Physics
50 pages
BS en Iso 7096-2008
No ratings yet
BS en Iso 7096-2008
30 pages
C# Is Modern and Easy
No ratings yet
C# Is Modern and Easy
83 pages
CPT 168 HW#13 Answer Key
0% (1)
CPT 168 HW#13 Answer Key
6 pages
Details of Using SIMULINK
No ratings yet
Details of Using SIMULINK
54 pages
Data Structures Using PDF
No ratings yet
Data Structures Using PDF
81 pages
Introduction To Compiler Design-Unit I
No ratings yet
Introduction To Compiler Design-Unit I
30 pages
F5 AM Chap 4 Vector 03
No ratings yet
F5 AM Chap 4 Vector 03
23 pages
Borland Object Pascal Language Guide
No ratings yet
Borland Object Pascal Language Guide
250 pages
C++ Chapter 01 - Introduction
100% (1)
C++ Chapter 01 - Introduction
23 pages
1720 Chapter 10 Notes - Stewart
No ratings yet
1720 Chapter 10 Notes - Stewart
19 pages
Assembly Programming Journal 7
100% (1)
Assembly Programming Journal 7
108 pages
Kotlin Vs Java
No ratings yet
Kotlin Vs Java
3 pages
Aiza Chinchin Angel Concon Aiza Chinchin Angel Concon
No ratings yet
Aiza Chinchin Angel Concon Aiza Chinchin Angel Concon
3 pages
Sat Math Practice Test 13
No ratings yet
Sat Math Practice Test 13
8 pages
Return To Basic Reference Manual: Gordon Henderson C 2012
No ratings yet
Return To Basic Reference Manual: Gordon Henderson C 2012
73 pages
HTML Internet Programing Cse 326 PDF
No ratings yet
HTML Internet Programing Cse 326 PDF
6 pages
PSR-2 - Coding Style Guide - PHP-FIG
No ratings yet
PSR-2 - Coding Style Guide - PHP-FIG
24 pages
Pli New 22012006
No ratings yet
Pli New 22012006
119 pages
311 Fall 2017 Final Exam
No ratings yet
311 Fall 2017 Final Exam
10 pages
Be Sharp With C# (Chapter 1, Controls and Code)
0% (1)
Be Sharp With C# (Chapter 1, Controls and Code)
21 pages
DUDLEYs HANDBOOK OF PRACTICAL GEAR DESIG Part189
No ratings yet
DUDLEYs HANDBOOK OF PRACTICAL GEAR DESIG Part189
1 page
Chapter 6 Dummy Variable Assignment
100% (1)
Chapter 6 Dummy Variable Assignment
4 pages
Ubd
No ratings yet
Ubd
8 pages
Laura Lynch: Objective
No ratings yet
Laura Lynch: Objective
2 pages
Notes:: National Exams 2016
No ratings yet
Notes:: National Exams 2016
6 pages
Good Questions To Ask Assembly
100% (2)
Good Questions To Ask Assembly
12 pages
C Programming On Linux
100% (3)
C Programming On Linux
18 pages
LECTURE 1 - Inroduction To OOPs
100% (1)
LECTURE 1 - Inroduction To OOPs
22 pages
Computer Programming 1st Year 3rd Unit
No ratings yet
Computer Programming 1st Year 3rd Unit
55 pages
XDS - Modula 2.IDE - User.guide - en
No ratings yet
XDS - Modula 2.IDE - User.guide - en
61 pages
Foundation - Apple Developer Documentation
100% (1)
Foundation - Apple Developer Documentation
4 pages
Graphing and Solving Inequalities
No ratings yet
Graphing and Solving Inequalities
9 pages
Class 9 - SCT
No ratings yet
Class 9 - SCT
9 pages
Getting Started With Arduino and Go by Agus Kurniawan
No ratings yet
Getting Started With Arduino and Go by Agus Kurniawan
79 pages
Workshop Calculation Science
No ratings yet
Workshop Calculation Science
28 pages
Fresnel Biprism
No ratings yet
Fresnel Biprism
3 pages
Beginners Python Cheat Sheet PCC BW
No ratings yet
Beginners Python Cheat Sheet PCC BW
2 pages
List of Practicals - CSHP101 S/W Lab Based On CSHT 101 Programming Fundamentals - C++
No ratings yet
List of Practicals - CSHP101 S/W Lab Based On CSHT 101 Programming Fundamentals - C++
4 pages
GTK+ GUI Programming: Ori Idan Helicon Technologies
100% (1)
GTK+ GUI Programming: Ori Idan Helicon Technologies
19 pages
DLL Quarter 4 Week 2 MATHEMATICS 3
No ratings yet
DLL Quarter 4 Week 2 MATHEMATICS 3
3 pages
Getting Full Speed With Delphi
No ratings yet
Getting Full Speed With Delphi
25 pages
Programming Paradigm
No ratings yet
Programming Paradigm
14 pages
Esc - 1991 - Vol2 - Page710 - Ward - Manipulating Hardware With C
100% (1)
Esc - 1991 - Vol2 - Page710 - Ward - Manipulating Hardware With C
6 pages
Kinetic and Potential Energy HW 1 Answers
No ratings yet
Kinetic and Potential Energy HW 1 Answers
3 pages
C# Tutorial
No ratings yet
C# Tutorial
5 pages
Tutorial Regression
No ratings yet
Tutorial Regression
2 pages
Lab Work: Lab 2: Structured Program Development in C
No ratings yet
Lab Work: Lab 2: Structured Program Development in C
5 pages
Part 1 - Introduction
No ratings yet
Part 1 - Introduction
5 pages
Practical Exercise1
No ratings yet
Practical Exercise1
3 pages
Android Studio 3.2 Development Essentials - Kotlin Edition: Developing Android 9 Apps Using Android Studio 3.2, Kotlin and Android Jetpack
From Everand
Android Studio 3.2 Development Essentials - Kotlin Edition: Developing Android 9 Apps Using Android Studio 3.2, Kotlin and Android Jetpack
Neil Smyth
No ratings yet
BlueJ Programming: learn lots of logic based skill of BlueJ
From Everand
BlueJ Programming: learn lots of logic based skill of BlueJ
SK Wasim Ali
No ratings yet
Android Studio 3.6 Development Essentials - Kotlin Edition: Developing Android 10 (Q) Apps Using Android Studio 3.6, Kotlin and Android Jetpack
From Everand
Android Studio 3.6 Development Essentials - Kotlin Edition: Developing Android 10 (Q) Apps Using Android Studio 3.6, Kotlin and Android Jetpack
Neil Smyth
No ratings yet
Java EE 7 Development with WildFly
From Everand
Java EE 7 Development with WildFly
Francesco Marchioni
No ratings yet
Learn C Programming through Nursery Rhymes and Fairy Tales: Classic Stories Translated into C Programs
From Everand
Learn C Programming through Nursery Rhymes and Fairy Tales: Classic Stories Translated into C Programs
Shari Eskenas
No ratings yet
Building Websites with VB.NET and DotNetNuke 4
From Everand
Building Websites with VB.NET and DotNetNuke 4
Daniel N. Egan
1/5 (1)
Google Cloud Platform Complete Self-Assessment Guide
From Everand
Google Cloud Platform Complete Self-Assessment Guide
Gerardus Blokdyk
1/5 (1)

Flex and Bison

Uploaded by

Flex and Bison

Uploaded by

www.aquamentus.

WHAT ARE FLEX AND BISON?

lex vs. flex, yacc vs. bison

what we'll do!

a header that just says "sNaZZle" followed by a version number,

Flex and Bison files have three sections:

1. the first is sort of "control" information,

This example can be compiled by running this:

% g++ lex.yy.c -o snazzle

[ \t] ; // ignore all whitespace

\n { ++linenum; } // increment line count, but don't return a token

reading from a file

// open a file handle to a particular file:

// lex through the input:

// Declare stuff from Flex that Bison needs to know about:

void yyerror(const char *s);

int main(int, char**) {

// Parse through the input:

void yyerror(const char *s) {

#include "snazzle.tab.h" // to get the token types from Bison

// stuff from flex that bison needs to know about:

void yyerror(const char *s);

int main(int, char**) {

void yyerror(const char *s) {

// stuff from flex that bison needs to know about:

void yyerror(const char *s);

// define the constant-string tokens:

// the first rule defined is the highest-level rule, which in our

int main(int, char**) {

// Parse through the input:

void yyerror(const char *s) {

This is compiled and run just like the others:

lex.yy.c: snazzle.l snazzle.tab.h

snazzle: lex.yy.c snazzle.tab.c snazzle.tab.h

forcing carriage returns

// stuff from flex that bison needs to know about:

// define the constant-string tokens:

int main(int, char**) {

// Parse through the input:

void yyerror(const char *s) {

But when we go to run it, we get a parse error:

And if we remove it, snazzle would be happy:

// stuff from flex that bison needs to know about:

void yyerror(const char *s);

// define the constant-string tokens:

78 124 100 256 bar

// stuff from flex that bison needs to know about:

void yyerror(const char *s);

// define the constant-string tokens:

int main(int, char**) {

// parse through the input until there is no more:

void yyerror(const char *s) {

[\(\)\{\}:;=,] { return yytext[0]; }

On the Bison side, you can also use them directly:

thing: FOO '(' BAR ';'

actions before the end of the grammar

whitespace in flex/bison files

avoiding -lfl and link problems with yywrap

int main(int, char**) {

Flex allows you to specify start states

<FOO>bar ; // we're in state FOO and we saw "bar"

bar { BEGIN(FOO); } // we saw a "bar", so go into the "FOO" state

bas { BEGIN(INITIAL); } // jump back to normal pattern matching

bar { do_something(); BEGIN(FOO); do_something_else(); }

\/* { // start of a comment: go to a 'COMMENTS' state.

reading gzipped input

#include <zlib.h> // direct access to the gzip API

// (ew!) global variable for the gzip stream.

int my_abstractread(char *buff, int buffsize) {

// called on a request by Flex to get the next 'buffsize' bytes of data

int res = gzread(my_gzfile, buff, buffsize);

Then, instead of using fopen, you use libz:

// libz is very nice in that it handles unzipped files as well, which

REGULAR EXPRESSION OVERVIEW

+ previous expression can match one or more times

? previous expression can match zero or one time