assignment1
assignment1
• Do not plagiarize or turn in solutions from other sources. You will be penalized if caught.
Submission
• Your solution should use Flex. We recommend using flex and not flex++. You can include
C++ STL code in the flex specification, and use g++ to compile the generated C file.
Do NOT use alternate software like RE/flex, JFlex, and PLY for this assignment.
• Upload a zip file named “hrolli-assign1.zip”. The zipped file should contain a directory
hrolli-assign1, two sub-directories problem1 and problem2, and the following files:
• We encourage you to use the LATEX typesetting system for generating the PDF file.
• You will get up to two late days to submit your assignment, with a 25% penalty for each
day.
Evaluation
• Write your code such that the exact output format (if any) is respected.
• We will evaluate your implementations on a Unix-like system, for example, a recent Debian-
based distribution. Check for compilation and linking errors.
• We will evaluate the implementations with our own inputs and test cases (where applicable),
so remember to test thoroughly.
1
Problem 1 [40 marks]
The following is the lexical specification for a new language “Kanpur”. The file extension for
Kanpur programs is .knp. Write a lexical analyzer (i.e., scanner) for Kanpur programs using Flex.
Name the flex specification prob1.l. Any illegal character encountered should be reported along
with the line number in the source code.
Keywords ARRAY BEGIN BOOLEAN COMMENT CONTINUE DO DOUBLE ELSE END FALSE FOR IF INTEGER
LABEL LIST LONG OWN PROCEDURE STEP SWITCH THEN TRUE UNTIL VALUE WHILE
Operators AND OR LEQ LT GEQ GT NOT EQL NEQ := + − * / % ^ | &
« »
Identifier A valid identifier begins with a letter and is followed by any number of occurrences of
digits and letters.
Strings A string literal is a collection of one or more ASCII characters encapsulated either in
single quotes or double quotes. A string literal with double quotes cannot have single quotes
included in the literal, and vice versa. Strings can be multiline.
Delimiters ; : , ' " [ ] { } ( )
The character ' is a single quote and " is a double quote.
Numeric literals Numeric literals can be decimal (base 10) or hexadecimal (base 16), and are
whole numbers. Leading 0s are not allowed. For example, 0 is valid but numbers of the form
0001 and 00 are invalid. Hex literals start with 0x or 0X followed by one or more hex digits.
As before, 0x0 is valid, but 0x00 is invalid.
A floating point number is formed by one or more digits before the decimal point followed
by at least one and up to six digits after the decimal. For example, floating point numbers
of the form .01, 1., and 123.4567890 are ill-formed. Floating point numbers are only in base
10.
Comments Comments are any text between { and }. The first occurrence of the terminating
character } ends the comment; comments can be multiline but cannot be nested.
White Space White space is defined as the ASCII space character, horizontal tab character, form
feed character, and line terminator characters.
Blanks, tabs, ends of lines, and comments are considered to be delimiters. The Lexical
Analyzer consumes these but does not return anything.
Other rules • Keywords and operators are case-insensitive (i.e., UNTIL and until have the
same meaning).
• All the numbers are unsigned (i.e., whole numbers).
The full set of tokens are KEYWORD, OPERATOR, IDENTIFIER, STRING, DELIMITER, IN-
TEGER, FLOATING_POINT, and HEXADECIMAL. Check the public test cases to learn about the
usage. The meaning of the programs is not important for this assignment.
• The output should list and classify all unique lexemes into proper syntactic categories (as
enumerated above).
2
• The output should be sorted by LEXEME (do not change case while printing). Follow the
conventions in the sample output. Not abiding by the output rules will attract a minimum
of 20% marks as a penalty.
• Your scanner can terminate after reporting the first error. You may optionally continue
beyond the first error.
Write a lexer for Fortran 2008. Name the flex specification prob2.l. Letters, digits, underscore,
and special characters (Table 3.1) make up the Fortran character set. Any illegal character en-
countered should be reported along with the line number in the source code.
We list the constraints on the specification in the following. Refer to Chapter 3 in the linked
PDF for more details.
Keywords allocatable allocate assign associate asynchronous backspace bind call case
class close common contains continue cycle codimension contiguous critical data
deallocate deferred dimension do else elsewhere end endfile endif entry equivalence
exit? external enum enumerator extends forall final flush format function goto
generic import if implicit inquire intrinsic include interface intent lock module
nopass namelist nullify open only operator optional parameter pause print program
pointer private pass protected procedure public read return recursive result rewind
rewrite save stop subroutine select sequence submodule target then use value unlock
volatile while write.
3
Names A valid name begins with a letter and is followed by more letters, digits, or underscore.
The maximum length of a name is 63.
Literals Consider only the rules for int-literal-constant, real-literal-constant, and logical-literal-
constant (Section 3.2.3). A string literal is a collection of one or more characters encapsulated
in double quotes. A string literal with double quotes cannot have single quotes included in
the literal. Strings can be multiline.
Delimiters ( ) / [ ] (/ /)
Comments If the first non-blank character on a line is an “!”, the line is a comment (Section
3.3.2.3). The character “!” initiates a comment except where it appears within a string. The
comment extends to the end of the line.
Special Characters All other characters that are listed in Table 3.1 but not mentioned above.
Other rules • Assume that keywords are reserved, and cannot be used as identifiers.
• Extend the operators to include && « » ||
• Fortran is case-insensitive, i.e., keywords and variables written in any case combination
are equivalent. Only strings are case-sensitive.
• Ignore labels (Section 3.2.5).
• Ignore nuances based on fixed-format layout.
Your lexer should tokenize input Fortran programs into the given token categories, and print
the count of each token along with the category. The full set of tokens are KEYWORD, NAME,
INT_LITERAL, REAL_LITERAL, LOGICAL_LITERAL, CHAR_LITERAL, OPERATOR, LABEL, DE-
LIMITER, and SPECIAL_CHAR.
The output requirements are the same as in Problem 1.