Compiler Construction
Compiler Construction
COMPILER CONSTRUCTION
Tapodhan Singla1*, Varun Vashishtha2, Sumeet Singh3
1,2,3
* Computer Science and Engineering Department, Maharishi Dayanand University, Rohtak, Haryana, India
*Corresponding Author: -
Abstract: -
Compiler construction is a widely used software engineering exercise, but because most students will not be compiler
writers, care must be taken to make it relevant in a core curriculum. The course is suitable for advanced undergraduate
and beginning graduate students. Auxiliary tools, such as generators and interpreters, often hinder the learning: students
have to fight tool idiosyncrasies, mysterious errors, and other poorly educative issues. It is intended both to provide a
general knowledge about compiler design and implementation and to serve as a springboard to more advanced courses.
Although this paper concentrates on the implementation of a compiler, an outline for an advanced topics course that
builds upon the compiler is also presented. We introduce a set of tools especially designed or improved for compiler
construction educative projects in C.
1. INTRODUCTION
A good course in compiler construction is hard to design. The main problem is time. Many courses assume C or some
similarly low-level language as both the source and implementation language. This assumption leads in one of two
directions. Either a rich source language is defined and the compiler is not completed, or the source and target languages
are drastically simplified in order to finish the compiler. Neither solution is particularly satisfying. If the compiler is not
completed, the course cannot be considered a success: some topics are left untaught, and the students are left unsatisfied.
If the compiler is completed with an oversimplified source language, the compiler is unrealistic on theoretical grounds
since the semantics of the language are weak, and if the compiler generates code for a simplified target language, the
compiler is unrealistic on practical grounds since the emitted code does not run on real hardware.
Computers, however, interpret sequences of particular instructions, but not program texts. Therefore, the program text
must be translated into a suitable instruction sequence before it can be processed by a computer. This translation can be
automated, which implies that it can be formulated as a program itself. The translation program is called a compiler, and
the text to be translated is called source code. Compilers and operating systems constitute the basic interfaces between a
programmer and the machine. Compiler is a program which converts high level programming language into low level
programming language or source code into machine code. It focuses attention on the basic relationships between languages
and machines. Understanding of these relationships eases the inevitable transitions to new hardware and programming
languages and improves a person's ability to make appropriate trade off in design and implementation. Many of the
techniques used to construct a compiler are useful in a wide variety of applications involving symbolic data.
The term compilation denotes the conversion of an algorithm expressed in a human-oriented source language to an
equivalent algorithm expressed in a hardware-oriented target language. We shall be concerned with the engineering of
compilers (their organization, algorithms, data structures and user interfaces Programming languages are tools used to
construct formal descriptions of finite computations (algorithms). Each computation consists of operations that transform
a given initial state into some final state. A programming language provides essentially three components for describing
such computations:
defined upon them.
Data types, objects and values with operaion defines upon them.
Rules fixing the chronological relationships among specified operations.
Rules fixing the static structure of a program.
A. Availability:
Lex and yacc were both developed at Bell.T.Laboratories in the 1970s. Yacc was the first of the two, developed by
Stephen C. Johnson. Lex was designed by Mike Lesk and Eric Schmidt to work with yacc. Both lex and yacc have been
standard UNIX utilities since 7th Edition UNIX. System V and older versions of BSD use the original AT&T versions,
while the latest version of BSD uses flex and Berkeley yacc. The articles written by the developers remain the primary
source of information on lex and yacc.
During the first phase the compiler reads the input and converts strings in the source to tokens. With regular expressions
we can specify patterns to lex so it can generate code that will allow it to scan and match strings in the input. Each pattern
specified in the input to lex has an associated action. Typically an action returns a token that represents the matched string
for subsequent use by the parser. Initially we will simply print the matched string rather than return a token value.
The following represents a simple pattern, composed of a regular expression, that scans for identifiers. Lex will read this
pattern and produce C code for a lexical analyzer that scans for identifiers. letter(letter|digit) *
This pattern matches a string of characters that begins with a single letter followed by zero or more letters or digits. This
example nicely illustrates operations allowed in regular expressions:
ation
Any regular expression expressions may be expressed as a finite state automaton (FSA). We can represent an FSA using
states, and transitions between states. There is one start state and one or more final or accepting states.
Vol. 2 No. 5 (2015) 22
Journal of Advance Research in Computer science and Engineering (ISSN: 2456-3552)
B. Grammar:
For some applications, the simple kind of word recognition we've already done may be more than adequate; others need
to recognize specific sequences of tokens and perform appropriate actions. Traditionally, a description of such a set of
actions is known as a grammar.
When you use a lex scanner and a yacc parser together, the parser is the higher level routine. It calls the lexer yylex()
whenever it needs a token from the input. The lexer then scans through the input recognizing tokens.As soon as it finds
a token of interest to the parser, it returns to the parser ,returning the token's code as the value of yyfex().Not all tokens
are of interest to the parser-in most programming languages the parser doesn't want to hear about comments and white
space,for example. For these ignored tokens, the lexer doesn't return so that it can continue on to the next token without
bothering the parser.The lexer and the parser have to agree what the token codes are. We solve this problem by letting
yacc define the token codes. The tokens in our grammar are the parts of speech: NOUN, PRONOUN, VERB, ADVERB,
ADJECTIVE, PREPOSITION, and CONJUNCTION. Yacc defines each of these as a small integer using a preprocessor
#define, here are
the definitions it used in this example:
# define NOUN 257
# define PRONOUN 258
# define VERB 259
# define ADVERB 260
# define ADJECTIVE 261
# define PREPOSITICN 262
# define cXwUNCTICN 263
Token code zero is always returned for the logical end of the input. Yacc doesn't define a symbol for it, but you can
yourself if you want.
same add-word () and lookup.word() as before ... There are several important differences here. We've changed the part
of speech names used in the lexer to agree with the token names in the parser. We have also added return statements to
pass to the parser the token codes for the words that it recognizes. There aren't any return statements for the tokens that
define new words to the lexer, since the parser doesn't care about them.
A Yacc Parser
Example 1-7 introduces our first cut at the yacc grammar.
Example 1-7: Simple yacc sentence parser
%t
/*
* A lexer for the basic g r m to use for recognizing mlish sentences.
/
#include <stdio.h>
%1
%token NOUN PRCXWUN VERB AIXlERB
ADJECl'IVE J3EPOSITIM CONJUNCTIM
%%
sentence: subject VERB object (printf("Sentence is
valid.\nn);)
subject: NOUN I PRONOUN object: NOUN extern FILE win;
main ()
(
while ( !f eof (yyin)) {
yparse( ) ;
example : Simple yacc sentence parser (continued)
yyerror ( s) char *s;
fprintf (stderr, "%s\na , s) ;
}
The structure of a yacc parser is, not by accident, similar to that of a lexlexer. Our first section, the definition section, has
a literal code block,enclosed in "%{" and "%I". We use it here for a C comment (as with lex, Ccomments belong inside
C code blocks, at least within the definition section) and a single include file.
C. Storage Management:
In this section we shall discuss management of storage for collections of objects, including temporary variables,during
their lifetimes. The important goals are the most economical use of memory and the simplicity of access functions to
individual objects. Source language properties govern the possible approaches, as indicated by the following questions:
▪ Is the exact number and size of all objects known at compilation time?
▪ Is the extent of an object restricted, and what relationships hold between the extents of distinct objects (e.g., are they
nested)?
▪ Does the static nesting of the program text control a procedure's access to global objects, or is access dependent upon
the dynamic nesting of calls?
storage in the activation record of the syntactic construct with which it is associated. The position of the object is
characterized by the base address, b, of the activation record and the relative location offset), R, of its storage with in the
activation record. R must be known at compile time but b cannot be known (otherwise we would have static storage
allocation). To access the object, b must be determined at runtime and placed in a register. R is then either added to the
register and the result used as an indirect address, or R appears as the constant in a direct access function of the
form'register+constant'.The extension, which may vary in size from activation to activation, is often called the second
order storage of the activation record. Storage within the extension is always accessed indirectly via information held in
the static part; in fact, the static part of an object may consist solely of a pointer tothe dynamic part.
D. Error Handling:
Error handling is concerned with failures due to many causes: errors in the compiler or its failures due to many causes:
errors in the compiler or its environment (hardware, operating system), design errors in the program being compiled, an
incomplete understanding of the source language, transcription errors, incorrect data, etc. The tasks of the error handling
process are to detect each error, report it to the user, and possibly make some repair to allow processing to continue. It
cannot generally determine the cause of the error, but can only diagnose the visible symptoms. Similarly, any repair cannot
be considered a correction (in the sense that it carries out the user's intent); it merely neutralizes the symptom so that
processing may continue. The purpose of error handling is to aid the programmer by highlighting inconsistencies. It has
a low frequency in comparison with other compiler tasks, and hence the time required to complete it is largely irrelevant,
but it cannot be regarded as an 'add-on' feature of a compiler. Its inuence upon the overall design is pervasive, and it is a
necessary debugging tool during construction of the compiler itself. Proper design and implementation of an error handler,
however, depends strongly upon complete understanding of the compilation process. This is why we have deferred
consideration of error handling until now.It is perhaps useful to make a distinction between the correctness of a system
and its reliability. The former property is derived from certain assumptions regarding both the primitives upon which the
system is based and the inputs that drive it. For example, program verification techniques might be used to prove that a
certain compiler will produce correct object programs for all source programs obeying the rules of the source language.
This would not be a useful property, however, if the compiler collapsed whenever some illegal source program was
presented to it. Thus we are more interested in the reliability of the compiler: its ability to produce useful results under the
weakest possible assumptions about the quality of the environment, input data and human operator.
Proper error handling techniques contribute to the reliability of a system by providing it with a means for dealing with
violations of some assumptions on which its design was based.
Most compilers simply report the symptom and let the user perform the diagnosis.An error is detectable if and only if it
results in a symptom that violates the definition of the language. This means that the error handling procedure is dependent
upon the language definition, but independent of the particular source program being analyzed. For example,the spelling
errors in an identifier will be detectable in LAX (provided that they do not result in another declared identifier) but not in
FORTRAN, which will simply treat the misspellings a new implicit declaration. We shall use the term anomaly to denote
something that appears suspicious, but that we cannot be certain is an error. Anomalies cannot be derived mechanically
from the language definition, but require some exercise of judgement on the part of the implementor. As experience is
gained with users of a particular language, one can spot frequently occurring errors and report them as anomalies before
their symptoms arise.
7. CONCLUSION
This report outlines a course in compiler construction. The implementation and source language is Scheme, and the target
language is assembly code. This choice of languages allows a direct-style,stack-based compiler to be implemented by an
undergraduate in one semester that touches on more aspects of compilation than a student is likely to see in a compiler
course for more traditional Languages. Furthermore, expressiveness is barely sacrificed; the compiler can be bootstrapped
provided there is enough run-time support. Besides covering basic compilation issues, the course yields an implemented
compiler that can serve as a test bed for coursework language implementation. The compiler has been used, for example,
to study advanced topic such as the implementation of firstclass continuations and register allocation.
8. REFERENCES
[1]. William M. WaiteDepartment of Electrical EngineeringUniversity of ColoradoBoulder, Colorado 80309USAemail:
[email protected].
[2]. GerhardGoosInstitutProgrammstrukturen und DatenorganisationFakultat fur Informatik