Low Level Virtual Machine C# Compiler Senior Project Proposal
Low Level Virtual Machine C# Compiler Senior Project Proposal
G RO U P M E M BE R S
Prabir Shrestha (4915302)
Myo Min Zin (4845411)
Napaporn Wuthongcharernkun (4846824)
C OM M I T T E E M E M BE R S
A D VI S O R
1 Introduction ............................................................................................................ 1
3 Scope ...................................................................................................................... 9
4 The Framework..................................................................................................... 12
6 References ............................................................................................................ 24
7 Appendix .............................................................................................................. 25
1 Introduction
could almost instantly convey to us information about the purpose of the system
design.
believe is due to its ability to effectively and easily model the real world objects and
their functionalities that we see around us in a way that machines can understand.
Modern high-level languages such as the source language we have focused on, C#,
more often than not contains a combination of all the above listed programming
paradigms. In the newer versions that have been released, an increased ease of use in
However the focus of our project will be primarily on the basic object-oriented
elements of the language which will capture the core-constructs of the syntax and
Diversity in alternative usage is another factor of importance when there are large
project. Large existing compiler frameworks are widely in use for the C# language
1
Low Level Virtual Machine C# Compiler
Senior Project Proposal
such as Microsoft's .NET and Mono. These systems are sometimes however bulky
due to the sets of features it provides even for those which developers would not be
using. Therefore the practicality and usefulness of our project is seen as a small
The core objective of this project is to create a compiler for the C# language that
generates a portable intermediate representation of low level code, which can then be
used across a wide variety of architectures and operating systems with minimal or no
code modification to the original source. In order to accomplish the task, Low Level
Virtual Machine Intermediate Representation (LLVM IR) has been chosen as the
target code output generated by the compiler due to its nature of independence.
1.1 Motivation
compiler.
Distributing the binaries created by the C# compilers requires us to install the bulky
.NET Framework. Even a traditional “helloworld” program would require all the
features of .NET Framework to be installed. To solve this problem we have taken the
2
Low Level Virtual Machine C# Compiler
Senior Project Proposal
approach of C and C++ which link the appropriate libraries required to the program
successfully.
D Language has also been one of the major inspirations, providing the programmers
garbage collection, interfaces and yet producing high performance codes to enable
system programming [1] such as system drivers and even operating systems.
Writing of operating system has been evolving throughout the past decades from
assembly codes to high level languages such as C and C++. There have been many
other projects such as SharpOS [2], Comos and even Microsoft‟s research operating
system – Singularity [3], which have taken a different approach by writing the kernel,
device drivers and application in managed code. The compilers of these operating
systems have been the motivation to create a C# compiler that produces native codes.
“Write Once, Run Anywhere” (WORA) slogan from Sun Microsystems has made us
think to generate a portable code which could be used over a wide variety of operating
The way we write programs have been evolving ever since the beginning of the stored
program concept and continue to evolve even at the present due to the advances in
hardware and software. From the introduction of Java and now the .Net framework,
the concept of virtual stack machine and Just in Time Compilation (JIT) has been
coming to popularity. One of the notable compilers which use this concept is C#. It
3
Low Level Virtual Machine C# Compiler
Senior Project Proposal
Even though Java byte-code and Common Language Infrastructure (CLI) consists of
highly machine independent code, it has not been a candidate for system
languages such as C and C++ and due to the JIT. LLVM has a similar concept of JIT
by converting the code to a compiled LLVM bit code which could then be executed in
other architecture and operating system. In order to gain better performance for a
the popular architectures such as x86, x86-64, PowerPC, PowerPC-64, ARM, Thumb,
SPARC, Alpha, CellSPU, PIC16 MIPS, MSP430, SystemZ and XCore [4].
While languages such as C and C++ provide better execution speed than compared to
C# and Java, programmers do have to face with unsafe codes such as manual memory
management which could lead to memory leak or dangling pointers. This memory
problem is usually solved by the use of garbage collection as seen in C# and Java. It
also introduces the concepts of delegates by avoiding the use of unsafe function
pointers.
As developers have been writing their codes, a set of common principles on the way
they write code have been evolving. Uses of accessors and mutators have been a
common way of accessing variable in the object oriented world rather than the use of
Because of features such as the memory management and the adhering to the
compiler.
4
Low Level Virtual Machine C# Compiler
Senior Project Proposal
code to each of those platforms. Languages such as C and C++ do not have a straight
forward way to know the length of integer – 32 bit or 64 bit. But C# provides an
1.3 Objectives
The objective of our project is to create a compiler for the C# language in which the
code called Low Level Virtual Machine Intermediate Representation (LLVM IR).
The focus of our project will be primarily on each phase of the compilation process,
from scanning the source language until target code generation. These phases include
Generator. Other phases such as assembling and linking will be handled by LLVM
tools.
The finalization and expected outcome of the project will be a compiler that is set to
The compiler will properly recognize the lexical structures of the C# language.
Check the syntax taking into account the correct grammar according to the
5
Low Level Virtual Machine C# Compiler
Senior Project Proposal
2 Literature Review
as well.
which provides a CTS (Common Type System) and CLS (Common Language
Language).
Low Level Virtual Machine (LLVM) is a compiler infrastructure that consists of two
optimizations of programs can occur at different phases of the program life such as
language containing RISC like instruction set that effectively captures the operations
physical registers and other low-level calling conventions. By increasing the layer of
abstraction apart from the hardware specifics in the code, the LLVM IR is in a sense,
6
Low Level Virtual Machine C# Compiler
Senior Project Proposal
hardware specifications.
The common code representation used throughout all phases of the LLVM
provides type safety, low-level operations and is flexible and capable of representing
A key important factor contributing to the productivity of the LLVM system is its
virtual instruction set. The LLVM code is a low level representation while being able
2.3 Contributions to C#
Other C# compiler projects that are available apart from Microsoft's .NET framework
are discussed briefly here to give an overview of the relevant developments that have
surfaced in this particular field, these include Mono, Cosmos(IL2CPU) [6], Bartok
and Ensemble.
systems such as Linux, UNIX, Mac OS X and Solaris. The concept of how it works is
first the C# code gets compiled into MSIL then the Mono JIT translates the MSIL into
native code at run time which is similar to as the original implementation of the .NET
framework by Microsoft.
7
Low Level Virtual Machine C# Compiler
Senior Project Proposal
translates the CIL into machine code by outputting raw assembly files which then get
Bartok was originally made for the use of the OS Singularity developed by Microsoft
Research. It works by translating CIL into native code by using three intermediate
representations, HIR (High-level IR), MIR (Medium-level IR) and LIR (Low-level
IR). At each of these representations starting from high-level it works its way down to
low-level IR and gradually changes the code representation at each phase until it
reaches the lowest level which is basically assembly, and then a standard linker puts
8
Low Level Virtual Machine C# Compiler
Senior Project Proposal
3 Scope
The scope from the language specifications has been determined for our project
version 1.0. We have chosen version 1 rather than the newer versions of C# because
we will not be supporting most of those new additional features such as Generics,
3.1 Keywords
f(x) - testing += *
a[x] ! < -= /
9
Low Level Virtual Machine C# Compiler
Senior Project Proposal
==
!=
16-bit characters that can be used to represent most of the known written languages in
the world. For our C# compiler we will not be implementing the original version but
rather, char will be the size of 8-bit which is the same as the standard C and C++. This
enumeration type such as byte, sbyte, short, ushort and int. The compiler will only be
Microsoft .NET has provided base class libraries, which are the classes, structures,
System.Enum System.String
We will be providing our own libraries for the end user to assemble and link with the
output LLVM IR. The provided libraries will be performing most of the
functionalities of the above .NET libraries. Should there be any exceptional cases; the
10
Low Level Virtual Machine C# Compiler
Senior Project Proposal
In our implementation of the compiler using declaratives can be used only at the top
Optimization would not be taken into consideration during the code generation of
LLVM IR.
11
Low Level Virtual Machine C# Compiler
Senior Project Proposal
4 The Framework
The compiler will be written in C# language using the Microsoft Visual Studio and
Implementation of scanner and parser is done by the automatic scanner and parser
generator called Coco/R which is also written in C#. In order to make the generation
of scanner and parser easier we have also created a Coco/R plugin which can be used
12
Low Level Virtual Machine C# Compiler
Senior Project Proposal
4.1 Scanner
Basically, Coco/R takes the attributed grammar of source language and generates a
scanner and recursive descent parser for this particular language. The scanner
generated by Coco/R reads the input stream and returns the stream of tokens to the
parser.
In a traditional overview of the compilation scanning and parsing process are seen as
two distinct separate processes occurring one after the other. However using the
COCO/R tool the scanner and parser generation occurs at the same time where the
scanner codes and parser codes are written in the same attributed grammar file ending
The scanner generator's purpose is to perform the lexical analysis on the source
language. What it does is it takes the syntax input of the program, tokenizes it and
checks for lexical errors. Tokenization refers to the process of categorizing the syntax
of the program into its basic building blocks which are tokens. Tokens usually include
identifiers, keywords, numbers and symbols; these are the fundamental building
blocks of a program.
4.2 Parser
The parser generator handles the syntax analysis for the source language. During the
syntax analysis phase the focus of concern is checking for the source input program's
adherence to the grammatical rules of the source language. There are two major
techniques for parsing, table driven and recursive descent. The Coco/R tool deploys
13
Low Level Virtual Machine C# Compiler
Senior Project Proposal
simple, convenient and accomplishes the task efficiently for the next sequenced
phase, semantic analysis to begin. The top-down parsing technique as the name
suggests starts constructing the parse tree from the top of the tree, the root and works
its way downwards, making predictions for each next token input as to which
production rule may be used, and adding them on to the parse tree. The control flow
statements are used. However recursive subroutines are in effect as that is a primary
However in general for this parsing technique a basic requirement of the grammar is
LL(1) is an abbreviation for left to right with left canonical derivations using only
one look-ahead symbol. The grammar of the source language which we have written
for our compiler however is not in LL(1) form, this then presents another factor into
the equation, there are a number of solutions that Coco/R uses for grammars that are
not in LL(1) form. They are typically termed 'Conflict Resolvers' and include the
following.
2. Resolver Symbols
In this technique the Coco/R generated parser uses two global variables that store the
last recognized terminal and the current look ahead symbol. When the need arises to
look ahead more than one symbol, the generated scanner does this by using the
14
Low Level Virtual Machine C# Compiler
Senior Project Proposal
methods ResetPeek() and Peek(). The ResetPeek method initializes the peeking to
begin from the symbol after the current look-ahead symbol. The Peek method returns
the next symbol as a Token but does not remove it from the input stream, so these
To make it easier for us to look ahead more than one token ahead, we have created a
which returns the n-th token after the current look ahead token.
Resolver Symbols
These are artificial tokens that are added into a separate section in the grammar to
help direct the parser in the correct way. They are inserted on- the-fly during parse
time as seen necessary by the resolution routine that is used by Coco/R. These
resolution routines are automatically put into the generated parser by Coco/R.
During the parsing phase, Abstract Syntax Tree (AST) is generated. All the AST
nodes inherit from a common class called AstNode. Some AstNodes implement
15
Low Level Virtual Machine C# Compiler
Senior Project Proposal
AstBinaryExpression.
16
Low Level Virtual Machine C# Compiler
Senior Project Proposal
Semantic Analysis is the phase in the compilation process that follows after the
parsing phase.
Once the parsing and scanning phase has been completed this means that the source
code has been checked for lexical and syntax errors. The next step then is to check
that the program source code is semantically correct as well as not all program
This task is aided by the semantic actions that are added onto the grammar in a format
For instance, types of errors that will be checked for during this phase are type
classes and methods within their respective scopes, initialization of variables and
fields.
Moreover, the source language C# does not allow the identifier to be used before it is
detected during compilation time; the compiler has to know the type information of a
information assigned to that identifier. In the later part of the program, when the
compiler examines the expression containing this identifier, it is verified by its type
In this example, the identifier x is used without being declared. When the compiler
encounters the expression x = 10, the type of the operands are compared and the
this expression, the complier do not have the type information of x and will not be
able to perform any of these. Then, it will give a compile time error to the
programmer.
Once the semantic analysis process has been completed the source program is ready
After the creation of AST and passing the semantic analysis, appropriate LLVM IR
18
Low Level Virtual Machine C# Compiler
Senior Project Proposal
following code.
The comments in LLVM begin with a semi colon terminating at the end of the line.
This line at the end of the code in the sample generated LLVM IR contains the
declaration of the function called printf which takes in the first parameter as a pointer
As our generated code requires the use of system calls to the operating system to print
notify the operating system about writing the text in LLVM IR. Other features such as
returning the operating system the exit code also requires the use of system calls. This
assembly code in the LLVM IR. But to achieve portability among different systems
19
Low Level Virtual Machine C# Compiler
Senior Project Proposal
the code generator will make use of the Standard C Library which can be linked to the
Due to the existences of the printf function in Standard C Library, the body of the
printf function is not defined in the LLVM IR. Like a function in C# can be called
before the declaration of the function, LLVM IR too makes use of the same feature by
enabling to write the function definition before the actual calling of the function as
shown in the generated LLVM IR which is appended to the end of the code.
This code creates a global variable called .str , an array of 8 bits integer whose array
size is 4. @ denotes a global variable in LLVM. Since LLVM supports arbitrary bit
width for integer ranging from 1 bit to 231-1 (approximately 8 million) explicit size
must be defined in integer type. (LLVM code generation does not support large
integer types to be used as function return types. The specific limit on how large a
return type the code generator can currently handle is target independent; currently it
is often 64 bits for 32-bit targets and 128 bits for 64-bit target. [8])The string variable
is integer of 8 bits due to the fact that the size of „char‟ in standard C is of 8 bits.
escaped using “\xx” where xx is the ASCII code for the character in hexadecimal.
The above block of code contains the function definition for PrintSquare function
keyword is added to inform that the function never returns the unwind or exception
control flow. In case the function does return, its runtime behavior is undefined.
20
Low Level Virtual Machine C# Compiler
Senior Project Proposal
The above statement creates a local variable named n_addr and allocates memory in
the stack frame which automatically gets released when it is returned to the caller.
After the allocation of the memory the pointer to the allocated memory is returned
which is stored in the n_addr variable. „%‟ sign indicates the variable is local.
This statement copies the integer value of local value n to the memory location
The above code fragment copies the integer value of the memory pointed to the
memory location stored at n_addr variable to a local variable named 0 (zero). Variable
The “getelementptr” instruction performs address calculation of the local variable .str
and doesn‟t access the memory. “call” instruction calls the function named printf and
passes the calculated memory location of the .str variable along with the integer value
After the LLVM IR has been generated it is the user‟s responsibility to assemble and
link it further down to the appropriate binary executable. The LLVM IR generated by
our compiler can be compiled to LLVM bitcode. With the help of GNU binutils it
21
Low Level Virtual Machine C# Compiler
Senior Project Proposal
assembly code. These tools are open source and also can be executed on wide
varieties of architectures and operating systems. For windows, we will be using the
officially LLVM tools while for the Gnu binutils we will be using the one from
22
Low Level Virtual Machine C# Compiler
Senior Project Proposal
5 Gantt Chart
23
Low Level Virtual Machine C# Compiler
Senior Project Proposal
6 R e fer en c es
http://research.microsoft.com/en-us/projects/singularity/.
[5]. Lattner, Chris and Adve, Vikram. The LLVM Compiler Infrastructure Project.
The LLVM Compiler Infrastructure Project. [Online] March 2004. [Cited: August 8,
2009.] http://llvm.org/pubs/2004-01-30-CGO-LLVM.pdf.
international.org/publications/files/ECMA-ST-WITHDRAWN/ECMA-
334,%201st%20edition,%20December%202001.pdf.
http://llvm.org/docs/LangRef.html.
2009.] http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-
334.pdf.
24
Low Level Virtual Machine C# Compiler
Senior Project Proposal
7 A p p en d i x
NamespaceMember
= ("namespace" Qualident "{"
{NamespaceMember}"}"
| {TypeModifiers} TypeDecl).
TypeDecl
=
( "class" ident [ClassBase] ClassBody [";"]
| "struct" ident [Base] StructBody [";"]
| "enum" ident [":" IntType] EnumBody [";"]
)
.
ClassMember = StructMember.
StructMember
=
"const" Type ident "=" Expr { "," ident "=" Expr } ";"
| ident "(" [FormalParams] ")" [ConstructorCall] (Block |
";")
| ("implicit"|"explicit") "operator" Type "(" Type ident ")"
(Block | ";")
| TypeDecl
| Type "operator" OverloadableOp "(" Type ident ("," Type
ident |) ")" (Block|";")
| Field { "," Field } ";"
| Qualident "(" [FormalParams] ")" (Block|";")
| "{" Accessors "}"
.
Statement =
(
"const" Type ident "=" Expr { "," ident "=" Expr}
| LocalVarDecl ";"
| EmbeddedStatement
)
.
EmbeddedStatement
= Block
| ";"
| StatementExpr ";"
| "if" "(" Expr ")" EmbeddedStatement ["else"
EmbeddedStatement]
| "while" "(" Expr ")" EmbeddedStatement
| "do" EmbeddedStatement "while" "(" Expr ")" ";"
| "for" "(" [ForInit] ";" [Expr] ";" [ForInc] ")"
EmbeddedStatement
| "break" ";"
| "continue" ";"
| "return" [Expr] ";"
.
Primary=
(
ident
| Literal
| "(" Expr ")"
26
Low Level Virtual Machine C# Compiler
Senior Project Proposal
| ( "bool" | "char" | "float" | "int" | "object" |
"string" ) "." ident
| "this" | "base" ( "." ident | "[" Expr "]" )
| "new" Type ( "(" [ Argument {"," Argument}] ")" |
ArrayInit )
| "typeof" "(" Type ")"
| "sizeof" "(" Type ")"
)
{
"++" | "--"
| "." ident
| "(" [Argument {"," Argument}] ")"
}
.
OverloadableOp
= "+" | "-" | "!" | "++" | "--" | "true" | "false"
| "*" | "/" | "==" | "!=" | ">" | "<" | ">=" | "<=".
Argument = Expr.
27