Lab Assgn4b
Lab Assgn4b
Faculty : Bikas Kanti Sarkar, Indrajit Mukherjee, Prashant Pranab, Supratim Biswas
Teaching Assistant : Swarna Aishwarya Twinkle (Research Scholar)
Lab Assignment 4 :
Objective : To write a lexical analyzer (or tokenizer) for X86-64 bit assembly language programs. The purpose
is to acquire skills to process assembly code of a contemporary real machine, such as the X86-64 bit
architecture, which is probably the same on which you are performing your experiments.
General Instructions :
1. You will need to identify tokens in an assembly program of X86 64 bit architecture. Read the handout to
understand the token structures so that you can write regexes for them in Lex.
2. Test your scanner design with the test cases given, and follow with your cases created by you.
Practice Problems
P1. You have to do the following activities, a) examine the assembly code generated by gcc, b) identify patterns
in the assembly code of interest, c) write a lex script that identifies the patterns, and d) finally produce some
statistical data about the various elements in the assembly code.
You can start with a C program of your own or use the program, “testprog1.c”, given to you. Let the generated
assembly code be named as “testprog1_opt.s”. Check the assembly code on your server as it may be slightly
different than that given below which was produced on the instructor’s laptop.
P2. Repeat the experiment stated in P1 above, by generating unoptimized assembly code of the source,
testprog1.c and use “scanass1” to tokenize the unoptimized assembly. Determine if some assembly instructions
are reported as unrecognized and fix the lex script to remove these anomalies.
Attachment : 1. testprog1.s 2. desired-output-unopt
16 registers of 64 bit %rip (instruction pointer), %rsp (stack pointer), %rbp (base pointer on
stack), "%rax, %rdi, %rsi, %rdx, %rbx, %rcx, %r8, %r9, %r10, %r11,
%r12, %r13, %r14, %r15
32 bit Registers %esp (stack pointer), %ebp (base pointer on stack), "%eax, %edi, %esi,
%edx, %ebx (replace the prefix, ‘r’ by ‘e’
Common instruction op codes sub, mov, xor, lea, add, cmp, imul, sal, shr, and, or ,not
Other assembly instructions endbr64, call, ret
Jump instructions jmp, je, jne, js, jns, jg, jge, jl, jle
Stack instructions push, pop, call, ret
Assembly instructions vary in length We add a suffix {‘b’, ‘w’, ’l’ or ‘q’} to denote the length {1, 2, 4, 8} in
from 1 to 8 bytes bytes. For instance, movb, movw, movl, movq are instances of the move
instructions.
Assembly code starting with a dot (.) These have to be recognized but no action taken, except count of
followed by some standard occurrences
keywords, such as “file”, “string”,
and many others
Labels of an assembly code .LX where X ≥ 1 alphanumeric characters denotes a Label
Operand formats : few are given in 1. $number : report as “immediate operand”
the second column. Identify 2. register : report as “register operand”, special mention for %rsp, $rbp
operands that are not captured by the and %rip
above patterns. 3. (register) : report as register indirect operand
4. num(%register) : report as “register indirect displacement operand”
5. num(register, register, num) : report as “register indirect scaled index
operand”