0% found this document useful (0 votes)
59 views

Chapter 1 Edit

This document provides an overview of computer organization and architecture. It discusses why these topics are important and defines key terms. It then describes the basic components of a computer system, including the processor, memory, I/O devices, and bus. An example system is presented to illustrate these components. The document also outlines the different levels in a computer's hierarchy, from the user level down to the digital logic level. It concludes with a brief discussion of the von Neumann model of stored-program computers.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Chapter 1 Edit

This document provides an overview of computer organization and architecture. It discusses why these topics are important and defines key terms. It then describes the basic components of a computer system, including the processor, memory, I/O devices, and bus. An example system is presented to illustrate these components. The document also outlines the different levels in a computer's hierarchy, from the user level down to the digital logic level. It concludes with a brief discussion of the von Neumann model of stored-program computers.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 463

02/03/2019

Chapter 1

COMPUTER SYSTEM
PERFORMANCE

Why study computer organization and architecture?

 Design better programs, including system


software such as compilers, operating systems,
and device drivers.
 Optimize program behavior.
 Evaluate (benchmark) computer system
performance.
 Understand time, space, and price tradeoffs.

1
02/03/2019

 Computer organization
 Encompasses all physical aspects of computer systems.
 E.g., circuit design, control signals, memory types.
 How does a computer work?
 Computer architecture
 Logical aspects of system implementation as seen by
the programmer.
 E.g., instruction sets, instruction formats, data types,
addressing modes.
 How do I design a computer?

Computer Components

 There is no clear distinction between matters related to


computer organization and matters relevant to computer
architecture.
 Principle of Equivalence of Hardware and Software:
 Anything that can be done with software can also be done
with hardware, and anything that can be done with hard
ware can also be done with software.

2
02/03/2019

 At the most basic level, a computer is a device consisting of


four pieces:
 A processor to interpret and execute programs
 A memory to store both data and programs
 A mechanism for transferring data to and from the outside
world, include IO devices.
 Bus (interconnection among parts)

An Example System

Consider this advertisement:

What does it all mean?? Go to Google and search


for PCI or USB port.
6

3
02/03/2019

An Example System

The microprocessor is the “brain” of


the system. It executes program
instructions. This one is a Pentium III
(Intel) running at 667MHz.

A system bus moves data within the


computer. The faster the bus the better.
This one runs at 133MHz.

An Example System

 Computers with large main memory capacity can run larger


programs with greater speed than computers having small
memories.
 RAM is an acronym for random access memory. Random
access means that memory contents can be accessed directly
if you know its location.
 Cache is a type of temporary memory that can be accessed
faster than RAM.

4
02/03/2019

An Example System
This system has 64MB of (fast)
synchronous dynamic RAM
(SDRAM) . . .
Google search for SDRAM?

… and two levels of cache memory, the level 1 (L1)


cache is smaller and (probably) faster than the L2 cache.
Note that these cache sizes are measured in KB.

An Example System
Hard disk capacity determines
the amount of data and size of
programs you can store.

This one can store 30GB. 7200 RPM is the rotational


speed of the disk. Generally, the faster a disk rotates,
the faster it can deliver data to RAM. (There are many
other factors involved.)

10

5
02/03/2019

An Example System

EIDE stands for enhanced integrated drive electronics,


which describes how the hard disk interfaces with (or
connects to) other system components.

A CD-ROM can store about 650MB of data, making


it an ideal medium for distribution of commercial
software packages. 48x describes its speed.

11

An Example System
Ports allow movement of data
between a system and its external
devices.

This system has


four ports.

12

6
02/03/2019

An Example System

 Serial ports send data as a series of pulses along one or


two data lines.
 Parallel ports send data as a single pulse along at least
eight data lines.
 USB, universal serial bus, is an intelligent serial
interface that is self-configuring. (It supports “plug and
play.”)

13

An Example System
System buses can be augmented by
dedicated I/O buses. PCI, peripheral
component interface, is one such bus.

This system has two PCI devices: a


sound card, and a modem for
connecting to the Internet.

14

7
02/03/2019

An Example System
The number of times per second that the image on
the monitor is repainted is its refresh rate. The dot
pitch of a monitor tells us how clear the image is.

This monitor has a dot pitch of


0.24mm and a refresh rate of 85Hz.

The graphics card contains memory and


programs that support the monitor.
Google search for AGP?
15

The computer level hierachy

 Computers consist of many things besides chips.


 Before a computer can do anything worthwhile, it
must also use software.
 Writing complex programs requires a “divide and
conquer” approach, where each program module
solves a smaller problem.
 Complex computer systems employ a similar
technique through a series of virtual machine
layers.

8
02/03/2019

 Each virtual machine layer


is an abstraction of the level
below it.
 The machines at each level
execute their own particular
instructions, calling upon
machines at lower levels to
perform tasks as required.
 Computer circuits
ultimately carry out the
work.

 Level 6: The User Level


 Program execution and user interface level.
 The level with which we are most familiar.
 Level 5: High-Level Language Level
 The level with which we interact when we write
programs in languages such as C, Pascal, Lisp, and Java.
 Level 4: Assembly Language Level
 Acts upon assembly language produced from Level 5, as
well as instructions programmed directly at this level.

9
02/03/2019

 Level 3: System Software Level


 Controls executing processes on the system.

 Protects system resources.

 Assembly language instructions often pass through Level 3

without modification.
 Level 2: Machine Level
 Also known as the Instruction Set Architecture (ISA) Level.
 Consists of instructions that are particular to the
architecture of the machine.
 Programs written in machine language need no compilers,
interpreters, or assemblers.

 Level 1: Control Level


 A control unit decodes and executes instructions and moves

data through the system.


 Control units can be microprogrammed or hardwired.

 A microprogram is a program written in a low-level

language that is implemented by the hardware.


 Hardwired control units consist of hardware that directly

executes machine instructions.


 Level 0: Digital Logic Level
 This level is where we find digital circuits (the chips).

 Digital circuits consist of gates and wires.

 These components implement the mathematical and control

logic of all other levels.

10
02/03/2019

The von Neumann Model

 Inventors of the ENIAC, John Mauchley and J. Presper


Eckert, conceived of a computer that could store
instructions in memory.
 The invention of this idea has since been ascribed to a
mathematician, John von Neumann, who was a
contemporary of Mauchley and Eckert.
 Stored-program computers have become known as von
Neumann Architecture systems.

21

The von Neumann Model


 Today’s stored-program computers have the following
characteristics:
 Three hardware systems:

 A central processing unit (CPU)

 A main memory system

 An I/O system

 The capacity to carry out sequential instruction


processing.
 A single data path between the CPU and main
memory.
 This single path is known as the von Neumann

bottleneck.
22

11
02/03/2019

The von Neumann Model

 This is a general
depiction of a von
Neumann system:

 These computers
employ a fetch-
decode-execute cycle
to run programs as
follows . . .

23

The von Neumann Model


 The control unit fetches the next instruction from
memory using the program counter to determine where
the instruction is located.

24

12
02/03/2019

The von Neumann Model


 The instruction is decoded into a language that the ALU
can understand.

25

The von Neumann Model


 Any data operands required to execute the instruction
are fetched from memory and placed into registers
within the CPU.

26

13
02/03/2019

The von Neumann Model


 The ALU executes the instruction and places results in
registers or memory.

27

The Modified von Neumann

Adding a system bus

28

14
02/03/2019

Microprocessor speed Techniques

 Pipelining
 Branch prediction
 Data flow analysis
 Speculative execution

Pineling

 Some CPUs divide the fetch-decode-execute cycle into


smaller steps.
 These smaller steps can often be executed in parallel to
increase throughput.
 Such parallel execution is called instruction-level
pipelining.
 This term is sometimes abbreviated ILP in the literature.

15
02/03/2019

 Suppose a fetch-decode-execute cycle were broken into


the following smaller steps:
1. Fetch instruction. 4. Fetch operands.
2. Decode opcode. 5. Execute instruction.
3. Calculate effective 6. Store result.
address of operands.

 Suppose we have a six-stage pipeline. S1 fetches the


instruction, S2 decodes it, S3 determines the address of
the operands, S4 fetches them, S5 executes the
instruction, and S6 stores the result.

 For every clock cycle, one small step is carried out, and
the stages are overlapped.

S1. Fetch instruction. S4. Fetch operands.


S2. Decode opcode. S5. Execute.
S3. Calculate effective S6. Store result.
address of operands.

16
02/03/2019

Real-World Examples of pipeling

 We return briefly to the Intel and MIPS architectures


from the last chapter, using some of the ideas introduced
in this chapter.
 Intel introduced pipelining to their processor line with
its Pentium chip.
 The first Pentium had two five-stage pipelines. Each
subsequent Pentium processor had a longer pipeline
than its predecessor with the Pentium IV having a 24-
stage pipeline.
 The Itanium (IA-64) has only a 10-stage pipeline.

Branch Prediction

 Branch prediction is another approach to minimizing branch


penalties.
 Branch prediction tries to avoid pipeline stalls by guessing the
next instruction in the instruction stream.
 This is called speculative execution.
 Branch prediction techniques vary according to the type of
branching. If/then/else, loop control, and subroutine branching all
have different execution profiles.
 There are various ways in which a prediction can be made:
 Fixed predictions do not change over time.
 True predictions result in the branch being always taken or never
taken.
 Dynamic prediction uses historical information about the branch
and its outcomes.
 Static prediction does not use any history.

17
02/03/2019

 Data flow analysis: The processor analyzes which


instructions are dependent on each other’s results, or
data, to create an optimized schedule of instructions.
 Speculative execution: Using branch prediction and
data flow analysis, some processors speculatively
execute instructions ahead of their actual appearance in
the program execution, holding the results in temporary
locations.

Performance

 Defining Performance : Which airplane has the


best performance?

18
02/03/2019

Response Time and Throughput

 Response time
 How long it takes to do a task
 Throughput
 Total work done per unit time (tasks/transactions/…
per hour)
 How are response time and throughput affected
by
 Replacing the processor with a faster version?
 Adding more processors?
 We’ll focus on response time for now…

Relative Performance

 Define Performance = 1/Execution Time


 “X is n time faster than Y”

 Example: time taken to run a program


10s on A, 15s on B
„
Execution TimeB / Execution TimeA
= 15s / 10s = 1.5
„
So A is 1.5 times faster than B

19
02/03/2019

CPU Clocking

 Operation of digital hardware governed by a


constant-rate clock

 Clock period: duration of a clock cycle


e.g., 250ps = 0.25ns = 250×10–12s
„
 Clock frequency (rate): cycles per second
„
e.g., 4.0GHz = 4000MHz = 4.0×109Hz

CPU Time

 Performance improved by
 Reducing number of clock cycles
 Increasing clock rate
 Hardware designer must often trade off clock rate
against cycle count

20
02/03/2019

CPU Time Example

 Computer A: 2GHz clock, 10s CPU time


 Designing Computer B
 Aim for 6s CPU time
 Can do faster clock, but causes 1.2 × clock cycles
 How fast must Computer B clock be?

Instruction Count and CPI

 Instruction Count for a program


 Determined by program, ISA and compiler
 Average cycles per instruction
 Determined by CPU hardware
 If different instructions have different CPI
 Average CPI affected by instruction mix

21
02/03/2019

CPI Example

 Computer A: Cycle Time = 250ps, CPI = 2.0


 Computer B: Cycle Time = 500ps, CPI = 1.2
 Same ISA
 Which is faster, and by how much?

CPI in More Detail

 If different instruction classes take different


numbers of cycles

 Weighted average CPI

22
02/03/2019

CPI Example

 Alternative compiled code sequences using


instructions in classes A, B, C

Performance Summary

 Performance depends on
 Algorithm: affects IC, possibly CPI
 Programming language: affects IC, CPI
 Compiler: affects IC, CPI
 Instruction set architecture: affects IC, CPI, Tc

23
02/03/2019

MIPS as a Performance Metric

 MIPS: Millions of Instructions Per Second


 Doesn’t account for
 Differences in ISAs between computers
 Differences in complexity between instructions

 CPI varies between programs on a given CPU

Amdahl’s Law

 The overall performance of a system is a result of the


interaction of all of its components.
 System performance is most effectively improved when
the performance of the most heavily used components is
improved.
 This idea is quantified by Amdahl’s Law:
where S is the overall speedup;
f is the fraction of work performed by
a faster component; and
k is the speedup of the faster
component.

24
02/03/2019

Amdahl’s Law

 Amdahl’s Law gives us a handy way to estimate the


performance improvement we can expect when we
upgrade a system component.
 On a large system, suppose we can upgrade a CPU to
make it 50% faster for $10,000 or upgrade its disk
drives for $7,000 to make them 250% faster.
 Processes spend 70% of their time running in the CPU
and 30% of their time waiting for disk service.
 An upgrade of which component would offer the greater
benefit for the lesser cost?

25
02/03/2019

 The processor option offers a speedup of 1.3 times, (S =


1.3) or 30%:

 And the disk drive option gives a speedup of 1.22 times


(S = 1.22), or 22%:

 Each 1% of improvement for the processor costs $333


(10000/30), and for the disk a 1% improvement costs
$318 (7000/22).

Performance and Power Trends

26
02/03/2019

Evolution of Computer Technology

 The 0 generation
 The 1st generation
 The 2nd generation
 The 3rd generation
 Evolution of Intel process

Generation Zero

 Generation Zero: Mechanical Calculating Machines (1642 -


1945)
 Calculating Clock - Wilhelm Schickard (1592 - 1635).

 Pascaline - Blaise Pascal (1623 - 1662).

 Difference Engine - Charles Babbage (1791 - 1871), also

designed but never built the Analytical Engine.


 Punched card tabulating machines - Herman Hollerith

(1860 - 1929).

27
02/03/2019

The first generation

 The First Generation: Vacuum


Tube Computers (1945 - 1953)
 Atanasoff Berry Computer
(1937 - 1938) solved systems
of linear equations.
 John Atanasoff and Clifford
Berry of Iowa State
University.

The First Generation

 The ENIAC (Electronic Numerical Integrator And


Computer), designed and constructed at the University
of Pennsylvania: The world’s first general purpose
electronic digital computer.
 Weighing 30 tons, occupying 1500 square feet of floor
space, and containing more than 18,000 vacuum tubes.
When operating, it consumed 140 kilowatts of power. It
was capable of 5000 additions per second.
 The major drawback of the ENIAC was that it had to be
programmed manually by setting switches, plugging
and unplugging cables

28
02/03/2019

The First Generation

 In 1946, von Neumann and his colleagues began


the design of a new stored-program computer,
referred to as the IAS computer, at the Princeton
Institute for Advanced Studies.
 Although not completed until 1952, is the
prototype of all subsequent general-purpose
computers

1rst Generation : Comercial

The UNIVAC I (Universal Automatic Computer) (1950)


was the first successful commercial computer. It was
intended for both scientific and commercial applications.
The UNIVAC II, which had greater memory capacity
and higher performance than the UNIVAC I, was
delivered in the late 1950s.
The IBM 701 (1953), which was delivered by IBM - the
first electronic stored-program computer (punched-card
processing equipment), which intended primarily for
scientific applications.
The IBM 702 (1955) which had a number of hardware
features that suited it to business applications.

29
02/03/2019

The second generation

 The Second Generation:


Transistorized Computers
(1954 - 1965)
 IBM 7094 (scientific) and
1401 (business)
 Digital Equipment
Corporation (DEC) PDP-1
 Univac 1100
 . . . and many others.

The Second Generation

The second generation saw the introduction of more


complex arithmetic and logic units and control units,
the use of high-level programming languages, and the
provision of system software with the computer.
In broad terms, system software provided the ability to

load programs, move data to peripherals, and libraries to


perform common computations, similar to what modern
OSes like Windows and Linux do.

30
02/03/2019

The third Generation

 The Third Generation:


Integrated Circuit Computers
(1965 - 1980)
 IBM 360
 DEC PDP-8 and PDP-11
 Cray-1 supercomputer
 . . . and many others.

The Third Generation

THE MOORE LAW


The number of transistors that could
be put on a single chip was
doubling every year and correctly
predicted that this pace would
continue into the near future. To
the surprise of many, including
Moore, the pace continued year
after year and decade after
decade. The pace slowed to a
doubling every 18 months in the
1970s but has sustained that rate
ever since.

31
02/03/2019

The fourth generation

 The Fourth Generation: VLSI


Computers (1980 - ?)
 Very large scale integrated circuits
(VLSI) have more than 10,000
components per chip.
 Enabled the creation of Intel
microprocessors. 4004

 The first was the 4-bit Intel 4004.


Later versions, such as the 8080,
8086, and 8088 spawned the idea
of “personal computing.”

The Microprocessors

In 1971, Intel developed 4004: the first chip to contain all
of the components of a CPU on a single chip. The 4004 can
add two 4-bit numbers and can multiply only by repeated
addition.
In 1972, Intel developed 8008. This was the first 8-bit

microprocessor and was almost twice as complex as the


4004.
In 1974, Intel developed 8080 (8-bit), which was designed

to be the CPU of a general-purpose microcomputer.


By the end of 70s, general-purpose 16-bit microprocessors

appeared. One of these was the 8086.

32
02/03/2019

Evolution of Intel Processors

 1970s

 1980s

33
02/03/2019

 1990s

 Recent processors

34
02/03/2019

Multicore, MICS, GPUS

 Improvements in Chip Organization and Architecture :


 Increase the hardware speed (clock speed) of the
processor:
 Increase heat dissipation (w/cm2)
 RC delay
 Memory latency
 The use of multiple processors on the same chip, also
referred to as multiple cores, or multicore, provides the
potential to increase performance without increasing the
clock rate.

 Chip manufacturers are now in the process of


making a huge leap forward in the number of
cores per chip (more than 50).
 The leap in performance as well as the
challenges in developing software to exploit such
a large number of cores have led to the
introduction of a new term: many integrated
core (MIC)

35
02/03/2019

Evolution of Intel x86 Architecture

 Two processor families:


 Intel x86: the sophisticated design principles once
found on mainframes, supercomputers and serves
(CISC – Complex Instruction Set Computers),
 The ARM architecture is used in a wide variety of
embedded systems and is one of the most powerful
and best-designed RISC-based systems on the market
(RISC - Reduced Instruction Set Computers),

CISC and RISC

 RISC Machines
 The underlying philosophy of RISC machines is that a system
is better able to manage program execution when the program
consists of only a few different instructions that are the same
length and require the same number of clock cycles to decode
and execute.
 RISC systems access memory only with explicit load and
store instructions.
 In CISC systems, many different kinds of instructions access
memory, making instruction length variable and fetch-decode-
execute time unpredictable.

36
02/03/2019

 The difference between CISC and RISC becomes


evident through the basic computer performance
equation:

 RISC systems shorten execution time by reducing the


clock cycles per instruction.
 CISC systems improve performance by reducing the
number of instructions per program.

 The simple instruction set of RISC machines enables


control units to be hardwired for maximum speed.
 The more complex-- and variable-- instruction set of
CISC machines requires microcode-based control units
that interpret instructions as they are fetched from
memory. This translation takes time.
 With fixed-length instructions, RISC lends itself to
pipelining and speculative execution.

37
02/03/2019

 Consider the the program fragments:


mov ax, 0
CISC mov ax, 10
RISC mov bx, 10
mov bx, 5
mul cx, 5
mul bx, ax
Begin add ax, bx
loop Begin

 The total clock cycles for the CISC version might be:
(2 movs  1 cycle) + (1 mul  30 cycles) = 32 cycles
 While the clock cycles for the RISC version is:
(3 movs  1 cycle) + (5 adds  1 cycle) + (5 loops  1 cycle) = 13 cycles

 With RISC clock cycle being shorter, RISC gives us much


faster execution speeds.

 Because of their load-store ISAs, RISC architectures


require a large number of CPU registers.
 These register provide fast access to data during
sequential program execution.
 They can also be employed to reduce the overhead
typically caused by passing parameters to subprograms.
 Instead of pulling parameters off of a stack, the
subprogram is directed to use a subset of registers.

38
02/03/2019

 This is how
registers can be
overlapped in a
RISC system.
 The current
window pointer
(CWP) points to
the active register
window.

Flynn’s Taxonomy

 Many attempts have been made to come up with a way


to categorize computer architectures.
 Flynn’s Taxonomy has been the most enduring of these,
despite having some limitations.
 Flynn’s Taxonomy takes into consideration the number
of processors and the number of data paths incorporated
into an architecture.
 A machine can have one or many processors that
operate on one or many data streams.

39
02/03/2019

 The four combinations of multiple processors and multiple


data paths are described by Flynn as:
 SISD: Single instruction stream, single data stream. These are
classic uniprocessor systems.
 SIMD: Single instruction stream, multiple data streams.
Execute the same instruction on multiple data values, as in
vector processors.
 MIMD: Multiple instruction streams, multiple data streams.
These are today’s parallel architectures.
 MISD: Multiple instruction streams, single data stream.

40
02/03/2019

Chapter 2

TOP LEVEL VIEW


OF

COMPUTER
FUNCTION

Contents

 Understand the basic elements of an instruction cycle and


the role of interrupts.
 Describe the concept of interconnection within a
computer system (buses)
 Describe the concept of interconnection within a
computer system.
 Understand the difference between synchronous and
asynchronous bus timing.
 Explain the need for multiple buses arranged in a
hierarchy.
 Assess the relative advantages of point-to-point
interconnection compared to bus interconnection.
 Present an overview of QPI, PCIe

1
02/03/2019

 Present an overview of the main characteristics of


computer memory systems and the use of a memory
hierarchy.
 Describe the basic concepts and intent of cache
memory, key elements of cache design.
 Distinguish between direct mapping and associative
mapping.
 Present an overview of the types of semiconductor main
memory, internal and external memory.

Computer Components

 Contemporary computer designs are based on concepts


developed by John von Neumann at the Institute for
Advanced Studies, Princeton
 Referred to as the von Neumann architecture and is
based on three key concepts:
 Data and instructions are stored in a single read-write memory
 The contents of this memory are addressable by location,
without regard to the type of data contained there
 Execution occurs in a sequential fashion (unless explicitly
modified) from one instruction to the next
 Hardwired program
 The result of the process of connecting the various components
in the desired configuration

2
02/03/2019

Hardware and Software Approaches

 Computer consist of 2 part : hardware and


software
 Program concept
 Hardwired systems are
inflexible
 General purpose
hardware can do
different tasks, given
correct control signals
 Instead of re-wiring,
supply a new set of
control signals

 What is a program
 A sequence of steps
 For each step, an
arithmetic or logical
operation is done
 For each operation, a
different set of control
signals is needed
 Also need temp storage
(memory) and way to get
input and output

3
02/03/2019

 Software
 A sequence of codes or instructions
 Part of the hardware interprets each instruction and generates
control signals
 Provide a new sequence of codes for each new program instead
of rewiring the hardware
 Major components:
 CPU
 Instruction interpreter
 Module of general-purpose arithmetic and logic functions
 I/O Components
 Input module : Contains basic components for accepting data and
instructions and converting them into an internal form of signals usable
by the system
 Output module : Means of reporting results

 Memory
Memory address Memory buffer
register (MAR) register (MBR)
• Specifies the • Contains the data
address in memory to be written into
for the next read or memory or
write receives the data
read from memory

I/O address I/O buffer


register (I/OAR) register (I/OBR)
• Specifies a • Used for the
particular I/O exchange of data
device between an I/O
module and the
CPU

4
02/03/2019

Computer system
 At the most basic level, a computer is a device consisting of
four parts:
 A processor to interpret and execute programs
 A memory to store both data and programs
 A mechanism for transferring data to and from the outside
world.
 Bus (interconnection among parts)

Computer Components: Top Level View

5
02/03/2019

Function of some registers

 Program Counter (PC) : contain address of next


instruction
 Instruction Register (IR) : contain instruction
code
 Memory Address Register (MAR) - usually
connected directly to address lines of bus
 Memory Buffer Register (MBR) - usually
connected directly to data lines of bus

 Program Status Word (PSW) - also essential, common


fields or flags contained include:
 Sign - sign bit of last arithmetic operation
 Zero - set when result of last arithmetic operation is 0
 Carry - set if last op resulted in a carry into or borrow out of a
high-order bit
 Equal - set if a logical compare result is equality
 Overflow - set when last arithmetic operation caused overflow
 Interrupt Enable/Disable - used to enable or disable interrupts
 Supervisor - indicates if privileged ops can be used

6
02/03/2019

 Other optional registers


 Pointer to a block of memory containing additional
status info (like process control blocks)
 An interrupt vector
 A system stack pointer
 A page table pointer
 I/O registers

CPU (Central Processing Unit)

 Performs data processing operations


 Consist of 4 main part :
 Control unit
 Register file
 ALU (Arithmetic and logic unit)
 Internal Bus

7
02/03/2019

 The Control Unit and the Arithmetic and Logic


Unit constitute the Central Processing Unit
 Data and instructions need to get into the
system and results out
 Input/output
 Temporary storage of code and results is needed
 Main memory

Example of registers in CPU

8
02/03/2019

 PC contains address of next instruction to be


fetched
 This address is moved to MAR and placed on
address bus
 Control unit requests a memory read
 Result is
 placed on data bus
 result copied to MBR
 then moved to IR
 Meanwhile, PC is incremented

Instruction Cycle

 INSTRUCTION FETCH & EXECUTE


 The basic function performed by a computer is execution of a
program, which consists of a set of instructions stored in
memory.
 Program execution consists of repeating the process of
instruction fetch and instruction execution. The instruction
execution may involve several operations and depends on the
nature of the instruction.
 The fetched instruction is loaded into a register in the
instruction register (IR).

9
02/03/2019

Fetch Cycle

 At the beginning of each instruction cycle the processor


fetches an instruction from memory
 The program counter (PC) holds the address of the
instruction to be fetched next
 Processor fetches instruction from memory
location pointed to by PC
 The processor increments the PC so that it will fetch the
next instruction in sequence
 The fetched instruction is loaded into the instruction
register (IR)
 The processor interprets the instruction and performs the
required action

Execute Cycle

 Four categories of actions


 1. Processor-memory : data transfer between CPU and
main memory
 2. Processor I/O: Data transfer between CPU and I/O
module
 3. Data processing: Some arithmetic or logical
operation on data
 4. Control: Alteration of sequence of operations e.g.
jump
 Instruction execution may involve a
combination of these

10
02/03/2019

Example of program execution

Explain

 The PC contains 300, the address of the first instruction.


The instruction (the value 1940 in hex) is loaded into IR
and PC is incremented. This process involves the use of
MAR and MBR.
 The first hexadecimal digit in IR indicates that the AC is
to be loaded. The remaining three hexadecimal digits
specify the address (940) from which data are to be
loaded.
 The next instruction (5941) is fetched from location 301
and PC is incremented.

11
02/03/2019

 The old contents of AC and the contents of


location 941 are added and the result is stored
in the AC.
 The next instruction (2941) is fetched from
location 302 and the PC is incremented
 The contents of the AC are stored in location
941.

Instruction Cycle State Diagram

12
02/03/2019

Interrupt

 Mechanism by which other modules (e.g. I/O)


may interrupt normal sequence of processing
 Program/CPU
 e.g. overflow, division by zero
 Timer
 Generated by internal processor timer
 Used in pre-emptive multi-tasking
 I/O
 from I/O controller
 Hardware failure
 e.g. memory parity error

Classes of Interrupts

13
02/03/2019

Software Interrupts

 Some processors support “Software Interrupts”


 In particular, both the Intel x86 family that we will use for
assembler and the ARM family use them extensively
 Software interrupts are not really interrupts at
all.
 A software interrupt is a machine instruction that
causes a transfer of control through the same
mechanism used by true interrupts
 Typically used for low-level calls to the operating
system or components such as device drivers

Why use interrupts?

 I/O Interrupts are used to improve CPU


utilization
 Most I/O devices are relatively slow compared
to the CPU
 Human interface devices and printers are
especially slow
 Keyboard: at best still fewer than 10 keystrokes per
second
 Printer: sending a single byte with the value 12
decimal causes a page eject (several seconds)

14
02/03/2019

Processing without interrupts

 Fig above a has three code segments (1,2,3) that


do not perform I/O
 WRITE calls the OS to perform an I/O Write
 Code sequence 4 prepares for the I/O transfer (check
device status, copy data to buffer, etc.)
 OS issues I/O command after seq 4.
 OS then has to wait and poll device status until I/O
completes
 Code seq 5 is post I/O processing; e.g., set status flag
 The user’s program is suspended until I/O
completes

15
02/03/2019

Processing with interrupts

 Fig 3.7b shows processing with interrupts


 The WRITE call again transfers control to OS
 After write preparation in block 4, control returns to
user program
 I/O proceeds concurrently with user program
 When I/O completes, device issues an interrupt
request
 OS interrupts user program (marked with *) and
executes post I/O code in block 5

Interrupt processing

 When the external device needs to be serviced—the I/O


module for that device sends an interrupt request signal
to the processor. The processor responds by suspending
operation of the current program, branching off to a
program to service that particular I/O device, known as
an interrupt handler, and resuming the original
execution after the device is serviced.

16
02/03/2019

Interrupt Cycle

 Added to instruction cycle


 Processor checks for interrupt
 Indicated by an interrupt signal
 If no interrupt, fetch next instruction
 If interrupt pending:
 Suspend execution of current program
 Save context
 Set PC to start address of interrupt handler routine
 Process interrupt
 Restore context and continue interrupted program

Transfer of Control via Interrupts

 From the point of view of the user


program, an interrupt is just an
interruption of the normal sequence
of execution.
 When the interrupt processing is
completed, execution resumes
(Thus, the user program does not
have to contain any special code to
accommodate interrupts; the
processor and the operating system
are responsible for suspending the
user program and then resuming it at
the same point.

17
02/03/2019

Interrupts and the instruction cycle

 An interrupt cycle is added to the instruction cycle in


which the processor checks to see if any interrupts have
occurred, indicated by the presence of an interrupt signal.
If no interrupts are pending, the processor proceeds to the
fetch cycle and fetches the next instruction of the current
program.
 If an interrupt is pending, the processor:
 suspends execution of the current program being executed and
saves the address of the next instruction to be executed and any
other data relevant to the processor’s current activity.
 Sets the program counter to the starting address of an interrupt
handler routine.

Interrupts and the instruction cycle

 Instruction cycle with interrupt

18
02/03/2019

Program Timing: Short I/O Wait

 Program timing

Program Timing: Long I/O Wait

 Long IO wait

19
02/03/2019

Instruction Cycle State DiagramWith Interrupts

Multiple Interrupts

 Two strategies for handling multiple interrupts:


 1. Disable interrupts
 Processor will ignore further interrupts while
processing one interrupt
 Interrupts remain pending and are checked after
first interrupt has been processed
 Interrupts handled in sequence as they occur
 2. Define priorities
 Low priority interrupts can be interrupted by higher
priority interrupts
 When higher priority interrupt has been processed,
processor returns to previous interrupt

20
02/03/2019

Transfer of Control-Multiple Interrupts

 Sequential interrupt
Processing
 Nested interrupt

Processing

Time Sequence of Multiple Interrupts

21
02/03/2019

Interrupt vector table

 INT and INT3 behave in a similar way.


 INT n:
 Calls ISR located at vector n (n*4).
 The INT instruction requires two bytes of memory, opcode
plus n.
 Is a table containt 256 address pointer

Example of interrupt vector table

22
02/03/2019

Exception Table

code for
exception handler 0
Exception code for
Table exception handler 1
0
1 code for
2 exception handler 2
...
n-1
...
code for
Exception
exception handler n-1
numbers

Exception Table (Excerpt)

 Example of interrupt vector table

Exception Number Description Exception Class


0 Divide error Fault
13 General protection fault Fault
14 Page fault Fault
18 Machine check Abort
32-255 OS-defined Interrupt or trap

23
02/03/2019

I/O Function

 An I/O device (e.g., disk controller) can exchange data


directly with the processor
 Just as the processor can read data from memory and
write data to memory, it can also read data from I/O
devices and write data to I/O devices

Direct Memory Access (DMA)

 In some cases it may be desirable to allow I/O


devices to exchange data directly with memory
 The processor will “grant permission” for this
exchange to take place
 Processor can then proceed to other work
(provided that it does not use the bus granted
to the I/O device)
 This operation is called Direct Memory Access
(DMA)

24
02/03/2019

I/O Module

 I/O is functionally similar to memory, but usually much


slower
 Like memory can read and write, but a single I/O
module may handle more than one device
 Each interface of an I/O device is referred to as a port
and given a unique address
 I/O devices also have external connections
 Ports numbered 0 to M-1 for M ports
 Think of port as an address in I/O space
 I/O devices can also generate interrupts

IO systems

 IO peripherals

25
02/03/2019

 Can be input or output peripherals


 Can be delay sensitive or throughput sensitive
 Can be controlled by a human or by a machine

 Questions to investigate:
 How does the CPU communicate with I/O
devices?
 How do I/O devices communicate with the CPU?
 How to transmit data efficiently, without errors?
 How to connect the I/O devices to the CPU?
 IO addressing
 According to address separation there are two
possibilities:
 Separate I/O and Memory address space
 Shared I/O and Memory address space

26
02/03/2019

Separate memory and IO addresses

 Two separate address spaces:


 x86: memory: 0 – 4GB, I/O: 0 – 64kB
 Separate instructions for I/O and memory operations
 R0 ← MEM[0x60]: data movement from memory
 R0 ← IO[0x60]: data movement from I/O peripheral

 Alternative implementation
 The CPU has a shared bus for the I/O and the
memory
 A selector signal determines the target of the
communication
 More cost effective (less wires)
 Example: x86

27
02/03/2019

Memory mapped IO

 The CPU has a single address space


 There are special memory addresses reserved for I/O
communication
 Memory read/write requests from/to these addresses are
answered by I/O devices

IO Address space on x86

28
02/03/2019

Example IO address space on 8086

 I/O Space
 It is important to notice that these I/O addresses are
NOT memory-mapped addresses on the 80x86
machines.
 Special instructions (IN/OUT)

are used to communicate


to the I/O devices.

Memory

 Memory consists of n words of equal length


numbered from 0 to n-1
 A word of data can be read or written
 Control signals specify R/W operation at
location specified by address
 Needs three sets of signal lines:
 Data
 Address
 Control (R/W and timing)

29
02/03/2019

Interconnection Structures

 Computer modules
 Computer is a network of basic modules.
 There must be paths for connecting the modules.
 The collection of paths connecting the various modules is called
the interconnection structure. The design of this structure will
depend on the exchanges that must be made among modules.

Computer Modules

 Memory modul :
 A memory module will
consist of N words of equal
length.
 Each word is assigned a
unique numerical address (0,
1, …, N - 1)
 A word of data can be read
from or written into the
memory.

30
02/03/2019

 IO modul :
 Function similar to memory.
 An I/O module may send interrupt signals to the processor.
 Processor
 The processor reads in instructions and data, writes out data
after processing, and uses control signals to control the overall
operation of the system.
 The processor also receives interrupt signals.

Types of transfers

 The interconnection structure must support the


following types of transfers
 Memory  processor: The processor reads/writes an
instruction or a unit of data from/to memory.
 I/O  processor: The processor reads/sends data from/to
an I/O device via an I/O module.
 I/O to or from memory: an I/O module is allowed to
exchange data directly with memory, without going through
the processor, using direct memory access.

31
02/03/2019

Bus interconnection

 A bus is a communication pathway connecting two or


more devices.
 Key characteristic of a bus: a shared transmission
medium. Multiple devices connect to the bus, and a
signal transmitted by any one device is available for
reception by all other devices attached to the bus. If two
devices transmit during the same time period, their
signals will overlap and become garbled. Thus, only one
device at a time can successfully transmit

 A bus consists of multiple lines. Each line is capable of


transmitting signals representing binary 1 and binary 0.
Several lines of a bus can be used to transmit binary
digits simultaneously (in parallel). For example, an 8-bit
unit of data can be transmitted over eight bus lines.
 Computer systems contain a number of different buses
that provide pathways between components.

32
02/03/2019

 A bus that connects major computer components


(processor, memory, I/O) is called a system bus.
 The most common computer interconnection structures
are based on the use of one or more system buses

System Bus

 Computers normally contain several buses


 The bus that interconnects major components
(processor, memory, I/O devices) is called the
system bus
 A system bus typically contains from 50 to
several hundred lines
 Lines are grouped
 Major groupings are data, address and control signals
 Power lines may not be shown in bus diagrams

33
02/03/2019

 A communication pathway connecting two or more


devices
 Key characteristic is that it is a shared transmission medium
 Signals transmitted by any one device are available for
reception by all other devices attached to the bus
 If two devices transmit during the same time period their
signals will overlap and become garbled
 Typically consists of multiple communication lines
 Each line is capable of transmitting signals representing
binary 1 and binary 0

 Computer systems contain a number of different buses


that provide pathways between components at various
levels of the computer system hierarchy
 System bus
 A bus that connects major computer components (processor,
memory, I/O)
 The most common computer interconnection structures
are based on the use of one or more system buses

34
02/03/2019

Bus Structure

 A system bus consists of from about fifty to hundreds of


separate lines and can be classified into three functional
groups: data, address, and control lines:
 Data bus: 32, 64, 128 lines (width  system performance)
 Address bus: 8, 16, 32 lines (width  max memory capacity)
 Control bus:
 Memory read/write
 I/O read/write
 Transfer ACK
 Bus request/grant
 Interrupt request/ACK
 Clock
 Reset

Multiple-Bus Hierachies

 A great number of devices connected to the bus


will suffer system performance (bottleneck).

35
02/03/2019

Multiple-Bus Hierachies

 High-performance architecture

Data Bus

 Data lines that provide a path for moving data among


system modules
 May consist of 32, 64, 128, or more separate lines
 The number of lines is referred to as the width of the
data bus
 The number of lines determines how many bits can be
transferred at a time
 The width of the data bus is a key factor in determining
overall system performance

36
02/03/2019

Address Bus

 Used to designate the source or destination of the


data on the data bus
 If the processor wishes to read a word of data from
memory it puts the address of the desired word on the
address lines
 Width determines the maximum possible memory
capacity of the system
 Also used to address I/O ports
 The higher order bits are used to select a particular
module on the bus and the lower order bits select a
memory location or I/O port within the module

Control Bus

 Used to control the access and the use of the data and
address lines
 Because the data and address lines are shared by all
components there must be a means of controlling their
use
 Control signals transmit both command and timing
information among system modules
 Timing signals indicate the validity of data and address
information
 Command signals specify operations to be performed

37
02/03/2019

Typical Control Signals

 Memory read/write signals


 I/O read/write signals
 Bus request/grant
 Transfer ACK (acknowledgement)
 Indicates that have been accepted from or placed
on bus
 Interrupt Request/ACK
 Clock signals synchronize operations
 Reset: initializes all modules

Basic Bus Operation

 Module that wishes to send data must


 Obtain use of the bus
 Then transfer data
 Module that requests data from another
module must
 Obtain use of the bus
 Transfer request to other module over bus
 Wait for data to be written to the bus

38
02/03/2019

 What do buses look like?


 Parallel lines on circuit boards
 Ribbon cables
 Strip connectors on mother boards : e.g. PCI
 Sets of wires
 With VLSI, many components (such as L1
cache) are on the same chip as the processor
 An on-chip bus connects these components

Single Bus Problems

 Lots of devices on one bus leads to:


 Propagation delays
 Long data paths mean that co-ordination of bus use can adversely affect
performance
 Bottlenecks when aggregate data transfer approaches bus
capacity
 Most systems use multiple buses to overcome these
problems
 Hierarchical structure
 High-speed limited access buses close to the processor
 Slower-speed general access buses farther away from the
processor

39
02/03/2019

Bus can be a bottleneck

 Can increase data rates and bus width, but


peripheral data rates are increasing rapidly
 Video and graphics controllers
 Network interfaces (1GB ethernet)
 High speed storage devices

Basic Elements of Bus Design

 These key elements serve to classify and


differentiate buses

40
02/03/2019

Bus Types

 Dedicated (functional)
 Separate data & address lines
 Multiplexed (Time multiplexing)
 Shared lines
 Address valid or data valid control line
 Advantage - fewer lines
 Disadvantages
 More complex control
 Performance – cannot have address and data simultaneously on
bus
 Dedicated (physical)
 Bus connects subset of modules
 Example: all I/O devices on a slow bus
 Provides high throughput, but cost and complexity increase

Bus Arbitration

 Because only one unit at a time can successfully transmit


over the bus, some method of arbitration is needed.
 Two types of arbitration: centralized and distributed.
 Centralized scheme: a bus controller or arbiter, is
responsible for allocating time on the bus.
 Distributed scheme: No central controller, each module
contains access control logic and the modules act
together to share the bus.
 The device which initiates data transfer is called the
master, while the other device involves in the data
exchange is called the slave.

41
02/03/2019

Timing

 Co-ordination of events on bus


 Synchronous
 Events determined by clock signals
 Control Bus includes clock line
 A single 1-0 is a bus cycle
 All devices can read clock line
 Usually sync on leading edge
 Usually a single cycle for an event

Synchronous Timing Diagram

 Synchronous diagram

42
02/03/2019

Asynchronous Timing

 Occurrence of one event on bus follows and


depends on a previous event
 ACK signals are used to signal end of event
 Synchronous timing easier to implement and
test
 But all devices are limited to fixed clock rate
 Cannot take advantage of newer, faster devices
 Asynchronous timing allows mixture of slow and
fast devices to share bus comfortably

Timing of Asynchronous Bus Operations

 Asynchronous diagram

43
02/03/2019

Bus Width

 Data width affects system performance


 Address width determines addressable memory
 Address: Width of address bus has an impact on
system capacity i.e. wider bus means greater the
range of locations that can be transferred.
 Data: width of data bus has an impact on system
performance i.e. wider bus means number of bits
transferred at one time.
 If bus has n bit width, CPU can manage 2n
memory cells (location)

Example of Bus

 Address:
 If I/O, a value between 0000H and FFFFH is issued.
 If memory, it depends on the architecture:
 20 -bits (8086/8088)
 24 -bits (80286/80386SX)
 25 -bits (80386SL/SLC/EX)
 32 -bits (80386DX/80486/Pentium)
 36 -bits (Pentium Pro/II/III)

44
02/03/2019

 Data:
 8 -bits (8088)
 16 -bits (8086/80286/80386SX/SL/SLC/EX)
 32 -bits (80386DX/80486/Pentium)
 64 -bits (Pentium/Pro/II/III)
 Control:
 Most systems have at least 4 control bus connections
(active low).
 MRDC (Memory ReaD Control), MWRC , IORC
(I/O Read Control), IOWC

45
02/03/2019

Chapter 3

COMPUTER MEMORY
Part 1

Contents

 Master the concepts of hierarchical memory


organization.
 Understand how each level of memory
contributes to system performance, and how the
performance is measured.
 Master the concepts behind cache memory,
virtual memory, memory segmentation, paging
and address translation.

1
02/03/2019

Hardware review

 Overview computer
systems
 CPU executes instructions; CPU Memory
memory stores data
 To execute an instruction,
the CPU must:
 fetch an instruction;
 fetch the data used by the Bus
instruction; and, finally,
 execute the instruction on the
data… Disks Net
USB Etc.
 which may result in writing
data back to memory

Byte-Oriented Memory Organization

 Conceptually, memory is a single, large array of bytes,


each with a unique address (index)
 The value of each byte in memory can be read and written
 Programs refer to bytes in memory by their addresses
 Domain of possible addresses = address space

 But not all values fit in a single byte… (e.g. 410)


 Many operations actually use multi-byte values
 We can store addresses as data to “remember” where other data is
in memory

•••

2
02/03/2019

Introduction

 Memory lies at the heart of the stored-program


computer.
 In previous chapters, we studied the components from
which memory is built and the ways in which memory
is accessed by various ISAs.
 In this chapter, we focus on memory organization. A
clear understanding of these ideas is essential for the
analysis of system performance.

 There are two kinds of main memory: random access


memory, RAM, and read-only-memory, ROM.
 There are two types of RAM, dynamic RAM (DRAM)
and static RAM (SRAM).
 Dynamic RAM consists of capacitors that slowly leak
their charge over time. Thus they must be refreshed
every few milliseconds to prevent data loss.
 DRAM is “cheap” memory owing to its simple design.

3
02/03/2019

Characteristics of memory systems


 The memory system can be characterised with their
Location, Capacity, Unit of transfer,
Access method, Performance, Physical type, Physical
characteristics, Organisation
 Location
 Processor memory: The memory like registers is included
within the processor and termed as processor memory.
 Internal memory: It is often termed as main memory and
resides within the CPU
 External memory: It consists of peripheral storage devices such
as disk and magnetic
tape that are accessible to processor via i/o controllers

 Capacity
 Word size: Capacity is expressed in terms of words or
bytes.The natural unit of organisation
 Number of words: Common word lengths are 8, 16, 32 bits
etc. or Bytes
 Unit of Transfer
 Internal: For internal memory, the unit of transfer is equal to
the number of data lines into and out of the memory module
 External: For external memory, they are transferred in block
which is larger than a word.
 Addressable unit : Smallest location which can be uniquely
addressed— Word internally— Cluster on Magnetic disks

4
02/03/2019

 Access Method
 Sequential acces. Examples tape
 Direct Access: Individual blocks of records have
unique address based on location. Access is
accomplished by jumping (direct access) to general
vicinity plus a sequential search to reach the final
location. Example disk
 Random access: example RAM
 Associative access: example cache

 Performance
 Access time
 Memory Cycle time
 Transfer Rate:
 Physical Types
 Semiconductor : RAM
 Magnetic : Disk & Tape
 Optical : CD & DVD
 Others

5
02/03/2019

The Memory Hierarchy

 Generally speaking, faster memory is more expensive


than slower memory.
 To provide the best performance at the lowest cost,
memory is organized in a hierarchical fashion.
 Small, fast storage elements are kept in the CPU, larger,
slower main memory is accessed through the data bus.
 Larger, (almost) permanent storage in the form of disk
and tape drives is still further from the CPU.
 This storage organization can be thought of as a
pyramid:

6
02/03/2019

Memory Hierarchy

An Example Memory Hierarchy

 Examples <1 ns 5-10 s


registers

1 ns on-chip L1
Smaller, cache (SRAM)
faster,
costlier
5-10 ns off-chip L2 1-2 min
per byte
cache (SRAM)

100 ns main memory 15-30 min


Larger,
slower, (DRAM)
cheaper 150,000 ns
per byte SSD 31 days
local secondary storage
10,000,000 ns Disk (local disks) 66 months = 1.3 years
(10 ms)

1-150 ms remote secondary storage


(distributed file systems, web servers)
1 - 15 years

7
02/03/2019

 Examples
registers CPU registers hold words retrieved from L1 cache

on-chip L1
Smaller, cache (SRAM) L1 cache holds cache lines retrieved from L2 cache
faster,
costlier
off-chip L2
per byte
cache (SRAM) L2 cache holds cache lines retrieved from
main memory

Larger, main memory


slower, (DRAM) Main memory holds disk blocks
retrieved from local disks
cheaper
per byte local secondary storage
(local disks)

remote secondary storage


(distributed file systems, web servers)

Example Memory Hierarchy

L0:
Regs CPU registers hold words
Smaller,
retrieved from the L1 cache.
faster, L1: L1 cache
and (SRAM) L1 cache holds cache lines
costlier retrieved from the L2 cache.
L2: L2 cache
(per byte)
(SRAM) L2 cache holds cache lines
storage
devices retrieved from L3 cache
L3: L3 cache
(SRAM)
L3 cache holds cache lines
Larger, retrieved from main memory.
slower, L4: Main memory
and (DRAM) Main memory holds disk
cheaper blocks retrieved from
(per byte) local disks.
storage L5: Local secondary storage
devices (local disks)

L6: Remote secondary storage


(e.g., Web servers)

8
02/03/2019

Intel Core i7 Cache Hierarchy


Processor package
Core 0 Core 3 Block size:
64 bytes for all caches
Regs Regs
L1 i-cache and d-cache:
L1 L1 L1 L1 32 KiB, 8-way,
d-cache i-cache d-cache i-cache
… Access: 4 cycles

L2 unified cache:
L2 unified cache L2 unified cache 256 KiB, 8-way,
Access: 11 cycles

L3 unified cache L3 unified cache:


(shared by all cores) 8 MiB, 16-way,
Access: 30-40 cycles

Main memory

Speed of memory operations

 The memory is a serious bottleneck of Neumann


computers
 Because it is slow

9
02/03/2019

 Programs do not refer to memory addresses randomly


 Memory addresses referenced by the programs show a
special pattern → we can utilize it!
 Locality of reference:
 Temporal: a memory content referenced will be referenced again
in the near future
 Spatial: if a memory content has been referenced, its neighborhood
will be referenced as well in the near future
 Algorithmic: some algorithms (like traversing linked lists) refer to
the memory in a systematic way
 Examples:
 Media players:
 Spatial locality: yes, temporal locality: no
 A "for" loop in the C language:
 Both temporal and spatial locality hold

Processor-Memory Gap

 1989 first Intel CPU with cache on chip


 1998 Pentium III has two cache levels on chip µProc
55%/year
(2X/1.5yr)
Processor-Memory
Performance Gap
(grows 50%/year)

DRAM
7%/year
(2X/10yrs)

10
02/03/2019

General Cache Concept

 Example cache
Smaller, faster, more expensive
Cache 8
4 9 14
10 3 memory caches a subset of
the blocks

Data is copied in block-sized


10
4 transfer units

Larger, slower, cheaper memory


Memory 0 1 2 3 viewed as partitioned into “blocks”
4 5 6 7
8 9 10 11
12 13 14 15

Cache memory

 Small amount of fast memory


 Sits between normal main memory and CPU
 May be located on CPU chip or module
 What we are going to cover are:
 Cache organization:
 How to store data in the cache in an efficient way
 Cache content management:
 When to put a data to the cache and when not
 What shall we throw out from the cache if we want to
put new data there

11
02/03/2019

Cache Memories

 Cache memories are small, fast SRAM-based


memories managed automatically in hardware
 Hold frequently accessed blocks of main memory
 CPU looks first for data in cache
 Typical system structure:
CPU chip
Register file
Cache
ALU
memory
System bus Memory bus

I/O Main
Bus interface
bridge memory

Cache and Main Memory

12
02/03/2019

General Cache Organization (S, E, B)

E = 2e lines per set


 Model
set
line

S = 2s sets

Cache size:
C = S x E x B data bytes
v tag 0 1 2 B-1

valid bit B = 2b bytes per cache block (the data)

Cache Read • Locate set


• Check if any line in set
has matching tag
E = 2e lines per set • Yes + line valid: hit
 Read • Locate data starting
at offset

Address of word:
t bits s bits b bits

S = 2s sets
tag set block
index offset

data begins at this offset

v tag 0 1 2 B-1

valid bit
B = 2b bytes per cache block (the data)

13
02/03/2019

General Cache Mechanics

 Overview cache
• Smaller, faster, more expensive
Cache memory.
7 9 14 3 • Caches a subset of the blocks (a.k.a.
lines)

Data is copied in block-sized


transfer units

Memory • Larger, slower, cheaper memory.


0 1 2 3 • Viewed as partitioned into “blocks”
4 5 6 7 or “lines”

8 9 10 11
12 13 14 15

General Cache Concepts: Hit

 1- data request Request: 14 Data in block b is needed

Cache
Block b is in cache:
7 9 14 3
Hit!

Data is returned to CPU

Memory 0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15

14
02/03/2019

General Cache Concepts: Miss

 Miss Request: 12 Data in block b is needed

Cache
Block b is not in cache:
7 9
12 14 3
Miss!

Block b is fetched from


12 Request: 12
memory

Memory
Block b is stored in cache
0 1 2 3 • Placement policy:
4 5 6 7 determines where b goes
• Replacement policy:
8 9 10 11
determines which block
12 13 14 15 gets evicted (victim)

Data is returned to CPU

Why Caches Work

 Locality: Programs tend to use data and


instructions with addresses near or equal to those
they have used recently
 Temporal locality: block

 Recently referenced items are likely


to be referenced again in the near future
 Spatial locality:
block
 Items with nearby addresses tend
to be referenced close together in time
 How do caches take advantage of this?

15
02/03/2019

Operations of cache

 When the CPU needs to access memory, cache is


examined. If the word is found in cache, it is read from the
cache and if the word is not found in cache, main memory
is accessed to read word. A block of word containing the
one just accessed is then transferred from main memory to
cache memory.
 Cache connects to the processor via data control and
address line. The data and address lines also attached to
data and address buffer which attached to a system bus
from which main memory is reached.

 When a cache hit occurs, the data and address buffers


are disabled and the communication is only between
processor and cache with no system bus traffic. When a
cache miss occurs, the desired word is first read into the
cache and then transferred from cache to processor. For
later case, the cache is physically interposed between
the processor and main memory for all data, address and
control lines

16
02/03/2019

Cache Operation Overview

Cache operation – overview

 CPU generates the receive address (RA) of a word to be


moved (read).
 Check a block containing RA is in cache.
 If present, get from cache (fast) and return.
 If not present, access and read required block from main
memory to cache.
 Allocate cache line for this new found block.
 Load bock for cache and deliver word to CPU
 Cache includes tags to identify which block of main
memory is in each cache slot

17
02/03/2019

Elements of Cache Design

 Addressing
 Size
 Mapping Function
 Replacement Algorithm
 Write Policy
 Block Size
 Number of Caches

Cache Addressing

 Where does cache sit?


 Between processor and virtual memory management unit
 Between MMU and main memory
 Logical cache (virtual cache) stores data using virtual addresses
 Processor accesses cache directly, not thorough physical cache
 Cache access faster, before MMU address translation
 Virtual addresses use same address space for different applications
 Must flush cache on each context switch
 Physical cache stores data using main memory physical addresses

18
02/03/2019

Cache size

 Size of the cache to be small enough so that the overall


average cost per bit is close to that of main memory
alone and large enough so that the overall average access
time is close to that of the cache alone
 The larger the cache, the larger the number of gates
involved in addressing the cache
 Large caches tend to be slightly slower than small ones –
even when built with the same integrated circuit
technology and put in the same place on chip and circuit
board.
 The available chip and board also limits cache size.

Mapping function

 The transformation of data from main memory to cache


memory is referred to as memory mapping process
 Because there are fewer cache lines than main memory
blocks, an algorithm is needed for mapping main
memory blocks into cache lines.
 There are three different types of mapping functions in
common use and are direct,
associative and set associative. All the three include
following elements in each example.

19
02/03/2019

Information about examples

 Cache of 64kByte
 Cache block of 4 bytes  cache is 16k (214)
lines of 4 bytes
 16MBytes main memory
 24 bit address (224=16M)
 Thus, for mapping purposes, we can consider
main memory to consist of 4Mbytes blocks of 4
bytes each.

Block Number Block Offset

20
02/03/2019

Cache organization

 Data units in the cache are called: block


 Size of blocks = 2 L
 Lower L bit of memory addresses: position inside a block
 Upper bits: number (ID) of the block
 Additional information to be stored in the cache with
each block:
 Cache tag (which block of the system memory is stored here)
 Valid bit: if =1, this cache block stores valid data
 Dirty bit: if =1, this cache block has been modified since in the
cache
 The principal questions of cache organization are:
 How to store blocks in the cache
 To enable finding a block fast
 To make it simple and relatively cheap

21
02/03/2019

Checking for a Requested Address

Block Number

Direct Mapping

 Each block of main memory maps to only one


cache line
 i.e. if a block is in cache, it must be in one specific
place
 Address is in two parts
 Least Significant w bits identify unique word
 Most Significant s bits specify one memory block
 The MSBs are split into a cache line field r and a
tag of s-r (most significant)

22
02/03/2019

Direct Mapping Address Structure

Tag s-r Line or Slot r Word w

8 14 2

 24 bit address
 2 bit word identifier (4 byte block)
 22 bit block identifier
 8 bit tag (=22-14)
 14 bit slot or line
 No two blocks in the same line have the same Tag field
 Check contents of cache by finding line and checking Tag

Direct Mapping from Cache to Main Memory

23
02/03/2019

Direct Mapping Cache Line Table

 Table mapping main memory block and cache line


Cache line Main Memory blocks held
0 0, m, 2m, 3m…2s-m

1 1,m+1, 2m+1…2s-m+1


m-1 m-1, 2m-1,3m-1…2s-1

Direct Mapping Cache Organization

24
02/03/2019

 Direct
cache
example

Direct Mapping Summary


 Address length = (s + w) bits
 Number of addressable units = 2s+w words or bytes
 Block size = line size = 2w words or bytes
 Number of blocks in main memory = 2s+ w/2w = 2s
 Number of lines in cache = m = 2r
 Size of tag = (s – r) bits
 Simple
 Inexpensive
 Fixed location for given block
 If a program accesses 2 blocks that map to the same
line repeatedly, cache misses are very high

25
02/03/2019

Example: Direct Mapped Cache (E = 1)

 Direct mapped: One line per set


 Assume: cache block size 8 bytes

Address of int:
v tag 0 1 2 3 4 5 6 7
t bits 0…01 100

v tag 0 1 2 3 4 5 6 7
find set
S= 2s sets
v tag 0 1 2 3 4 5 6 7

v tag 0 1 2 3 4 5 6 7

Example: Direct Mapped Cache (E = 1)

 Direct mapped: One line per set


 Assume: cache block size 8 bytes
Address of int:
valid? + match: assume yes = hit
t bits 0…01 100

v tag 0 1 2 3 4 5 6 7

block offset

26
02/03/2019

Example: Direct Mapped Cache (E = 1)

 Direct mapped: One line per set


 Assume: cache block size 8 bytes
Address of int:
valid? + match: assume yes = hit
t bits 0…01 100

v tag 0 1 2 3 4 5 6 7

block offset

int (4 Bytes) is here

 If tag doesn’t match: old line is evicted and replaced

Direct-Mapped Cache Simulation


 Example of direct mapped
t=1 s=2 b=1 M=16 bytes (4-bit addresses), B=2 bytes/block,
x xx x S=4 sets, E=1 Blocks/set

Address trace (reads, one byte per read):


0 [00002], miss
1 [00012], hit
7 [01112], miss
8 [10002], miss
0 [00002] miss

v Tag Block
Set 0 0
1 1?
0 ?
M[8-9]
M[0-1]
Set 1
Set 2
Set 3 1 0 M[6-7]

27
02/03/2019

Direct-Mapped Cache-example

Memory Cache
Block Addr Block Data Index Tag Block Data
00 00 00 00
00 01 01 11
00 10 10 01
00 11 11 01
01 00
01 01
01 10
01 11  Hash function: (block address)
10 00
10 01
mod (# of blocks in cache)
10 10  Each memory address maps to
10 11
11 00
exactly one index in the cache
11 01  Fast (and simpler) to find an
11 10
address
11 11

 8 = 00 10 00
Direct-Mapped Cache Problem
 24=01 10 00
(t) (s) (k)
Memory Cache
Block Addr Block Data Index Tag Block Data
00 00 00 ??
00 01 01 ??
00 10 10
00 11 11 ??
01 00
01 01
01 10
01 11  What happens if we access the
10 00
10 01 following addresses?
10 10
 8, 24, 8, 24, 8, …?
10 11
11 00  Conflict in cache (misses!)
11 01
11 10
 Rest of cache goes unused
11 11
 Solution?

28
02/03/2019

Note that

 All locations in a single block of memory have the same


higher order bits (call them the block number), so the
lower order bits can be used to find a particular word in
the block.
 Within those higher-order bits, their lower-order bits
obey the modulo mapping given
above (assuming that the number of cache lines is a
power of 2), so they can be used to get the cache line for
that block
 The remaining bits of the block number become a tag,
stored with each cache line, and used to distinguish one
block from another that could fit into that same cache line

Associated Mapping

 It overcomes the disadvantage of direct mapping by


permitting each main memory block to be loaded into any
line of cache.
 Cache control logic interprets a memory address simply
as a tag and a word field
 Tag uniquely identifies block of memory
 Cache control logic must simultaneously examine every
line’s tag for a match which requires fully associative
memory
 very complex circuitry, complexity increases
exponentially with size
 Cache searching gets expensive

29
02/03/2019

Associative Mapping

 A main memory block can load into any line of


cache
 Memory address is interpreted as tag and word
 Tag uniquely identifies block of memory
 Every line’s tag is examined for a match
 Cache searching gets expensive

Associative Mapping from Cache to Main Memory

 Block of main memory can mapping to any line

30
02/03/2019

Fully Associative Cache Organization

Example

31
02/03/2019

Associative Mapping Address Structure

Word
Tag 22 bit 2 bit

 22 bit tag stored with each 32 bit block of data


 Compare tag field with tag entry in cache to check for hit
 Least significant 2 bits of address identify which 16 bit word is
required from 32 bit data block
 e.g.
 Address Tag Data Cache line
 FFFFFC FFFFFC 24682468 3FFF

Associative Mapping Summary

 Address length = (s + w) bits


 Number of addressable units = 2s+w words or bytes
 Block size = line size = 2w words or bytes
 Number of blocks in main memory = 2s+ w/2w = 2s
 Number of lines in cache = undetermined
 Size of tag = s bits

32
02/03/2019

Set Associated Mapping

 It is a compromise between direct and associative


mappings that exhibits the strength and reduces the
disadvantages
 Cache is divided into v sets, each of which has k lines;
number of cache lines = vk
M=vXk
I = j modulo v
Where, i = cache set number; j = main memory block
number; m = number of lines in the cache
 So a given block will map directly to a particular set, but
can occupy any line in that set (associative mapping is
used within the set)

 Cache control logic interprets a memory address simply


as three fields tag, set and word. The d set bits specify
one of v = 2d sets. Thus s bits of tag and set fields specify
one of the 2s block of main memory.
 The most common set associative mapping is 2 lines per
set, and is called two-way set associative. It significantly
improves hit ratio over direct mapping, and the
associative hardware is not too expensive.

33
02/03/2019

Set Associative Mapping

 Cache is divided into a number of sets


 Each set contains a number of lines
 A given block maps to any line in a given set
 e.g. Block B can be in any line of set i
 e.g. 2 lines per set
 2 way associative mapping
 A given block can be in one of 2 lines in only one set

Set Associative Mapping Example

 13 bit set number


 Block number in main memory is modulo 213
 000000, 00A000, 00B000, 00C000 … map to
same set

34
02/03/2019

Mapping From Main Memory to Cache: v Associative

Mapping From Main Memory to Cache: k-way Associative

35
02/03/2019

K-Way Set Associative Cache Organization

Set Associative Mapping Address Structure

Word
Tag 9 bit Set 13 bit 2 bit

 Use set field to determine cache set to look in


 Compare tag field to see if we have a hit
 Examples
 Address Tag Data Set number
 1FF 7FFC 1FF 12345678 1FFF
 001 7FFC 001 11223344 1FFF

36
02/03/2019

Two Way Set Associative Mapping Example

Set Associative Mapping Summary

 Address length = (s + w) bits


 Number of addressable units = 2s+w words or bytes
 Block size = line size = 2w words or bytes
 Number of blocks in main memory = 2d
 Number of lines in set = k
 Number of sets = v = 2d
 Number of lines in cache = kv = k * 2d
 Size of tag = (s – d) bits

37
02/03/2019

E-way Set Associative Cache (Here: E = 2)

 E = 2: Two lines per set


 Assume: cache block size 8 bytes Address of short int:
t bits 0…01 100

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 find set

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7

E-way Set Associative Cache (Here: E = 2)

 Example with E=2

E = 2: Two lines per set


Assume: cache block size 8 bytes Address of short int:
t bits 0…01 100
compare both

valid? + match: yes = hit

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7

block offset

38
02/03/2019

E-way Set Associative Cache (Here: E = 2)


E = 2: Two lines per set
Assume: cache block size 8 bytes Address of short int:
t bits 0…01 100
compare both

valid? + match: yes = hit

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7

block offset

short int (2 Bytes) is here

 No match:
 One line in set is selected for eviction and replacement
 Replacement policies: random, least recently used (LRU), …

2-Way Set Associative Cache Simulation

 Example of 2 way
t=2 s=1 b=1
xx x x M=16 byte addresses, B=2 bytes/block,
S=2 sets, E=2 blocks/set

Address trace (reads, one byte per read):


0 [00002], miss
1 [00012], hit
7 [01112], miss
8 [10002], miss
0 [00002] hit

v Tag Block
0
Set 0 1 ?
00 ?M[0-1]
0
1 10 M[8-9]
0
1 01 M[6-7]
Set 1
0

39
02/03/2019

Replacement algorithm

 When all lines are occupied, bringing in a new block


requires that an existing line be overwritten.
 Direct mapping
 No choice possible with direct mapping
 Each block only maps to one line
 Replace that line
 Associative and Set Associative mapping
 Algorithms must be implemented in hardware for speed
 Least Recently used (LRU)
 replace that block in the set which has been in cache longest with no
reference to it

 Implementation: with 2-way set associative, have a USE bit


for each line in a set. When a block is read into cache, use
the line whose USE bit is set to 0, then set its USE bit to one
and the other line’s USE bit to 0.
 Probably the most effective method
 First in first out (FIFO)
 replace that block in the set which has been in the cache
longest
 Implementation: use a round-robin or circular buffer
technique (keep up with which slot’s “turn” is next
 Least-frequently-used (LFU)
 replace that block in the set which has experienced the
fewest references or hits
 Implementation: associate a counter with each slot and
increment when used

40
02/03/2019

LRU

Memory page B E E R B A R E B E A R
1 B
2 E *
3 R

LRU

Memory page B E E R B A R E B E A R
1 B *
2 E *
3 R

41
02/03/2019

LRU

Memory page B E E R B A R E B E A R
1 B *
2 E *
3 R

LRU

Memory page B E E R B A R E B E A R
1 B *
2 E * A
3 R

42
02/03/2019

LRU

Memory page B E E R B A R E B E A R
1 B *
2 E * A
3 R *

LRU

Memory page B E E R B A R E B E A R
1 B *
2 E * A
3 R *

43
02/03/2019

LRU

Memory page B E E R B A R E B E A R
1 B * E
2 E * A
3 R *

LRU

Memory page B E E R B A R E B E A R
1 B * E
2 E * A
3 R *

44
02/03/2019

LRU

Memory page B E E R B A R E B E A R
1 B * E
2 E * A B
3 R *

LRU

Memory page B E E R B A R E B E A R
1 B * E *
2 E * A B
3 R *

45
02/03/2019

LRU

Memory page B E E R B A R E B E A R
1 B * E *
2 E * A B
3 R *

LRU

Memory page B E E R B A R E B E A R
1 B * E *
2 E * A B
3 R * A

46
02/03/2019

LRU

Memory page B E E R B A R E B E A R
1 B * E *
2 E * A B
3 R * A

LRU

Memory page B E E R B A R E B E A R
1 B * E *
2 E * A B R
3 R * A

47
02/03/2019

LRU

 8 page faults

Memory page B E E R B A R E B E A R
1 B * E *
2 E * A B R
3 R * A

LFU

Memory page B E E R B A R E B E A R
1 B
2
3

48
02/03/2019

LFU

Memory page B E E R B A R E B E A R
1 B
2
3

LFU

Memory page B E E R B A R E B E A R
1 B
2 E
3

49
02/03/2019

LFU

Memory page B E E R B A R E B E A R
1 B
2 E 2
3

LFU

Memory page B E E R B A R E B E A R
1 B
2 E 2
3 R

50
02/03/2019

LFU

Memory page B E E R B A R E B E A R
1 B 2
2 E 2
3 R

LFU

Memory page B E E R B A R E B E A R
1 B 2
2 E 2
3 R A

51
02/03/2019

LFU

Memory page B E E R B A R E B E A R
1 B 2
2 E 2
3 R A R

LFU

Memory page B E E R B A R E B E A R
1 B 2
2 E 2 3
3 R A R

52
02/03/2019

LFU

Memory page B E E R B A R E B E A R
1 B 2 3
2 E 2 3
3 R A R

LFU

Memory page B E E R B A R E B E A R
1 B 2 3
2 E 2 3 4
3 R A R

53
02/03/2019

LFU

Memory page B E E R B A R E B E A R
1 B 2 3
2 E 2 3 4
3 R A R A

LFU

Memory page B E E R B A R E B E A R
1 B 2 3
2 E 2 3 4
3 R A R A R

54
02/03/2019

LFU

 7 page faults

Memory page B E E R B A R E B E A R
1 B 2 3
2 E 2 3 4
3 R A R A R

Cache write policy

 When a line is to be replaced, must update the original


copy of the line in main memory if any addressable unit in
the line has been changed
 If a block has been altered in cache, it is necessary to write
it back out to main memory before replacing it with
another block (writes are about 15% of memory
references)
 Must not overwrite a cache block unless main memory is
up to date
 I/O modules may be able to read/write directly to memory
 Multiple CPU’s may be attached to the same bus, each
with their own cache

55
02/03/2019

 Write Through
 All write operations are made to main memory as well as to cache,
so main memory is always valid
 Other CPU’s monitor traffic to main memory to update their
caches when needed
 This generates substantial memory traffic and may create a
bottleneck
 Anytime a word in cache is changed, it is also changed in main
memory
 Both copies always agree
 Generates lots of memory writes to main memory
 Multiple CPUs can monitor main memory traffic to keep local (to
CPU) cache up to date
 Lots of traffic
 Slows down writes
 Remember bogus write through caches!

 Write back
 When an update occurs, an UPDATE bit associated
with that slot is set, so when the block is replaced it is
written back first
 During a write, only change the contents of the cache
 Update main memory only when the cache line is to be
replaced
 Causes “cache coherency” problems -- different values
for the contents of an address are in the cache and the
main memory
 Complex circuitry to avoid this problem
 Accesses by I/O modules must occur through the cache

56
02/03/2019

 Multiple caches still can become invalidated, unless


some cache coherency system is used. Such systems
include:
 Bus Watching with Write Through - other caches monitor
memory writes by other caches (using write through) and
invalidates their own cache line if a match
 Hardware Transparency - additional hardware links
multiple caches so that writes to one cache are made to the
others
 Non-cacheable Memory - only a portion of main memory
is shared by more than one processor, and it is non-
cacheable

57
02/03/2019

Chapter 3

COMPUTER MEMORY
Part 2
(Virtual memory)

Hmmm, How Does This Work?!


Process 1 Process 2 Process n

Solution: Virtual Memory (today and next lecture)

1
02/03/2019

A System Using Physical Addressing

Main memory
0:
1:
Physical address 2:
(PA) 3:
CPU 4:
4
5:
6:
7:
8:

...
M-1:

Data word

 Used in “simple” systems like embedded


microcontrollers in devices like cars, elevators,
and digital picture frames

A System Using Virtual Addressing

Main memory
0:
CPU Chip 1:
2:
Virtual address Physical address
(VA) (PA)
3:
CPU MMU 4:
4100 4
5:
6:
7:
8:
...

M-1:

Data word

 Used in all modern servers, laptops, and smart


phones
 One of the great ideas in computer science

2
02/03/2019

Address Spaces

 Linear address space: Ordered set of contiguous non-negative integer


addresses:
{0, 1, 2, 3 … }

 Virtual address space: Set of N = 2n virtual addresses


{0, 1, 2, 3, …, N-1}

 Physical address space: Set of M = 2m physical addresses


{0, 1, 2, 3, …, M-1}

Why Virtual Memory (VM)?

 Uses main memory efficiently


 Use DRAM as a cache for parts of a virtual address
space
 Simplifies memory management
 Each process gets the same uniform linear address
space
 Isolates address spaces
 One process can’t interfere with another’s memory
 User program cannot access privileged kernel
information and code

3
02/03/2019

VM as a Tool for Caching

 Conceptually, virtual memory is an array of N contiguous bytes


stored on disk.
 The contents of the array on disk are cached in physical memory
(DRAM cache)
p
 These cache blocks are called pages (size is P = 2 bytes)

Virtual memory Physical memory


0
VP 0 Unallocated
0
VP 1 Cached Empty PP 0
Uncached PP 1
Unallocated Empty
Cached
Uncached Empty
Cached PP 2m-p-1
M-1
VP 2n-p-1 Uncached
N-1

Virtual pages (VPs) Physical pages (PPs)


stored on disk cached in DRAM

DRAM Cache Organization

 DRAM cache organization driven by the enormous miss penalty


 DRAM is about 10x slower than SRAM

 Disk is about 10,000x slower than DRAM

 Time to load block from disk > 1ms (> 1 million clock cycles)

 CPU can do a lot of computation during that time

 Consequences
 Large page (block) size: typically 4 KB

 Linux “huge pages” are 2 MB (default) to 1 GB

 Fully associative

 Any VP can be placed in any PP

 Requires a “large” mapping function – different from cache memories

 Highly sophisticated, expensive replacement algorithms

 Too complicated and open-ended to be implemented in hardware

 Write-back rather than write-through

4
02/03/2019

Enabling Data Structure: Page Table


 A page table is an array of page table entries (PTEs) that maps virtual pages
to physical pages.
 Per-process kernel data structure in DRAM

Physical memory
Physical page (DRAM)
number or
VP 1 PP 0
Valid disk address
VP 2
PTE 0 0 null
VP 7
1 VP 4 PP 3
1
0
1
0 null Virtual memory
0 (disk)
PTE 7 1 VP 1
Memory resident VP 2
page table
VP 3
(DRAM)
VP 4
VP 6
VP 7

Page Hit

 Page hit: reference to VM word that is in


physical memory (DRAM cache hit)
Physical memory
Physical page (DRAM)
Virtual address
number or
VP 1 PP 0
Valid disk address
VP 2
PTE 0 0 null
VP 7
1 VP 4 PP 3
1
0
1
0 null Virtual memory
0 (disk)
PTE 7 1 VP 1
Memory resident VP 2
page table
VP 3
(DRAM)
VP 4
VP 6
VP 7

5
02/03/2019

Page Fault

 Page fault: reference to VM word that is not in


physical memory (DRAM cache miss)
Physical memory
Physical page (DRAM)
Virtual address number or
VP 1 PP 0
Valid disk address
VP 2
PTE 0 0 null
VP 7
1 VP 4 PP 3
1
0
1
0 null Virtual memory
0 (disk)
PTE 7 1 VP 1
Memory resident VP 2
page table
VP 3
(DRAM)
VP 4
VP 6
VP 7

Handling Page Fault


 Page miss causes page fault (an exception)

Physical memory
Physical page (DRAM)
Virtual address number or
VP 1 PP 0
Valid disk address
VP 2
PTE 0 0 null
VP 7
1 VP 4 PP 3
1
0
1
0 null Virtual memory
0 (disk)
PTE 7 1 VP 1
Memory resident VP 2
page table
VP 3
(DRAM)
VP 4
VP 6
VP 7

6
02/03/2019

Handling Page Fault


 Page miss causes page fault (an exception)
 Page fault handler selects a victim to be evicted (here VP 4)

Physical memory
Physical page (DRAM)
Virtual address number or
VP 1 PP 0
Valid disk address
VP 2
PTE 0 0 null
VP 7
1 VP 4 PP 3
1
0
1
0 null Virtual memory
0 (disk)
PTE 7 1 VP 1
Memory resident VP 2
page table
VP 3
(DRAM)
VP 4
VP 6
VP 7

Handling Page Fault


 Page miss causes page fault (an exception)
 Page fault handler selects a victim to be evicted (here VP 4)

Physical memory
Physical page (DRAM)
Virtual address number or
VP 1 PP 0
Valid disk address
VP 2
PTE 0 0 null
VP 7
1 VP 3 PP 3
1
1
0
0 null Virtual memory
0 (disk)
PTE 7 1 VP 1
Memory resident VP 2
page table
VP 3
(DRAM)
VP 4
VP 6
VP 7

7
02/03/2019

Handling Page Fault


 Page miss causes page fault (an exception)
 Page fault handler selects a victim to be evicted (here VP 4)
 Offending instruction is restarted: page hit!
Physical memory
Physical page (DRAM)
Virtual address number or
VP 1 PP 0
Valid disk address
VP 2
PTE 0 0 null
VP 7
1 VP 3 PP 3
1
1
0
0 null Virtual memory
0 (disk)
PTE 7 1 VP 1
Memory resident VP 2
page table
VP 3
(DRAM)
VP 4
Key point: Waiting until the miss to copy the page to DRAM
VP 6
is known as demand paging
VP 7

Allocating Pages

 Allocating a new page (VP 5) of virtual memory.


Physical memory
Physical page (DRAM)
number or
VP 1 PP 0
Valid disk address
VP 2
PTE 0 0 null
VP 7
1 VP 3 PP 3
1
1
0
0 Virtual memory
0 (disk)
PTE 7 1 VP 1
Memory resident VP 2
page table
VP 3
(DRAM)
VP 4
VP 5
VP 6
VP 7

8
02/03/2019

Locality to the Rescue Again!

 Virtual memory seems terribly inefficient, but


it works because of locality.

 At any point in time, programs tend to access a


set of active virtual pages called the working
set
 Programs with better temporal locality will have
smaller working sets

 If (working set size < main memory size)


 Good performance for one process (after cold
misses)

If (working set size > main memory size )

Motivation for Virtual Memory

 The physical main memory (RAM) is relatively limited in


capacity.
 It may not be big enough to store all the executing
programs at the same time.
 A program may need memory larger than the main
memory size, but the whole program doesn’t need to be
kept in the main memory at the same time.
 Virtual Memory takes advantage of the fact that at any
given instant of time, an executing program needs only a
fraction of the memory that the whole program occupies.
 The basic idea: Load only pieces of each executing
program which are currently needed.

9
02/03/2019

Virtual Memory - Objectives

 Allow program to be written without memory


constraints
 Program can exceed the size of the main memory
 Many Programs sharing DRAM Memory so that context
switches can occur
 Relocation: Parts of the program can be placed at
different locations in the memory instead of a big chunk
 Virtual Memory:
 Main Memory holds many programs running at same time
(processes)
 Use Main Memory as a kind of “cache” for disk

Characteristics of Paging and Segmentation

 Memory references are dynamically translated into physical


addresses at run time
 a process may be swapped in and out of main memory such

that it occupies different regions


 A process may be broken up into pieces (pages or segments) that
do not need to be located contiguously in main memory
 Hence: all pieces of a process do not need to be loaded in main
memory during execution
 computation may proceed for some time if the next instruction to be
fetch (or the next data to be accessed) is in a piece located in main
memory

10
02/03/2019

Real and Virtual Memory

 Concept real
and virtual
memory

Support Needed for Virtual Memory

 For virtual memory to be practical and effective :


 Hardware must support paging and segmentation
 Operating system must include software for
managing the movement of pages or segments
between secondary memory and main memory

11
02/03/2019

Paging of Memory

 Divide programs (processes) into equal sized, small


blocks, called pages.
 Divide the primary memory into equal sized, small
blocks called page frames.
 Allocate the required number of page frames to a
program.
 A program does not require continuous page frames!
 The operating system (OS) is responsible for:
 Maintaining a list of free frames.
 Using a page table to keep track of the mapping between pages
and page frames

Paging

 The term virtual memory is usually associated


with systems that employ paging
 Use of paging to achieve virtual memory was
first reported for the Atlas computer
 Each process has its own page table
 Each page table entry contains the frame number of
the corresponding page in main memory

12
02/03/2019

 Typically, each process has its own page table

 Each page table entry contains a present bit to indicate whether


the page is in main memory or not.
 If it is in main memory, the entry contains the frame number

of the corresponding page in main memory


 If it is not in main memory, the entry may contain the address

of that page on disk or the page number may be used to index


another table (often in the PCB) to obtain the address of that
page on disk

Paging

 A modified bit indicates if the page has been altered since it


was last loaded into main memory
 If no change has been made, the page does not have to be

written to the disk when it needs to be swapped out


 Other control bits may be present if protection is managed at
the page level
 a read-only/read-write bit

 protection level bit: kernel page or user page (more bits are

used when the processor supports more than 2 protection


levels)

13
02/03/2019

Page Table Structure

 Page tables are variable in length (depends on


process size)
 then must be in main memory instead of registers
 A single register holds the starting physical
address of the page table of the currently running
process

Address Translation in a Paging System

14
02/03/2019

Sharing Pages

 If we share the same code among different users, it is


sufficient to keep only one copy in main memory
 Shared code must be reentrant (ie: non self-modifying)
so that 2 or more processes can execute the same code
 If we use paging, each sharing process will have a page
table who’s entry points to the same frames: only one
copy is in main memory
 But each user needs to have its own private data pages

Sharing Pages: a text editor

15
02/03/2019

Translation Lookaside Buffer

 Because the page table is in main memory, each


virtual memory reference causes at least two
physical memory accesses
 one to fetch the page table entry
 one to fetch the data
 To overcome this problem a special cache is set
up for page table entries
 called the TLB - Translation Lookaside Buffer
 Contains page table entries that have been most recently
used
 Works similar to main memory cache

 Given a logical address, the processor examines the


TLB
 If page table entry is present (a hit), the frame number is
retrieved and the real (physical) address is formed
 If page table entry is not found in the TLB (a miss), the
page number is used to index the process page table
 if present bit is set then the corresponding frame is accessed
 if not, a page fault is issued to bring in the referenced page in
main memory
 The TLB is updated to include the new page entry

16
02/03/2019

Use of a Translation Lookaside Buffer

TLB: further comments

 TLB use associative mapping hardware to


simultaneously interrogates all TLB entries to find
a match on page number
 The TLB must be flushed each time a new process
enters the Running state
 The CPU uses two levels of cache on each virtual
memory reference
 first the TLB: to convert the logical address to the
physical address
 once the physical address is formed, the CPU then
looks in the cache for the referenced word

17
02/03/2019

Page Tables and Virtual Memory

 Most computer systems support a very large virtual


address space
 32 to 64 bits are used for logical addresses

 If (only) 32 bits are used with 4KB pages, a page

table may have 220 entries


 The entire page table may take up too much main
memory. Hence, page tables are often also stored in
virtual memory and subjected to paging
 When a process is running, part of its page table must

be in main memory (including the page table entry of


the currently executing page)

Multilevel Page Tables

 Since a page table will generally require several pages to be stored. One
solution is to organize page tables into a multilevel hierarchy
 When 2 levels are used (ex: 386, Pentium), the page number is split into
two numbers p1 and p2
 p1 indexes the outer paged table (directory) in main memory who’s entries
points to a page containing page table entries which is itself indexed by p2.
Page tables, other than the directory, are swapped in and out as needed

18
02/03/2019

Segmentation

 Segmentation allows the programmer to view


memory as consisting of multiple address
spaces or segments
 Advantages:
 simplifies handling of growing data structures
 allows programs to be altered and recompiled
Independently
 lends itself to sharing data among processes
 lends itself to protection

Segmentation
 Typically, each process has its own segment table
 Similarly to paging, each segment table entry contains a
present bit and a modified bit
 If the segment is in main memory, the entry contains the
starting address and the length of that segment
 Other control bits may be present if protection and
sharing is managed at the segment level
 Logical to physical address translation is similar to
paging except that the offset is added to the starting
address (instead of being appended)

19
02/03/2019

Address Translation in a Segmentation System

Segmentation: comments

 In each segment table entry we have both the starting address and
length of the segment
 the segment can thus dynamically grow or shrink as needed

 address validity easily checked with the length field

 But variable length segments introduce external fragmentation


and are more difficult to swap in and out...
 It is natural to provide protection and sharing at the segment level
since segments are visible to the programmer (pages are not)
 Useful protection bits in segment table entry:
 read-only/read-write bit

 Supervisor/User bit

20
02/03/2019

Sharing in Segmentation Systems

 Segments are shared when entries in the segment


tables of 2 different processes point to the same
physical locations
 Ex: the same code of a text editor can be shared
by many users
 Only one copy is kept in main memory
 but each user would still need to have its own
private data segment

Sharing of Segments: text editor example

21
02/03/2019

Combined Segmentation and Paging

 To combine their advantages some processors and OS page the


segments.
 Several combinations exists. Here is a simple one
 Each process has:
 one segment table

 several page tables: one page table per segment

 The virtual address consist of:


 a segment number: used to index the segment table who’s entry

gives the starting address of the page table for that segment
 a page number: used to index that page table to obtain the

corresponding frame number


 an offset: used to locate the word within the frame

Address Translation in a (simple) combined


Segmentation/Paging System

22
02/03/2019

Simple Combined Segmentation and Paging

 The Segment Base is the physical address of the page table of that
segment
 Present and modified bits are present only in page table entry
 Protection and sharing info most naturally resides in segment table
entry
 Ex: a read-only/read-write bit, a kernel/user bit...

Introduction

 Cache memory enhances performance by providing


faster memory access speed.
 Virtual memory enhances performance by providing
greater memory capacity, without the expense of adding
main memory.
 Instead, a portion of a disk drive serves as an extension
of main memory.
 If a system uses paging, virtual memory partitions main
memory into individually managed page frames, that
are written (or paged) to disk when they are not
immediately needed.

23
02/03/2019

 A physical address is the actual memory address of


physical memory.
 Programs create virtual addresses that are mapped to
physical addresses by the memory manager.
 Page faults occur when a logical address requires that a
page be brought in from disk.
 Memory fragmentation occurs when the paging process
results in the creation of small, unusable clusters of
memory addresses.

 Main memory and virtual memory are divided into


equal sized pages.
 The entire address space required by a process need not
be in memory at once. Some parts can be on disk, while
others are in main memory.
 Further, the pages allocated to a process do not need to
be stored contiguously -- either on disk or in memory.
 In this way, only the needed pages are in memory at any
time, the unnecessary pages are in slower disk storage.

24
02/03/2019

 Information concerning the location of each page,


whether on disk or in memory, is maintained in a data
structure called a page table (shown below).
 There is one page table for each active process.

 When a process generates a virtual address, the


operating system translates it into a physical memory
address.
 To accomplish this, the virtual address is divided into
two fields: A page field, and an offset field.
 The page field determines the page location of the
address, and the offset indicates the location of the
address within the page.
 The logical page number is translated into a physical
page frame through a lookup in the page table.

25
02/03/2019

 If the valid bit is zero in the page table entry for the logical
address, this means that the page is not in memory and must be
fetched from disk.
 This is a page fault.
 If necessary, a page is evicted from memory and is replaced
by the page retrieved from disk, and the valid bit is set to 1.
 If the valid bit is 1, the virtual page number is replaced by the
physical frame number.
 The data is then accessed by adding the offset to the physical
frame number.

 As an example, suppose a system has a virtual address space of


8K, each page has 1K, and a physical address space of 4K bytes.
The system uses byte addressing.
13 10 = 23 virtual pages.
 We have 2 /2

 A virtual address has 13 bits (8K = 213) with 3 bits for the page field
and 10 for the offset, because the page size is 1024.
 A physical memory address requires 12 bits, the first two bits for the
page frame and the trailing 10 bits the offset.

26
02/03/2019

Virtual memory

 Suppose we have the page table shown below.


 What happens when CPU generates address 545910 =
10101010100112? (in page 5, the first 3 bits is 101)

 The address 10101010100112 is converted to physical


address 010101010011 because the page field 101 is
replaced by frame number 01 (in frame #1) through a
lookup in the page table.

27
02/03/2019

 What happens when the CPU generates address


10000000001002? (first 3 bits is 100, in page #4)

 We said earlier that effective access time (EAT) takes


all levels of memory into consideration.
 Thus, virtual memory is also a factor in the calculation,
and we also have to consider page table access time.
 Suppose a main memory access takes 200ns, the page
fault rate is 1%, and it takes 10ms to load a page from
disk. We have:
EAT = 0.99(200ns + 200ns) + 0.01(10ms) = 100,396ns.

28
02/03/2019

 Even if we had no page faults, the EAT would be 400ns


because memory is always read twice: First to access
the page table, and second to load the page from
memory.
 Because page tables are read constantly, it makes sense
to keep most recent page lookup values in a special
cache called a translation look-aside buffer (TLB).
 TLBs are a special associative cache that stores the
mapping of virtual pages to physical pages.

Example of virtual address

29
02/03/2019

Flowchart of virtual address

 Another approach to virtual memory is the use of


segmentation.
 Instead of dividing memory into equal-sized pages,
virtual address space is divided into variable-length
segments, often under the control of the programmer.
 A segment is located through its entry in a segment
table, which contains the segment’s memory location
and a bounds limit that indicates its size.
 After a page fault, the operating system searches for a
location in memory large enough to hold the segment
that is retrieved from disk.

30
02/03/2019

 Both paging and segmentation can cause fragmentation.


 Paging is subject to internal fragmentation because a
process may not need the entire range of addresses
contained within the page. Thus, there may be many
pages containing unused fragments of memory.
 Segmentation is subject to external fragmentation,
which occurs when contiguous chunks of memory
become broken up as segments are allocated and
deallocated over time.

 Large page tables are cumbersome and slow, but with


its uniform memory mapping, page operations are fast.
Segmentation allows fast access to the segment table,
but segment loading is labor-intensive.
 Paging and segmentation can be combined to take
advantage of the best features of both by assigning
fixed-size pages within variable-sized segments.
 Each segment has a page table. This means that a
memory address will have three fields, one for the
segment, another for the page, and a third for the offset.

31
02/03/2019

Real-World Example

 The Pentium architecture supports both paging and


segmentation, and they can be used in various
combinations including unpaged unsegmented,
segmented unpaged, unsegmented paged, and
segmented paged memory (pp263-264).
 The processor supports two levels of cache (L1 and L2),
both having a block size of 32 bytes.
 The L1 cache is next to the processor, and the L2 cache
sits between the processor and memory.
 The L1 cache is in two parts: and instruction cache (I-
cache) and a data cache (D-cache).

Real world example

32
02/03/2019

Chapter 3

COMPUTER MEMORY
Part 3
Internal memory

Basic Principles of Computers

 Virtually all modern computer


designs are based on the von
Neumann architecture principles:
 Data and instructions are stored in a
single read/write memory.
 The contents of this memory are
addressable by location, without
regard to what are stored there.
 Instructions are executed sequentially
(from one instruction to the next)
unless the order is explicitly modified

1
02/03/2019

Many Different Technologies

Internal and External Memories

2
02/03/2019

Main Memory Model

Byte-Oriented Memory Organization

 Conceptually, memory is a single, large array of bytes,


each with a unique address (index)
 The value of each byte in memory can be read and written
 Programs refer to bytes in memory by their addresses
 Domain of possible addresses = address space
 But not all values fit in a single byte… (e.g. 410)
 Many operations actually use multi-byte values
 We can store addresses as data to “remember” where other data is
in memory

•••

3
02/03/2019

Word-Oriented Memory Organization


64-bit 32-bit Addr.
 Addresses still specify Bytes
Words Words (hex)
locations of bytes in memory 0x00
Addr
 Addresses of successive words =
0x01
differ by word size (in bytes): 0000
?? 0x02
Addr
e.g. 4 (32-bit) or 8 (64-bit) =
0x03
 Address of word 0, 1, … 10? 0000
?? 0x04
Addr
=
0x05
 Address of word 0004
?? 0x06
= address of first byte in word 0x07
 The address of any chunk of 0x08
Addr
memory is given by the address =
0x09
of the first byte Addr
0008
?? 0x0A
 Alignment = 0x0B
0008
?? 0x0C
Addr
=
0x0D
0012
?? 0x0E

Byte Ordering

 How should bytes within a word be ordered in memory?


 Example: store the 4-byte (32-bit) int:
0x a1 b2 c3 d4
 By convention, ordering of bytes called endianness
 The two options are big-endian and little-endian
 Based on Gulliver’s Travels: tribes cut eggs on different sides
(big, little)

4
02/03/2019

 Big-endian (SPARC, z/Architecture)


 Least significant byte has highest address
 Little-endian (x86, x86-64)
 Least significant byte has lowest address
 Bi-endian (ARM, PowerPC)
 Endianness can be specified as big or little

 Example: 4-byte data 0xa1b2c3d4 at address 0x100


0x100 0x101 0x102 0x103

Big-Endian 01
a1 23
b2 45
c3 67
d4

0x100 0x101 0x102 0x103


Little-Endian 67
d4 45
c3 23
b2 01
a1

Decimal: 12345
Binary: 0011 0000 0011 1001
Byte Ordering Examples Hex: 3 0 3 9

5
02/03/2019

Traditional Bus Structure Connecting CPU and Memory

 A bus is a collection of parallel wires that carry


address, data, and control signals.
 Buses are typically shared by multiple devices.
CPU chip

Register file

ALU

System bus Memory bus

I/O Main
Bus interface
bridge memory

Memory Read Transaction (1)

 CPU places address A on the memory bus.

Register file Load operation: movq A, %rax

ALU
%rax

Main memory
I/O bridge 0
A

Bus interface
x A

6
02/03/2019

Memory Read Transaction (2)

 Main memory reads A from the memory bus,


retrieves word x, and places it on the bus.

Register file Load operation: movq A, %rax

ALU
%rax

Main memory
I/O bridge x 0

Bus interface x A

Memory Read Transaction (3)

 CPU read word x from the bus and copies it into


register %rax.

Register file Load operation: movq A, %rax

ALU
%rax x

Main memory
I/O bridge 0

Bus interface
x A

7
02/03/2019

Memory Write Transaction (1)

 CPU places address A on bus. Main memory


reads it and waits for the corresponding data
word to arrive.
Register file Store operation: movq %rax, A

ALU
%rax y

Main memory
I/O bridge 0
A
Bus interface A

Memory Write Transaction (2)

 CPU places data word y on the bus.

Register file Store operation: movq %rax, A

ALU
%rax y

Main memory
I/O bridge 0
y

Bus interface
A

8
02/03/2019

Memory Write Transaction (3)

 Main memory reads data word y from the bus


and stores it at address A.
Register file Store operation: movq %rax, A

ALU
%rax y

main memory
I/O bridge 0

Bus interface y A

Physical types

 Semiconductor
 RAM
 Magnetic
 Disk & Tape
 Optical
 CD & DVD
 Others
 Bubble
 Hologram

9
02/03/2019

Storing data in main memory

 Possible types of memories:


 ROM: read-only
 Classical ROM: the content is stored during the manufacturing process
 PROM: one-time programmable
 EPROM: can be erased using ultraviolet light
 Etc.
 SRAM: Static Random Access Memory
 Can be read and modified any time
 It preserves data while power supply is present
 DRAM: Dynamic Random Access Memory
 Can be read and modified any time
 It forgets its content! Needs to be refreshed periodically

Nonvolatile Memories
 DRAM and SRAM are volatile memories
 Lose information if powered off.
 Nonvolatile memories retain value even if powered off
 Read-only memory (ROM): programmed during production
 Programmable ROM (PROM): can be programmed once
 Eraseable PROM (EPROM): can be bulk erased (UV, X-Ray)
 Electrically eraseable PROM (EEPROM): electronic erase capability
 Flash memory: EEPROMs. with partial (block-level) erase capability
 Wears out after about 100,000 erasings
 Uses for Nonvolatile Memories
 Firmware programs stored in a ROM (BIOS, controllers for disks, network
cards, graphics accelerators, security subsystems,…)
 Solid state disks (replace rotating disks in thumb drives, smart phones, mp3
players, tablets, laptops,…)
 Disk caches

10
02/03/2019

Random-Access Memory (RAM)


 Key features
 RAM is traditionally packaged as a chip
 Basic storage unit is normally a cell (one bit per cell)
 Multiple RAM chips form a memory
 Static RAM (SRAM)
 Each cell stores a bit with a four- or six-transistor circuit
 Retains value indefinitely, as long as it is kept powered
 Relatively insensitive to electrical noise (EMI), radiation, etc.
 Faster and more expensive than DRAM
 Dynamic RAM (DRAM)
 Each cell stores bit with a capacitor; transistor is used for access
 Value must be refreshed every 10-100 ms
 More sensitive to disturbances (EMI, radiation,…) than SRAM
 Lower and cheaper than SRAM

SRAM vs DRAM Summary

 EDC = error detection and correction


 To cope with noise, etc.

11
02/03/2019

 How to create a memory system for the computer


 It must be: cheap, large, low latency, high throughput

DRAM

 Bits stored as charge in capacitors


 Charges leak
 Need refreshing even when powered
 Simpler construction
 Smaller per bit
 Less expensive
 Need refresh circuits
 Slower
 Main memory
 Essentially analogue
 Level of charge determines value

12
02/03/2019

Static RAM

 Bits stored as on/off switches


 No charges to leak
 No refreshing needed when powered
 More complex construction
 Larger per bit
 More expensive
 Does not need refresh circuits
 Faster
 Cache
 Digital
 Uses flip-flops

SRAM v DRAM

 Both volatile
 Power needed to preserve data
 Dynamic cell
 Simpler to build, smaller
 More dense
 Less expensive
 Needs refresh
 Larger memory units
 Static
 Faster
 Cache

13
02/03/2019

Summary: DRAM vs. SRAM

 DRAM (Dynamic RAM)  SRAM (Static RAM)


 Used mostly in main mem.  Used mostly in caches (I, D,
 Capacitor + 1 transistor/bit TLB, BTB)
 Need refresh every 4-8 ms  1 flip-flop (4-6 transistors)
 5% of total time per bit
 Read is destructive (need for  Read is not destructive
write-back)  Access time = cycle time
 Access time < cycle time  Speed (8-16):1 to DRAM
(because of writing back)  Address lines not multiplexed
 high speed of decoding imp.
 Density (25-50):1 to SRAM
 Address lines multiplexed
 pins are scarce!

Chip Organization

 Chip capacity (= number of data bits)


 tends to quadruple
 1K, 4K, 16K, 64K, 256K, 1M, 4M, …
 In early designs, each data bit belonged to a different
address (x1 organization)
 Starting with 1Mbit chips, wider chips (4, 8, 16, 32 bits
wide) began to appear
 Advantage: Higher bandwidth
 Disadvantage: More pins, hence more expensive packaging

14
02/03/2019

DRAM bank

 Structure:
DRAM cells in a 2D grid
 Each cell in a row shares
the same word line
 Each cell in a column
shares the same bit line
 Reading:
 The row decoder selects (activates) a row
 The sense amplifiers detect and store the bits of the row
 The column multiplexer selects the desired column from the row
 Two-phase operations:
 To reduce the width of the address bus
 Address bus: row address → wait → address bus: column
address→ data bus: the desired data

16 X 1 as 4 X 4 Array

 Two decoders
 Row
 Column
 Address just broken
up
 Not visible from
outside

15
02/03/2019

DRAM Logical Diagram

Conventional DRAM Organization

 d x w DRAM:
 dw total bits organized as d supercells of size w bits

16
02/03/2019

Reading DRAM Supercell (2,1)

 Step 1(a): row access strobe (RAS) selects row 2


 Step 1(b): row 2 copied from DRAM array to row buffer

Reading DRAM Supercell (2,1)

 Step 2(a): column access strobe (CAS) selects column 1


 Step 2(b): supercell (2,1) copied from buffer to data
lines, and eventually back to the CPU

17
02/03/2019

Memory Modules

 Combine some of chip to create memory module

Enhanced DRAMs
 Basic DRAM cell has not changed since its invention in
1966
 Commercialized by Intel in 1970
 DRAMs with better interface logic and faster I/O:
 Synchronous DRAM (SDRAM)
 Uses a conventional clock signal instead of asynchronous control
 Allows reuse of the row addresses (e.g., RAS, CAS, CAS, CAS)
 Double data-rate synchronous DRAM (DDR SDRAM)
 DDR1 : twice as fast
 DDR2 : four times as fast
 DDR3 : eight times as fast

18
02/03/2019

Enhanced DRAMs

DRAM chips

19
02/03/2019

Organisation in detail

 A 16Mbit chip can be organised as 1M of 16 bit


words
 A bit per chip system has 16 lots of 1Mbit chip
with bit 1 of each word in chip 1 and so on
 A 16Mbit chip can be organised as a 2048 x 2048
x 4bit array
 Reduces number of address pins
 Multiplex row address and column address
 11 pins to address (211=2048)
 Adding one more pin doubles range of values so x4 capacity

Typical 16 Mb DRAM (4M x 4)

20
02/03/2019

Packaging

DRAM MEMORY MODULE

 A memory module consists of DRAM chips


 Command lines, bank selection lines, address lines:
shared
 Data lines: concatenated

 Each chip receives all commands


 Effect:
 Throughput increases 8x
 Delay: the same

21
02/03/2019

Memory Interleaving
 Goal: Try to take advantage of bandwidth of multiple
DRAMs in memory system
 Memory address A is converted into (b,w) pair, where
 b = bank index
 w = word index within bank
 Logically a wide memory
 Accesses to B banks staged over time to share internal resources
such as memory bus
 Interleaving can be on
 Low-order bits of address (cyclic)
 b = A mod B, w = A div B
 High-order bits of address (block)
 Combination of the two (block-cyclic)

Low-order Bit Interleaving

22
02/03/2019

Mixed Interleaving

 Memory address register is 6 bits wide


 Most significant 2 bits give bank address
 Next 3 bits give word address within bank
 LSB gives (parity of) module within bank
 6 = 0001102 = (00, 011, 0) = (0, 3, 0)
 41 = 1010012 = (10, 100, 1) = (2, 4, 1)

23
02/03/2019

Chapter 3

COMPUTER MEMORY
Part 4
Storage devices

Data storage devices

 All computers have data storage devices


 Their performance is important for the overall
performance of the whole system
 They have a crucial role in virtual memory management
 We are going to cover:
 HDD: Hard disk drives
 SSD: Solid state drives
 There are others as well:
 Optical drives: similar to HDDs at several aspects
 Pendrives: are based on the same flash memory technology as
SSDs
 Etc

1
02/03/2019

Hard disk drives

 First HDD:
 1956, IBM (RAMAC 305)
 Features:
 Weight: 1 tons
 50 double sided disks, 24" each
 Two read/write heads
 100 tracks/disk
 Access time: 1s
 Capacity: 5 million 7-bit characters
 Microdrive
 2006: 1", 8 GB capacity

What’s Inside A Disk Drive?

2
02/03/2019

Disk geometry

 Disks consist of platters, each with two surfaces


 Each surface consists of concentric rings called
tracks
 Each track consists of sectors separated by gaps

Disk geometry (Multiple – platter view)

 Aligned tracks form a cylinder

3
02/03/2019

Disk capacity

 Capacity: maximum number of bits that can be stored


 capacity expressed in units of gigabytes (GB), where
1 GB = 230 Bytes ≈ 109 Bytes
 Capacity is determined by these technology factors:
 recording density (bits/in): number of bits that can be squeezed
into a 1 inch segment of a track
 track density (tracks/in): number of tracks that can be squeezed
into a 1 inch radial segment
 areal density (bits/in2): product of recording and track density
 Modern disks partition tracks into disjoint subsets called
recording zones
 each track in a zone has the same number of sectors,
determined by the circumference of innermost track
 each zone has a different number of sectors/track

Computing disk capacity


 Capacity = (# bytes/sector) x (avg. # sectors/track)
x(# tracks/surface) x
(#surfaces/platter) x(# platters/disk)
Example:
 512 bytes/sector
 600 sectors/track (on average)

 40,000 tracks/surface

 2 surfaces/platter

 5 platters/disk

 Capacity = 512 x 600 x 40000 x 2 x 5= 122,280,000,000


= 113.88 GB

4
02/03/2019

Disk operation

Disk structure : top view of single platter

 Surface organized into tracks


 Tracks divided into sectors
 Disk access
 Head in position above a track
 Head in position above a track

5
02/03/2019

Writing data to manegtic surface

 The head is moved to the desired radial position → seek


 The disk is rotated to the desired angular position
 The head generates a local external magnetic field above
the disk
 The disk will be magnetized permanently (locally)

Reading data from manegtic surface

 We need to detect the magnetic field of the disk


→ Not possible in a direct way!
 What is possible: to detect the change of the magnetic
field
 Magnetic field is changed: bit 1
 No change: bit 0
 Example: bit sequence „101”:
 Consequences:
 Individual bits can not be modified!
 Since by allowing it we would have to change the direction of
the magnetic field on each subsequent bit positions
 What we do instead: we introduce larger data units (called
sectors)
 Only whole sectors can be read or written

6
02/03/2019

Data organization
 Data units
 We can only read and write blocks (and not individual
bytes)
 Sector system
 Fixed data units – sectors (typically 512 bytes)
 Advantage: easier to handle, the free space is not fragmented
 Issue: the operating system has to map the files of various
sizes to the fixed sized sectors
 Components of a sector:

 Gap: to leave time to switch on and off the read or


write head
 Preamble: calibrate head (adjusts signal strength and
data density)
 Data starts” inticates the end of calibration
 Flush pad” to time the last bytes leave the head

7
02/03/2019

Identifying a sector

 How to refer to a sector?


 Specifying the physical position:
 Track: the radial position of the data
 Specifying tracks:
 Cylinder: the same tracks of the all the platters
 Head: which platter on the same cylinder
 Specifying the locations of a sector:
→ CHS coordinates (cylinder-head-sector)
(On which cylinder, under which head, which sector)
 This is how the HDD identifies a sector internally

 And how does the external environment of the


HDD identify a sector?
 When the operating system wants to load a sector,
how does it refer to it?
→ By using logical addresses
 Logical addresses
 Why? Why does not the operating system use CHS?
 They used it in the old days. Issues:
 The HDD can not hide the bad sectors from the operating
system
 The ATA standard was able to handle 8.4 GB disks using
CHS

8
02/03/2019

 They introduced the logical addressing: Logical Block


Address, LBA
 Sectors are identified by a single number (which is it on
the disk)
 The operating system tells just a sector number to the HDD
 The HDD maps this logical address to a physical CHS
address
 The HDD is a black box now!
 The operating system does not need to know the internal structure
of the disk (number of heads, number of cylinders, etc.)
 The HDD can hide the bad sectors by its own (it leaves them out
from the logical→physical mapping)

 Mapping logical addresses to physical CHS


addresses
 Cylinder” strategy:
 Serpentine” strategy:

9
02/03/2019

Disk access – service time components

Disk access time

 Average time to access some target sector approximated


by :
 Taccess = Tavg seek + Tavg rotation + Tavg transfer
 Seek time (Tavg seek)
 time to position heads over cylinder containing target sector
 typical Tavg seek is 3–9 ms
 Rotational latency (Tavg rotation)
 time waiting for first bit of target sector to pass under r/w
head
 typical rotation speed R = 7200 RPM
 Tavg rotation = 1/2 x 1/R x 60 sec/1 min
 Transfer time (Tavg transfer)
 time to read the bits in the target sector
 Tavg transfer = 1/R x 1/(avg # sectors/track) x 60 secs/1 min

10
02/03/2019

Example
 Given:
 rotational rate = 7,200 RPM
 average seek time = 9 ms
 avg # sectors/track = 600
 Derived:
 Tavg rotation = 1/2 x (60 secs/7200 RPM) x 1000 ms/sec = 4 ms
 Tavg transfer = 60/7200 RPM x 1/600 sects/track x 1000 ms/sec = 0.014 ms
 Taccess = 9 ms + 4 ms + 0.014 ms
 Important points:
 access time dominated by seek time and rotational latency
 first bit in a sector is the most expensive, the rest are free
 SRAM access time is about 4 ns/doubleword, DRAM about 60
ns
 disk is about 40,000 times slower than SRAM
 2,500 times slower than DRAM

Logical disk blocks


 Modern disks present a simpler abstract view of
the complex sector geometry:
 the set of available sectors is modeled as a sequence
of b-sized logical blocks (0, 1, 2, ...)
 Mapping between logical blocks and actual
(physical) sectors
 maintained by hardware/firmware device called disk
controller
 converts requests for logical blocks into (surface, track,
sector) triples
 Allows controller to set aside spare cylinders for
each zone
 accounts for the difference in “formatted capacity” and
“maximum capacity

11
02/03/2019

IO Bus

Reading a disk sector - 1

12
02/03/2019

Reading a disk sector - 2

Reading a disk sector - 3

13
02/03/2019

Solid – State Disks (SSDs)

 Pages: 512KB to 4KB; blocks: 32 to 128 pages


 Data read/written in units of pages
 Page can be written only after its block has been erased
 A block wears out after 100,000 repeated writes

SSD Performance Characteristics

 Why are random writes so slow?


 erasing a block is slow (around 1 ms)
 modifying a page triggers a copy of all useful pages in the
block
 find a used block (new block) and erase it
 write the page into the new block
 copy other pages from old block to the new block

14
02/03/2019

SSD Tradeoffs vs Rotating Disks

 Advantages
 no moving parts à faster, less power, more rugged
 Disadvantages
 have the potential to wear out
 mitigated by “wear-leveling logic” in flash translation layer
 e.g. Intel X25 guarantees 1 petabyte (1015 bytes) of random
writes before they wear out
 in 2010, about 100 times more expensive per byte
 in 2017, about 6 times more expensive per byte
 Applications
 smart phones, laptops
 Apple “Fusion” drives

RAID

 Redundant Array of Independent Disks


 Redundant Array of Inexpensive Disks
 6 levels in common use
 Not a hierarchy
 Set of physical disks viewed as single logical
drive by O/S
 Data distributed across physical drives
 Can use redundant capacity to store parity
information

15
02/03/2019

 RAID, an acronym for Redundant Array of Independent


Disks was invented to address problems of disk
reliability, cost, and performance.
 In RAID, data is stored across many disks, with extra
disks added to the array to provide error correction
(redundancy).
 The inventors of RAID, David Patterson, Garth Gibson,
and Randy Katz, provided a RAID taxonomy that has
persisted for a quarter of a century, despite many efforts
to redefine it.

RAID 0

 RAID Level 0, also known as drive spanning, provides


improved performance, but no redundancy.
 Data is written in blocks across the entire array
 The disadvantage of RAID 0 is in its low reliability.
 No redundancy
 Data striped across all disks
 Round Robin striping
 Increase speed
 Multiple data requests probably not on same disk
 Disks seek in parallel
 A set of data is likely to be striped across multiple disks

16
02/03/2019

RAID 1
 Mirrored Disks, provides 100% redundancy, and good
performance.
 Data is striped across disks
 2 copies of each stripe on separate disks
 Read from either
 Write to both
 Recovery is simple
 Swap faulty disk & re-mirror
 No down time
 Expensive
 Two matched sets of disks contain the same data.
 The disadvantage of RAID 1 is cost.

RAID 2

 Disks are synchronized


 Very small stripes
 Often single byte/word
 Error correction calculated across corresponding bits on
disks
 Multiple parity disks store Hamming code error
correction in corresponding positions
 Lots of redundancy
 Expensive
 Not used
 A RAID Level 2 configuration consists of a set of data drives,
and a set of Hamming code drives.
 Hamming code drives provide error correction for the data drives.
 RAID 2 performance is poor (slow) and the cost is relatively high.

17
02/03/2019

RAID 3

 Similar to RAID 2
 Only one redundant disk, no matter how large the array
 Simple parity bit for each set of corresponding bits
 Data on failed drive can be reconstructed from surviving
data and parity info
 Very high transfer rates

 RAID Level 3 stripes bits across a set of data drives and


provides a separate disk for parity.
 Parity is the XOR of the data bits.
 RAID 3 is not suitable for commercial applications, but is good for
personal systems.

RAID 4
 Each disk operates independently
 Good for high I/O request rate

 Large stripes
 Bit by bit parity calculated across stripes on each disk
 Parity stored on parity disk
 RAID Level 4 is like adding parity disks to RAID 0.
 Data is written in blocks across the data disks, and a parity block is
written to the redundant drive.
 RAID 4 would be feasible if all record blocks were the same
size, such as audio/video data.
 Poor performance, no commercial implementation of RAID

18
02/03/2019

RAID 5
 Like RAID 4

 Parity striped across all disks


 Round robin allocation for parity stripe
 Avoids RAID 4 bottleneck at parity disk
 Commonly used in network servers
 N.B. DOES NOT MEAN 5 DISKS!!!!!
 RAID Level 5 is RAID 4 with distributed parity.
 With distributed parity, some accesses can be serviced concurrently,
giving good performance and high reliability.
 RAID 5 is used in many commercial systems.

RAID 6

 Two parity calculations


 Stored in separate blocks on different disks
 User requirement of N disks needs N+2
 High data availability
 Three disks need to fail for data loss
 Significant write penalty
 RAID Level 6 carries two levels of error protection over
striped data: Reed-Soloman and parity.
 It can tolerate the loss of two disks.
 RAID 6 is write-intensive, but highly fault-tolerant.

19
02/03/2019

Optical Disks

 Optical disks provide large storage capacities very


inexpensively.
 They come in a number of varieties including CD-
ROM, DVD, and WORM (write-once-read-many-
times).
 Many large computer installations produce document
output on optical disk rather than on paper. This idea is
called COLD-- Computer Output Laser Disk.
 It is estimated that optical disks can endure for a
hundred years. Other media are good for only a decade-
- at best.

 CD-ROMs were designed by the music industry in the


1980s, and later adapted to data.
 This history is reflected by the fact that data is recorded
in a single spiral track, starting from the center of the
disk and spanning outward.
 Binary ones and zeros are delineated by bumps in the
polycarbonate disk substrate. The transitions between
pits and lands define binary ones.
 If you could unravel a full CD-ROM track, it would be
nearly five miles long!

20
02/03/2019

 The logical data format for a CD-ROM is much more


complex than that of a magnetic disk. (See the text for
details.)
 Different formats are provided for data and music.
 Two levels of error correction are provided for the data
format.
 DVDs can be thought of as quad-density CDs.
 Where a CD-ROM can hold at most 650MB of data,
DVDs can hold as much as 8.54GB.
 It is possible that someday DVDs will make CDs
obsolete.

Optical Storage CD-ROM

 Originally for audio


 650Mbytes giving over 70 minutes audio
 Polycarbonate coated with highly reflective coat,
usually aluminium
 Data stored as pits
 Read by reflecting laser
 Constant packing density
 Constant linear velocity

21
02/03/2019

CD-ROM Format

 Mode 0=blank data field


 Mode 1=2048 byte data+error correction
 Mode 2=2336 byte data

22
02/03/2019

Chapter 4

THE CENTRAL
PROCESSING UNIT

Language levels

 High-Level Languages
 Characteristics
 Portable : to varying
degrees
 Complex:one statement
can do much work
 Expressive
 Human readable

1
02/03/2019

Machine languages

 Characteristics
 Not portable : specific to
hardware
 Simple : each instruction
does a simple task
 Not expressive : each
instruction performs little
work
 Not human readable :
requires losts of effort,
requires tool support

Assembly languages

 Characteristics
 Not portable : each
assembly language
instruction map to one
machine language
instruction
 Simple : each
instruction does a
simple task
 Not expressive
 Human readable

2
02/03/2019

Why learn assembly language ?

 Why learn assembly language ?


 Knowing assembly language helps you :
 Write fast code : in assembly language, in a high –
level language
 Understand what’s happening under the hood
 Someone needs to develop future computer systems
 Maybe that will be you

Why Learn x86-64 Assembly Lang?

 Pros
 X86-64 is popular
 CourseLab computers are x86-64 computers
 Program natively on CourseLab instead of using an emulator
 Cons
 X86-64 assembly language is big
 Each instruction is simple, but…
 There are many instructions
 Instructions differ widely

3
02/03/2019

Von Neumann Architecture

 Includes
 CPU : CU, ALU,
Registers
 Memory : RAM

RAM

4
02/03/2019

Intel Microprocessors

 Intel introduced the 8086 microprocessor in 1979


 8086, 8087, 8088, and 80186 processors
 16-bit processors with 16-bit registers
 16-bit data bus and 20-bit address bus
 Physical address space = 220 bytes = 1 MB
 8087 Floating-Point co-processor
 Uses segmentation and real-address mode to address
memory
 Each segment can address 216 bytes = 64 KB
 8088 is a less expensive version of 8086
 Uses an 8-bit data bus
 80186 is a faster version of 8086

Intel 80286 and 80386 Processors


 80286 was introduced in 1982
 24-bit address bus  224 bytes = 16 MB address space
 Introduced protected mode
 Segmentation in protected mode is different from the real
mode
 80386 was introduced in 1985
 First 32-bit processor with 32-bit general-purpose
registers
 First processor to define the IA-32 architecture
 32-bit data bus and 32-bit address bus
 232 bytes  4 GB address space
 Introduced paging, virtual memory, and the flat memory
model Segmentation can be turned off

5
02/03/2019

Intel 80486 and Pentium Processors

 80486 was introduced 1989


 Improved version of Intel 80386
 On-chip Floating-Point unit (DX versions)
 On-chip unified Instruction/Data Cache (8 KB)
 Uses Pipelining: can execute up to 1 instruction per clock
cycle
 Pentium (80586) was introduced in 1993
 Wider 64-bit data bus, but address bus is still 32 bits
 Two execution pipelines: U-pipe and V-pipe
 Superscalar performance: can execute 2 instructions per clock/c
 Separate 8 KB instruction and 8 KB data caches
 MMX instructions (later models) for multimedia
applications

Intel P6 Processor Family

 P6 Processor Family: Pentium Pro, Pentium II and III


 Pentium Pro was introduced in 1995
 Three-way superscalar: can execute 3 instructions per clock
cycle
 36-bit address bus  up to 64 GB of physical address space
 Introduced dynamic execution
 Out-of-order and speculative execution
 Integrates a 256 KB second level L2 cache on-chip
 Pentium II was introduced in 1997
 Added MMX instructions (already introduced on Pentium
MMX)
 Pentium III was introduced in 1999
 Added SSE instructions and eight new 128-bit XMM registers

6
02/03/2019

Pentium 4 and Xeon Family

 Pentium 4 is a seventh-generation x86 architecture


 Introduced in 2000
 New micro-architecture design called Intel Netburst
 Very deep instruction pipeline, scaling to very high frequencies
 Introduced the SSE2 instruction set (extension to SSE)
 Tuned for multimedia and operating on the 128-bit XMM registers
 In 2002, Intel introduced Hyper-Threading technology
 Allowed 2 programs to run simultaneously, sharing resources
 Xeon is Intel's name for its server-class microprocessors
 Xeon chips generally have more cache
 Support larger multiprocessor configurations

Pentium-M and EM64T


 Pentium M (Mobile) was introduced in 2003
 Designed for low-power laptop computers
 Modified version of Pentium III, optimized for power efficiency
 Large second-level cache (2 MB on later models)
 Runs at lower clock than Pentium 4, but with better performance
 Extended Memory 64-bit Technology (EM64T)
 Introduced in 2004
 64-bit superset of the IA-32 processor architecture
 64-bit general-purpose registers and integer support
 Number of general-purpose registers increased from 8 to 16
 64-bit pointers and flat virtual address space
 Large physical address space: up to 240 = 1 Terabytes

7
02/03/2019

64-bit Processors

 Intel64
 64-bit linear address space
 Intel: Pentium Extreme, Xeon, Celeron D, Pendium D,
Core 2, and Core i7
 IA-32e Mode
 Compatibility mode for legacy 16- and 32-bit
applications
 64-bit Mode uses 64-bit addresses and operands

CISC and RISC

 CISC – Complex Instruction Set Computer


 Large and complex instruction set
 Variable width instructions
 Requires microcode interpreter
 Each instruction is decoded into a sequence of micro-operations
 Example: Intel x86 family
 RISC – Reduced Instruction Set Computer
 Small and simple instruction set
 All instructions have the same width
 Simpler instruction formats and addressing modes
 Decoded and executed directly by hardware
 Examples: ARM, MIPS, PowerPC, SPARC, etc.

8
02/03/2019

Basic Program Execution Registers

 Registers are high speed memory inside the CPU


 Eight 32-bit general-purpose registers
 Six 16-bit segment registers
 Processor Status Flags (EFLAGS) and Instruction
Pointer (EIP)
32-bit General-Purpose Registers
EAX EBP
EBX ESP
ECX ESI
EDX EDI

16-bit Segment Registers


EFLAGS CS ES
SS FS
EIP
DS GS

General-Purpose Registers
 Used primarily for arithmetic and data movement
 mov eax, 10 move constant 10 into register eax
 Specialized uses of Registers
 EAX – Accumulator register
 Automatically used by multiplication and division instructions
 ECX – Counter register
 Automatically used by LOOP instructions
 ESP – Stack Pointer register
 Used by PUSH and POP instructions, points to top of stack
 ESI and EDI – Source Index and Destination Index register
 Used by string instructions
 EBP – Base Pointer register
 Used to reference parameters and local variables on the stack

9
02/03/2019

Accessing Parts of Registers


 EAX, EBX, ECX, and EDX are 32-bit Extended registers
 Programmers can access their 16-bit and 8-bit parts
 Lower 16-bit of EAX is named AX
 AX is further divided into
 AL = lower 8 bits
 AH = upper 8 bits
 ESI, EDI, EBP, ESP have only
16-bit names for lower half

Accessing Parts of Registers

 Difference name of registers

10
02/03/2019

Special-Purpose & Segment Registers


 EIP = Extended Instruction Pointer
 Contains address of next instruction to be executed
 EFLAGS = Extended Flags Register
 Contains status and control flags
 Each flag is a single binary bit
 Six 16-bit Segment Registers
 Support segmented memory
 Six segments accessible at a time
 Segments contain distinct contents
 Code
 Data
 Stack

EFLAGS Register
 Status Flags
 Status of arithmetic and logical operations
 Control and System flags
 Control the CPU operation
 Programs can set and clear individual bits in the EFLAGS
register

11
02/03/2019

Status Flags

 Carry Flag
 Set when unsigned arithmetic result is out of range
 Overflow Flag
 Set when signed arithmetic result is out of range
 Sign Flag
 Copy of sign bit, set when result is negative
 Zero Flag
 Set when result is zero
 Auxiliary Carry Flag
 Set when there is a carry from bit 3 to bit 4
 Parity Flag
 Set when parity is even
 Least-significant byte in result contains even number of 1s

Floating-Point, MMX, XMM Registers

 Floating-point unit performs high speed FP


operations
 Eight 80-bit floating-point data registers
 ST(0), ST(1), . . . , ST(7)
 Arranged as a stack
 Used for floating-point arithmetic
 Eight 64-bit MMX registers
 Used with MMX instructions
 Eight 128-bit XMM registers
 Used with SSE instructions

12
02/03/2019

Registers in Intel Core Microarchitecture

X86 – 64 Registers

 There are sixteen, 64-bit General Purpose Registers


(GPRs)
 A GPR register can be accessed with all 64-bits or some
portion or
subset accessed.

13
02/03/2019

 General purpose registers

 When using data element sizes less than 64-bits (i.e.,


32-bit, 16-bit, or 8-bit), the lower portion of the register
can be accessed by using a different register name

14
02/03/2019

RSP Register

 RSP (Stack Pointer)


register
 Contains address of top (low
address) of current
function’s stack frame
 Allows use of the STACK
section of memory
 (See Assembly Language:
Function Calls lecture)

15
02/03/2019

 RBP (Base Pointer Register)


 is used as a base pointer during function calls
 The rbp register should not be used for data or other
uses.

 Flag Register (rFlags)

16
02/03/2019

RIP Register

 Special-purpose register…
 RIP (Instruction Pointer) register
 Stores the location of the next instruction
 Address (in TEXT section) of machine-language instructions to be
executed next
 Value changed:
 Automatically to implement sequential control flow
 By jump instructions to implement selection, repetition

Memory Segmentation
 Memory segmentation is necessary since the 20-bits memory
addresses cannot fit in the 16-bits CPU registers
 Since x86 registers are 16-bits wide, a memory segment is made of
216 consecutive words (i.e. 64K words)
 Each segment has a number identifier that is also a 16-bit number
(i.e. we have segments numbered from 0 to 64K)
 A memory location within a memory segment is referenced by
specifying its offset from the start of the segment. Hence the first
word in a segment has an offset of 0 while the last one has an offset
of FFFFh
 To reference a memory location its logical address has to be
specified. The logical address is written as:
 Segment number:offset
 For example, A43F:3487h means offset 3487h within segment
A43Fh.

17
02/03/2019

Program Segments

 Machine language programs usually have 3 different parts stored


in different memory segments:
 Instructions: This is the code part and is stored in the code segment
 Data: This is the data part which is manipulated by the code and is stored
in the data segment
 Stack: The stack is a special memory buffer organized as Last-In-First-
Out (LIFO) structure used by the CPU to implement procedure calls and as
a temporary holding area for addresses and data. This data structure is
stored in the stack segment
 The segment numbers for the code segment, the data segment,
and the stack segment are stored in the segment registers CS, DS,
and SS, respectively.
 Program segments do not need to occupy the whole 64K
locations in a segment

Real Address Mode

 A program can access up to six segments at any time


 Code segment
 Stack segment
 Data segment
 Extra segments (up to 3)
 Each segment is 64 KB
 Logical address
 Segment = 16 bits
 Offset = 16 bits
 Linear (physical) address = 20 bits

18
02/03/2019

Logical to Linear Address Translation

Linear address = Segment × 10 (hex) + Offset


Example:
segment = A1F0 (hex)
offset = 04C0 (hex)
logical address = A1F0:04C0 (hex)
what is the linear address?
Solution:
A1F00 (add 0 to segment in hex)
+ 04C0 (offset in hex)
A23C0 (20-bit linear address in hex)

Segment Overlap

 There is a lot of overlapping


between segments in the main
memory.
 A new segment starts every
10h locations (i.e. every 16
locations).
 Starting address of a segment
always has a 0h LSD.
 Due to segments overlapping
logical addresses are not
unique .

19
02/03/2019

Flat Memory Model

 Modern operating systems turn segmentation off


 Each program uses one 32-bit linear address space
 Up to 232 = 4 GB of memory can be addressed
 Segment registers are defined by the operating system
 All segments are mapped to the same linear address space
 In assembly language, we use .MODEL flat directive
 To indicate the Flat memory model
 A linear address is also called a virtual address
 Operating system maps virtual address onto physical addresses
 Using a technique called paging

Programmer View of Flat Memory

 Same base address for all segments Linear address space of a


 All segments are mapped to the same program (up to 4 GB)
linear address space 32-bit address
ESI
 EIP Register EDI DATA
 Points at next instruction 32-bit address
EIP
CODE
 ESI and EDI Registers
32-bit address
 Contain data addresses EBP STACK
ESP
 Used also to index arrays
CS
 ESP and EBP Registers DS Unused
SS
 ESP points at top of stack ES
 EBP is used to address parameters and base address = 0 for
all segments
variables on the stack

20
02/03/2019

Protected Mode Architecture

 Logical address consists of


 16-bit segment selector (CS, SS, DS, ES, FS, GS)
 32-bit offset (EIP, ESP, EBP, ESI ,EDI, EAX, EBX, ECX,
EDX)
 Segment unit translates logical address to linear address
 Using a segment descriptor table
 Linear address is 32 bits (called also a virtual address)
 Paging unit translates linear address to physical address
 Using a page directory and a page table

Logical to Linear Address Translation

Upper 13 bits of
segment selector are
used to index the
descriptor table

TI = Table Indicator
Select the descriptor table
0 = Global Descriptor Table
1 = Local Descriptor Table

21
02/03/2019

Segment Descriptor Tables

 Global descriptor table (GDT)


 Only one GDT table is provided by the operating system
 GDT table contains segment descriptors for all programs
 Also used by the operating system itself
 Table is initialized during boot up
 GDT table address is stored in the GDTR register
 Modern operating systems (Windows-XP) use one GDT table
 Local descriptor table (LDT)
 Another choice is to have a unique LDT table for each program
 LDT table contains segment descriptors for only one program
 LDT table address is stored in the LDTR register

Segment Descriptor Details

 Base Address
 32-bit number that defines the starting location of the segment
 32-bit Base Address + 32-bit Offset = 32-bit Linear Address
 Segment Limit
 20-bit number that specifies the size of the segment
 The size is specified either in bytes or multiple of 4 KB pages
 Using 4 KB pages, segment size can range from 4 KB to 4 GB
 Access Rights
 Whether the segment contains code or data
 Whether the data can be read-only or read & written
 Privilege level of the segment to protect its access

22
02/03/2019

Segment Visible and Invisible Parts

 Visible part = 16-bit Segment Register


 CS, SS, DS, ES, FS, and GS are visible to the programmer
 Invisible Part = Segment Descriptor (64 bits)
 Automatically loaded from the descriptor table

Paging

 Paging divides the linear address space into …


 Fixed-sized blocks called pages, Intel IA-32 uses 4 KB pages
 Operating system allocates main memory for pages
 Pages can be spread all over main memory
 Pages in main memory can belong to different programs
 If main memory is full then pages are stored on the hard disk
 OS has a Virtual Memory Manager (VMM)
 Uses page tables to map the pages of each running program
 Manages the loading and unloading of pages
 As a program is running, CPU does address translation
 Page fault: issued by CPU when page is not in memory

23
02/03/2019

Paging – cont’d

The operating
system uses

linear virtual address

linear virtual address


space of Program 2
space of Program 1
Page m ... Page n
page tables to ... ...
map the pages
in the linear Page 2 Page 2
virtual address Page 1 Page 1
space onto Page 0 Page 0
main memory
Hard Disk
The operating
Pages that cannot system swaps pages
Each running
fit in main memory between memory
program has its
are stored on the and the hard disk
own page table
hard disk

As a program is running, the processor translates the linear virtual addresses


onto real memory (called also physical) addresses

x86 Assembly Language Syntax

 An assembly statement
 3 essentials: opcode, operands
(dest, src)
 E.g.,: add a, b,c => c = a+b  AT&T Syntax

 Intel Syntax  opcode src dest

 opcode dest src  has suffix for opcode

 no suffix for opcode  b/w/l/q, 8|16|32|64-bits

 immed. val. is number  immediate value has $

 Plain register name  Register name has %

 [ ] for memory oprand  ( ) for memory oprand

 Example :  Example :

MOV EAX, 5 MOVL $5, %EAX


MOV EAX, [EBX+4] MOV 4(%EBX), %EAX

24
02/03/2019

x86 Operand Addressing Modes


 Addressing mode is about where the oprands are.
 Immediate Operands – operand values are part of the
instruction
 movl $5, %eax
 Register Operands – operand values are in registers
 add %ebx, %eax
 Memory Operands – operand values are in memory
 Addressed with
 Specifying the segment
 Explicitly specify segments: movl %eax, %es:(%ebx)
 Implicitly specify segments

 Implicity specify segment

25
02/03/2019

 Specifying the offset


 Offset = Base + Displacement + (Index * scale)
Base Disp Index Scale
EAX
EBX 8-BIT EAX 1
ECX EBX
ECX 2
EDX + X
16-BIT EDX
ESP
EBP 4
EBP ESI
ESI 32-BIT EDI 8
EDI
 Displacement is a number
 Special rules about ESP and EBP
 Any component can be NULL

Example memory operands


 Writing the memory operand
 Intel syntax: segreg:[base+index*scale+disp]
 AT&T syntax: %segreg:disp(base, index, scale)
 Example:
 int a[2][10]; and move a[1][2] into EAX
 Intel syntax
mov ebx, a;
mov ecx, 2;
mov eax, ds:[ebx + ecx * 4h + 40]
 AT&T syntax
movl &a, %ebx;
movl $2, %ecx;
movl %ds:40(%ebx, %ecx, 0x4),%eax

26
02/03/2019

From Assembly Language to Instructions


 Assembly language is for human
 Instructions are for machine
 One assembly language statement translates into
one instruction
 Why this translation is important
 x86 is CISC, i.e., instructions have variable lengths
 To properly deliver/inspect malicious code at the
correct memory location, attackers/defenders should be
aware of the length of the instructions

Summary of addressing modes

Assembler converts a variable name into a


constant offset (called also a displacement)

For indirect addressing, a base/index


register contains an address/index

CPU computes the effective


address of a memory operand

27
02/03/2019

Chapter 5

Data representation
Computer Arithmetic

Contents

 Number system
 Digital Number System
 Decimal, Binary, and Hexadecimal
 Base Conversion
 Binary Encoding
 IEC Prefixes

1
02/03/2019

Base (radix) of a number system

 Is number of digits to present all values


 There are some number systems common
 Binary systems
 Decimal systems
 Octal systems
 Hexa systems
 No reason that we can't also use base 7 or 19 but
they're obviously not very useful

Decimal Numbering System

2
02/03/2019

Octal Numbering System

Binary and Hexadecimal


3
02/03/2019

Base Conversion

 Is converting from one base to another


 Any non-negative number can be written in any
base
 Since most humans are used to the decimal
system and most computers use the binary
system it is important for people who work with
computers to understand how to convert
between binary and decimal

Conver from any base system to decimal

 Use formular :
NS = Cn Sn + Cn-1Sn-1 + Cn-2 Sn-2 + … + C0 S0 + C-1 S-1 + …

 Or
NS =  Ci Si
 In which :
0  Ci  S-1
i is position of ith digit, i=0 is the first digit in
front of dot decimal

4
02/03/2019

Examples : convert following numbers into D

 From B – D
 Example : 10012
10012 = 1x23 + 0x22 + 0x21 + 1x20 = 9
 From O – D
 Example : 162.438
162.438 = 1x82+6x81+2x80+4x8-1+3x8-2
 From H – D
 Example : 1E4A.6B16
1x163+Ex162+4x161+Ax160+6x16-1+Bx16-2
1x163+14x162+4x161+10x160+6x16-1+11x16-2

Examples :

5
02/03/2019

Converting from Decimal to Binary

Examples

 Convert 29 and 13 to Binary


Operation Resu Remainder
lt

29/2 14 1
14/2 7 0
7/2 3 1
3/2 1 1
1/2 0 1

 2910= 111012
 1310= 11012

6
02/03/2019

Convert decimal fraction to binary

 Convert 13.625 to binary


 Whole part: 13=11012
 Fraction decimal part : 0.625

Multiply by 2 Whole part Fraction


 0.625 x 2 = 1.25 1 0.25
 0.25 x 2 = 0.5 0 0.5
 0.5 x 2 = 1.0 1 0 (stop)
 Fraction binary : 1012
 Combine both parts : 1101.101

Converting from Decimal to Base Hex

7
02/03/2019

 Examples : convert from Decimal to Hex


 1023 = 3FFH

Convert from Decimal to Octal

 Example : 153.513 to octal


 Whole part: 153=2318
 Fraction decimal part : 0.513

Multiply by 8 Whole part Fraction


 0.513x 8 = 4.104 4 0.104
 0.104x 8 = 0.832 0 0.832
 0.832x 8 = 6.656 6 0.656
 0.656x8 = 5.248 5 0.28
 0.248x8 = 1.984 1 0.984…
 Result : 406518
 Final : 231.406518

8
02/03/2019

Base 10 Base 2 Base 16


 0 0000 0
1 0001 1
2 0010 2
3 0011 3
4 0100 4
5 0101 5
6 0110 6
7 0111 7
8 1000 8
9 1001 9
10 1010 A
11 1011 B
12 1100 C
13 1101 D
14 1110 E
15 1111 F

 Convert 0b100110110101101
 How many digits?
 Pad:
 Substitute:
 Example: 3E8 H – B

9
02/03/2019

Converting Binary to Octal

 Using groups of hextets, the binary number


110101000110112 (= 1359510) in hexadecimal is:

 Octal (base 8) values are derived from binary by using


groups of three bits (8 = 23):

 Octal was very useful when computers used six-bit


words.

Base Comparison
Base 10 Base 2 Base 16
 Why does all of this matter? 0 0000 0
1 0001 1
 Humans think about numbers in 2 0010 2
base 10, but computers “think” 3 0011 3
about numbers in base 2 4 0100 4
 Binary encoding is what allows 5 0101 5
computers to do all of the 6 0110 6
7 0111 7
amazing things that they do!
8 1000 8
 You should have this table 9 1001 9
memorized by the end of the 10 1010 A
11 1011 B
class 12 1100 C
 Might as well start now! 13 1101 D
14 1110 E
15 1111 F

10
02/03/2019

Aside: Why Base 2?

 Electronic implementation
 Easy to store with bi-stable elements
 Reliably transmitted on noisy and inaccurate wires
0 1 0
3.3V
2.8V

0.5V
0.0V
 Other bases possible, but not yet viable:
 DNA data storage (base 4: A, C, G, T) is a hot topic
 Quantum computing

Binary Encoding

11
02/03/2019

So What’s It Mean?

Numerical Encoding

12
02/03/2019

Binary Encoding

Binary Encoding – Colors

13
02/03/2019

Binary Encoding – Characters/Text

 ASCII Encoding (www.asciitable.com)


 American Standard Code for Information Interchange

Binary Encoding – Video Games

 As programs run, in-game data is stored


somewhere
 In many old games, stats would go to a
maximum of 255
 Pacman “kill screen”
 http://www.numberphile.com/videos/255.html

14
02/03/2019

Binary Encoding – Files and Programs

 At the lowest level, all digital data is stored as


bits!
 Layers of abstraction keep everything
comprehensible
 Data/files are groups of bits interpreted by program
 Program is actually groups of bits being interpreted
by your CPU
 Computer Memory Demo
 From vim: %!xxd
 From emacs: M-x hexl-mode

Binary Encoding – Optical Disk Storage

 Data stored using tiny indentations in a spiral pattern


 Not a direct translation between 0/1 and bump/no bump
 https://en.wikipedia.org/wiki/Compact_disc#Physical_details

15
02/03/2019

Units and Prefixes

 Here focusing on large numbers (exponents > 0)


 Note that 103 ≈ 210
 SI prefixes are ambiguous if base 10 or 2
 IEC prefixes are unambiguously base 2

Arithemetic on Binary system

 Addition : four cases


 0+0=0
 0+1=1
 0+1=1
 1 + 1 = 10
 Examples :
1 1. 0 1 1 (3.375)
1 0. 1 1 0 (2.750)
1 1 0. 0 0 1 (6.125)

16
02/03/2019

 Substraction : has four cases


0–0=0
1–0=1
1–1=0
0 – 1 = 1 Borrow 1
 Examples:
 Borrow 11 111
 Subtrahend 100 111001
 Minus 011 1011
 Result 001 101110

 Multiply :
0x0=0
0x1=0
1x0=0
1x1=1
 Examples:

17
02/03/2019

Signed Integer Representation

 The conversions we have so far presented have involved


only positive numbers.
 To represent negative values, computer systems allocate the
high-order bit to indicate the sign of a value.
 The high-order bit is the leftmost bit in a byte. It is also called
the most significant bit.
 The remaining bits contain the value of the number.

 There are three ways in which signed binary numbers may


be expressed:
 Signed magnitude,
 One’s complement and
 Two’s complement.
 In an 8-bit word, signed magnitude representation places the
absolute value of the number in the 7 bits to the right of the
sign bit.

18
02/03/2019

Signed magnitude

 For example, in 8-bit signed magnitude,


 Positive 3 is: 00000011
 Negative 3 is: 10000011
 Computers perform arithmetic operations on signed
magnitude numbers in much the same way as humans
carry out pencil and paper arithmetic.
 Humans often ignore the signs of the operands while

performing a calculation, applying the appropriate


sign after the calculation is complete.

Comments

 Zero number has 2 ways present


 000000 (0)
 100000 (-0)
 Calculations are almost wrong result
 Examples : 75 + 46
 Examples : 107+46 =25

19
02/03/2019

 Signed magnitude representation is easy for people to


understand, but it requires complicated computer
hardware.
 Another disadvantage of signed magnitude is that it
allows two different representations for zero: positive
zero (00000000) and negative zero (10000000).
 Mathematically speaking, positive 0 and negative 0
simply shouldn’t happen.
 For these reasons (among others) computers systems
employ complement systems for numeric value
representation.

One’s complement

 One’s complement of A number is B with


reverse all bit in A : 1  0, 0  1
 Examples: find one’s complement of
00000011
 Answers : 11111100

20
02/03/2019

Two’s complement

 Two’s complement equal one’s complement add


1
 Example: find two’s complement of 00000011
 One’s complement : 11111100
 Add 1 + 1

11111100

Signed Representation using one’s complement

 As following Rules :
 MSB is sign bit : 0 – Positive, 1 – Negative
 Other bits present value of postitive number or one’s
complement of negative number
 With n bit, values can be present –(2n-1 – 1) to (2n-1 – 1)
 Example : using 6 bits
17 : 010001 26 : 011010
-17 : 101110 -26 : 100101

21
02/03/2019

Signed Representation using two’s complement

 As following rules :
 MSB is sign bit : 0 – Positive, 1 – Negative
 Other bits present value of postitive number or two’s
complement of negative number
 With n bit, values can be present –(2n-1 ) to (2n-1 – 1)
 Examples:
17 : 010001 26 : 011010
-17 : 101111 -26 : 100110

Add 2 sign number using one’s complement

 With one’s complement addition, the carry bit is


“carried around” and added to the sum.
 Example: Using one’s complement binary arithmetic,
find the sum of 48 and – 19
13 001101 -13 110010
+11 +001011 -11 + 110100
+24 011000 -24 100110
+ 1
100111

22
02/03/2019

Add 2 sign number using two’s complement

 With two’s complement arithmetic, all we do is add our


two binary numbers. Just discard any carries emitting
from the high order bit.
 Example: Using one’s complement binary arithmetic,
find the sum of 48 and - 19.
12 001100 -12 110100
+ 9 + 001001 + -9 +110111
21 010101 -21 1101011
discard, result is : 101011

Sign Extension

23
02/03/2019

Examples

 Convert from smaller to larger integral data types


 C automatically performs sign extension

BCD code (Binary coded Decimal)

 Is code of ten number from 0 – 9 with 4 bit


 Examples :
1941D = 11110010101
1941D = 0001 1001 0100 0001BCD (dùng 16 bit)

24
02/03/2019

Arthmetic on BCD number

 A+B : using the following rules


 Carry at low decade to move high decace and edit low
decade
 If which decade of total greater 9 also is edited
 Editting is performed by adding with 6
 Example :
18 0001 1000
+ 26 + 0010 0110
44 0011 1110
+ 0110 (edit decade S0)
0100 0100

 Example
1 (carry from decade S0)
28 0010 1000
+ 19 0001 1001
47 0100 0001
+ 0110 (editting S0)
0100 0111

25
02/03/2019

Floating point numbers

 The signed magnitude, one’s complement, and two’s


complement representation that we have just presented
deal with integer values only.
 Without modification, these formats are not useful in
scientific or business applications that deal with real
number values.
 How do we encode the following:
 Real numbers (e.g. 3.14159)
 Very large numbers (e.g. 6.02×1023)
 Very small numbers (e.g. 6.626×10-34)
 Special numbers (e.g. ∞, NaN)
 Floating-point representation solves this problem.

Scientific Notation (Decimal)

mantissa exponent
6.0210 × 1023
decimal point radix (base)

 Normalized form: exactly one digit (non-zero)


to left of decimal point
 Alternatives to representing 1/1,000,000,000
 Normalized: 1.0×10-9
 Not normalized: 0.1×10-8,10.0×10-10

26
02/03/2019

Scientific Notation (Binary)

mantissa exponent
1.012 × 2-1

binary point radix (base)

 Computer arithmetic that supports this called floating


point due to the “floating” of the binary point
 Declare such variable in C as float (or double)

 Computers use a form of scientific notation for floating-


point representation
 Numbers written in scientific notation have three
components:

27
02/03/2019

Scientific Notation Translation


Floating Point Encoding


 Use normalized, base 2 scientific notation:
 Value: ±1 × Mantissa × 2Exponent
 Bit Fields: (-1)S × 1.M × 2(E–bias)
 Representation Scheme:
 Sign bit (0 is positive, 1 is negative)
 Mantissa (a.k.a. significand) is the fractional part of the number
in normalized form and encoded in bit vector M
 Exponent weights the value by a (possibly negative) power of 2
and encoded in the bit vector E

31 30 23 22 0
S E M
1 bit 8 bits 23 bits

28
02/03/2019

29
02/03/2019

31 30 23 22 0
S E M
1 bit 8 bits 23 bits
(-1)S x (1 . M) x 2(E–bias)
 Note the implicit 1 in front of the M bit vector
 Example: 0b 0011 1111 1100 0000 0000 0000 0000
0000
is read as 1.12 = 1.510, not 0.12 = 0.510
 Gives us an extra bit of precision
 Mantissa “limits”
 Low values near M = 0b0…0 are close to 2Exp
 High values near M = 0b1…1 are close to 2Exp+1

 Precision is a count of the number of bits in a computer


word used to represent a value
 Capacity for accuracy
 Accuracy is a measure of the difference between the
actual value of a number and its computer representation

 High precision permits high accuracy but doesn’t guarantee it.


It is possible to have high precision but low accuracy.
 Example: float pi = 3.14;
 pi will be represented using all 24 bits of the mantissa

(highly precise), but is only an approximation (not accurate)

30
02/03/2019

 Double Precision (vs. Single Precision) in 64 bits


63 62 52 51 32
S E (11) M (20 of 52)
31 0
M (32 of 52)
 C variable declared as double
 Exponent bias is now 210–1 = 1023
 Advantages: greater precision (larger mantissa),
greater range (larger exponent)
 Disadvantages:more bits used,
slower to manipulate

 But wait… what happened to zero?


 Using standard encoding 0x00000000 =
 Special case: E and M all zeros = 0
 Two zeros! But at least 0x00000000 = 0 like integers
 New numbers closest to 0: Gaps! b
 a = 1.0…02×2-126 = 2-126 -∞ +∞
0
 b = 1.0…012×2-126 = 2-126 + 2-149 a

 Normalization and implicit 1 are to blame


 Special case: E = 0, M ≠ 0 are denormalized numbers

31
02/03/2019

Floating-point binary fields

 Signed bit (1): 0 for positive, 1 for negative

 Biased exponent (characteristic): c = e + 2t-1 -1

e: exponent needed to store, t: bit size of this field

Ex: need to store the exponent 4 in 6-bit field. c = 4+25 = 36 = 100101

Converting decimal to floating-point


1. Convert the absolute value of the number to binary,
2. Append × 20 to the end of the binary number (which
does not change its value),
3. Normalize the number. Move the binary point so that it
is one bit from the left. Adjust the exponent of two so
that the value does not change,
4. Place the mantissa into the mantissa field of the number.
Omit the leading one, and fill with zeros on the right,
5. Add the bias to the exponent of two, and place it in the
exponent field,
6. Set the sign bit

32
02/03/2019

Examples

Convert 2.625 to 8-bit binary floating-point


1-bit sign 3-bit exponent 4-bit mantissa

Convert to binary: 2.62510 = 10.1012

1. Add exponent part: 10.101 = 10.101 x 20

2. Normalize: 10.101 x 20 = 1.0101 x 21

3. Mantissa: 0101

4. Exponent: 1+3 = 4 = 1002

5. Sign bit is 0

6. The resulting number is 0100 01012 = 4516

Examples

Convert -4.75 to 8-bit binary floating-point

1-bit sign 3-bit exponent 4-bit mantissa


Convert to binary: 4.7510 = 100.112

1. Add exponent part: 100.11 = 100.11 x 2 0

2. Normalize: 100.11 x 20 = 1.0011 x 22

3. Mantissa: 0011

4. Exponent: 2+3 = 5 = 1012

5. Sign bit is 1

6. The resulting number is 1101 00112 = D316

33
02/03/2019

Examples

Convert -1313.3125 to IEEE 32-bit binary floating-point


1-bit sign 8-bit exponent 23-bit mantissa

Convert to binary: 1313.312510 = 10100100001.01012

1. Add exponent part: 10100100001.01012 = 10100100001.01012 x 20

2. Normalize: 10100100001.01012 x 20 = 1.010010000101012 x 210

3. Mantissa: 01001000010101000000000

4. Exponent: 10+127 = 137 = 10001001 2

5. Sign bit is 1

6. The resulting number: 11000100101001000010101000000000 = C4A42A0016

Examples

Convert 0.1015625 to IEEE 32-bit binary floating-point


1-bit sign 8-bit exponent 22-bit mantissa

Convert to binary: 0.101562510 = 0.00011012

1. Add exponent part: 0.00011012 = 0.00011012 x 20

2. Normalize: 0.00011012 x 20 = 1.1012 x 2-4

3. Mantissa: 10100000000000000000000

4. Exponent: -4+127 = 123 = 011110112

5. Sign bit is 0

6. The resulting number: 00111101110100000000000000000000 = 3DD0000016

34
02/03/2019

Examples

 Express 3210 in the revised 14-bit floating-point


model.
 We know that 32 = 1.0 x 25 = 0.1 x 26.
 To use our excess 16 biased exponent, we add 16 to 6, giving
2210 (=101102).
 Graphically:

 Express 0.062510 in the revised 14-bit floating-


point model.
 We know that 0.0625 is 2-4. So in (binary) scientific notation
0.0625 = 1.0 x 2-4 = 0.1 x 2 -3.
 To use our excess 16 biased exponent, we add 16 to -3, giving
1310 (=011012).

35
02/03/2019

 Express -26.62510 in the revised 14-bit floating-


point model.
 We find 26.62510 = 11010.1012. Normalizing, we have: 26.62510
= 0.11010101 x 2 5.
 To use our excess 16 biased exponent, we add 16 to 5, giving
2110 (=101012). We also need a 1 in the sign bit (for a negative
number).

 Find the sum of 1210 and 1.2510 using the 14-bit


floating-point model.
 We find 1210 = 0.1100 x 2 4. And 1.2510 = 0.101 x 2 1 = 0.000101
x 2 4.
 Thus, our sum is 0.110101 x 2 4.

36
02/03/2019

 Floating-point multiplication is also carried out in a


manner akin to how we perform multiplication using
pencil and paper.
 We multiply the two operands and add their exponents.
 If the exponent requires adjustment, we do so at the end
of the calculation.

Examples

 Find the product of 1210 and 1.2510 using the 14-


bit floating-point model.
 We find 1210 = 0.1100 x 2 4. And 1.2510 = 0.101 x 2 1.
 Thus, our product is 0.0111100 x 2 5 = 0.1111 x 2 4.
 The normalized product requires an exponent of 2010 = 101102.

37
02/03/2019

38
02/03/2019

Chapter 6

ASSEMBLY
LANGUAGE

HW Interface Affects Performance


Source code Compiler Architecture Hardware
Different applications Perform optimizations, Instruction set Different
or algorithms generate instructions implementations

Intel Pentium 4
C Language
Intel Core i7
Program
A
GCC x86-64 AMD Ryzen

Program AMD Epyc


B
Intel Xeon
Clang
Your
program ARMv8 ARM Cortex-A53
(AArch64/A64)
Apple A7

1
02/03/2019

Instruction Set Architectures

 The ISA defines:


 The system’s state (e.g. registers, memory, program counter)
 The instructions the CPU can execute
 The effect that each of these instructions will have on the
system state

CPU
PC
Memory
Registers

General ISA Design Decisions

 Instructions
 What instructions are available? What do they do?
 How are they encoded?
 Registers
 How many registers are there?
 How wide are they?
 Memory
 How do you specify a memory location?

2
02/03/2019

Mainstream ISAs

Macbooks & PCs Smartphone-like devices Digital home & networking


(Core i3, i5, i7, M) (iPhone, iPad, Raspberry Pi) equipment
x86-64 Instruction Set ARM Instruction Set (Blu-ray, PlayStation 2)
MIPS Instruction Set

Assembly Programmer’s View

CPU Addresses
Memory
Registers • Code
PC Data
• Data
Condition Instructions • Stack
Codes

 Programmer-visible state
 PC: the Program Counter (rip in x86-64)
 Address of next instruction
 Memory
 Named registers
 Byte-addressable array
 Together in “register file”
 Heavily used program data
 Code and user data
 Condition codes  Includes the Stack (for
supporting procedures)
 Store status information about most recent
arithmetic operation
 Used for conditional branching

3
02/03/2019

64 bit x86 systems (x86-64)

x86-64 Assembly “Data Types”

 Integral data of 1, 2, 4, or 8 bytes


 Data values
 Addresses (untyped pointers)
 Floating point data of 4, 8, 10 or 2x8 or 4x4 or 8x2
 Different registers for those (e.g. xmm1, ymm2)
 Come from extensions to x86 (SSE, AVX, …)
 No aggregate types such as arrays or structures
 Just contiguously allocated bytes in memory
 Two common syntaxes
 “AT&T”: used by our course, slides, textbook, gnu tools, …
 “Intel”: used by Intel documentation, Intel tools, …
 Must know which you’re reading

4
02/03/2019

x86-64 Integer Registers – 64 bits wide

rax eax r8 r8d

rbx ebx r9 r9d

rcx ecx r10 r10d

rdx edx r11 r11d

rsi esi r12 r12d

rdi edi r13 r13d

rsp esp r14 r14d

rbp ebp r15 r15d

 Can reference low-order 4 bytes (also low-order 2 & 1


bytes)

Some History: IA32 Registers – 32 bits wide

eax ax ah al accumulate

ecx cx ch cl counter
general purpose

edx dx dh dl data

ebx bx bh bl base

esi si source index

edi di destination index

esp sp stack pointer

ebp bp base pointer

16-bit virtual registers Name Origin


(backwards compatibility) (mostly obsolete)

5
02/03/2019

What is an Assembler?

 Major Assemblers  An assembler is a


 Microsoft Assembler (MASM) program that translates an
 GNU Assembler (GAS) assembly language
 Flat Assembler (FASM) program into binary code
 Turbo Assembler (TASM)
 Netwide Assembler (NASM)

Our platform

 Hardware: 80x86 processor (32, 64 bit)


 OS: Linux
 Assembler: Netwide Assembler (NASM)
 C Compiler: GNU C Compiler (GCC)
 Linker: GNU Linker (LD)
 We will use the NASM assembler, as it is:
 Free. You can download it from various web sources.
 Well-documented and you will get lots of information
on net.
 Could be used on both Linux and Windows.

6
02/03/2019

Introduction to NASM assembler

 NASM Command Line Options


 -h for usage instructions
 -o output file name
 -f output file format
 Must be coff always
 -l generate listing file, i.e. file with code generated
 -e preprocess only
 -g enable debugging information
 Example
nasm -g -f coff foo.asm -o foo.o

Base elements of NASM Assemble

 Character Set
 Letters a..z A..Z
 Digits 0..9
 Special characters ? _ @ $ . ~
 NASM (unlike most assemblers) is case-sensitive
with respect to labels and variables
 It is not case-sensitive with respect to keywords,
mnemonics, register names, directives, etc.

7
02/03/2019

Literals

 Literals are values that are known or calculated at


assembly time. Examples:
 'This is a string constant‘
 "So is this“
 ‘Backquoted strings can use escape chars\n‘
 123
 1.2
 0FAAh
 $1A01
 0x1A01

Integers
 Numeric digits (including A..F) with no decimal point
 may include radix specifier at end:
 b or y binary
 d decimal
 h hexadecimal
 q octal
 Examples
 200 decimal (default)
 200d decimal
 200h hex
 200q octal
 10110111b binary

8
02/03/2019

NASM Syntax
 In order to refer to the contents of a memory location, use square
brackets.
 In order to refer to the address of a variable, leave them out, e.g.,
 mov eax, bar ;Refers to the address of bar
 mov eax, [bar] ;Refers to the contents of bar
No need for the OFFSET directive.
 NASM does not support the hybrid syntaxes such as:
 mov eax,table[ebx] ;ERROR
 mov eax,[table+ebx] ;O.K
 mov eax,[es:edi] ;O.K
 NASM does NOT remember variable types:
 data dw 0 ;Data type defi ned as double word.
 mov [data], 2 ;Doesn’t work.
 mov word [data], 2 ;O.K

 NASM does NOT remember variable types. Therefore, un-typed


operations are not supported, e.g.
LODS, MOVS, STOS, SCAS, CMPS, INS, and OUTS.
 You must use instead:
LODSB, MOVSW, and SCASD, etc.
 NASM does not support ASSUME.
It will not keep track of what values you choose to put in your
segment registers.
 NASM does not support memory models.
 The programmer is responsible for coding CALL FAR instructions
where necessary when calling external functions.
call (seg procedure):proc ;call segment:offset
 seg returns the segment base of procedure proc.

9
02/03/2019

 NASM does not support memory models.


 The programmer has to keep track of which functions are
supposed to be called with a far call and which with a near
call, and is responsible for putting the correct form of RET
instruction (RETN or RETF).
 NASM uses the names st0, st1, etc. to refer to floating
point registers.
 NASM’s declaration syntax for un-initialized storage is
different.
 stack DB 64 DUP (?) ;ERROR
 stack resb 64 ;Reserve 64 bytes
 Macros and directives work differently than they do in
MASM

Statemenmts

 Syntax:
[label[:]] [mnemonic] [operands] [;comment]
 [ ] indicates optionality
 Note that all parts are optional  blank lines are legal
 [label] can also be [name]
 Variable names are used in data definitions
 Labels are used to identify locations in code
 Statements are free form; they need not be formed
into columns
 Statement must be on a single line, max 128 chars

10
02/03/2019

 Example:
 L100: add eax, edx ; add subtotal to total
 Labels often appear on a separate line for code
clarity:
 L100:
add eax, edx ; add subtotal to total

Labels and Names

 Names identify labels, variables, symbols, and keywords


 May contain:
 letters: a..z A..Z
 digits: 0..9
 special chars: ? _ @ $ . ~
 NASM is case-sensitive (unlike most x86 assemblers)
 First character must be a letter, _ or . (which has a
special meaning in NASM as a “local label” indicating
it can be redefined)
 Names cannot match a reserved word (and there are
many reserved words!)

11
02/03/2019

Type of statements
 1. Directives
 limit EQU 100 ; defines a symbol limit
 % define limit 100 ; like C #define
 2. Data Definitions
 msg db 'Welcome to Assembler!‘
 db 0Dh, 0Ah
 count dd 0
 mydat dd 1,2,3,4,5
 resd 100 ; reserves 400 bytes
 3. Instructions
 mov eax, ebx
 add ecx, 10

Directives

 A directive is an instruction to the assembler,


not the CPU
 A directive is not an executable instruction
 A directive can be used to
 define a constant
 define memory for data
 include source code & other file
 They are similar to C’s #include and #define

12
02/03/2019

 equ directive : EQU defi nes a symbol to a constant


 format: symbol equ value

 Defines a symbol

 Cannot be redefined later

 Examples : message db 'hello, world'


msglen equ $-message
 % directive
 format: %define symbol value
 Similar to #define in C
 Example : %define N 100
mov eax , N

 Including files
 %include “some_file”
 If you know the C preprocessor, these are the
same ideas as
 #define SIZE 100 or #include “stdio.h

13
02/03/2019

Data formats

 Defines storage for uninitialized or uninitialized


data
 Double and single quotes are treated the same

There are two kinds of data directives


 RESx directive; x is one of b, w, d, q, t REServe
memory (uninitialized data)
 Dx directive; x is one of b, w, d, q, t Define
memory (initialized data)
 Example :
 L1 db 0 ;defines a byte and initializes to 0
 L2 dw FF0Fh ;define a word and initialize to FF0Fh
 L3 db "A" ;byte holding ASCII value of A
 L4 resd 100 ;reserves space for 100 double words
 L5 times 100 db 0 ;defines 100 bytes init. to 0
 L6 db "s","t","r","i","n","g",0 ;defines "string“
 L7 db ’string’,0 ;same as above
 L8 resb 10 ; reserves 10 bytes

14
02/03/2019

The DX data directives


 One declares a zone of initialized memory using three
elements:
 Label: the name used in the program to refer to that zone of
memory
 A pointer to the zone of memory, i.e., an address
 DX, where X is the appropriate letter for the size of the data
being declared
 Initial value, with encoding information
 default: decimal
 b: binary
 h: hexadecimal
 o: octal
 quoted: ASCII

 Example : L8 db 0, 1, 2, 3

 Examples
 mov al , [L2] ;move a byte at L2 to al
 mov eax, L2 ;move the address of L2 to eax
 mov [L1], ah ;move ah to the byte pointed to by L1
 mov eax, dword 5
 add [L2], eax ;double word at L2 containing [L2]+eax
 mov [L2], 1 ;does not work, why?
 mov dword [L2], 1 ;works, why

15
02/03/2019

DX with the times qualifier

 Say you want to declare 100 bytes all initialized to 0


 NASM provides a nice shortcut to do this, the
“times” qualifier
 L11 times 100 db 0
 Equivalent to L11 db 0,0,0,....,0 (100 times)

NASM directives

 BITS 32 generate code for 32 bit processor mode


 CPU 386 | 686 | ... restrict assembly to the specified
processor
 SECTION <section_name>
specifies the section the assembly code will be assembled
into. For COFF can be one of:
 .text code (program) section
 .data initialized data section
 .bss uninitialized data section
 EXTERN <symbol> declare <symbol> as declared
elsewhere, allowing it to be used in the module;
 GLOBAL <symbol> declare <symbol> as global so that it
can be used in other modules that import it via EXTERN

16
02/03/2019

Examples using $

 message db ’hello, world’


 msglen equ $-message
 Note
 The msglen is evaluated once using the value of $ at
the point of definition
 $ evaluates to the assembly position at the beginning
of the line containing the expression

NASM Program Structure

17
02/03/2019

Data segment example

Data segment example

18
02/03/2019

Example

Uninitialized Data

 The RESX directive is very similar to the DX


directive, but always specifies the number of
memory elements
 L20 resw 100
 100 uninitialized 2-byte words
 L20 is a pointer to the first word
 L21 resb 1
 1 uninitialized byte named L21

19
02/03/2019

Moving immediate values

 Consider the instruction: mov [L], 1


 The assembler will give us an error: “operation size not
specified”!
 This is because the assembler has no idea whether we mean
for “1” to be 01h, 0001h, 00000001h, etc.
 Labels have no type (they’re NOT variables)
 Therefore the assembler must provide us with a way
to specify the size of immediate operands
 mov dword [L], 1
 4-byte double-word
 5 size specifiers: byte, word, dword, qword, tword

Size Specifier Examples

 mov [L1], 1 ; Error


 mov byte [L1], 1 ; 1 byte
 mov word [L1], 1 ; 2 bytes
 mov dword [L1], 1 ; 4 bytes
 mov [L1], eax ; 4 bytes
 mov [L1], ax ; 2 bytes
 mov [L1], al ; 1 byte
 mov eax, [L1] ; 4 bytes
 mov ax, [L1] ; 2 bytes
 mov ax, 12 ; 2 bytes

20
02/03/2019

Program structure
 SECTION .data ;data section
msg: db "Hello World",10 ;the string to print 10=newline
len: equ $-msg ;len is value, not an addr.
 SECTION .text ;code section
global main ;for linker
main: ;standard gcc entry point
mov edx, len ;arg3, len of str. to print
mov ecx, msg ;arg2, pointer to string
mov ebx, 1 ;arg1, write to screen
mov eax, 4 ;write sysout command to int 80 hex
int 0x80 ;interrupt 80 hex, call kernel
mov ebx, 0 ;exit code, 0=normal
mov eax, 1 ;exit command to kernel
int 0x80 ;interrupt 80 hex, call kernel

 To produce hello.o object file:


nasm -f elf hello.asm
 To produce hello ELF executable:
ld -s -o hello hello.o

21
02/03/2019

Program layout

 Consit of 3 parts:
 Text
 Data
 Bss

NASM Program Structure

 ; include directives
segment .data
; DX directives

segment .bss
; RESX directives

segment .text
global asm_main
asm_main:
; instructions

22
02/03/2019

More on the text segment


 Before and after running the instructions of your program
there is a need for some “setup” and “cleanup”
 We’ll understand this later, but for now, let’s just accept
the fact that your text segment will always looks like this:
enter 0,0
pusha
;
; Your program here
;
popa
mov eax, 0
leave
ret

Assembly program structure


 %include "asm_io.inc"
segment .data ;initialized data

segment .bss ;uninitialized data

segment .text
global asm_main
asm_main:
enter 0,0 ;setup
pusha ;save all registers
;put your code here
popa ;restore all registers
mov eax, 0 ;return value
leave
ret

23
02/03/2019

OR NASM Skeleton File

 ; include directives
segment .data
; DX directives
segment .bss
; RESX directives
segment .text
global asm_main
asm_main:
enter 0,0
pusha
; Your program here
popa
mov eax, 0
leave
ret

Example
segment .data
integer1 dd 15 ; first int
integer2 dd 6 ; second int
segment .bss
result resd 1 ; result
segment .text
global asm_main
asm_main:
enter 0,0
pusha

24
02/03/2019

mov eax, [integer1] ; eax = int1


add eax, [integer2] ; eax = int1 + int2
mov [result], eax ; result = int1 + int2
popa
mov eax, 0
leave
ret

I/O?

 This is all well and good, but it’s not very interesting if we can’t
“see” anything
 We would like to:
 Be able to provide input to the program
 Be able to get output from the program
 Also, debugging will be difficult, so it would be nice if we could
tell the program to print out all register values, or to print out the
content of some zones of memory
 Doing all this requires quite a bit of assembly code and
requires techniques that we will not see for a while
 The author of our textbook provides a nice I/O package that
we can just use, without understanding how it works for now

25
02/03/2019

asm_io.asm and asm_io.inc


 The “PC Assembly Language” book comes with
many add-ons and examples
 Downloadable from the course’s Web site
 A very useful one is the I/O package, which comes
as two files:
 asm_io.asm (assembly code)
 asm_io.inc (macro code)
 Simple to use:
 Assemble asm_io.asm into asm_io.o
 Put “%include asm_io.inc” at the top of your assembly
code
 Link everything together into an executable

I/O

 C: I/O done through the standard C library


 NASM: I/O through the standard C library
%include "asm_io.inc“
 Contains routines for I/O
 print_int prints EAX
 print_char prints ASCII value of AL
 print_string prints the string stored at the address
stored in EAX; must be 0 terminated
 print_nl prints newline
 read_int reads an integer into EAX
 read_char reads a character into AL

26
02/03/2019

Examples

Modified example

27
02/03/2019

The Big Picture

28
02/03/2019

How do we run the program?


 Now that we have written our program, say in file vd1.asm using a
text editor, we need to assemble it
 When we assemble a program we obtain an object file (vd1.o file)
 We use NASM to produce the .o file:
% nasm -f elf vd1.asm -o vd1.o
 So now we have vd1.o file, that is a machine code translation of our
assembly code
 We also need driver.o file for the C driver:
% gcc -m32 -c driver.c -o driver.o
 We generate a 32-bit object (our machines are likely 64-bit)
 We also create asm_io.o by assembling asm_io.asm
 Now we have three .o files.
 We link them together to create an executable:
% gcc driver.o vd1.o asm_io.o -o first_vd1

First program

 ;
; file: first.asm
; First assembly program. This program asks for two
integers as
; input and prints out their sum.
;
; To create executable:
;
; Using Linux and gcc:
; nasm -f elf first.asm
; gcc -o first first.o driver.c asm_io.o

29
02/03/2019

 %include "asm_io.inc" ;
; initialized data is put in the .data segment
segment .data
;
; These labels refer to strings used for output
prompt1 db "Enter a number: ", 0 ; don’t forget null
prompt2 db "Enter another number: ", 0
outmsg1 db "You entered ", 0
outmsg2 db " and ", 0
outmsg3 db ", the sum of these is ", 0
; uninitialized data is put in the .bss segment
;
segment .bss

 ;
; These labels refer to double words used to store the inputs;
;
input1 resd 1
input2 resd 1
; code is put in the .text segment
segment .text
global asm_main
asm_main:
enter 0,0 ; setup routine
pusha
mov eax, prompt1 ; print out prompt
call print_string
call read_int ; read integer
mov [input1], eax ; store into input1
mov eax, prompt2 ; print out prompt

30
02/03/2019

 call print_string
call read_int ; read integer
mov [input2], eax ; store into input2
mov eax, [input1] ; eax = dword at input1
add eax, [input2] ; eax += dword at input2
mov ebx, eax ; ebx = eax
dump_regs 1 ; dump out register values
dump_mem 2, outmsg1, 1 ; dump out memory
; next print out result message as series of steps
mov eax, outmsg1
call print_string ; print out first message
mov eax, [input1]
call print_int ; print out input1
mov eax, outmsg2
call print_string ; print out second message
mov eax, [input2]

print_int ; print out input2


mov eax, outmsg3
call print_string ; print out third message
mov eax, ebx
call print_int ; print out sum (ebx)
call print_nl ; print new-line
popa
mov eax, 0 ; return value
leave
ret

31
02/03/2019

C driver
 #include "cdecl.h"
int PRE_CDECL asm_main( void ) POST_CDECL;
int main() {
int ret_status;
ret_status = asm_main();
return ret_status;
}
 All segments and registers are initialized by the C system
 I/O is done through the C standard library
 Initialized data in .data
 Uninitialized data in .bss
 Code in .text
 Stack later

The Big Picture

32
02/03/2019

Compiling

 nasm -f elf first.asm


produces first.o
 ELF: executable and linkable format
 gcc -c driver.c
produces driver.o
option -c means compile only
 We need to compile asm_io.asm:
nasm -f elf -d ELF_TYPE asm_io.asm
produces asm_io.o
 On 64-bit machines, add the option -m32 to generate
32-bit code, e.g. gcc -m32 -c driver.c

Linking

 Linker: combines machine code & data in object


files and libraries together to create an executable
 gcc -o first driver.o first.o asm_io.o
 On 64-bit machines,
gcc -m32 -o first driver.o first.o asm_io.o
 -o outputfile specifies the output file
 gcc driver.o first.o asm_io.o
produces a.out by default

33
02/03/2019

The Big Picture

Assembling/Linking Process

34
02/03/2019

Assembling/Linking Process

Assembling/Linking Process

35
02/03/2019

dum_regs and dump_mem

 The macro dump_regs prints out the bytes stored in all the
registers (in hex), as well as the bits in the FLAGS register
(only if they are set to 1)
dump_regs 13
 ‘13’ above is an arbitrary integer, that can be used to distinguish outputs
from multiple calls to dump_regs
 The macro dump_memory prints out the bytes stored in memory
(in hex). It takes three arguments:
 An arbitrary integer for output identification purposes
 The address at which memory should be displayed
 The number minus one of 16-byte segments that should be displayed
 for instance
dump_mem 29, integer1, 3
 prints out “29”, and then (3+1)*16 bytes

 To demonstrate the usage of these two macros, let’s just


write a program that highlights the fact that the Intel
x86 processors use Little Endian encoding
 We will do something ugly using 4 bytes
 Store a 4-byte hex quantity that corresponds to
the ASCII codes: “live”
 “l” = 6Ch
 “i” = 69h
 “v” = 76h
 “e” = 65h
 Print that 4-byte quantity as a string

36
02/03/2019

Example

Output of the program

37
02/03/2019

Chapter 7

INSTRUCTION SET
OVERVIEW

Contents

 Operand Types
 Data Transfer Instructions
 Addition and Subtraction
 Addressing Modes
 Jump and Loop Instructions
 Copying a String
 Summing an Array of Integers

1
02/03/2019

Three Basic Types of Operands

 Immediate
 Constant integer (8, 16, or 32 bits)
 Constant value is stored within the instruction
 Register
 Name of a register is specified
 Register number is encoded within the instruction
 Memory
 Reference to a location in memory
 Memory address is encoded within the instruction, or
 Register holds the address of a memory location

Instruction Operand Notation

2
02/03/2019

Data Transfer Instructions

 MOV Instruction : Move source operand to destination


mov destination, source
 Source and destination operands can vary
mov reg, reg Rules
mov mem, reg •Both operands must be of same size
mov reg, mem •No memory to memory moves
mov mem, imm •No immediate to segment moves
mov reg, imm •No segment to segment moves
mov r/m16, sreg •Destination cannot be CS
mov sreg, r/m16

MOV Examples
 .DATA
 count BYTE 100
 bVal BYTE 20
 wVal WORD 2
 dVal DWORD 5
 .CODE

 mov bl, count ; bl = count = 100


 mov ax, wVal ; ax = wVal = 2
 mov count,al ; count = al = 2
 mov eax, dval ; eax = dval = 5
Assembler will not accept the following moves
 mov ds, 45 ; immediate move to DS not permitted
 mov esi, wVal ; size mismatch
 mov eip, dVal ; EIP cannot be the destination
 mov 25, bVal ; immediate value cannot be destination
 mov bVal,count ; memory-to-memory move not permitted

3
02/03/2019

Zero Extension
 MOVZX Instruction
 Fills (extends) the upper part of the destination with zeros
 Used to copy a small source into a larger destination
 Destination must be a register
 movzx r32, r/m8
 movzx r32, r/m16
 movzx r16, r/m8
0 10001111 Source

mov bl, 8Fh


00000000 10001111 Destination movzx ax, bl

Sign Extension

 MOVSX Instruction
 Fills (extends) the upper part of the destination register with a
copy of the source operand's sign bit
 Used to copy a small source into a larger destination
 movsx r32, r/m8
 movsx r32, r/m16
 movsx r16, r/m8
10001111 Source

mov bl, 8Fh


11111111 10001111 Destination movsx ax, bl

4
02/03/2019

XCHG Instruction
 XCHG exchanges the values of two operands
 xchg reg, reg Rules
 xchg reg, mem • Operands must be of the same size
• At least one operand must be a register
 xchg mem, reg
• No immediate operands are permitted
 Examples
 .DATA
 var1 DWORD 10000000h
 var2 DWORD 20000000h
 .CODE
 xchg ah, al ; exchange 8-bit regs
 xchg ax, bx ; exchange 16-bit regs
 xchg eax, ebx; exchange 32-bit regs
 xchg var1,ebx; exchange mem, reg
 xchg var1,var2; error: two memory operands

Direct Memory Operands

 Variable names are references to locations in memory


 Direct Memory Operand: Named reference to a memory
location
 Assembler computes address (offset) of named variable
.DATA
var1 BYTE 10h
Direct Memory Operand
.CODE
mov al, var1 ; AL = var1 = 10h
mov al,[var1] ; AL = var1 = 10h

Alternate Format

5
02/03/2019

Direct-Offset Operands

 Direct-Offset Operand: Constant offset is added to a


named memory location to produce an effective address
 Assembler computes the effective address
 Lets you access memory locations that have no name
.DATA
arrayB BYTE 10h,20h,30h,40h
.CODE
mov al, arrayB+1 ; AL = 20h
mov al,[arrayB+1] ; alternative notation
mov al, arrayB[1] ; yet another notation

Q: Why doesn't arrayB+1 produce 11h?

Direct-Offset Operands - Examples


.DATA
arrayW WORD 1020h, 3040h, 5060h
arrayD DWORD 1, 2, 3, 4
.CODE
mov ax, arrayW+2 ; AX = 3040h
mov ax, arrayW[4] ; AX = 5060h
mov eax,[arrayD+4] ; EAX = 00000002h
mov eax,[arrayD-3] ; EAX = 01506030h
mov ax, [arrayW+9] ; AX = 0200h
mov ax, [arrayD+3] ; Error: Operands are not same size
mov ax, [arrayW-2] ; AX = ? Out-of-range address
mov eax,[arrayD+16] ; EAX = ? MASM does not detect error

1020 3040 5060 1 2 3 4


20 10 40 30 60 50 01 00 00 00 02 00 00 00 03 00 00 00 04 00 00 00
+1 +2 +3 +4 +5 +1 +2 +3 +4 +5 +6 +7 +8 +9 +10 +11 +12 +13 +14 +15
arrayW arrayD

6
02/03/2019

Examples

 Given the following definition of arrayD


.DATA
arrayD DWORD 1,2,3
 Rearrange the three values in the array as: 3, 1, 2
 Solution:
; Copy first array value into EAX
mov eax, arrayD ; EAX = 1
; Exchange EAX with second array element
xchg eax, arrayD[4] ; EAX = 2, arrayD = 1,1,3
; Exchange EAX with third array element
xchg eax, arrayD[8] ; EAX = 3, arrayD = 1,1,2
; Copy value in EAX to first array element
mov arrayD, eax ; arrayD = 3,1,2

Addition and Subtraction

 ADD destination, source


destination = destination + source
 SUB destination, source
destination = destination – source
 Destination can be a register or a memory location
 Source can be a register, memory location, or a constant
 Destination and source must be of the same size
 Memory-to-memory arithmetic is not allowed

7
02/03/2019

Examples
 Write a program that adds the following three words:
.DATA
array WORD 890Fh,1276h,0AF5Bh
Solution: Accumulate the sum in the AX register
mov ax, array
add ax,[array+2]
add ax,[array+4] ; what if sum cannot fit in AX?

Solution 2: Accumulate the sum in the EAX register


movzx eax, array ; error to say: mov eax,array
movzx ebx, array[2] ; use movsx for signed integers
add eax, ebx ; error to say: add eax,array[2]
movzx ebx, array[4]
add eax, ebx

Flags Affected

 ADD and SUB affect all the six status flags:


 Carry Flag: Set when unsigned arithmetic result is out of range
 Overflow Flag: Set when signed arithmetic result is out of range
 Sign Flag: Copy of sign bit, set when result is negative
 Zero Flag: Set when result is zero
 Auxiliary Carry Flag: Set when there is a carry from bit 3 to bit 4
 Parity Flag: Set when parity in least-significant byte is even

8
02/03/2019

More on Carry and Overflow

 Addition: A + B
 The Carry flag is the carry out of the most significant bit
 The Overflow flag is only set when . . .
 Two positive operands are added and their sum is negative
 Two negative operands are added and their sum is positive
 Overflow cannot occur when adding operands of opposite signs

 Subtraction: A – B
 For Subtraction, the carry flag becomes the borrow flag
 Carry flag is set when A has a smaller unsigned value than B
 The Overflow flag is only set when . . .
 A and B have different signs and sign of result ≠ sign of A
 Overflow cannot occur when subtracting operands of the same sign

Hardware Viewpoint

 CPU cannot distinguish signed from unsigned integers


 YOU, the programmer, give a meaning to binary numbers
 How the ADD instruction modifies OF and CF:
 CF = (carry out of the MSB) MSB = Most Significant Bit

 OF = (carry out of the MSB) XOR (carry into the MSB)


 Hardware does SUB by … XOR = eXclusive-OR operation
 ADDing destination to the 2's complement of the source
operand
 How the SUB instruction modifies OF and CF:
 Negate (2's complement) the source and ADD it to destination
 OF = (carry out of the MSB) XOR (carry into the MSB)
 CF = INVERT (carry out of the MSB)

9
02/03/2019

ADD and SUB Examples


 For each of the following marked entries, show the values of the
destination operand and the six status flags:

mov al,0FFh ; AL=-1


add al,1 ; AL=00h CF= 1 OF= 0 SF= 0 ZF= 1 AF= 1 PF= 1
sub al,1 ; AL=FFh CF= 1 OF= 0 SF= 1 ZF= 0 AF= 1 PF= 1
mov al,+127 ; AL=7Fh
add al,1 ; AL=80h CF= 0 OF= 1 SF= 1 ZF= 0 AF= 1 PF= 0
mov al,26h
sub al,95h ; AL=91h CF= 1 OF= 1 SF= 1 ZF= 0 AF= 0 PF= 0

1 0 0 1 0 0 0 1 0 1 1 0 1 1 1 0

0 0 1 0 0 1 1 0 26h (38) 0 0 1 0 0 1 1 0 26h (38)


– +
1 0 0 1 0 1 0 1 95h (-107) 0 1 1 0 1 0 1 1 6Bh (107)

1 0 0 1 0 0 0 1 91h (-111) 1 0 0 1 0 0 0 1 91h (-111)

INC, DEC, and NEG Instructions

 INC destination
 destination = destination + 1
 More compact (uses less space) than: ADD destination, 1
 DEC destination
 destination = destination – 1
 More compact (uses less space) than: SUB destination, 1
 NEG destination
 destination = 2's complement of destination
 Destination can be 8-, 16-, or 32-bit operand
 In memory or a register
 NO immediate operand

10
02/03/2019

Affected Flags

 INC and DEC affect five status flags


 Overflow, Sign, Zero, Auxiliary Carry, and Parity
 Carry flag is NOT modified
 NEG affects all the six status flags
 Any nonzero operand causes the carry flag to be set
.DATA
B SBYTE -1 ; 0FFh
C SBYTE 127 ; 7Fh
.CODE
inc B ; B=0 OF=0 SF=0 ZF=1 AF=1 PF=1
dec B ; B=-1=FFh OF=0 SF=1 ZF=0 AF=1 PF=1
inc C ; C=-128=80h OF=1 SF=1 ZF=0 AF=1 PF=0
neg C ; C=-128 CF=1 OF=1 SF=1 ZF=0 AF=0 PF=0

ADC and SBB Instruction

 ADC Instruction: Addition with Carry


ADC destination, source
destination = destination + source + CF
 SBB Instruction: Subtract with Borrow
SBB destination, source
destination = destination - source – CF
 Destination can be a register or a memory location
 Source can be a register, memory location, or a constant
 Destination and source must be of the same size
 Memory-to-memory arithmetic is not allowed

11
02/03/2019

Extended Arithmetic

 ADC and SBB are useful for extended arithmetic


 Example: 64-bit addition
 Assume first 64-bit integer operand is stored in EBX:EAX
 Second 64-bit integer operand is stored in EDX:ECX
 Solution:
add eax, ecx ;add lower 32 bits
adc ebx, edx ;add upper 32 bits + carry
64-bit result is in EBX:EAX
 STC and CLC Instructions
 Used to Set and Clear the Carry Flag

Addressing Modes

 Two Basic Questions


 Where are the operands?
 How memory addresses are computed?
 Intel IA-32 supports 3 fundamental addressing modes
 Register addressing: operand is in a register
 Immediate addressing: operand is stored in the instruction
itself
 Memory addressing: operand is in memory
 Memory Addressing
 Variety of addressing modes
 Direct and indirect addressing
 Support high-level language constructs and data structures

12
02/03/2019

Register and Immediate Addressing

 Register Addressing
 Most efficient way of specifying an operand: no memory
access
 Shorter Instructions: fewer bits are needed to specify register
 Compilers use registers to optimize code
 Immediate Addressing
 Used to specify a constant
 Immediate constant is part of the instruction
 Efficient: no separate operand fetch is needed
 Examples
mov eax, ebx ; register-to-register move
add eax, 5 ; 5 is an immediate constant

Direct Memory Addressing


 Used to address simple variables in memory
 Variables are defined in the data section of the program
 We use the variable name (label) to address memory directly
 Assembler computes the offset of a variable
 The variable offset is specified directly as part of the
instruction
 Example
.data
var1 DWORD 100
var2 DWORD 200
sum DWORD ?
.code
mov eax, var1 var1, var2, and sum are
add eax, var2 direct memory operands
mov sum, eax

13
02/03/2019

Register Indirect Addressing

 Problem with Direct Memory Addressing


 Causes problems in addressing arrays and data structures
 Does not facilitate using a loop to traverse an array
 Indirect memory addressing solves this problem
 Register Indirect Addressing
 The memory address is stored in a register
 Brackets [ ] used to surround the register holding the address
 For 32-bit addressing, any 32-bit register can be used
 Example
mov ebx, OFFSET array ; ebx contains the address
mov eax, [ebx] ; [ebx] used to access memory

EBX contains the address of the operand, not the operand itself

Array Sum Example


 Indirect addressing is ideal for traversing an array
.data
array DWORD 10000h,20000h,30000h
.code
mov esi, OFFSET array ; esi = array address
mov eax,[esi] ; eax = [array] = 10000h
add esi,4 ; why 4?
add eax,[esi] ; eax = eax + [array+4]
add esi,4 ; why 4?
add eax,[esi] ; eax = eax + [array+8]

 Note that ESI register is used as a pointer to array


 ESI must be incremented by 4 to access the next array element
 Because each array element is 4 bytes (DWORD) in memory

14
02/03/2019

Ambiguous Indirect Operands

 Consider the following instructions:


mov [EBX], 100
add [ESI], 20
inc [EDI]

 Where EBX, ESI, and EDI contain memory addresses


 The size of the memory operand is not clear to the assembler
 EBX, ESI, and EDI can be pointers to BYTE, WORD, or DWORD

 Solution: use PTR operator to clarify the operand size


mov BYTE PTR [EBX], 100 ; BYTE operand in memory
add WORD PTR [ESI], 20 ; WORD operand in memory
inc DWORD PTR [EDI] ; DWORD operand in memory

Indexed Addressing
 Combines a displacement (name±constant) with an
index register
 Assembler converts displacement into a constant offset
 Constant offset is added to register to form an effective address

 Syntax: [disp + index] or disp [index]


.data
array DWORD 10000h,20000h,30000h
.code
mov esi, 0 ; esi = array index
mov eax,array[esi] ; eax = array[0] = 10000h
add esi,4
add eax,array[esi] ; eax = eax + array[4]
add esi,4
add eax,[array+esi] ; eax = eax + array[8]

15
02/03/2019

Index Scaling
 Useful to index array elements of size 2, 4, and 8 bytes
 Syntax: [disp + index * scale] or disp [index * scale]

 Effective address is computed as follows:


 Disp.'s offset + Index register * Scale factor

.DATA
arrayB BYTE 10h,20h,30h,40h
arrayW WORD 100h,200h,300h,400h
arrayD DWORD 10000h,20000h,30000h,40000h
.CODE
mov esi, 2
mov al, arrayB[esi] ; AL = 30h
mov ax, arrayW[esi*2] ; AX = 300h
mov eax, arrayD[esi*4] ; EAX = 30000h

Based Addressing
 Syntax: [Base + disp.]
 Effective Address = Base register + Constant Offset
 Useful to access fields of a structure or an object
 Base Register  points to the base address of the structure
 Constant Offset  relative offset within the structure

.DATA
mystruct WORD 12
mystruct is a structure
DWORD 1985 consisting of 3 fields: a
BYTE 'M' word, a double word,
.CODE and a byte
mov ebx, OFFSET mystruct
mov eax, [ebx+2] ; EAX = 1985
mov al, [ebx+6] ; AL = 'M'

16
02/03/2019

Based-Indexed Addressing

 Syntax: [Base + (Index * Scale) + disp.]


 Scale factor is optional and can be 1, 2, 4, or 8
 Useful in accessing two-dimensional arrays
 Offset: array address => we can refer to the array by name
 Base register: holds row address => relative to start of array
 Index register: selects an element of the row => column index
 Scaling factor: when array element size is 2, 4, or 8 bytes
 Useful in accessing arrays of structures (or objects)
 Base register: holds the address of the array
 Index register: holds the element address relative to the base
 Offset: represents the offset of a field within a structure

Based-Indexed Examples
.data
matrix DWORD 0, 1, 2, 3, 4 ; 4 rows, 5 cols
DWORD 10,11,12,13,14
DWORD 20,21,22,23,24
DWORD 30,31,32,33,34

ROWSIZE EQU SIZEOF matrix ; 20 bytes per row

.code
mov ebx, 2*ROWSIZE ; row index = 2
mov esi, 3 ; col index = 3
mov eax, matrix[ebx+esi*4] ; EAX = matrix[2][3]

mov ebx, 3*ROWSIZE ; row index = 3


mov esi, 1 ; col index = 1
mov eax, matrix[ebx+esi*4] ; EAX = matrix[3][1]

17
02/03/2019

LEA Instruction

 LEA = Load Effective Address


LEA r32, mem (Flat-Memory)
LEA r16, mem (Real-Address Mode)
 Calculate and load the effective address of a memory operand
 Flat memory uses 32-bit effective addresses
 Real-address mode uses 16-bit effective addresses
 LEA is similar to MOV … OFFSET, except that:
 OFFSET operator is executed by the assembler
 Used with named variables: address is known to the assembler
 LEA instruction computes effective address at runtime
 Used with indirect operands: effective address is known at runtime

LEA Examples
.data
array WORD 1000 DUP(?)

.code ; Equivalent to . . .
lea eax, array ; mov eax, OFFSET array

lea eax, array[esi] ; mov eax, esi


; add eax, OFFSET array

lea eax, array[esi*2] ; mov eax, esi


; add eax, eax
; add eax, OFFSET array

lea eax, [ebx+esi*2] ; mov eax, esi


; add eax, eax
; add eax, ebx

18
02/03/2019

Summary of Addressing Modes


Assembler converts a variable name into a
constant offset (called also a displacement)

For indirect addressing, a base/index


register contains an address/index

CPU computes the effective


address of a memory operand

Registers Used in 32-Bit Addressing

 32-bit addressing modes use the following 32-bit


registers
Base + ( Index * Scale ) + displacement
EAX EAX 1 no displacement
EBX EBX 2 8-bit displacement
ECX ECX 4 32-bit displacement
EDX EDX 8 Only the index register can
have a scale factor
ESI ESI
EDI EDI ESP can be used as a base
register, but not as an index
EBP EBP
ESP

19
02/03/2019

16-bit Memory Addressing


Old 16-bit Used with real-address mode
addressing
mode Only 16-bit registers are used

No Scale Factor

Only BX or BP can be the base register

Only SI or DI can be the index register

Displacement can be 0, 8, or 16 bits

Default Segments

 When 32-bit register indirect addressing is used …


 Address in EAX, EBX, ECX, EDX, ESI, and EDI is relative to DS
 Address in EBP and ESP is relative to SS
 In flat-memory model, DS and SS are the same segment
 Therefore, no need to worry about the default segment
 When 16-bit register indirect addressing is used …
 Address in BX, SI, or DI is relative to the data segment DS
 Address in BP is relative to the stack segment SS
 In real-address mode, DS and SS can be different segments
 We can override the default segment using segment prefix
 mov ax, ss:[bx] ; address in bx is relative to stack segment
 mov ax, ds:[bp] ; address in bp is relative to data segment

20
02/03/2019

JMP Instruction
 JMP is an unconditional jump to a destination instruction
 Syntax: JMP destination
 JMP causes the modification of the EIP register
EIP  destination address
 A label is used to identify the destination address
 Example: top:
. . .
jmp top
 JMP provides an easy way to create a loop
 Loop will continue endlessly unless we find a way to terminate it

LOOP Instruction
 The LOOP instruction creates a counting loop
 Syntax: LOOP destination
 Logic: ECX  ECX – 1
if ECX != 0, jump to destination label
 ECX register is used as a counter to count the iterations
 Example: calculate the sum of integers from 1 to 100
mov eax, 0 ; sum = eax
mov ecx, 100 ; count = ecx
L1:
add eax, ecx ; accumulate sum in eax
loop L1 ; decrement ecx until 0

21
02/03/2019

Your turn . . .

mov eax,6
What will be the final value of EAX? mov ecx,4
Solution: 10 L1:
inc eax
loop L1

How many times will the loop execute?


mov eax,1
Solution: 232 = 4,294,967,296 mov ecx,0
L2:
What will be the final value of EAX? dec eax
Solution: same value 1 loop L2

Nested Loop

If you need to code a loop within a loop, you must


save the outer loop counter's ECX value
.DATA
count DWORD ?
.CODE
mov ecx, 100 ; set outer loop count to 100
L1:
mov count, ecx ; save outer loop count
mov ecx, 20 ; set inner loop count to 20
L2: .
.
loop L2 ; repeat the inner loop
mov ecx, count ; restore outer loop count
loop L1 ; repeat the outer loop

22
02/03/2019

Copying a String

The following code copies a string from source to target

.DATA
source BYTE "This is the source string",0
target BYTE SIZEOF source DUP(0)
.CODE
main PROC Good use of SIZEOF
mov esi,0 ; index register
mov ecx, SIZEOF source ; loop counter
L1:
mov al,source[esi] ; get char from source
mov target[esi],al ; store it in the target
inc esi ; increment index
loop L1 ESI is used to index; loop for entire string
exit source & target
main ENDP strings
END main

Summing an Integer Array

This program calculates the sum of an array of 16-bit integers

.DATA
intarray WORD 100h,200h,300h,400h,500h,600h
.CODE
main PROC
mov esi, OFFSET intarray ; address of intarray
mov ecx, LENGTHOF intarray ; loop counter
mov ax, 0 ; zero the accumulator
L1:
add ax, [esi] ; accumulate sum in ax
add esi, 2 ; point to next integer
loop L1 ; repeat until ecx = 0
exit
main ENDP esi is used as a pointer
END main contains the address of an array element

23
02/03/2019

Summing an Integer Array – cont'd

This program calculates the sum of an array of 32-bit integers

.DATA
intarray DWORD 10000h,20000h,30000h,40000h,50000h,60000h
.CODE
main PROC
mov esi, 0 ; index of intarray
mov ecx, LENGTHOF intarray ; loop counter
mov eax, 0 ; zero the accumulator
L1:
add eax, intarray[esi*4] ; accumulate sum in eax
inc esi ; increment index
loop L1 ; repeat until ecx = 0
exit
main ENDP esi is used as a scaled index
END main

PC-Relative Addressing

The following loop calculates the sum: 1 to 1000

Offset Machine Code Source Code


00000000 B8 00000000 mov eax, 0
00000005 B9 000003E8 mov ecx, 1000
0000000A L1:
0000000A 03 C1 add eax, ecx
0000000C E2 FC loop L1
0000000E . . . . . .

When LOOP is assembled, the label L1 in LOOP is translated as FC


which is equal to –4 (decimal). This causes the loop instruction to
jump 4 bytes backwards from the offset of the next instruction. Since
the offset of the next instruction = 0000000E, adding –4 (FCh) causes
a jump to location 0000000A. This jump is called PC-relative.

24
02/03/2019

PC-Relative Addressing – cont'd

Assembler:
Calculates the difference (in bytes), called PC-relative offset, between
the offset of the target label and the offset of the following instruction

Processor:
Adds the PC-relative offset to EIP when executing LOOP instruction

If the PC-relative offset is encoded in a single signed byte,


(a) what is the largest possible backward jump?
(b) what is the largest possible forward jump?

Answers: (a) –128 bytes and (b) +127 bytes

Summary

 Data Transfer
 MOV, MOVSX, MOVZX, and XCHG instructions
 Arithmetic
 ADD, SUB, INC, DEC, NEG, ADC, SBB, STC, and CLC
 Carry, Overflow, Sign, Zero, Auxiliary and Parity flags
 Addressing Modes
 Register, immediate, direct, indirect, indexed, based-indexed
 Load Effective Address (LEA) instruction
 32-bit and 16-bit addressing
 JMP and LOOP Instructions
 Traversing and summing arrays, copying strings
 PC-relative addressing

25
Stack and Procedures

Computer Organization
&
Assembly Language Programming

Dr Adnan Gutub
aagutub ‘at’ uqu.edu.sa
[Adapted from slides of Dr. Kip Irvine: Assembly Language for Intel-Based Computers]
Most Slides contents have been arranged by Dr Muhamed Mudawar & Dr Aiman El-Maleh from Computer Engineering Dept. at KFUPM

Presentation Outline

 Runtime Stack
 Stack Operations
 Defining and Using Procedures
 Program Design Using Procedures

Stack and Procedures Computer Organization & Assembly Language Programming slide 2/46

1
What is a Stack?
 Stack is a Last-In-First-Out (LIFO) data structure
 Analogous to a stack of plates in a cafeteria
 Plate on Top of Stack is directly accessible
 Two basic stack operations
 Push: inserts a new element on top of the stack
 Pop: deletes top element from the stack
 View the stack as a linear array of elements
 Insertion and deletion is restricted to one end of array
 Stack has a maximum capacity
 When stack is full, no element can be pushed
 When stack is empty, no element can be popped
Stack and Procedures Computer Organization & Assembly Language Programming slide 3/46

Runtime Stack
 Runtime stack: array of consecutive memory locations
 Managed by the processor using two registers
 Stack Segment register SS
 Not modified in protected mode, SS points to segment descriptor
 Stack Pointer register ESP
 For 16-bit real-address mode programs, SP register is used

 ESP register points to the top of stack


 Always points to last data item placed on the stack
 Only words and doublewords can be pushed and popped
 But not single bytes
 Stack grows downward toward lower memory addresses
Stack and Procedures Computer Organization & Assembly Language Programming slide 4/46

2
Runtime Stack Allocation
 .STACK directive specifies a runtime stack
 Operating system allocates memory for the stack
high
 Runtime stack is initially empty ESP = 0012FFC4 address
?
 The stack size can change dynamically at runtime ?
?
?
 Stack pointer ESP ?
?
 ESP is initialized by the operating system ?
?
 Typical initial value of ESP = 0012FFC4h .
.
.
 The stack grows downwards ?
 The memory below ESP is free low
address
 ESP is decremented to allocate stack memory
Stack and Procedures Computer Organization & Assembly Language Programming slide 5/46

Presentation Outline

 Runtime Stack
 Stack Operations
 Defining and Using Procedures
 Program Design Using Procedures

Stack and Procedures Computer Organization & Assembly Language Programming slide 6/46

3
Stack Instructions
 Two basic stack instructions:
 push source
 pop destination
 Source can be a word (16 bits) or doubleword (32 bits)
 General-purpose register
 Segment register: CS, DS, SS, ES, FS, GS
 Memory operand, memory-to-stack transfer is allowed
 Immediate value
 Destination can be also a word or doubleword
 General-purpose register
 Segment register, except that pop CS is NOT allowed
 Memory, stack-to-memory transfer is allowed
Stack and Procedures Computer Organization & Assembly Language Programming slide 7/46

Push Instruction
 Push source32 (r/m32 or imm32)
 ESP is first decremented by 4
 ESP = ESP – 4 (stack grows by 4 bytes)
 32-bit source is then copied onto the stack at the new ESP
 [ESP] = source32
 Push source16 (r/m16)
 ESP is first decremented by 2
 ESP = ESP – 2 (stack grows by 2 bytes)
 16-bit source is then copied on top of stack at the new ESP
 [ESP] = source16
 Operating system puts a limit on the stack capacity
 Push can cause a Stack Overflow (stack cannot grow)
Stack and Procedures Computer Organization & Assembly Language Programming slide 8/46

4
Examples on the Push Instruction
 Suppose we execute:
The stack grows
 PUSH EAX ; EAX = 125C80FFh
downwards
 PUSH EBX ; EBX = 2Eh
The area below
 PUSH ECX ; ECX = 9B61Dh
ESP is free

BEFORE AFTER

0012FFC4 ESP 0012FFC4

0012FFC0 0012FFC0 125C80FF

0012FFBC 0012FFBC 0000002E

0012FFB8 0012FFB8 0009B61D ESP

0012FFB4 0012FFB4

Stack and Procedures Computer Organization & Assembly Language Programming slide 9/46

Pop Instruction
 Pop dest32 (r/m32)
 32-bit doubleword at ESP is first copied into dest32
 dest32 = [ESP]
 ESP is then incremented by 4
 ESP = ESP + 4 (stack shrinks by 4 bytes)
 Pop dest16 (r/m16)
 16-bit word at ESP is first copied into dest16
 dest16 = [ESP]
 ESP is then incremented by 2
 ESP = ESP + 2 (stack shrinks by 2 bytes)
 Popping from an empty stack causes a stack underflow
Stack and Procedures Computer Organization & Assembly Language Programming slide 10/46

5
Examples on the Pop Instruction
 Suppose we execute: The stack shrinks
 POP SI ; SI = B61Dh upwards
 POP DI ; DI = 0009h The area at & above
ESP is allocated

BEFORE AFTER

0012FFC4 0012FFC4

0012FFC0 125C80FF 0012FFC0 125C80FF

0012FFBC 0000002E 0012FFBC 0000002E ESP


0012FFB8 0009B61D ESP 0012FFB8 0009B61D

0012FFB4 0012FFB4

Stack and Procedures Computer Organization & Assembly Language Programming slide 11/46

Uses of the Runtime Stack


 Runtime Stack can be utilized for
 Temporary storage of data and registers
 Transfer of program control in procedures and interrupts
 Parameter passing during a procedure call
 Allocating local variables used inside procedures

 Stack can be used as temporary storage of data


 Example: exchanging two variables in a data segment
push var1 ; var1 is pushed
push var2 ; var2 is pushed
pop var1 ; var1 = var2 on stack
pop var2 ; var2 = var1 on stack

Stack and Procedures Computer Organization & Assembly Language Programming slide 12/46

6
Temporary Storage of Registers
 Stack is often used to free a set of registers
push EBX ; save EBX
push ECX ; save ECX
. . .
; EBX and ECX can now be modified
. . .
pop ECX ; restore ECX first, then
pop EBX ; restore EBX

 Example on moving DX:AX into EBX


push DX ; push most significant word first
push AX ; then push least significant word
pop EBX ; EBX = DX:AX

Stack and Procedures Computer Organization & Assembly Language Programming slide 13/46

Example: Nested Loop


When writing a nested loop, push the outer loop counter
ECX before entering the inner loop, and restore ECX after
exiting the inner loop and before repeating the outer loop

mov ecx, 100 ; set outer loop count


L1: . . . ; begin the outer loop
push ecx ; save outer loop count

mov ecx, 20 ; set inner loop count


L2: . . . ; begin the inner loop
. . . ; inner loop
loop L2 ; repeat the inner loop

. . . ; outer loop
pop ecx ; restore outer loop count
loop L1 ; repeat the outer loop

Stack and Procedures Computer Organization & Assembly Language Programming slide 14/46

7
Push/Pop All Registers
 pushad
 Pushes all the 32-bit general-purpose registers
 EAX, ECX, EDX, EBX, ESP, EBP, ESI, and EDI in this order
 Initial ESP value (before pushad) is pushed
 ESP = ESP – 32
 pusha
 Same as pushad but pushes all 16-bit registers AX through DI
 ESP = ESP – 16
 popad
 Pops into registers EDI through EAX in reverse order of pushad
 ESP is not read from stack. It is computed as: ESP = ESP + 32
 popa
 Same as popad but pops into 16-bit registers. ESP = ESP + 16
Stack and Procedures Computer Organization & Assembly Language Programming slide 15/46

Stack Instructions on Flags


 Special Stack instructions for pushing and popping flags
 pushfd
 Push the 32-bit EFLAGS

 popfd
 Pop the 32-bit EFLAGS

 No operands are required


 Useful for saving and restoring the flags
 For 16-bit programs use pushf and popf
 Push and Pop the 16-bit FLAG register

Stack and Procedures Computer Organization & Assembly Language Programming slide 16/46

8
Next . . .

 Runtime Stack
 Stack Operations
 Defining and Using Procedures
 Program Design Using Procedures

Stack and Procedures Computer Organization & Assembly Language Programming slide 17/46

Procedures
 A procedure is a logically self-contained unit of code
 Called sometimes a function, subprogram, or subroutine
 Receives a list of parameters, also called arguments
 Performs computation and returns results
 Plays an important role in modular program development
 Example of a procedure (called function) in C language
int sumof ( int x,int y,int z ) {
Result type int temp; Formal parameter list
temp = x + y + z;
return temp;
Return function result
}
 The above function sumof can be called as follows:
sum = sumof( num1,num2,num3 ); Actual parameter list
Stack and Procedures Computer Organization & Assembly Language Programming slide 18/46

9
Defining a Procedure in Assembly
 Assembler provides two directives to define procedures
 PROC to define name of procedure and mark its beginning
 ENDP to mark end of procedure
 A typical procedure definition is

procedure_name PROC
. . .
; procedure body
. . .
procedure_name ENDP

 procedure_name should match in PROC and ENDP

Stack and Procedures Computer Organization & Assembly Language Programming slide 19/46

Documenting Procedures
 Suggested Documentation for Each Procedure:
 Does: Describe the task accomplished by the procedure

 Receives: Describe the input parameters

 Returns: Describe the values returned by the procedure

 Requires: List of requirements called preconditions

 Preconditions
 Must be satisfied before the procedure is called

 If a procedure is called without its preconditions satisfied, it will


probably not produce the expected output

Stack and Procedures Computer Organization & Assembly Language Programming slide 20/46

10
Example of a Procedure Definition
 The sumof procedure receives three integer parameters
 Assumed to be in EAX, EBX, and ECX
 Computes and returns result in register EAX
;------------------------------------------------
; Sumof: Calculates the sum of three integers
; Receives: EAX, EBX, ECX, the three integers
; Returns: EAX = sum
; Requires: nothing
;------------------------------------------------
sumof PROC
add EAX, EBX ; EAX = EAX + second number
add EAX, ECX ; EAX = EAX + third number
ret ; return to caller
sumof ENDP

 The ret instruction returns control to the caller


Stack and Procedures Computer Organization & Assembly Language Programming slide 21/46

The Call Instruction


 To invoke a procedure, the call instruction is used
 The call instruction has the following format
call procedure_name
 Example on calling the procedure sumof
 Caller passes actual parameters in EAX, EBX, and ECX
 Before calling procedure sumof
mov EAX, num1 ; pass first parameter in EAX
mov EBX, num2 ; pass second parameter in EBX
mov ECX, num3 ; pass third parameter in ECX
call sumof ; result is in EAX
mov sum, EAX ; save result in variable sum
 call sumof will call the procedure sumof
Stack and Procedures Computer Organization & Assembly Language Programming slide 22/46

11
How a Procedure Call / Return Works
 How does a procedure know where to return?
 There can be multiple calls to same procedure in a program
 Procedure has to return differently for different calls
 It knows by saving the return address (RA) on the stack
 This is the address of next instruction after call
 The call instruction does the following
 Pushes the return address on the stack
 Jumps into the first instruction inside procedure
 ESP = ESP – 4; [ESP] = RA; EIP = procedure address
 The ret (return) instruction does the following
 Pops return address from stack
 Jumps to return address: EIP = [ESP]; ESP = ESP + 4
Stack and Procedures Computer Organization & Assembly Language Programming slide 23/46

Details of CALL and Return


Address Machine Code Assembly Language IP-relative call
.CODE EIP = 00401036
main PROC + 0000004B
00401020 A1 00405000 mov EAX, num1 EIP = 00401081
00401025 8B 1D 00405004 mov EBX, num2
Before Call
0040102B 8B 0D 00405008 mov ECX, num3
ESP = 0012FFC4
00401031 E8 0000004B call sumof
00401036 A3 0040500C mov sum, EAX After Call
ESP = 0012FFC0
. . . . . . . . .
exit After Ret (Return)
ESP = 0012FFC4
main ENDP
sumof PROC
Runtime Stack

Allocated
00401081 03 C3 add EAX, EBX ESP
00401083 03 C1 add EAX, ECX ESP RA=00401036
00401085 C3 ret Free Area
sumof ENDP
END main
Stack and Procedures Computer Organization & Assembly Language Programming slide 24/46

12
Don’t Mess Up the Stack !
 Just before returning from a procedure
 Make sure the stack pointer ESP is pointing at return address
 Example of a messed-up procedure
 Pushes EAX on the stack before returning
 Stack pointer ESP is NOT pointing at return address!
main PROC Stack
call messedup
. . . high addr
exit Used
ESP
main ENDP ESP Return Addr
messedup PROC ESP EAX Value Where to return?
push EAX Free Area
EAX value is NOT
ret
the return address!
messedup ENDP
Stack and Procedures Computer Organization & Assembly Language Programming slide 25/46

Nested Procedure Calls


main PROC
.
.
call Sub1 By the time Sub3 is called, the stack
exit contains all three return addresses
main ENDP

Sub1 PROC
.
return address of call Sub1
.
call Sub2
ret return address of call Sub2
Sub1 ENDP
return address of call Sub3 ESP
Sub2 PROC
.
.
call Sub3
ret
Sub2 ENDP

Sub3 PROC
.
.
ret
Sub3 ENDP

Stack and Procedures Computer Organization & Assembly Language Programming slide 26/46

13
Parameter Passing
 Parameter passing in assembly language is different
 More complicated than that used in a high-level language
 In assembly language
 Place all required parameters in an accessible storage area
 Then call the procedure
 Two types of storage areas used
 Registers: general-purpose registers are used (register method)
 Memory: stack is used (stack method)
 Two common mechanisms of parameter passing
 Pass-by-value: parameter value is passed
 Pass-by-reference: address of parameter is passed
Stack and Procedures Computer Organization & Assembly Language Programming slide 27/46

Passing Parameters in Registers


;-----------------------------------------------------
; ArraySum: Computes the sum of an array of integers
; Receives: ESI = pointer to an array of doublewords
; ECX = number of array elements
; Returns: EAX = sum
;-----------------------------------------------------
ArraySum PROC
mov eax,0 ; set the sum to zero
L1: add eax, [esi] ; add each integer to sum
add esi, 4 ; point to next integer
loop L1 ; repeat for array size
ret
ArraySum ENDP

ESI: Reference parameter = array address


ECX: Value parameter = count of array elements

Stack and Procedures Computer Organization & Assembly Language Programming slide 28/46

14
Preserving Registers
 Need to preserve the registers across a procedure call
 Stack can be used to preserve register values
 Which registers should be saved?
 Those registers that are modified by the called procedure
 But still used by the calling procedure
 We can save all registers using pusha if we need most of them
 However, better to save only needed registers when they are few

 Who should preserve the registers?


 Calling procedure: saves and frees registers that it uses
 Registers are saved before procedure call and restored after return
 Called procedure: preferred method for modular code
 Register preservation is done in one place only (inside procedure)
Stack and Procedures Computer Organization & Assembly Language Programming slide 29/46

Example on Preserving Registers


;-----------------------------------------------------
; ArraySum: Computes the sum of an array of integers
; Receives: ESI = pointer to an array of doublewords
; ECX = number of array elements
; Returns: EAX = sum
;-----------------------------------------------------
ArraySum PROC
push esi ; save esi, it is modified
push ecx ; save ecx, it is modified
mov eax,0 ; set the sum to zero
L1: add eax, [esi] ; add each integer to sum
add esi, 4 ; point to next integer
loop L1 ; repeat for array size
pop ecx ; restore registers
pop esi ; in reverse order
ret
ArraySum ENDP No need to save EAX. Why?

Stack and Procedures Computer Organization & Assembly Language Programming slide 30/46

15
USES Operator
 The USES operator simplifies the writing of a procedure
 Registers are frequently modified by procedures
 Just list the registers that should be preserved after USES
 Assembler will generate the push and pop instructions

ArraySum PROC
ArraySum PROC USES esi ecx push esi
mov eax,0 push ecx
L1: add eax, [esi] mov eax,0
add esi, 4 L1: add eax, [esi]
loop L1 add esi, 4
ret loop L1
ArraySum ENDP pop ecx
pop esi
ret
ArraySum ENDP
Stack and Procedures Computer Organization & Assembly Language Programming slide 31/46

Next . . .

 Runtime Stack
 Stack Operations
 Defining and Using Procedures
 Program Design Using Procedures

Stack and Procedures Computer Organization & Assembly Language Programming slide 32/46

16
Program Design using Procedures
 Program Design involves the Following:
 Break large tasks into smaller ones
 Use a hierarchical structure based on procedure calls
 Test individual procedures separately

Integer Summation Program:


Write a program that prompts the user for multiple 32-bit integers,
stores them in an array, calculates the array sum, and displays the
sum on the screen.

1. Prompt user for multiple integers


Main steps: 2. Calculate the sum of the array
3. Display the sum
Stack and Procedures Computer Organization & Assembly Language Programming slide 33/46

Structure Chart
Summation
Program (main)

Clrscr PromptForIntegers ArraySum DisplaySum

WriteString ReadInt WriteString WriteInt


WriteInt

Structure Chart
Above diagram is called a structure chart
Describes program structure, division into procedure, and call sequence
Link library procedures are shown in grey
Stack and Procedures Computer Organization & Assembly Language Programming slide 34/46

17
Integer Summation Program – 1 of 4
INCLUDE Irvine32.inc
ArraySize EQU 5
.DATA
prompt1 BYTE "Enter a signed integer: ",0
prompt2 BYTE "The sum of the integers is: ",0
array DWORD ArraySize DUP(?)
.CODE
main PROC
call Clrscr ; clear the screen
mov esi, OFFSET array
mov ecx, ArraySize
call PromptForIntegers ; store input integers in array
call ArraySum ; calculate the sum of array
call DisplaySum ; display the sum
exit
main ENDP
Stack and Procedures Computer Organization & Assembly Language Programming slide 35/46

Integer Summation Program – 2 of 4


;-----------------------------------------------------
; PromptForIntegers: Read input integers from the user
; Receives: ESI = pointer to the array
; ECX = array size
; Returns: Fills the array with the user input
;-----------------------------------------------------
PromptForIntegers PROC USES ecx edx esi
mov edx, OFFSET prompt1
L1:
call WriteString ; display prompt1
call ReadInt ; read integer into EAX
call Crlf ; go to next output line
mov [esi], eax ; store integer in array
add esi, 4 ; advance array pointer
loop L1
ret
PromptForIntegers ENDP
Stack and Procedures Computer Organization & Assembly Language Programming slide 36/46

18
Integer Summation Program – 3 of 4
;-----------------------------------------------------
; ArraySum: Calculates the sum of an array of integers
; Receives: ESI = pointer to the array,
; ECX = array size
; Returns: EAX = sum of the array elements
;-----------------------------------------------------
ArraySum PROC USES esi ecx
mov eax,0 ; set the sum to zero
L1:
add eax, [esi] ; add each integer to sum
add esi, 4 ; point to next integer
loop L1 ; repeat for array size

ret ; sum is in EAX


ArraySum ENDP

Stack and Procedures Computer Organization & Assembly Language Programming slide 37/46

Integer Summation Program – 4 of 4


;-----------------------------------------------------
; DisplaySum: Displays the sum on the screen
; Receives: EAX = the sum
; Returns: nothing
;-----------------------------------------------------
DisplaySum PROC
mov edx, OFFSET prompt2
call WriteString ; display prompt2
call WriteInt ; display sum in EAX
call Crlf
ret
DisplaySum ENDP
END main

Stack and Procedures Computer Organization & Assembly Language Programming slide 38/46

19
Sample Output

Enter a signed integer: 550


Enter a signed integer: -23
Enter a signed integer: -96
Enter a signed integer: 20
Enter a signed integer: 7
The sum of the integers is: +458

Stack and Procedures Computer Organization & Assembly Language Programming slide 39/46

Parameter Passing Through Stack


 Parameters can be saved on the stack before a
procedure is called.
 The called procedure can easily access the parameters
using either the ESP or EBP registers without altering
ESP register.
 Example
Then, the assembly language
Suppose you want to code fragment looks like:
implement the following mov i, 25
pseudo-code: mov j, 4
i = 25; push 1
j = 4; push j
Test(i, j, 1); push i
call Test

Stack and Procedures Computer Organization & Assembly Language Programming slide 40/46

20
Parameter Passing Through Stack

Example: Accessing parameters on the


stack
Test PROC Lower Address
mov AX, [ESP + 4] ;get i
add AX, [ESP + 8] ;add j ESP
sub AX, [ESP + 12] ;subtract parm 3
ESP+4
(1) from sum
ESP+8
ret
Test ENDP ESP+12

Higher Address

Stack and Procedures Computer Organization & Assembly Language Programming slide 41/46

Call & Return Instructions


Instruction Operand Note
Push IP
CALL label name IP= IP + displacement relative to next instruction
Push IP
CALL r/m IP = [r/m]
Push CS
CALL label name (FAR) Push IP
CS:IP=address of label name
Push CS
CALL m (FAR) Push IP
CS:IP= [m]
RET Pop IP
Pop IP
RET imm SP = SP + imm
Pop IP
RET (FAR) Pop CS
Pop IP
RET imm (FAR) Pop CS
SP = SP + imm

Stack and Procedures Computer Organization & Assembly Language Programming slide 42/46

21
Freeing Passed Parameters From Stack
 Use RET N instruction to free parameters from stack

Example: Accessing parameters on the


stack
Test PROC
mov AX, [ESP + 4] ;get i
add AX, [ESP + 8] ;add j
sub AX, [ESP + 12] ;subtract parm. 3
(1) from sum
ret 12
Test ENDP

Stack and Procedures Computer Organization & Assembly Language Programming slide 43/46

Local Variables
 Local variables are dynamic data whose values must be
preserved over the lifetime of the procedure, but not
beyond its termination.
 At the termination of the procedure, the current
environment disappears and the previous environment
must be restored.
 Space for local variables can be reserved by subtracting
the required number of bytes from ESP.
 Offsets from ESP are used to address local variables.

Stack and Procedures Computer Organization & Assembly Language Programming slide 44/46

22
Local Variables

Pseudo-code (Java-like) Assembly Language

Test PROC
push EBP
mov EBP, ESP
sub ESP, 4
void Test(int i){
push EAX
int k;
mov DWORD PTR [EBP-4], 9
mov EAX, [EBP + 8]
k = i+9;
add [EBP-4], EAX
……
……
}
pop EAX
mov ESP, EBP
pop EBP
ret 4
Test ENDP

Stack and Procedures Computer Organization & Assembly Language Programming slide 45/46

Summary
 Procedure – Named block of executable code
 CALL: call a procedure, push return address on top of stack
 RET: pop the return address and return from procedure
 Preserve registers across procedure calls
 Runtime stack – LIFO structure – Grows downwards
 Holds return addresses, saved registers, etc.
 PUSH – insert value on top of stack, decrement ESP
 POP – remove top value of stack, increment ESP

Stack and Procedures Computer Organization & Assembly Language Programming slide 46/46

23
Conditional Processing

Computer Organization
&
Assembly Language Programming

Dr Adnan Gutub
aagutub ‘at’ uqu.edu.sa
[Adapted from slides of Dr. Kip Irvine: Assembly Language for Intel-Based Computers]
Most Slides contents have been arranged by Dr Muhamed Mudawar & Dr Aiman El-Maleh from Computer Engineering Dept. at KFUPM

Presentation Outline

 Boolean and Comparison Instructions


 Conditional Jumps
 Conditional Loop Instructions
 Translating Conditional Structures
 Indirect Jump and Table-Driven Selection
 Application: Sorting an Integer Array

Conditional Processing Computer Organization & Assembly Language Programming slide 2/55

1
AND Instruction
 Bitwise AND between each pair of matching bits
AND destination, source
 Following operand combinations are allowed AND
AND reg, reg
Operands can be
AND reg, mem
8, 16, or 32 bits
AND reg, imm and they must be
AND mem, reg of the same size
AND mem, imm
 AND instruction is 00111011
often used to AND 00001111
cleared unchanged
clear selected bits 00001011

Conditional Processing Computer Organization & Assembly Language Programming slide 3/55

Converting Characters to Uppercase


 AND instruction can convert characters to uppercase
'a' = 0 1 1 0 0 0 0 1 'b' = 0 1 1 0 0 0 1 0
'A' = 0 1 0 0 0 0 0 1 'B' = 0 1 0 0 0 0 1 0

 Solution: Use the AND instruction to clear bit 5


mov ecx, LENGTHOF mystring
mov esi, OFFSET mystring
L1: and BYTE PTR [esi], 11011111b ; clear bit 5
inc esi
loop L1

Conditional Processing Computer Organization & Assembly Language Programming slide 4/55

2
OR Instruction
 Bitwise OR operation between each pair of matching bits
OR destination, source
 Following operand combinations are allowed OR
OR reg, reg
Operands can be
OR reg, mem
8, 16, or 32 bits
OR reg, imm and they must be
OR mem, reg of the same size
OR mem, imm
 OR instruction is 00111011
often used to OR 11110000
set unchanged
set selected bits 11111011

Conditional Processing Computer Organization & Assembly Language Programming slide 5/55

Converting Characters to Lowercase


 OR instruction can convert characters to lowercase
'A' = 0 1 0 0 0 0 0 1 'B' = 0 1 0 0 0 0 1 0
'a' = 0 1 1 0 0 0 0 1 'b' = 0 1 1 0 0 0 1 0

 Solution: Use the OR instruction to set bit 5


mov ecx, LENGTHOF mystring
mov esi, OFFSET mystring
L1: or BYTE PTR [esi], 20h ; set bit 5
inc esi
loop L1

Conditional Processing Computer Organization & Assembly Language Programming slide 6/55

3
Converting Binary Digits to ASCII
 OR instruction can convert a binary digit to ASCII
0 =00000000 1 =00000001
'0' = 0 0 1 1 0 0 0 0 '1' = 0 0 1 1 0 0 0 1

 Solution: Use the OR instruction to set bits 4 and 5


or al,30h ; Convert binary digit 0 to 9 to ASCII

 What if we want to convert an ASCII digit to binary?

 Solution: Use the AND instruction to clear bits 4 to 7


and al,0Fh ; Convert ASCII '0' to '9' to binary
Conditional Processing Computer Organization & Assembly Language Programming slide 7/55

XOR Instruction
 Bitwise XOR between each pair of matching bits
XOR destination, source
 Following operand combinations are allowed XOR
XOR reg, reg
Operands can be
XOR reg, mem
8, 16, or 32 bits
XOR reg, imm and they must be
XOR mem, reg of the same size
XOR mem, imm
 XOR instruction is 00111011
often used to XOR 11110000
inverted unchanged
invert selected bits 11001011

Conditional Processing Computer Organization & Assembly Language Programming slide 8/55

4
Affected Status Flags

The six status flags are affected


1. Carry Flag: Cleared by AND, OR, and XOR
2. Overflow Flag: Cleared by AND, OR, and XOR
3. Sign Flag: Copy of the sign bit in result
4. Zero Flag: Set when result is zero
5. Parity Flag: Set when parity in least-significant byte is even
6. Auxiliary Flag: Undefined by AND, OR, and XOR
Conditional Processing Computer Organization & Assembly Language Programming slide 9/55

String Encryption Program


 Tasks:
 Input a message (string) from the user
 Encrypt the message
 Display the encrypted message
 Decrypt the message
 Display the decrypted message

 Sample Output

Enter the plain text: Attack at dawn.


Cipher text: «¢¢Äîä-Ä¢-ïÄÿü-Gs
Decrypted: Attack at dawn.

Conditional Processing Computer Organization & Assembly Language Programming slide 10/55

5
Encrypting a String
KEY = 239 ; Can be any byte value
BUFMAX = 128
.data
buffer BYTE BUFMAX+1 DUP(0)
bufSize DWORD BUFMAX

The following loop uses the XOR instruction to


transform every character in a string into a new value

mov ecx, bufSize ; loop counter


mov esi, 0 ; index 0 in buffer
L1:
xor buffer[esi], KEY ; translate a byte
inc esi ; point to next byte
loop L1

Conditional Processing Computer Organization & Assembly Language Programming slide 11/55

TEST Instruction
 Bitwise AND operation between each pair of bits
TEST destination, source
 The flags are affected similar to the AND Instruction
 However, TEST does NOT modify the destination operand
 TEST instruction can check several bits at once
 Example: Test whether bit 0 or bit 3 is set in AL
 Solution: test al, 00001001b ; test bits 0 & 3
 We only need to check the zero flag
; If zero flag => both bits 0 and 3 are clear
; If Not zero => either bit 0 or 3 is set

Conditional Processing Computer Organization & Assembly Language Programming slide 12/55

6
NOT Instruction
 Inverts all the bits in a destination operand
NOT destination
 Result is called the 1's complement
 Destination can be a register or memory NOT

NOT reg NOT 00111011

NOT mem 11000100 inverted

 None of the Flags is affected by the NOT instruction

Conditional Processing Computer Organization & Assembly Language Programming slide 13/55

CMP Instruction
 CMP (Compare) instruction performs a subtraction
Syntax: CMP destination, source
Computes: destination – source
 Destination operand is NOT modified
 All six flags: OF, CF, SF, ZF, AF, and PF are affected
 CMP uses the same operand combinations as SUB
 Operands can be 8, 16, or 32 bits and must be of the same size
 Examples: assume EAX = 5, EBX = 10, and ECX = 5
cmp eax, ebx ; OF=0, CF=1, SF=1, ZF=0
cmp eax, ecx ; OF=0, CF=0, SF=0, ZF=1

Conditional Processing Computer Organization & Assembly Language Programming slide 14/55

7
Unsigned Comparison
 CMP can perform unsigned and signed comparisons
 The destination and source operands can be unsigned or signed

 For unsigned comparison, we examine ZF and CF flags


Unsigned Comparison ZF CF To check for
unsigned destination < unsigned source 1 equality, it is
unsigned destination > unsigned source 0 0 enough to
check ZF flag
destination = source 1

 CMP does a subtraction and CF is the borrow flag


CF = 1 if and only if unsigned destination < unsigned source

 Assume AL = 5 and BL = -1 = FFh


cmp al, bl ; Sets carry flag CF = 1
Conditional Processing Computer Organization & Assembly Language Programming slide 15/55

Signed Comparison
 For signed comparison, we examine SF, OF, and ZF
Signed Comparison Flags
signed destination < signed source SF ≠ OF
signed destination > signed source SF = OF, ZF = 0
destination = source ZF = 1

 Recall for subtraction, the overflow flag is set when …


 Operands have different signs and result sign ≠ destination sign
 CMP AL, BL (consider the four cases shown below)
Case 1 AL = 80 BL = 50 OF = 0 SF = 0 AL > BL
Case 2 AL = -80 BL = -50 OF = 0 SF = 1 AL < BL
Case 3 AL = 80 BL = -50 OF = 1 SF = 1 AL > BL
Case 4 AL = -80 BL = 50 OF = 1 SF = 0 AL < BL

Conditional Processing Computer Organization & Assembly Language Programming slide 16/55

8
Next . . .

 Boolean and Comparison Instructions


 Conditional Jumps
 Conditional Loop Instructions
 Translating Conditional Structures
 Indirect Jump and Table-Driven Selection
 Application: Sorting an Integer Array

Conditional Processing Computer Organization & Assembly Language Programming slide 17/55

Conditional Structures
 No high-level control structures in assembly language
 Comparisons and conditional jumps are used to …
 Implement conditional structures such as IF statements
 Implement conditional loops

 Types of Conditional Jump Instructions


 Jumps based on specific flags
 Jumps based on equality
 Jumps based on the value of CX or ECX
 Jumps based on unsigned comparisons
 Jumps based on signed comparisons
Conditional Processing Computer Organization & Assembly Language Programming slide 18/55

9
Jumps Based on Specific Flags
 Conditional Jump Instruction has the following syntax:
Jcond destination ; cond is the jump condition

 Destination
Destination Label

 Prior to 386
Jump must be within
–128 to +127 bytes
from current location

 IA-32
32-bit offset permits
jump anywhere in
memory

Conditional Processing Computer Organization & Assembly Language Programming slide 19/55

Jumps Based on Equality

 JE is equivalent to JZ JNE is equivalent to JNZ

 JECXZ jecxz L2 ; exit loop

Checked once at the beginning L1: . . . ; loop body

Terminate a loop if ECX is zero loop L1


L2:

Conditional Processing Computer Organization & Assembly Language Programming slide 20/55

10
Examples of Jump on Zero
 Task: Check whether integer value in EAX is even
Solution: TEST whether the least significant bit is 0
If zero, then EAX is even, otherwise it is odd

test eax, 1 ; test bit 0 of eax


jz EvenVal ; jump if Zero flag is set

 Task: Jump to label L1 if bits 0, 1, and 3 in AL are all set


Solution:
and al,00001011b ; clear bits except 0,1,3
cmp al,00001011b ; check bits 0,1,3
je L1 ; all set? jump to L1

Conditional Processing Computer Organization & Assembly Language Programming slide 21/55

Jumps Based on Unsigned Comparison

Task: Jump to a label if unsigned EAX is less than EBX

Solution: cmp eax, ebx JB condition


jb IsBelow CF = 1
Conditional Processing Computer Organization & Assembly Language Programming slide 22/55

11
Jumps Based on Signed Comparisons

Task: Jump to a label if signed EAX is less than EBX

Solution: cmp eax, ebx JL condition


jl IsLess OF ≠ SF
Conditional Processing Computer Organization & Assembly Language Programming slide 23/55

Compare and Jump Examples


Jump to L1 if unsigned EAX is greater than Var1

Solution: cmp eax, Var1 JA condition


ja L1 CF = 0, ZF = 0

Jump to L1 if signed EAX is greater than Var1

Solution: cmp eax, Var1 JG condition


jg L1 OF = SF, ZF = 0

Jump to L1 if signed EAX is greater than or equal to Var1

Solution: cmp eax, Var1 JGE condition


jge L1 OF = SF
Conditional Processing Computer Organization & Assembly Language Programming slide 24/55

12
Computing the Max and Min
 Compute the Max of unsigned EAX and EBX
mov Max, eax ; assume Max = eax
cmp Max, ebx
Solution: jae done
mov Max, ebx ; Max = ebx
done:

 Compute the Min of signed EAX and EBX


mov Min, eax ; assume Min = eax
cmp Min, ebx
Solution: jle done
mov Min, ebx ; Min = ebx
done:
Conditional Processing Computer Organization & Assembly Language Programming slide 25/55

Application: Sequential Search


; Receives: esi = array address
; ecx = array size
; eax = search value
; Returns: esi = address of found element
search PROC USES ecx
jecxz notfound
L1:
cmp [esi], eax ; array element = search value?
je found ; yes? found element
add esi, 4 ; no? point to next array element
loop L1
notfound:
mov esi, 0 ; if not found then esi = 0
found:
ret ; if found, esi = element address
search ENDP
Conditional Processing Computer Organization & Assembly Language Programming slide 26/55

13
BT Instruction
 BT = Bit Test Instruction
 Syntax:
BT r/m16, r16
BT r/m32, r32
BT r/m16, imm8
BT r/m32, imm8

 Copies bit n from an operand into the Carry flag


 Example: jump to label L1 if bit 9 is set in AX register

bt AX, 9 ; CF = bit 9
jc L1 ; jump if Carry to L1

Conditional Processing Computer Organization & Assembly Language Programming slide 27/55

Next . . .

 Boolean and Comparison Instructions


 Conditional Jumps
 Conditional Loop Instructions
 Translating Conditional Structures
 Indirect Jump and Table-Driven Selection
 Application: Sorting an Integer Array

Conditional Processing Computer Organization & Assembly Language Programming slide 28/55

14
LOOPZ and LOOPE
 Syntax:
LOOPE destination
LOOPZ destination

 Logic:
 ECX = ECX – 1
 if ECX > 0 and ZF=1, jump to destination

 Useful when scanning an array for the first element that


does not match a given value.

Conditional Processing Computer Organization & Assembly Language Programming slide 29/55

LOOPNZ and LOOPNE


 Syntax:
LOOPNZ destination
LOOPNE destination
 Logic:
 ECX  ECX – 1;
 if ECX > 0 and ZF=0, jump to destination

 Useful when scanning an array for the first element that


matches a given value.

Conditional Processing Computer Organization & Assembly Language Programming slide 30/55

15
LOOPZ Example
The following code finds the first negative value in an array
.data
array SWORD 17,10,30,40,4,-5,8
.code
mov esi, OFFSET array – 2 ; start before first
mov ecx, LENGTHOF array ; loop counter
L1:
add esi, 2 ; point to next element
test WORD PTR [esi], 8000h ; test sign bit
loopz L1 ; ZF = 1 if value >= 0
jnz found ; found negative value
notfound:
. . . ; ESI points to last array element
found:
. . . ; ESI points to first negative value

Conditional Processing Computer Organization & Assembly Language Programming slide 31/55

Your Turn . . .
Locate the first zero value in an array
If none is found, let ESI be initialized to 0

.data
array SWORD -3,7,20,-50,10,0,40,4
.code
mov esi, OFFSET array – 2 ; start before first
mov ecx, LENGTHOF array ; loop counter
L1:
add esi, 2 ; point to next element
cmp WORD PTR [esi], 0 ; check for zero
loopne L1 ; continue if not zero
JE Found
XOR ESI, ESI
Found:

Conditional Processing Computer Organization & Assembly Language Programming slide 32/55

16
Next . . .

 Boolean and Comparison Instructions


 Conditional Jumps
 Conditional Loop Instructions
 Translating Conditional Structures
 Indirect Jump and Table-Driven Selection
 Application: Sorting an Integer Array

Conditional Processing Computer Organization & Assembly Language Programming slide 33/55

Block-Structured IF Statements
 IF statement in high-level languages (such as C or Java)
 Boolean expression (evaluates to true or false)
 List of statements performed when the expression is true
 Optional list of statements performed when expression is false
 Task: Translate IF statements into assembly language
 Example:
mov eax,var1
cmp eax,var2
if( var1 == var2 ) jne elsepart
X = 1; mov X,1
else jmp next
X = 2; elsepart:
mov X,2
next:
Conditional Processing Computer Organization & Assembly Language Programming slide 34/55

17
Your Turn . . .
 Translate the IF statement to assembly language
 All values are unsigned

cmp ebx,ecx
if( ebx <= ecx )
ja next
{
mov eax,5
eax = 5;
mov edx,6
edx = 6;
next:
}

There can be multiple correct solutions

Conditional Processing Computer Organization & Assembly Language Programming slide 35/55

Your Turn . . .
 Implement the following IF in assembly language
 All variables are 32-bit signed integers
mov eax,var1
if (var1 <= var2) { cmp eax,var2
var3 = 10; jle ifpart
} mov var3,6
else { mov var4,7
var3 = 6; jmp next
var4 = 7; ifpart:
}
mov var3,10
next:

There can be multiple correct solutions


Conditional Processing Computer Organization & Assembly Language Programming slide 36/55

18
Compound Expression with AND
 HLLs use short-circuit evaluation for logical AND
 If first expression is false, second expression is skipped
if ((al > bl) && (bl > cl)) {X = 1;}

; One Possible Implementation ...


cmp al, bl ; first expression ...
ja L1 ; unsigned comparison
jmp next
L1: cmp bl,cl ; second expression ...
ja L2 ; unsigned comparison
jmp next
L2: mov X,1 ; both are true
next:
Conditional Processing Computer Organization & Assembly Language Programming slide 37/55

Better Implementation for AND


if ((al > bl) && (bl > cl)) {X = 1;}

The following implementation uses less code


By reversing the relational operator, We allow the program to
fall through to the second expression
Number of instructions is reduced from 7 to 5

cmp al,bl ; first expression...


jbe next ; quit if false
cmp bl,cl ; second expression...
jbe next ; quit if false
mov X,1 ; both are true
next:
Conditional Processing Computer Organization & Assembly Language Programming slide 38/55

19
Your Turn . . .
 Implement the following IF in assembly language
 All values are unsigned

if ((ebx <= ecx) && cmp ebx,ecx


(ecx > edx)) ja next
{ cmp ecx,edx
eax = 5; jbe next
edx = 6; mov eax,5
} mov edx,6
next:

Conditional Processing Computer Organization & Assembly Language Programming slide 39/55

Application: IsDigit Procedure


Receives a character in AL
Sets the Zero flag if the character is a decimal digit
if (al >= '0' && al <= '9') {ZF = 1;}

IsDigit PROC
cmp al,'0' ; AL < '0' ?
jb L1 ; yes? ZF=0, return
cmp al,'9' ; AL > '9' ?
ja L1 ; yes? ZF=0, return
test al, 0 ; ZF = 1
L1: ret
IsDigit ENDP

Conditional Processing Computer Organization & Assembly Language Programming slide 40/55

20
Compound Expression with OR
 HLLs use short-circuit evaluation for logical OR
 If first expression is true, second expression is skipped
if ((al > bl) || (bl > cl)) {X = 1;}

 Use fall-through to keep the code as short as possible

cmp al,bl ; is AL > BL?


ja L1 ; yes, execute if part
cmp bl,cl ; no: is BL > CL?
jbe next ; no: skip if part
L1: mov X,1 ; set X to 1
next:

Conditional Processing Computer Organization & Assembly Language Programming slide 41/55

WHILE Loops
A WHILE loop can be viewed as
IF statement followed by
The body of the loop, followed by
Unconditional jump to the top of the loop

while( eax < ebx) { eax = eax + 1; }

This is a possible implementation:

top: cmp eax,ebx ; eax < ebx ?


jae next ; false? then exit loop
inc eax ; body of loop
jmp top ; repeat the loop
next:
Conditional Processing Computer Organization & Assembly Language Programming slide 42/55

21
Your Turn . . .
Implement the following loop, assuming unsigned integers

while (ebx <= var1) {


ebx = ebx + 5;
var1 = var1 - 1
}

top: cmp ebx,var1 ; ebx <= var1?


ja next ; false? exit loop
add ebx,5 ; execute body of loop
dec var1
jmp top ; repeat the loop
next:

Conditional Processing Computer Organization & Assembly Language Programming slide 43/55

Yet Another Solution for While


Check the loop condition at the end of the loop
No need for JMP, loop body is reduced by 1 instruction
while (ebx <= var1) {
ebx = ebx + 5;
var1 = var1 - 1
}

cmp ebx,var1 ; ebx <= var1?


ja next ; false? exit loop
top: add ebx,5 ; execute body of loop
dec var1
cmp ebx, var1 ; ebx <= var1?
jbe top ; true? repeat the loop
next:
Conditional Processing Computer Organization & Assembly Language Programming slide 44/55

22
Next . . .

 Boolean and Comparison Instructions


 Conditional Jumps
 Conditional Loop Instructions
 Translating Conditional Structures
 Indirect Jump and Table-Driven Selection
 Application: Sorting an Integer Array

Conditional Processing Computer Organization & Assembly Language Programming slide 45/55

Indirect Jump
 Direct Jump: Jump to a Labeled Destination
 Destination address is a constant
 Address is encoded in the jump instruction
 Address is an offset relative to EIP (Instruction Pointer)

 Indirect jump
 Destination address is a variable or register
 Address is stored in memory/register
 Address is absolute

 Syntax: JMP mem32/reg32


 32-bit absolute address is stored in mem32/reg32 for FLAT
memory
 Indirect jump is used to implement switch statements
Conditional Processing Computer Organization & Assembly Language Programming slide 46/55

23
Switch Statement
 Consider the following switch statement:
Switch (ch) {
case '0': exit();
case '1': count++; break;
case '2': count--; break;
case '3': count += 5; break;
case '4': count -= 5; break;
default : count = 0;
}

 How to translate above statement into assembly code?


 We can use a sequence of compares and jumps
 A better solution is to use the indirect jump
Conditional Processing Computer Organization & Assembly Language Programming slide 47/55

Implementing the Switch Statement


case0:
exit
case1: There are many case
inc count
jmp exitswitch labels. How to jump
case2: to the correct one?
dec count
jmp exitswitch
case3:
add count, 5
jmp exitswitch Answer: Define a
case4:
jump table and use
sub count, 5
jmp exitswitch indirect jump to jump
default: to the correct label
mov count, 0
exitswitch:
Conditional Processing Computer Organization & Assembly Language Programming slide 48/55

24
Jump Table and Indirect Jump
 Jump Table is an array of double words
 Contains the case labels of the switch statement
 Can be defined inside the same procedure of switch statement
jumptable DWORD case0,
case1, Assembler converts
case2,
case3, labels to addresses
case4

 Indirect jump uses jump table to jump to selected label


movzx eax, ch ; move ch to eax
sub eax, '0' ; convert ch to a number
cmp eax, 4 ; eax > 4 ?
ja default ; default case
jmp jumptable[eax*4] ; Indirect jump
Conditional Processing Computer Organization & Assembly Language Programming slide 49/55

Next . . .

 Boolean and Comparison Instructions


 Conditional Jumps
 Conditional Loop Instructions
 Translating Conditional Structures
 Indirect Jump and Table-Driven Selection
 Application: Sorting an Integer Array

Conditional Processing Computer Organization & Assembly Language Programming slide 50/55

25
Bubble Sort
 Consider sorting an array of 5 elements: 5 1 3 2 4
First Pass (4 comparisons) 5 1 3 2 4
Compare 5 with 1 and swap: 1 5 3 2 4 (swap)
Compare 5 with 3 and swap: 1 3 5 2 4 (swap)
Compare 5 with 2 and swap: 1 3 2 5 4 (swap)
Compare 5 with 4 and swap: 1 3 2 4 5 (swap)
Second Pass (3 comparisons) largest
Compare 1 with 3 (No swap): 1 3 2 4 5 (no swap)
Compare 3 with 2 and swap: 1 2 3 4 5 (swap)
Compare 3 with 4 (No swap): 1 2 3 4 5 (no swap)
Third Pass (2 comparisons)
Compare 1 with 2 (No swap): 1 2 3 4 5 (no swap)
Compare 2 with 3 (No swap): 1 2 3 4 5 (no swap)
No swapping during 3rd pass  array is now sorted
Conditional Processing Computer Organization & Assembly Language Programming slide 51/55

Bubble Sort Algorithm


 Algorithm: Sort array of given size
bubbleSort(array, size) {
comparisons = size
do {
comparisons--;
sorted = true; // assume initially
for (i = 0; i<comparisons; i++) {
if (array[i] > array[i+1]) {
swap(array[i], array[i+1]);
sorted = false;
}
}
} while (! sorted)
}
Conditional Processing Computer Organization & Assembly Language Programming slide 52/55

26
Bubble Sort Procedure – Slide 1 of 2
;---------------------------------------------------
; bubbleSort: Sorts a DWORD array in ascending order
; Uses the bubble sort algorithm
; Receives: ESI = Array Address
; ECX = Array Length
; Returns: Array is sorted in place
;---------------------------------------------------
bubbleSort PROC USES eax ecx edx
outerloop:
dec ECX ; ECX = comparisons
jz sortdone ; if ECX == 0 then we are done
mov EDX, 1 ; EDX = sorted = 1 (true)
push ECX ; save ECX = comparisons
push ESI ; save ESI = array address

Conditional Processing Computer Organization & Assembly Language Programming slide 53/55

Bubble Sort Procedure – Slide 2 of 2


innerloop:
mov EAX,[ESI]
cmp EAX,[ESI+4] ; compare [ESI] and [ESI+4]
jle increment ; [ESI]<=[ESI+4]? don’t swap
xchg EAX,[ESI+4] ; swap [ESI] and [ESI+4]
mov [ESI],EAX
mov EDX,0 ; EDX = sorted = 0 (false)
increment:
add ESI,4 ; point to next element
loop innerloop ; end of inner loop
pop ESI ; restore ESI = array address
pop ECX ; restore ECX = comparisons
cmp EDX,1 ; sorted == 1?
jne outerloop ; No? loop back
sortdone:
ret ; return
bubbleSort ENDP
Conditional Processing Computer Organization & Assembly Language Programming slide 54/55

27
Summary
 Bitwise instructions (AND, OR, XOR, NOT, TEST)
 Manipulate individual bits in operands
 CMP: compares operands using implied subtraction
 Sets condition flags for later conditional jumps and loops
 Conditional Jumps & Loops
 Flag values: JZ, JNZ, JC, JNC, JO, JNO, JS, JNS, JP, JNP
 Equality: JE(JZ), JNE (JNZ), JCXZ, JECXZ
 Signed: JG (JNLE), JGE (JNL), JL (JNGE), JLE (JNG)
 Unsigned: JA (JNBE), JAE (JNB), JB (JNAE), JBE (JNA)
 LOOPZ (LOOPE), LOOPNZ (LOOPNE)
 Indirect Jump and Jump Table
Conditional Processing Computer Organization & Assembly Language Programming slide 55/55

28
Integer Arithmetic

Computer Organization
&
Assembly Language Programming

Dr Adnan Gutub
aagutub ‘at’ uqu.edu.sa
[Adapted from slides of Dr. Kip Irvine: Assembly Language for Intel-Based Computers]
Most Slides contents have been arranged by Dr Muhamed Mudawar & Dr Aiman El-Maleh from Computer Engineering Dept. at KFUPM

Outline
 Shift and Rotate Instructions
 Shift and Rotate Applications
 Multiplication and Division Instructions
 Translating Arithmetic Expressions
 Decimal String to Number Conversions

Integer Arithmetic COE 205 – KFUPM slide 2

1
SHL Instruction
 SHL is the Shift Left instruction
 Performs a logical left shift on the destination operand
 Fills the lowest bit with zero
 The last bit shifted out from the left becomes the Carry Flag

0
CF

 Operand types for SHL:


SHL reg,imm8 The shift count is either:
SHL mem,imm8 8-bit immediate imm8, or
SHL reg,CL stored in register CL
SHL mem,CL Only least sig. 5 bits used
Integer Arithmetic COE 205 – KFUPM slide 3

Fast Multiplication
Shifting left 1 bit multiplies a number by 2

mov dl,5 Before: 00000101 =5


shl dl,1 After: 00001010 = 10

Shifting left n bits multiplies the operand by 2n


For example, 5 * 22 = 20

mov dl,5 ; DL = 00000101b


shl dl,2 ; DL = 00010100b = 20, CF = 0

Integer Arithmetic COE 205 – KFUPM slide 4

2
SHR Instruction
 SHR is the Shift Right instruction
 Performs a logical right shift on the destination operand
 The highest bit position is filled with a zero
 The last bit shifted out from the right becomes the Carry Flag
 SHR uses the same instruction format as SHL

0
CF

 Shifting right n bits divides the operand by 2n


mov dl,80 ; DL = 01010000b
shr dl,1 ; DL = 00101000b = 40, CF = 0
shr dl,2 ; DL = 00001010b = 10, CF = 0
Integer Arithmetic COE 205 – KFUPM slide 5

Logical versus Arithmetic Shifts


 Logical Shift
 Fills the newly created bit position with zero

0
CF

 Arithmetic Shift
 Fills the newly created bit position with a copy of the sign bit
 Applies only to Shift Arithmetic Right (SAR)

CF

Integer Arithmetic COE 205 – KFUPM slide 6

3
SAL and SAR Instructions
 SAL: Shift Arithmetic Left is identical to SHL
 SAR: Shift Arithmetic Right
 Performs a right arithmetic shift on the destination operand

CF

 SAR preserves the number's sign

mov dl,-80 ; DL = 10110000b


sar dl,1 ; DL = 11011000b = -40, CF = 0
sar dl,2 ; DL = 11110110b = -10, CF = 0

Integer Arithmetic COE 205 – KFUPM slide 7

Your Turn . . .
Indicate the value of AL and CF after each shift

mov al,6Bh ; al = 01101011b


shr al,1 ; al = 00110101b = 35h, CF = 1
shl al,3 ; al = 10101000b = A8h, CF = 1
mov al,8Ch ; al = 10001100b
sar al,1 ; al = 11000110b = C6h, CF = 0
sar al,3 ; al = 11111000b = F8h, CF = 1

Integer Arithmetic COE 205 – KFUPM slide 8

4
Effect of Shift Instructions on Flags
 The CF is the last bit shifted
 The OF is defined for single bit shift only
 It is 1 if the sign bit changes
 The ZF, SF and PF are affected according to the result
 The AF is unaffected

Integer Arithmetic COE 205 – KFUPM slide 9

ROL Instruction
 ROL is the Rotate Left instruction
 Rotates each bit to the left, according to the count operand
 Highest bit is copied into the Carry Flag and into the Lowest Bit

 No bits are lost

CF

mov al,11110000b
rol al,1 ; AL = 11100001b, CF = 1
mov dl,3Fh ; DL = 00111111b
rol dl,4 ; DL = 11110011b = F3h, CF = 1

Integer Arithmetic COE 205 – KFUPM slide 10

5
ROR Instruction
 ROR is the Rotate Right instruction
 Rotates each bit to the right, according to the count operand
 Lowest bit is copied into the Carry flag and into the highest bit

 No bits are lost

CF

mov al,11110000b
ror al,1 ; AL = 01111000b, CF = 0
mov dl,3Fh ; DL = 00111111b
ror dl,4 ; DL = F3h, CF = 1

Integer Arithmetic COE 205 – KFUPM slide 11

RCL Instruction
 RCL is the Rotate Carry Left instruction
 Rotates each bit to the left, according to the count operand
 Copies the Carry flag to the least significant bit
 Copies the most significant bit to the Carry flag
 As if the carry flag is part of the destination operand
CF

clc ; clear carry, CF = 0


mov bl,88h ; BL = 10001000b
rcl bl,1 ; CF = 1, BL = 00010000b
rcl bl,2 ; CF = 0, BL = 01000010b
Integer Arithmetic COE 205 – KFUPM slide 12

6
RCR Instruction
 RCR is the Rotate Carry Right instruction
 Rotates each bit to the right, according to the count operand
 Copies the Carry flag to the most significant bit
 Copies the least significant bit to the Carry flag
 As if the carry flag is part of the destination operand
CF

stc ; set carry, CF = 1


mov ah,11h ; AH = 00010001b
rcr ah,1 ; CF = 1, AH = 10001000b
rcr ah,3 ; CF = 0, AH = 00110001b
Integer Arithmetic COE 205 – KFUPM slide 13

Effect of Rotate Instructions on Flags


 The CF is the last bit shifted
 The OF is defined for single bit rotates only
 It is 1 if the sign bit changes
 The ZF, SF, PF and AF are unaffected

Integer Arithmetic COE 205 – KFUPM slide 14

7
SHLD Instruction
 SHLD is the Shift Left Double instruction
 Syntax: SHLD destination, source, count
 Shifts a destination operand a given count of bits to the left

 The rightmost bits of destination are filled by the leftmost


bits of the source operand
 The source operand is not modified
 Operand types:

SHLD reg/mem16, reg16, imm8/CL


SHLD reg/mem32, reg32, imm8/CL

Integer Arithmetic COE 205 – KFUPM slide 15

SHLD Example
Shift variable var1 4 bits to the left
Replace the lowest 4 bits of var1 with the high 4 bits of AX

.data var1 AX
var1 WORD 9BA6h
.code Before: 9BA6 AC36
mov ax, 0AC36h After: BA6A AC36
shld var1, ax, 4

destination source count destination

Only the destination is modified, not the source


Integer Arithmetic COE 205 – KFUPM slide 16

8
SHRD Instruction
 SHRD is the Shift Right Double instruction
 Syntax: SHRD destination, source, count
 Shifts a destination operand a given count of bits to the right

 The leftmost bits of destination are filled by the rightmost


bits of the source operand
 The source operand is not modified
 Operand types:

SHRD reg/mem16, reg16, imm8/CL


SHRD reg/mem32, reg32, imm8/CL

Integer Arithmetic COE 205 – KFUPM slide 17

SHRD Example
Shift AX 4 bits to the right
Replace the highest 4 bits of AX with the low 4 bits of DX

DX AX
mov ax,234Bh
mov dx,7654h
Before: 7654 234B

shrd ax, dx, 4 After: 7654 4234

destination source count


destination

Only the destination is modified, not the source


Integer Arithmetic COE 205 – KFUPM slide 18

9
Your Turn . . .
Indicate the values (in hex) of each destination operand

mov ax,7C36h
mov dx,9FA6h
shld dx,ax,4 ; DX = FA67h
shrd ax,dx,8 ; AX = 677Ch

Integer Arithmetic COE 205 – KFUPM slide 19

Next . . .
 Shift and Rotate Instructions
 Shift and Rotate Applications
 Multiplication and Division Instructions
 Translating Arithmetic Expressions
 Decimal String to Number Conversions

Integer Arithmetic COE 205 – KFUPM slide 20

10
Shifting Bits within an Array
 Sometimes, we need to shift all bits within an array
 Example: moving a bitmapped image from one screen to another

 Task: shift an array of bytes 1 bit right

.data
ArraySize EQU 100
array BYTE ArraySize DUP(9Bh) [0] [1] [2] [99]
.code
array before 9B 9B 9B ... 9B
mov ecx, ArraySize
array after 4D CD CD ... CD
mov esi, 0
clc ; clear carry flag
L1:
rcr array[esi], 1 ; propagate the carry flag
inc esi ; does not modify carry
loop L1 ; does not modify carry
Integer Arithmetic COE 205 – KFUPM slide 21

Binary Multiplication
 You know that SHL performs multiplication efficiently
 When the multiplier is a power of 2

 You can factor any binary number into powers of 2


 Example: multiply EAX by 36
 Factor 36 into (4 + 32) and use distributive property of multiplication

 EAX * 36 = EAX * (4 + 32) = EAX * 4 + EAX * 32

mov ebx, eax ; EBX = number


shl eax, 2 ; EAX = number * 4
shl ebx, 5 ; EBX = number * 32
add eax, ebx ; EAX = number * 36

Integer Arithmetic COE 205 – KFUPM slide 22

11
Your Turn . . .
Multiply EAX by 26, using shifting and addition instructions
Hint: 26 = 2 + 8 + 16
mov ebx, eax ; EBX = number
shl eax, 1 ; EAX = number * 2
shl ebx, 3 ; EBX = number * 8
add eax, ebx ; EAX = number * 10
shl ebx, 1 ; EBX = number * 16
add eax, ebx ; EAX = number * 26

Multiply EAX by 31, Hint: 31 = 32 – 1


mov ebx, eax ; EBX = number
shl eax, 5 ; EAX = number * 32
sub eax, ebx ; EAX = number * 31
Integer Arithmetic COE 205 – KFUPM slide 23

Convert Number to Binary String


Task: Convert Number in EAX to an ASCII Binary String
Receives: EAX = Number
ESI = Address of binary string
Returns: String is filled with binary characters '0' and '1'
ConvToBinStr PROC USES ecx esi
mov ecx,32
L1: rol eax,1 Rotate left most significant
mov BYTE PTR [esi],'0' bit of EAX into the Carry flag;
jnc L2 If CF = 0, append a '0'
mov BYTE PTR [esi],'1' character to a string;
L2: inc esi
otherwise, append a '1';
loop L1
mov BYTE PTR [esi], 0 Repeat in a loop 32 times
ret for all bits of EAX.
ConvToBinStr ENDP
Integer Arithmetic COE 205 – KFUPM slide 24

12
Convert Number to Hex String
Task: Convert EAX to a Hexadecimal String pointed by ESI
Receives: EAX = Number, ESI= Address of hex string
Returns: String pointed by ESI is filled with hex characters '0' to 'F'
ConvToHexStr PROC USES ebx ecx esi
mov ecx, 8 ; 8 iterations, why?
L1: rol eax, 4 ; rotate upper 4 bits
mov ebx, eax
and ebx, 0Fh ; keep only lower 4 bits
mov bl, HexChar[ebx] ; convert to a hex char
mov [esi], bl ; store hex char in string
inc esi
loop L1 ; loop 8 times
mov BYTE PTR [esi], 0 ; append a null byte
ret
HexChar BYTE "0123456789ABCDEF"
ConvToHexStr ENDP
Integer Arithmetic COE 205 – KFUPM slide 25

Isolating a Bit String


 MS-DOS date packs the year, month, & day into 16 bits
 Year is relative to 1980

DH DL In this example:

0 0 1 0 0 1 1 0 0 1 1 0 1 0 1 0
Day = 10
Month = 3
Field: Year Month Day Year = 1980 + 19
Bit numbers: 9-15 5-8 0-4 Date = March 10, 1999

Isolate the Month field:


mov ax,dx ; Assume DX = 16-bit MS-DOS date
shr ax,5 ; shift right 5 bits
and al,00001111b ; clear bits 4-7
mov month,al ; save in month variable

Integer Arithmetic COE 205 – KFUPM slide 26

13
Next . . .
 Shift and Rotate Instructions
 Shift and Rotate Applications
 Multiplication and Division Instructions
 Translating Arithmetic Expressions
 Decimal String to Number Conversions

Integer Arithmetic COE 205 – KFUPM slide 27

MUL Instruction
 The MUL instruction is used for unsigned multiplication
 Multiplies 8-, 16-, or 32-bit operand by AL, AX, or EAX
 The instruction formats are:
MUL r/m8 ; AX = AL * r/m8
MUL r/m16 ; DX:AX = AX * r/m16
MUL r/m32 ; EDX:EAX = EAX * r/m32

Integer Arithmetic COE 205 – KFUPM slide 28

14
MUL Examples
Example 1: Multiply 16-bit var1 (2000h) * var2 (100h)

.data
var1 WORD 2000h The Carry and Overflow flags are set if
var2 WORD 100h upper half of the product is non-zero
.code
mov ax,var1
mul var2 ; DX:AX = 00200000h, CF = OF = 1

Example 2: Multiply EAX (12345h) * EBX (1000h)

mov eax,12345h
mov ebx,1000h
mul ebx ; EDX:EAX = 0000000012345000h, CF=OF=0

Integer Arithmetic COE 205 – KFUPM slide 29

Your Turn . . .
What will be the hexadecimal values of DX, AX, and the
Carry flag after the following instructions execute?
mov ax, 1234h
mov bx, 100h Solution

mul bx DX = 0012h, AX = 3400h, CF = 1

What will be the hexadecimal values of EDX, EAX, and the


Carry flag after the following instructions execute?

mov eax,00128765h Solution


mov ecx,10000h EDX = 00000012h,
mul ecx EAX = 87650000h, CF = OF = 1
Integer Arithmetic COE 205 – KFUPM slide 30

15
IMUL Instruction
 The IMUL instruction is used for signed multiplication
 Preserves the sign of the product by sign-extending it
 One-Operand formats, as in MUL
IMUL r/m8 ; AX = AL * r/m8
IMUL r/m16 ; DX:AX = AX * r/m16
IMUL r/m32 ; EDX:EAX = EAX * r/m32

 Two-Operand formats:
IMUL r16, r16/m16/imm8/imm16 The Carry and Overflow
IMUL r32, r32/m32/imm8/imm32 flags are set if the upper
half of the product is not
 Three-Operand formats: a sign extension of the
IMUL r16, r16/m16, imm8/imm16 lower half
IMUL r32, r32/m32, imm8/imm32

Integer Arithmetic COE 205 – KFUPM slide 31

IMUL Examples
 Multiply AL = 48 by BL = 4
mov al,48
mov bl,4
imul bl ; AX = 00C0h, CF = OF = 1

OF = 1 because AH is not a sign extension of AL


 Your Turn: What will be DX, AX and OF ?

mov ax,8760h
mov bx,100h
imul bx

DX = FF87h, AX = 6000h, OF = CF = 1
Integer Arithmetic COE 205 – KFUPM slide 32

16
Two and Three Operand Formats
.data
wval SWORD -4
dval SDWORD 4
.code
mov ax, -16
mov bx, 2
imul bx, ax ; BX = BX * AX = -32
imul bx, 2 ; BX = BX * 2 = -64
imul bx, wval ; BX = BX * wval = 256
imul bx, 5000 ; OF = CF = 1
mov edx,-16
imul edx,dval ; EDX = EDX * dval = -64
imul bx, wval,-16 ; BX = wval * -16 = 64
imul ebx,dval,-16 ; EBX = dval * -16 = -64
imul eax,ebx,2000000000 ; OF = CF = 1
Integer Arithmetic COE 205 – KFUPM slide 33

DIV Instruction
 The DIV instruction is used for unsigned division
 A single operand (divisor) is supplied
 Divisor is an 8-bit, 16-bit, or 32-bit register or memory
 Dividend is implicit and is either AX, DX:AX, or EDX:EAX

 The instruction formats are:


DIV r/m8
DIV r/m16
DIV r/m32

Integer Arithmetic COE 205 – KFUPM slide 34

17
DIV Examples
Divide AX = 8003h by CX = 100h
mov dx,0 ; clear dividend, high
mov ax,8003h ; dividend, low
mov cx,100h ; divisor
div cx ; AX = 0080h, DX = 3 (Remainder)

Your turn: what will be the hexadecimal values of DX


and AX after the following instructions execute?

mov dx,0087h
mov ax,6023h
mov bx,100h
div bx Solution: DX = 0023h, AX = 8760h

Integer Arithmetic COE 205 – KFUPM slide 35

Divide Overflow
 Divide Overflow occurs when …
 Quotient cannot fit into the destination operand, or when
 Dividing by Zero
 Divide Overflow causes a CPU interrupt
 The current program halts and an error dialog box is produced
 Example of a Divide Overflow
mov dx,0087h
Divide overflow:
mov ax,6002h
Quotient = 87600h
mov bx,10h
Cannot fit in AX
div bx

Integer Arithmetic COE 205 – KFUPM slide 36

18
Signed Integer Division
 Signed integers must be sign-extended before division
 Fill high byte, word, or double-word with a copy of the sign bit

 CBW, CWD, and CDQ instructions


 Provide important sign-extension operations before division
 CBW:Convert Byte to Word, sign-extends AL into AH
 CWD:Convert Word to Double, sign-extends AX into DX
 CDQ: Convert Double to Quad, sign-extends EAX into EDX

 Example:
mov ax, 0FE9Bh ; AX = -357
cwd ; DX:AX = FFFFFF9Bh

Integer Arithmetic COE 205 – KFUPM slide 37

IDIV Instruction
 IDIV performs signed integer division
 Same syntax and operands as DIV instruction
IDIV r/m8
IDIV r/m16
IDIV r/m32

 Example: divide eax (-503) by ebx (10)


mov eax, -503 All status flags
cdq are undefined
mov ebx, 10 after executing
idiv ebx ; EAX = -50, EDX = -3 DIV and IDIV

Integer Arithmetic COE 205 – KFUPM slide 38

19
IDIV Examples
Example: Divide DX:AX (-48) by BX (-5)
mov ax,-48
cwd ; sign-extend AX into DX
mov bx,-5
idiv bx ; AX = 9, DX = -3

Example: Divide EDX:EAX (48) by EBX (-5)

mov eax,48
cdq ; sign-extend EAX into EDX
mov ebx,-5
idiv ebx ; EAX = -9, EDX = 3

Integer Arithmetic COE 205 – KFUPM slide 39

Next . . .
 Shift and Rotate Instructions
 Shift and Rotate Applications
 Multiplication and Division Instructions
 Translating Arithmetic Expressions
 Decimal String to Number Conversions

Integer Arithmetic COE 205 – KFUPM slide 40

20
Translating Arithmetic Expressions
 Some good reasons to translate arithmetic expressions
 Learn how compilers do it
 Test your understanding of MUL, IMUL, DIV, and IDIV
 Check for Carry and Overflow flags

 Two Types of Arithmetic Expressions


 Unsigned arithmetic expressions
 Unsigned variables and values are used only
 Use MUL and DIV for unsigned multiplication and division

 Signed arithmetic expressions


 Signed variables and values
 Use IMUL and IDIV for signed multiplication and division
Integer Arithmetic COE 205 – KFUPM slide 41

Unsigned Arithmetic Expressions


 Example: var4 = (var1 + var2) * var3
 All variables are 32-bit unsigned integers
 Translation:
mov eax, var1
add eax, var2 ; EAX = var1 + var2
jc tooBig ; check for carry
mul var3 ; EAX = EAX * var3
jc tooBig ; check for carry
mov var4, eax ; save result
jmp next
tooBig:
. . . ; display error message
next:
Integer Arithmetic COE 205 – KFUPM slide 42

21
Signed Arithmetic Expressions
Example: var4 = (-var1 * var2) + var3
mov eax, var1
neg eax
imul var2 ; signed multiplication
jo tooBig ; check for overflow
add eax, var3
jo tooBig ; check for overflow
mov var4, eax ; save result

Example: var4 = (var1 * 5) / (var2 – 3)


mov eax, var1
mov ebx, 5
imul ebx ; EDX:EAX = product
mov ebx, var2 ; right side
sub ebx, 3
idiv ebx ; EAX = quotient
mov var4, eax
Integer Arithmetic COE 205 – KFUPM slide 43

Your Turn . . .
Translate: var5 = (var1 * -var2)/(var3 – var4)
Assume signed 32-bit integers

mov eax, var1


mov edx, var2
neg edx
imul edx ; EDX:EAX = product
mov ecx, var3
sub ecx, var4
idiv ecx ; EAX = quotient
mov var5, eax

Integer Arithmetic COE 205 – KFUPM slide 44

22
Next . . .
 Shift and Rotate Instructions
 Shift and Rotate Applications
 Multiplication and Division Instructions
 Translating Arithmetic Expressions
 Decimal String to Number Conversions

Integer Arithmetic COE 205 – KFUPM slide 45

Convert Decimal String to Number


Task: Convert decimal string pointed by ESI to a number
Receives: ESI = address of decimal string
Returns: EAX = number in binary format
Algorithm:
Start by initializing EAX to 0
For each decimal character in string (example: "1083")
Move one decimal character of string into EDX
Convert EDX to digit (0 to 9): EDX = EDX – '0'
Compute: EAX = EAX * 10 + EDX
Repeat until end of string (NULL char)
Integer Arithmetic COE 205 – KFUPM slide 46

23
Convert Decimal String – cont'd
; Assumes: String should contain only decimal chars
; String should not be empty
; Procedure does not detect invalid input
; Procedure does not skip leading spaces

ConvDecStr PROC USES edx esi


mov eax, 0 ; Initialize EAX
L1: imul eax, 10 ; EAX = EAX * 10
movzx edx, BYTE PTR [esi] ; EDX = '0' to '9'
sub edx, '0' ; EDX = 0 to 9
add eax, edx ; EAX = EAX*10 + EDX
inc esi ; point at next char
cmp BYTE PTR [esi],0 ; NULL byte?
jne L1
ret ; return
ConvDecStr ENDP
Integer Arithmetic COE 205 – KFUPM slide 47

Convert Number to Decimal String


Task: Convert Number in EAX to a Decimal String
Receives: EAX = Number, ESI = String Address
Returns: String is filled with decimal characters '0' to '9'
Algorithm: Divide EAX by 10 (Example: EAX = 1083)
mov EBX, 10 ; divisor = EBX = 10
mov EDX, 0 ; dividend = EDX:EAX
div EBX ; EDX (rem) = 3, EAX = 108
add dl, '0' ; DL = '3'
Repeat division until EAX becomes 0
Remainder chars are computed backwards: '3', '8', '0', '1'
Store characters in reverse order in string pointed by ESI
Integer Arithmetic COE 205 – KFUPM slide 48

24
Convert to Decimal String – cont'd
ConvToDecStr PROC
pushad ; save all since most are used
mov ecx, 0 ; Used to count decimal digits
mov ebx, 10 ; divisor = 10
L1: mov edx, 0 ; dividend = EDX:EAX
div ebx ; EDX = remainder = 0 to 9
add dl, '0' ; convert DL to '0' to '9'
push dx ; save decimal character
inc ecx ; and count it
cmp eax, 0
jnz L1 ; loop back if EAX != 0
L2: pop dx ; pop in reverse order
mov [esi], dl ; store decimal char in string
inc esi
loop L2
mov BYTE PTR [esi], 0 ; Terminate with a NULL char
popad ; restore all registers
ret ; return
ConvToDecStr ENDP
Integer Arithmetic COE 205 – KFUPM slide 49

Summary
 Shift and rotate instructions
 Provide finer control over bits than high-level languages
 Can shift and rotate more than one bit left or right
 SHL, SHR, SAR, SHLD, SHRD, ROL, ROR, RCL, RCR
 Shifting left by n bits is a multiplication by 2n
 Shifting right does integer division (use SAR to preserve sign)

 MUL, IMUL, DIV, and IDIV instructions


 Provide signed and unsigned multiplication and division
 One operand format: one of the operands is always implicit
 Two and three operand formats for IMUL instruction only
 CBW, CDQ, CWD: extend AL, AX, and EAX for signed division
Integer Arithmetic COE 205 – KFUPM slide 50

25
02/03/2019

Chapter 8

MACRO
FUNCTION

Macro

 Definition : macro is a predefined set of instructions that


can easily be inserted wherever needed
 After defined, macro can be used as many times as
necessary
 Macro must be defined before of using
 Macro can be used in text section
 There are 2 types of macros : single-line macro and multi-
line macro

1
02/03/2019

Single – line macro

 Single-line macros are defined using the %define


directive.
 Example : %define mulby4(x) shl x, 2
 Use the macro by entering : mulby4 (rax)
 Explain : in the source, which will multiply the
contents to the rax register by 4 (via shifting two
bits).

Multi-Line Macros
 Multi-line macros can include a varying number of
lines (including one). The multi-line macros are
more useful and the following sections will focus
primarily on multi-line macros.
 Macro Definition : before using
 Syntax :
 %macro <name> <number of arguments>
; [body of macro]
%endmacro
 The arguments can be referenced within the macro by
%<number>, with %1 being the first argument, and
%2 the second argument, and so forth.

2
02/03/2019

 In order to use labels, the labels within the macro


must be prefixing the label name with a %%.
 This will ensure that calling the same macro
multiple times will use a different label each time.
 For example, a macro definition for the absolute
value function would be as follows:
 %macro abs 1
cmp %1, 0
jge %%done
neg %1
%%done:
%endmacro

Using a Macro

 Example : given declaration as follows


 qVar dq 4
 Invoke (call) abs macro (twice)
 mov eax, -3
 abs eax
 abs qword [qVar]
 The list file will display the code as follows (for
the first invocation):

3
02/03/2019

 27 00000000 B8FDFFFFFF mov eax, -3


28 abs eax
29 00000005 3D00000000 <1> cmp %1, 0
30 0000000A 7D02 <1> jge %%done
31 0000000C F7D8 <1> neg %1
32 <1> %%done:
The macro will be copied from the definition into the
code, with the appropriate arguments replaced in the body
of the macro, each time it is used. The <1> indicates
code copied from a macro definition. In both cases, the
%1 argument was replaced with the given argument; eax
in this example.

Macro Example
 ; Example Program to demonstrate a simple macro
 ;****************************************
; Define the macro
 ; called with three arguments:
 ; aver <lst>, <len>, <ave>
%macro aver 3
mov eax, 0
mov ecx, dword [%2] ; length
mov r12, 0
lea rbx, [%1]

4
02/03/2019

%%sumLoop:
add eax, dword [rbx+r12*4] ; get list[n]
inc r12
loop %%sumLoop
cdq
idiv dword [%2]
mov dword [%3], eax
%endmacro

 ;***************************************;
Data declarations
section .data
; -----
; Define constants
EXIT_SUCCESS equ 0 ; success code
SYS_exit equ 60 ; code for terminate
; Define Data.
section .data
list1 dd 4, 5, 2, -3, 1
len1 dd 5
ave1 dd 0

5
02/03/2019

list2 dd 2, 6, 3, -2, 1, 8, 19
len2 dd 7
ave2 dd 0
 ;***************************************section
.text
global _start
_start:
; Use the macro in the program
aver list1, len1, ave1 ; 1st, data set 1
aver list2, len2, ave2
last:
mov rax, SYS_exit ; exit
mov rdi, EXIT_SUCCESS ; success
syscall

Functions
 Functions and procedures (i.e., void functions),
help break-up a program into smaller parts
making it easier to code, debug, and maintain.
 Function calls involve two main actions:
 Linkage : Since the function can be called from
multiple different places in the code, the function must
be able to return to the correct place in which it was
originally called.
 Argument Transmission : The function must be able to
access parameters to operate on or to return
results (i.e., access call-by-reference parameters).

6
02/03/2019

Function Declaration

 A function must be written before it can be used.


Functions are located in the code segment. The
general format is:
 global <procName>
<procName>:
; function body
ret
 A function may be defined only once.
 Functions cannot be
 A function definition should be started and ended
before the next function’s definition can be started.

Linkage
 The linkage is about getting to and returning from
a function call correctly. There are two instructions
that handle the linkage, call <funcName> and ret
instructions.
 The call transfers control to the named function,
and ret returns control back to the calling routine.
 The call works
 Push RIP
 Jump to label
 Ret instruction
 POP RIP
 Jump to address

7
02/03/2019

 The function calling or linkage instruction is


summarized as follows:

Argument Transmission
 Argument transmission refers to sending information
(variables, etc.) to a function and obtaining a result as
appropriate for the specific function.
 Transmitting values to a function is referred to as call-
byvalue.
 Transmitting addresses to a function is referred to as call-
by-reference.
 There are various ways to pass arguments to and/or from a
function
 Placing values in register
 Easiest, but has limitations (i.e., the number of registers).
 Used for first six integer arguments.
 Used for system calls.

8
02/03/2019

 Globally defined variables


 Generally poor practice, potentially confusing, and will
not work in many cases.
 Occasionally useful in limited circumstances.
 Putting values and/or addresses on stack
 No specific limit to count of arguments that can be
passed.
 Incurs higher run-time overhead.
 In general, the calling routine is referred to as the
caller and the routine being called is referred to as
the callee.

Parameter Passing
 As noted, a combination of registers and the stack is used
to pass parameters to and/or from a function.The first six
integer arguments are passed in registers as follows:

 The seventh and any additional arguments are passed on


the stack.

9
02/03/2019

 when the function is completed, the calling routine is


responsible for clearing the arguments from the stack
 Instead of doing a series of pop instructions, the
stack pointer, rsp, is adjusted as necessary to clear the
arguments off the stack.
 Since each argument is 8 bytes, the adjustment would be
adding [(number of arguments) * 8] to the rsp
 For value returning functions, the result is placed in the
A register based on the size of the value being returned.
Specifically, the values are returned as follows:

 The rax register may be


used in the function as
needed as long as the
return value is set
appropriately before
returning.

10
02/03/2019

Register Usage
 some registers are expected to be preserved across a
function call. That means that if a value is placed in a
preserved register or saved register and the function must
use that register, the original value must be preserved by
placing it on the stack, altered as needed, and then
restored to its original value before returning to the
calling routine

 The temporary registers (r10 and r11) and the argument


registers (rdi, rsi, rdx, rcx, r8, and r9) are not
preserved across a function call This means that any of
these registers may be used in the function without the
need to preserve the original value.
 None of the floating-point registers are preserved across
a function call

11
02/03/2019

Call Frame

 The items on the stack as part of a function call


are referred to as a call frame (also referred to as
an activation record or stack frame).
 The possible items in the call frame include:
 Return address (required).
 Preserved registers (if any).
 Passed arguments (if any).
 Stack dynamic local variables (if any).

 For example, assuming a function call has eight (8)


arguments and assuming the function uses rbx, r12, and
r13 registers (and thus must be pushed), the call frame
would be as follows:

12
02/03/2019

Red Zone

 In the Linux standard calling convention, the first 128-


bytes after the stack pointer, rsp, are reserved. For
example, extending the previous example, the call frame
would be as follows:

Example, Statistical Function 1 (leaf)

 Example will demonstrate calling a simple void


function to find the sum and average of an array
of numbers
 The High-Level Language (HLL) call for C/C++
is as follows:
stats1(arr, len, sum, ave);
 The array, arr, is call-by-reference and the length,
len, is call-by-value. The arguments for sum and
ave are both call-by-reference (since there are no
values as yet)

13
02/03/2019

Caller
 There are 4 arguments, and all arguments are passed in
registers in accordance with the standard calling
convention. The assembly language code in the calling
routine for the call to the stats function would be as
follows:
 ; stats1(arr, len, sum, ave);
 mov rcx, ave ; 4th arg, addr of ave
 mov rdx, sum ; 3rd arg, addr of sum
 mov esi, dword [len] ; 2nd arg, value of len
 mov rdi, arr ; 1st arg, addr of arr
 call stats1

Callee

 The function being called, the callee, must perform the


prologue and epilogue operations (as specified by the
standard calling convention) before and after the code to
perform the function goal
 For this example, the function must perform the
summation of values in the array, compute the integer
average, return the sum and average values

14
02/03/2019

Example, Statistical Function2 (non-leaf)

 This extended example will demonstrate calling a


simple void function to find the minimum, median,
maximum, sum and average of an array of numbers.
 The HighLevel Language (HLL) call for C/C++ is as
follows:
stats2(arr, len, min, med1, med2, max, sum, ave);
 For this example, it is assumed that the array is sorted in
ascending order
 the median will be the middle value. For an even length
list, there are two middle values, med1 and med2, both
of which are returned

15
02/03/2019

Caller

 There are 8 arguments and only the first six can be passed
in registers. The last two arguments are passed on the stack
 The assembly language code in the calling routine for the
call to the stats function would be as follows:

Callee

 The function must perform the summation of values in


the array, find the minimum, medians, and maximum,
compute the average, return all the values.
 When call-by-reference arguments are passed on the
stack, two steps are required to return the value.
 Get the address from the stack.
 Use that address to return the value.

16
02/03/2019

17
02/03/2019

 The call frame for this


function would be as
follows:
 In this example, the
preserved registers rpb and
then r12 is pushed. When
popped, they must be popped
in the exact reverse order r12
and then rpb in order to
correctly restore their
original values.

18
02/03/2019

19
02/03/2019

Chapter 9

SYSTEM SERVICES

Introduction

 There are many operations that an application program


must use the operating system to perform
 Such operations include console output, keyboard input,
file services (open,read, write, close, etc.), obtaining the
time or date, requesting memory allocation, and many
others.
 Accessing system services is how the application requests
that the operating system perform some specific operation
(on behalf of the process)
 More specifically, the system call is the interface between
an executing process and the operating system

1
02/03/2019

Calling System Services

 A system service call is logically similar to calling a


function, where the function code is located within the
operating system
 The function may require privileges to operate which is
why control must be transferred to the operating system.
 When calling system services, arguments are placed in
the standard argument registers.
 System services do not typically use stack-based
arguments. This limits the arguments of a system
services to six (6)
 To call a system service, the first step is to determine
which system service is desired

 The general process is that the system service call code is


placed in the rax register.
 The call code is a number that has been assigned for the
specific system service being requested.
 These are assigned as part of the operating system and
cannot be changed by application programs
 If any are needed, the arguments for system services are
placed in the rdi, rsi, rdx, r10, r8, and r9 registers (in that
order).
 The following table shows the argument locations
which are consistent with the standard calling convention.

2
02/03/2019

 Each system call will


use a different number
of arguments (from
none up to 6).However,
the system service call
code is always required
 After the call code and
any arguments are set,
the syscall instruction is
executed

System Calls

 Each x86-64 system call has a unique ID number


 Examples:

3
02/03/2019

Newline Character
 In the context of output, a newline means move the
cursor to the start of the next line.
 The many languages, including C, it is often noted as
“\n” as part of a string
 Nothing is displayed for the newline, but the cursor is
moved to the start of the next line.
 In Unix/Linux systems, the linefeed, abbreviated LF with
an ASCII value of 10 (or 0x0A), is used as the newline
character.
 In Windows systems, the newline is carriage
return, abbreviated as CR with an ASCII value 13 (or
0x0D) followed by the LF.

Console Output
 The system service to output characters to the console is
the system write (SYS_write).
 Like a high-level language characters are written to
standard out (STDOUT) which is the console
 The STDOUT is the default file descriptor for the console
 The arguments for the write system service are as follows:

4
02/03/2019

 Assuming the following declarations:

 For example to output “Hello World” (it’s traditional) to


the console, the system write (SYS_write) would be
used. The code would be as follows:

Example, Console Output

 This example is a complete program to output


some strings to the console.

5
02/03/2019

6
02/03/2019

7
02/03/2019

Console Input
 The system service to read characters from the
console is the system read (SYS_read).
 Like a high-level language, for the console,
characters are read from standard input (STDIN).
 We will need to declare an appropriate amount of
space to store the characters being read
 If we request 10 characters to read and the user types
more than 10, the additional characters will be lost
 If the user types less than 10 characters, for example
5 characters, all five characters will be read plus the
newline (LF) for a total of six characters.

 The arguments for the read system service are as


follows:

 Assuming the following declarations:

8
02/03/2019

Example

 Read a single character from the keyboard, the


system read (SYS_read) would be used. The code
would be as follows:

Example, Console Input


 Read a line of 50 characters from the keyboard, and then
echo the input back to the console to verify that the input
was read correctly
 Since space for the newline (LF) along with a final NULL termination is
included, an input array allowing 52 bytes would be required.

9
02/03/2019

10
02/03/2019

11
02/03/2019

12
02/03/2019

File Open Operations

 In order to perform file operations such as read and write,


the file must first be opened.
 There are two file open operations, open and open/create.
 After the file is opened, in order to perform file read or
write operations the operating system needs detailed
information about the file, including the complete status
and current read/write location.
 If the file open operation fails, an error code will be
returned.
 If the file open operation succeeds, a file descriptor is
returned.

 The complete set of information about an open


file is stored in an operating system data structure
named File Control Block (FCB)
 File Open
 The file open requires that the file exist in order to be
opened. If the file does not exist, it is an error
 The file open operation also requires the parameter
flag to specify the access mode. The access mode
must include one of the following
 Read-Only Access → O_RDONLY
 Write-Only Access → O_WRONLY
 Read/Write Access → O_RDWR

13
02/03/2019

 The arguments for the file open system service


are as follows:

 Assuming the following declarations:

 After the system call, the rax register will contain the
return value.
 If the file open operation fails, rax will contain a negative
value (i.e., < 0).
 If the file open operation succeeds, rax contains the file
descriptor
 File Open/Create
 A file open/create operation will create a file
 If the file does not exist, a new file will be created
 If the file already exists, it will be erased and a new file created.
 Since the file is being created, the access mode must include the
file permissions that will be set when the file is created.

14
02/03/2019

 The arguments for the file open/create system service are


as follows:

 Assuming the following declarations:

 File Read
 A file must be opened with the appropriate file access flags
before it can be read.
 The arguments for the file read system service are as follows:

 Assuming the following declarations:

 If the file read operation does not succeed, a negative value is


returned in the rax register. If the file read operation succeeds,
the number of characters actually read is returned.

15
02/03/2019

 File Write
 The arguments for the file write system service are as
follows:

 Assuming the following declarations:

 If the file write operation does not succeed, a negative


value is returned in the rax register. If the file write
operation does succeed, the number of characters
actually written is returned.

Example, File Write

 Program writes a short message to a file


 The file created contains a simple message, a URL in
this example.
 The file name and message to be written to the
file are hard-coded

16
02/03/2019

17
02/03/2019

18
02/03/2019

19
02/03/2019

20
02/03/2019

21
02/03/2019

Example, File Read

22

You might also like