Pranabananda Chakraborty - Operating Systems. Evolutionary Concepts and Modern Design Principles-CRC Press (2024)
Pranabananda Chakraborty - Operating Systems. Evolutionary Concepts and Modern Design Principles-CRC Press (2024)
This text demystifes the subject of operating systems by using a simple step-by-step approach,
from fundamentals to modern concepts of traditional uniprocessor operating systems, in addition to
advanced operating systems on various multiple-processor platforms and also real-time operating
systems (RTOSs). While giving insight into the generic operating systems of today, its primary
objective is to integrate concepts, techniques, and case studies into cohesive chapters that provide
a reasonable balance between theoretical design issues and practical implementation details. It
addresses most of the issues that need to be resolved in the design and development of continuously
evolving, rich, diversifed modern operating systems and describes successful implementation
approaches in the form of abstract models and algorithms. This book is primarily intended for use
in undergraduate courses in any discipline and also for a substantial portion of postgraduate courses
that include the subject of operating systems. It can also be used for self-study.
Key Features
Pranabananda Chakraborty
xv
xvi Preface
and negotiate the grand challenges of the future. To acquaint readers with the design principles and
implementation issues of every relevant aspect, each topic, after a description of the theory, is
mapped to various types of contemporary real-world design methodologies wherever applicable,
using directions that are exploited in the actual development of almost all widely used representa-
tive modern operating systems, including Windows, UNIX, Linux, and Solaris. Placing real-life
implementations after a theoretical description of each respective relevant topic is thought to be a
much better method for understanding and is thus used throughout the text instead of compiling
them separately in a single additional chapter or appendix.
Part One: The frst three chapters deal with basic principles and fundamental issues, detail-
ing the objectives and functions of an operating system in general.
Preface xvii
Part Two: The next four chapters (Chapters 4, 5, 6, and 7) together describe all the operat-
ing system modules individually and cover almost all the fundamental OS principles and
techniques that are used by generic operating systems of today with additional topic-wise
emphasis on how they are actually realized in representative modern widely used com-
mercial operating systems.
Part Three: This part includes only Chapter 8, which is entirely dedicated to briefy intro-
ducing one of the most challenging areas linked to today’s operating systems, security and
protection.
xviii Preface
• Chapter 8 gives an overview of the objectives of security and protection needed for an
OS to negotiate different types of security threats. Different types of active and passive
security attacks and how to counter them; the required security policies, mechanisms, and
different proven methods to prevent them are explained. A spectrum of approaches to pro-
vide appropriate protections to the system is shown. Different types of malicious programs
(malware), worms, and viruses and various schemes and mechanisms to restrict them from
entering the system are described. The actual implementation of security and protection
carried out in UNIX, Linux, and Windows in real-life situations is explained in brief.
Part Four: This part consists of only Chapter 9 and presents the introductory concepts and
fundamental issues of advanced operating systems, describing the different topics involved
in these systems.
• Chapter 9 briefy explains advanced operating systems, including the major issues of
multiprocessor operating systems, multicomputer operating systems, and distributed
operating systems, and also highlights the differences between them as well as from a
conventional uniprocessor operating system. Since the computing resources and con-
trols of these OSs are distributed and geographically separated, it gives rise to many
fundamental issues concerning the effciency, consistency, reliability, and security of
the computations as well as of the OS itself. This chapter addresses many of the most
common issues and puts less emphasis on their actual implementations, mostly due to
page-count constraints. A brief discussion of design issues of distributed shared mem-
ory (DSM) and different aspects in the design of distributed fle systems (DFSs) is
given, with examples of real-life implementation of Windows distributed fle system,
SUN NFS, and Linux GPFS. The cluster architecture, a modern approach in distributed
computing system design to form a base substitute of distributed systems, is explained
here, along with its advantages, classifcations, and different methods of clustering.
Part Five: Chapter 10 simply describes in brief an increasingly important emerging area, the
real-time operating system (RTOS).
• Chapter 10 attempts to explain the RTOS, indicating why it is said to be a very differ-
ent kind of system that belongs to a special class, distinct from all the other traditional
operating systems running either on uniprocessor or multiple-processor computing sys-
tems. The description here mainly provides fundamental topics, useful concepts, and
major issues, including kernel structure, signals in the form of software interrupts, the
role of clocks and timers, scheduling mechanisms, implementation of required synchro-
nization and communication, memory management, and many other similar distinct
aspects. Finally, real-life implementations of RTOS are shown, with a brief description
of the Linux real-time extension, KURT system, RT Linux system, Linux OS, pSOSys-
tem, and VxWorks used in Mars Pathfnder.
Despite all-out sincere efforts, many topics still remain untouched, and some are not adequately described.
This is mostly due to their being outside the scope of this book or limitations on the text’s length.
THE PREREQUISITES
The design of this book begins with an introduction to operating systems and gradually steps
towards the state of the art of implementation issues and ultimate design of versatile modern oper-
ating systems, in addition to the concepts and important criteria of different types of advanced
operating systems of today. So, to start with, this book does not assume any specifc background.
However, today’s students know many things related to computers from many different sources but
often severely suffer from a lack of fundamental concepts or, more precisely, wrong concepts. Thus,
it is suggested to at least go through the basic, familiar material, which is presented here over the
frst three chapters in a concept-based manner. Also, it is assumed that students and readers have
some knowledge of logic design, algorithmic approaches, the fundamentals of computer hardware,
and a little bit of C-like programming. It is hoped that users of this book will, after all, fnd it useful
even if they do not have all the prerequisites.
Author Bio
Pranabananda Chakraborty has strong, diversifed experience in the information technology indus-
try over the last 45 years covering system analysis and design and implementation of system soft-
ware (like operating system and compiler design) for various types of large mainframe computing
systems with the giant multinationals, re-engineering, and project monitoring and management,
including banking, insurance, and state-based public examination processing systems; production
planning and survey, demographic census (government of India); different areas in postal systems,
Ministry of Posts, government of India; staff selection systems, government of India; and many
other real-time projects in India and abroad.
As an academician, for the last 45 years, he has also been affliated with several prominent insti-
tutes, including reputed engineering colleges and universities. Recently, he was a senior visiting pro-
fessor at the Government Engineering Colleges, Kolkata, West Bengal, India, and also guest faculty
at Birla Institutes of Technology and Sciences (BITS), Pilani, India, on a regular basis. During this
period, he also conducted corporate and institutional training on various academic subjects of core
computer science and IT disciplines for large, reputed multinationals, including IBM, and R&D
organization using contemporary large systems, as well as seminars and management development
programs in India and abroad sponsored by different corporate bodies in information technology-
based industry.
Although he has extensive research experience in theoretical computer science and software
development, his work is mainly focused on operating systems and real-time operating systems. He
has also authored a text book on computer organization and architecture published by CRC Press,
USA.
xxi
1 Computers and Software
Learning Objectives
• To defne different types of generic system software and their relative hierarchical position
on a common computer hardware platform.
• To illustrate the evolution of operating systems with their basic functions and role in com-
puter operation.
• To describe the different generations of computers and the salient features of the corre-
sponding generations of operating systems up to modern operating systems.
• To provide an overview of networked operating systems running on computer networks.
• To provide a general idea of distributed operating systems running on multiple-processor
machines (multiprocessors and multicomputer systems) separately.
• To explain the cluster architecture of computers, its classifcation, different methods of
clustering, and the role of the operating system in distributed computing.
• To give an overview of real-time operating systems (RTOSs), with a few of their distinct
features and characteristics
• To show the genesis of modern operating systems and their grand challenges.
1.1 INTRODUCTION
A brief history of the evolution of operating systems is not only interesting but a journey because
it reveals how the concept of operating systems has evolved and subsequently provides a com-
prehensive overview of operating system principles, design issues, the different forms of their
structures, and their functions and activities. It is also observed how the different generations of
operating systems, starting from a bare primitive form to today’s most modern systems, gradually
progressed over a period of the last six-odd decades in order to manage constantly emerging more
intelligent and sophisticated computer hardware. A different form of operating system, known
as a real-time operating system (RTOS), has been also introduced that evolved to meet certain
specifc demands of different kinds. Most of the concepts mentioned briefy in this chapter are,
however, thoroughly explained in later relevant chapters. This chapter fnally comes to an end
describing the stage-wise, generation-wise development of operating systems, since their birth in
primitive form to the ultimate design and development of the most advanced operating systems:
in other words, the genesis of modern operating systems.
DOI: 10.1201/9781003383055-1 1
2 Operating Systems
system can be used. The joint effort of the software and hardware of a computer system provides a
tool that precisely solves numerous problems, performing logical decisions and various mathemati-
cal calculations with astonishing speed.
Software is differentiated according to its purpose and broadly classifed into two types, applica-
tion software and system software. In the early days, there was only one class of software, applica-
tion software, which was designed and developed by the user, writing lines of code to solve his or
her specifc problem and also several additional instructions that were required to keep track of and
control the associated machine operations.
In order to release the programmer from this tedious task of writing frequently needed com-
mon codes to drive the machine to implement the application software every time, a set of codes
in the form of a program could be developed and tested and stored permanently on a storage
medium for common use by all users. Any application program could then exploit the service
of these common programs by issuing an appropriate call whenever required. These common
programs intended for all users of the computer hardware were developed to drive, control, and
monitor the operations of computing system resources as and when they were required. These
programs together are historically called system software, and it essentially hides the details of
how the hardware operates, thereby making computer hardware relatively easy and better adapted
to the needs of the users. It provides a general programming environment to programmers for the
mechanics of preparing their specifc applications using the underlying hardware appropriately.
This environment, in turn, often provides new functions that are not available at the hardware
level and offers facilities for tasks related to the creation and effective use of application software.
Common system software, in particular, is very general and covers a broad spectrum of func-
tionalities. It mainly comprises three major subsystems: (i) language translators (compilers,
interpreters, assemblers) and runtime systems (linkers, loaders, etc.) for a programming lan-
guage, (ii) utility systems, and (iii) operating systems (OSs). Numerous input/output (I/O)
devices also require device-dependent programs that control and monitor their smooth operation
during an I/O operation. These programs are essentially system software known as device driv-
ers, sometimes called input–output control systems (IOCSs). All these programs are mostly
written in low‑level languages, such as Assembly language, binary language, and so on, which are
very close to the machine’s (hardware’s) own language or have patterns so that machine resources
can be directly accessed from the user level. Nowadays, they are also often developed with the
high-level language C.
Some system software, namely graphic library, artifcial intelligence, image processing, expert
systems, and so on, are specifc to a particular application area and are not very common in oth-
ers. The OS, compiler, assembler, loader, and to some extent the utilities (DBMS) are required to
commit physical hardware (machine) resources to bind with the application program for execution.
The optimal design of this software, based on the architecture and organization of the underlying
hardware, its offered facilities, and fnally its effectiveness, ultimately determines the effciency of
the hardware and the programmability of the computer system as a whole. Figure 1.1 is a conceptual
representation of an overall computing environment when viewed from the user’s end with respect
to the relative placement of hardware and the different types of software, as already mentioned,
including the operating system.
Modern computers use many such system programs as an integral part of them. They are often
designed and developed by the manufacturer of the hardware and are supplied along with the
machine as inseparable components. The use of third-party–developed system software against
additional cost is already a common practice today.
For more details about system software, see the Support Material at www.routledge.com/
9781032467238.
Computers and Software 3
FIGURE 1.1 The level-wise position of operating system, system software and hardware organization in a
generic computer system.
also be used to control other devices. That is why, for reasons of economy, the common hardware
(i.e. the common control circuitry) is sometimes separated out into a device called a control unit.
World War II built an electronic computer, COLOSSUS, using 1,500 electronic valves and a fast
photo-electric tape reader for input, mainly to crack secret German codes. However, no machine
introduced over this period had an operating system, even in concept.
For more details about the evolution of operating system software, see the Support Material at www.
routledge.com/9781032467238.
different types of operating systems used in mini, supermini, and mainframe large systems, even
up to today’s small microcomputers.
Second-generation computers with single-user batch operating systems in the early 1960s were
mostly used for both commercial processing using COBOL or some form of dedicated language and
scientifc/engineering computations using FORTRAN and Assembly languages. Typical operating sys-
tems were developed by different computer manufacturers exclusively for their own systems. Since
batch processing with a bulk of information does not require any human interaction while it is executed,
hence, all modern operating systems usually incorporate facilities to support batch-style processing.
For more details about second-generation OSes, see the Support Material at: www.routledge.
com/9781032467238.
system) because not one but many jobs will suffer if it is damaged. The maximum number of
programs allowed in main memory at any instant, actively competing for resources, is called the
degree of multiprogramming. Intuitively, the higher the degree, of course, up to a certain extent,
the higher the resource utilization. A multiprogrammed operating system supervises and monitors
the state of all active programs and system resources, provides resource isolation and sharing with
proper scheduling, manages memory with protection, and deals with several other issues related to
directly supporting multiple simultaneously active users in the machine. As a result, this operating
system became comparatively complex and fairly sophisticated. However, this OS was quite able to
handle both sophisticated scientifc applications and massive volumes of commercial data process-
ing, which is considered today the central theme of all modern operating systems.
However, the major drawback of this system was that there was no provision for the user to
interact with the computer system during runtime to study the behavior of the executing program in
order to avoid unpleasant situations, if it happened. Consequently, it caused serious inconvenience
for professional users, particularly when technological development began to rapidly progress and
numerous software designs and development processes started to emerge in more and more new
areas as major activities in computer usage. Moreover, for some types of jobs, such as transaction
processing, user interaction with the system during runtime is mandatory.
Interactive multiprogramming: To resolve all these limitations, time-shared operating sys-
tems were further enhanced to empower them to facilitate interactive support to the user, known as
interactive multiprogramming. By this arrangement, the user could now interact with the system
during runtime from an interactive terminal to execute specifc types of jobs, such as online trans-
action processing (OLTP). The principal user-oriented I/O devices now changed from cards or tape
to the interactive terminal (keyboard and display device CRT), and data were then fed from any
input devices, especially from the terminal according to the status and demands of the executing
program. This summarily made computer users more productive, since they could directly interact
with the computer system on an as-needed basis during runtime. With continuous technological
development, the tangible benefts of interactive computing facilities, however, have been extended,
and are often extracted today by the use of dedicated microcomputers used as terminals attached to
large main systems. Today, the term multiprogramming simply implies multiprogramming together
with its other aspects.
Interactive computing resulted in a revolution in the way computers were used. Instead of being
treated as number crunchers, systems became information manipulators. Interactive text editors
allowed users to construct fles representing programs, documents, or data online. Instead of speak-
ing of a job composed of steps, interactive multiprogramming (also called “timesharing”) deals
with sessions that continue from initial connection (begin/logon) to the point at which that connec-
tion is broken (end/logoff).
Machines from IBM, and from other giant manufacturers, like NCR, Burroughs, DEC, and
UNIVAC, etc., with almost compatible architecture implemented all these ideas in their designed
operating systems. Of all of them, the most notable came from IBM with a series of models in the
System/360 family running under operating systems like DOS/360, OS/360, etc., and also many
others.
Multiuser systems: This form of multiprogramming operating systems in single-processor
hardware with some sort of attached computer terminal enables multiple users to interact with the
system during runtime in a centralized fashion instead of traditional advanced batch processing.
Here, the main objective is that the system be responsive to the needs of each user and yet, for cost
reasons, able to support many users simultaneously. Precisely, this type of operating system essen-
tially provides facilities for the allocation and maintenance of each individual user environment,
requires user identifcation to authenticate for security and protection, preserves system integrity
with good performance, and offers accounting of per-user resource usage. With the advent of the
tremendous power of today’s 32- or even 64-bit microprocessors, which rival yesterday’s main-
frames and minicomputers in speed, memory capacity, and hardware sophistication, this multiuser
10 Operating Systems
approach with modern hardware facilities gradually became more versatile, opening new dimen-
sions in application areas providing a graphical user interface (GUI).
Multiaccess systems: A multi-access operating system allows simultaneous access to a single pro‑
gram by the users (not multi-users) of a computer system with the help of two or more terminals. In
general, multi-access operation is limited mainly to one or in some cases a few applications at best and
does not necessarily imply multiprogramming. This approach, however, opened an important line of
development in the area of online transaction processing (OLTP), such as railway reservation, banking
systems, etc. Under a dedicated transaction-processing system, users enter queries or updates against a
database through hundreds of active terminals supported under the control of a single program. Thus,
the key difference between this transaction processing system and the multiprogramming system is
that the former is restricted mainly to one application, whereas users of a multiprogramming system
can be involved in different types of activities like program development, job execution, and the use of
numerous other applications. Of course, the system response time is of paramount interest in both cases.
For more details about third-generation OSes, see the Support Material at: www.routledge.
com/9781032467238.
job at the time of its submission, the time-sharing approach usually needs short commands to be
entered at the terminal. However, purely batch-style multiprogramming with no interactive comput-
ing can also be implemented with a time-sharing option.
Time-sharing systems demand sophisticated processor scheduling to fulfll certain requirements
of the environment ensuring system balance. Numerous approaches in this regard have been thus
developed to meet certain goals, which are discussed in the following chapters. Memory manage-
ment should ensure perfect isolation and protection of co-resident programs. Of course, some form
of controlled sharing is offered in order to conserve memory space and possibly to exchange data
between active programs. Generally, programs from different users running under time-sharing
systems do not usually have much need to communicate with each other. Device management in
time-sharing systems should be adequately equipped to handle multiple active users while nego-
tiating their simultaneous requests to access numerous devices. Allocation of devices and later
their deallocation must be done keeping in view safeguarding the users’ interest, preserving system
integrity, and at the same time optimizing device utilization. File management in a time-sharing
system should be able to resolve conficting attempts made by different active users over a shared
fle and should ensure full protection and access control at the time of concurrent accesses. In many
cases, it is desirable for a user to create fles that are not to be modifed by other users, or even
sometimes not read by other users. Protection and security became major issues in the early days of
timesharing, even though these issues were equally applied to batch-style multiprogramming. These
aspects are still crucial in today’s systems in multiuser environments and are continually growing
in importance due to the ubiquity of computers. All these issues will be discussed in detail later in
respective chapters.
1.6.3.3 MULTICS
To simplify CTSS, Bell Labs and General Electric (GE, a major computer manufacturer in those
days) developed Multics (MULTiplexed Information and Computing Service) with many sensa-
tional, innovative ideas that could support hundreds of timesharing users concurrently. It was not
just years but decades ahead of its time. Even up to the mid-1980s, almost 20 years after it became
operational, Multics continued and was able to hold its market share in the midst of steep compe-
tition with other emerging advanced operating systems enriched with many novel ideas in their
12 Operating Systems
design. The design of Multics was realized by organizing it as a series of concentric rings instead of
layers (to be discussed in Chapter 3). The inner ring was more privileged than the outer ones, and
any attempt to call a procedure in an inner ring required the equivalent of a system call, called a trap
instruction, with its own valid parameters.
Multics had superior security features and greater sophistication in the user interface and also
in other areas than all contemporary comparable mainframe operating systems. But it was gigantic
and slow, and it also required complicated and diffcult mechanisms to implement. Moreover, it was
written in PL/1, and the PL/1 compiler was years late and hardly worked at all when it was fnally
released.
As a result, Bell Labs withdrew from the project, and General Electric totally closed down
its computer activities. Yet Multics ran well enough at a few dozen sites, including MIT. Later,
Multics was transferred to Honeywell and went on to modest commercial success. Had Honeywell
not had two other mainframe operating systems, one of which was marketed very aggressively,
Multics might have had greater success. Nevertheless, Multics remained a Honeywell product with
a small but trusted customer base until Honeywell got out of the computer business in the late
1980s. However, Multics ultimately fzzled out, leaving behind enormous infuence and an immense
impact on the design and development of the following modern systems, especially in the areas of
virtual memory, protection and security that were implemented on a number of operating systems
developed in the 1970s, including the operating systems developed to drive commercially popular
minicomputers.
For more details about Multics, see the Support Material at www.routledge.com/9781032467238.
1.6.3.4 Mainframes
By this time, the steady development of IC technology ultimately led to the emergence of large-
scale integration (LSI) circuits containing thousands of transistors on a square centimeter of sili-
con. Consequently, new machines with compatible architectures belonging to the third generation
with these powerful components were launched by different manufacturers. Of these, a notable
one, the S/360 series (System/360 series, 3 stands for third generation and 60 for the 1960s) came
from IBM, the industry’s frst planned computers with a family concept, comprising different soft-
ware-compatible machines that offered time-shared multiprogramming with telecommunication
facilities using a video display unit (VDU) for interactive use. To drive this potential hardware,
appropriate advanced operating systems were developed by different manufacturers implement-
ing the ideas mostly tested by Multics. Two of the most popular operating systems were ultimately
released by IBM, multiprogramming with fxed tasks (MFT) and multiprogramming with variable
tasks (MVT) for their large S/360 and early large S/370 (3 stands for third generation and 70 for the
1970s) systems.
The versatile operating systems DOS/360 and OS/360 were developed later by IBM with the
main goal of driving their latest System/360 series but also for their existing small systems like
1401s and large systems like 7094s. As a result, an extraordinarily complex large operating system
evolved with thousands of bugs that were ultimately rectifed and also modifed after several ver-
sions. This operating system, however, was the frst of this kind that was fnally able to operate on
all different machines belonging to the S/360 family. The awesome success of DOS/360 and OS/360
provoked the other contemporary leading manufacturers like Burroughs, UNIVAC, NCR, CDC,
and ICL to come out with their own operating systems, and those were developed along the same
lines as DOS/360 and OS/360.
DOS/360 and OS/360, however, were further enhanced by IBM, primarily to accommodate
more users of the machine at a time, and also to include some deserving enhanced features. The
main bottleneck in this context was the limited capacity of main memory, which was really not
enough to simultaneously accommodate all the executing programs with their whole data sets.
It was thus fnally planned to allocate memory space only dynamically (during runtime) among
Computers and Software 13
different competing executing programs and move or “swap” them back and forth between main
and secondary memory as and when needed to mitigate this scarcity of memory space. This strat-
egy, however, was ultimately implemented within the operating system as an additional major pri-
mary function to automate memory management.
The ultimate outcome of this idea was the concept of virtual memory, an additional facility
implemented in third-generation computers that offered the user an illusion of having essentially
unlimited addressable memory for use with unrestricted access. IBM successfully implemented the
virtual memory mechanism in its line of 360 series machines, and the existing operating systems
were also upgraded accordingly to drive these architecturally modifed machines. Consequently,
operating systems like OS/VS1 and OS/VS2 (VS stands for virtual storage) came out, and later the
operating system OS/SVS (single virtual storage) was introduced in 1972 using a 24-bit addressing
scheme, a virtual space of 16 MB (224 = 16 MB) for the older S/370 machine architecture. But this
24-bit address space along with separate virtual memory for each job also quickly became inad-
equate for some situations and the constantly widening spectrum of the S/370 family. Moreover, as
newer different application areas were constantly emerging, and there was an upsurge in the mem-
ory requirements from the user end, IBM thus introduced multiple virtual storage (MVS), the top
of the line and one of the most complex operating systems ever developed, for its mainframes to
manage such situations. With MVS, the limit was a dedicated 16 MB memory per job, where each
job actually got something less than half of this assigned virtual memory; the remaining space was
for the use of the operating system.
IBM extended the architecture of its underlying processor to handle 31-bit addresses, a facility
known as extended addressing (XA). To extract the potential of this new hardware, a new version of
MVS known as MVS/XA was launched in 1983 that summarily increased the per-job address space
to a maximum of 2 GB (gigabytes). Still, this was found not at all suffcient for some applications
and environments. As a result, IBM introduced the last major extension of the 370 architecture, a
new form known as enterprise system architecture (ESA), and the corresponding enhanced version
of the operating system, known as MVS/ESA, emerged in the late 1980s/early 1990s. Out of many
distinguishing features of this OS, one was that there were up to 15 additional 2-GB address spaces
for data available only to a specifc job, apart from the 2 GB address space per job that was already
available in MVS/XA. Consequently, the maximum addressable virtual memory per job was now
32 GB, one step further in the implementation of virtual storage.
With the development of semiconductor IC RAM, memory size and speed notably increased,
and the price of the memory dropped drastically. The cost of sizable memory then came down to
an affordable range that ultimately inspired designers to have multiple memory modules instead of
just one. A relatively large high-speed buffer memory is thus provided between the CPU and main
memory to act as intermediate storage for quick access to information by the CPU that summarily
reduces the impact of the speed gap between the fast CPU and relatively slow main memory, thereby
improving overall execution speed. This memory is now called caches and is in common use by
modern CPUs to access both data and instructions.
As the hardware was relentlessly upgraded with the continuous advancement of electronic tech-
nology, computer architecture was also constantly enhanced to make use of modern hardware to
fulfll the various needs and continually increasing demands of the users. Consequently, new oper-
ating systems were required to drive the changed hardware systems, and these new OSs were natu-
rally realized either by developing new ones or by repeated modifcation of existing older ones.
This, in turn, invited another problem: a user’s program, while executable under an old operating
system, became unusable under a new operating system (enhanced version) without modifcation,
and this modifcation sometimes could be quite extensive and also expensive. This situation was
faced by IBM in particular, since it introduced many different versions of OSs in quick succession
for its 360 and 370 series of machines, and those were very similar but not fully compatible. An
IBM installation with different versions of operating systems faced a lot of diffculties while chang-
ing or switching the operating system periodically to meet all the user needs and demands. To avoid
14 Operating Systems
the operational diffculties arising from frequent switching of operating systems caused by frequent
updates in hardware architecture, IBM developed a special form of system architecture for its S/360
and S/370 series, popularly known as virtual machines (VMs).
So many installations resorted to VM that an operating system was designed to emulate multiple
computer systems. The ultimate objective of a virtual machine was to multiplex all system resources
between the users in such a way that each user was under the illusion they had undivided access to
all machine’s resources. In other words, each user believed they had a separate copy of the entire
machine of their own. Each such copy was termed a virtual machine. Each virtual machine was
logically separated from all others; consequently it could be controlled and run by its own separate
operating system. This led to innovative system organization (discussed in detail in Chapter 3),
where several different OSs were used concurrently over a single piece of hardware. The heart of the
system, known as virtual machine monitor (VMM), ran on the bare hardware (physical hardware)
and created the required virtual machine interface. The operating system accordingly was upgraded
to drive these architecturally modifed machines. The new operating systems thus released based on
the existing ones were called DOS/VM and OS/VM. A VM could even be confgured to take advan-
tage of systems with multiple processors, but it was unable to provide all of the controls needed to
extract the full strength of modern multiprocessor confgurations.
In spite of having several merits, two major drawbacks were observed in the design of virtual
machines. First of all, the cost of the hardware and hardware interface were very high in those days.
Second, any fault or failure in the common hardware interface (VMM) or the single piece of basic
hardware on which the different virtual machines with their own operating systems were running
concurrently would cause a severe breakdown in the operation of every machine, leading to a total
collapse of the entire system. In addition, the VMM-interface is still a complex program and was
not that simple to realize to obtain reasonable performance. As a result, virtual machines gradually
fzzled out, but the concept and its successful implementation eventually had an immense impact
that opened new horizons in the architectural design of computer systems and their organization
(especially the innovation of computer networks) in the days to come.
For more details about mainframes, see the Support Material at www.routledge.com/
9781032467238.
1.6.3.5 Minicomputers
During the third generation, another major development was the introduction of minicomputers.
The Digital Equipment Corporation (DEC) in 1961 launched its PDP-1, which had only 4K memory
of 18-bit words but with a drastic reduction in price and size, much less than its contemporary, IBM
7094. The performance of the PDP-1 was up to the level of the IBM 7094, and hence it became
very popular. A whole new industry came up, with DEC introducing a series of other PDPs: PDP-7
and PDP-8. The block diagram of the organization of PDP-8 is shown in Figure 1.2. Soon, DEC
launched its enhanced version, the 16-bit successor PDP-11, a machine totally compatible with
FIGURE 1.2 Operating systems used to drive minicomputers and the hardware structure of such representa-
tive minicomputer organization (DEC-PDP–8).
Computers and Software 15
the other members in the family, in the early 1970s. The IBM 360 and PDP-11 had a very close
resemblance: both had byte-oriented memory and word-oriented registers. The parameters of the
PDP-11, like the cost, performance, size, and overhead expenditure, were so attractive that it became
immensely successful both in the commercial and academic arenas, and it was also in wide use until
the 2000s for industrial process control applications.
DEC developed many operating systems for its various computer lines, including the simple
RT-11 operating system for its 16-bit PDP-11-class machines. For PDP-10–class systems (36-bit),
the time-sharing operating systemsTOPS-10 and TOPS-20 were developed. In fact, prior to the
widespread use of UNIX, TOPS-10 was a particularly popular system in universities and also in the
early ARPANET community.
Technological advancements led to the increasing power of minicomputers’ functional opera-
tions. The availability of low-cost and larger-capacity main memory attached to the minicomputer
allowed a multi-user, shared system to run. Within a short period, the successor to the PDP-11, the
frst versatile 32-bit minicomputer, VAX, was launched with a powerful VMS operating system that
offered a multi-user shared environment.
Minicomputers also came from many other companies with various confgurations, but a mini-
computer technically consists of a 16-or 32-bit microprocessor or similar type of other processor, a
comfortable size of memory and a few input-output-supported chips interconnected with each other
or mounted on a single motherboard. DEC with its PDP family soon took a formidable lead over
the other manufacturers. The PDP-11 became the computer of choice at nearly all computer science
departments. Commercial organizations were also able to afford a computer of this type for their
own dedicated applications.
For more details about minicomputers, see the Support Material at www.routledge.com/
9781032467238.
1.6.3.6 UNICS/UNIX
Multics eventually fzzled out. But Ken Thompson, one of computer scientists who worked on the
Multics project at Bell Labs, remained under the strong infuence of the design approach used
by Multics. He wanted to continue with this idea and ultimately decided to write a new, stripped-down,
one-user version of Multics using an obsolete and discarded PDP-7 minicomputer. His developed
system actually worked and met the predefned goal in spite of the tiny size of the PDP-7 computer.
One of the other researchers and his friend, Brian Kernighan, somewhat jokingly called it UNICS
(UNiplexed Information and Computing Service), although the spelling was eventually changed
to UNIX after a series of modifcations and enhancements. Historically, UNIX appeared at that
point as a savior of the popular PDP-11 series, which was in search of a simpler, effcient operating
system, since it was running at that time under a dreadful operating system with literally no other
alternatives.
Looking at the initial success of UNIX, and after being totally convinced of its bright future,
Dennis Ritchie, a renowned computer scientist at Bell Labs and a colleague of Thompson, joined
in this project with his whole team. Two major steps were immediately taken in this development
process. The frst was to migrate the project from the platform of the obsolete PDP-7 to the more
modern and larger PDP-11/20, then to the even more advanced PDP-11/45, and fnally to the most
modern system of those days, thePDP-11/70. The second step was in regard to the language used
in writing UNIX. Thompson, at this juncture, decided to rewrite UNIX afresh in a high-level lan-
guage, leaving its existing line to avoid having to rewrite the entire system whenever the underlying
hardware platform was upgraded. He thus used his own designed language, called B, a simplifed
form of BCPL, which in turn was a form of CPL that never worked. Unfortunately, the structure
of this B language was not enough equipped to support his approaches in designing UNIX, and
consequently, his strategy for realizing this operating system could not be implemented. At this
stage, Ritchie came to the rescue and designed an appropriate successor to the B language called C,
and then wrote a fabulous compiler for it. Later, Ritchie and Thompson jointly rewrote UNIX once
again using C. Coincidences sometimes shape history, and the emergence of C at that critical time
16 Operating Systems
was precisely the correct approach for that implementation. Since then, the C language has been
constantly modifed and enhanced, and remains as an important language platform in the area of
system program development even today.
UNIX developed using C has been widely accepted in the academic world, and ultimately in
1974, the UNIX system was frst described in a technical journal. However, the frst commonly
available version outside Bell Labs, released in 1976, became the frst de facto standard named
Version 6, so called because it was described in the sixth edition of the UNIX programmer’s man-
ual. Soon, it was upgraded to Version 7, introduced in 1978, with a simple fle system, pipes, clean
user-interface—the shell—and extensible design. Apart from many other contemporary systems,
the main non-AT&T UNIX system developed at the University of California, Berkeley was called
UNIX BSD and ran frst on PDP and then on VAX machines. In the meantime, AT&T repetitively
refned its own systems, constantly adding many more features, and by 1982, Bell Labs had com-
bined several such variants of AT&T UNIX into a single system that was marketed commercially
as UNIX system III. Later, this operating system, after several upgrades, incorporated a number of
important new features in a commercially viable integrated fashion that was eventually introduced
as UNIX system V.
By the late 1980s, however, the situation was horrible. Virtually every vendor by this time had
started to regularly include many extra nonstandard features as enhancements and part of its own
upgrades. As a result, there were no standards for binary program formats, and the world of UNIX
was split into many dissimilar ones that greatly inhibited the expected commercial success of
UNIX. In fact, two different and quite incompatible versions of UNIX, 4.3 BSD and System V
Release 3, were in widespread use. It was then diffcult and truly impossible for software vendors
to write and package UNIX programs that would run on any UNIX system, as could be done with
other contemporary operating systems. The net outcome was that standardization in different ver-
sions of UNIX, including the ones from these two different camps, was immediately needed and
accordingly demanded. Many attempts in this regard initially failed. For example, AT&T issued its
System V Interface Defnition (SVID), which defned all the system calls, fle formats, and many
other components. The ultimate objective of this release was to keep all the System V vendors in
line, but it failed to have any impact on the enemy camp (BSD), who just ignored it.
However, the frst serious attempt to reconcile the different favors of UNIX was initiated through
a project named POSIX (the frst three letters refer to portable operating system, and the last two let-
ters was added to make the name UNIXish) carried out by a collective effort under the auspices of
the IEEE standards board, hundreds of people from industry, academia, and the government. After
a great deal of debate, with arguments and counterarguments, the POSIX committee fnally pro-
duced a standard known as 1003.1 that eventually broadened the base of OS implementation beyond
that of pure UNIX by standardizing the user–interface to the OS rather than merely organizing its
implementation. This standard actually defnes a set of library procedures that every conformant
UNIX system must provide. Most of these procedures invoke a system call, but a few can be imple-
mented outside the kernel. Typical procedures are open, read, and fork, etc. The 1003.1 document is
written in such a way that both operating system implementers and software developers can under-
stand it, another novelty in the world of standards. In fact, all manufacturers are now committed to
provide standardized communication software that behaves in conformance with certain predefned
rules to provide their customers the ability to communicate with other open systems.
The triumph of UNIX had already begun, and by the mid-1980s, UNIX nearly monopolized
the commercial as well as scientifc environment, with workstations running on machines ranging
from 32-bit microprocessors up to supercomputers. Although UNIX was designed as a time-sharing
system, its multiprogramming ability as well as the extensibility function inherent in its design
naturally ft into the workstations used in network environments. UNIX gradually became popular
in the workstation market and ultimately started to support high-resolution graphics. Today, main-
frame environments and even supercomputers are managed by UNIX or a UNIX-like (or a variant
of UNIX) operating system.
Computers and Software 17
Another system evolved during this period was the Pick operating system that initially started
as a database application support program and ultimately graduated to carrying out system works
and is still in use as an add-on database system across a wide variety of systems supported on most
UNIX systems. Other database packages, such as Oracle, Cybase, and Ingres, etc. came at a later
stage and were primarily middleware that contained many of the features of operating systems, by
which they can support large applications running on many different hardware platforms.
For more details about UNICS/UNIX, see the Support Material at www.routledge.com/
9781032467238.
multitasking in single-user environment. Windows NT was one such operating system that exploited
the tremendous power of contemporary 32-bit microprocessors and provided full multitasking with
ease-of-use features in a single-user environment. This indicates that a multiprogramming operat-
ing system is necessarily a multitasking operating system, but the converse is not always true.
typical clock rates exceeding a few gigahertz (GHz) today. Other manufacturers started produc-
ing PCs in the same line as IBM, and all their machines were compatible with IBM-PCs, but the
components used by different manufacturers were obviously different. The most popular single-
user highly interactive operating system for early personal computers was CP/M, which was then
converted to PC-DOS by IBM and was fnally displaced by MS-DOS being developed based on
existing PC-DOS by the Microsoft Corporation under license from IBM.
Although single-user MS-DOS was not very intelligent, it was well suited to the organization
of the PC machine and had excellent performance. Its design actually put more emphasis on user-
friendliness, sacrifcing its other vital objectives; one such is resource utilization. All PCs, irrespec-
tive of the brand (manufacturer), ultimately used MS-DOS, which eventually became a de facto
standard. It has been constantly upgraded on a regular basis, ending every time with a release of a
newer version with more advanced features to fulfll most of the users’ requirements and, of course,
developed in the same line of the existing system, maintaining its family concept. Consequently,
MS-DOS on personal computers ultimately dominated other operating system products and fnally
consolidated its position in the entire personal computer market. MS-DOS is traditionally a single-
user system but does provide multitasking by way of coupling available hardware facilities (DMA)
with existing software support. It was not at all a true multiprogramming system, although it pro-
vided a fle system similar to the one offered by UNIX.
Due to the constantly decreasing cost of hardware resources, it became feasible to provide graph-
ical user interfaces (GUIs) for many operating systems. The original GUI was developed at Xerox’s
Palo Alto Research Center (XEROX PARC) in the early 1970s (the Alto computer system), and
then many others were created, including Apple’s Mac OS and also IBM’s OS/2. Microsoft fnally
added this excellent GUI feature in the early 1990s in the form of Windows as a platform to the user
running on the existing MS-DOS operating system. Windows 3.1/95/98/second version of Windows
98 and even Windows Millennium were this type of platform released by Microsoft running on
MS-DOS. Finally, Microsoft launched a Windows NT-based OS, a full-fedged standalone operat-
ing system providing an extensive graphic user interface.
Hardware technology rapidly advances, offering more tiny, low-cost, sophisticated components
for use in all aspects of computers to make the system more powerful and versatile. Side by side,
both the system software and third-party–created application software development tools progressed
remarkably, yielding lots of different types of useful software for users to fulfll their everyday
requirements. As a result, personal computer culture gradually became widespread and matured to
ultimately grow into more sophisticated systems, even moving one step forward to almost replace
existing larger minicomputer machines.
For more details about personal computers, see the Support Material at www.routledge.com/
9781032467238.
Both the issues, such as; cost, and fault tolerance of the entire system (or, at least, a part of the
entire system) at the time of critical hardware failure have been successfully addressed exploiting a
different approach, but keeping preserved the central theme of allowing multiple operating systems
to run concurrently.
Now all the costly centralized hardware has been replaced, and the cost was then distributed to
realize a collection of low-cost standalone autonomous computer systems, like microcomputers;
each one was then run under its own operating system to support its own local users, and all such
systems in this domain were then interconnected with one another via their hardware and soft-
ware tools to enable them to intercommunicate and cooperate. This fulflls the frst requirement:
cost. Since each such small system in this arrangement can use its own different operating system
to support its own local users, this arrangement appeared to have multiple operating system run-
ning simultaneously, thus fulflling the second requirement. In this way, the cost was substantially
brought down to an affordable limit, and since the failure of any system or any fault in intercon-
nection hardware in this arrangement will not affect the other systems, the fault-tolerance issue has
been solved to a large extent.
The emergence of this design concept of interconnecting a collection of autonomous computer
systems capable of communication and cooperation between one another by way of communi-
cation links and protocols is popularly known as a computer network. Each such autonomous
machine running under its own operating system is able to execute its own application programs
to support its own local users and also offers computational resources to the networks, usually
called a host. Computer networks could, however, be considered an immediate predecessor of a
true distributed computing system, which will be discussed later in detail. Sometimes computer
networks are loosely called distributed computer systems, since they carry a favor of a true dis-
tributed computing system comprising hardware composed of loosely bound multiple processors
(not multiprocessors).
Apart from the local operating system installed in each machine to drive it, computer networks
are managed by a different type of operating system installed additionally in each individual com-
puter to allow the local user to use and access information stored in another computer via high-
speed communication facilities. This operating system is known as a network operating system
(NOS) or sometimes network fle system (NFS), a successor of KRONOS, which was developed
by Control Data Corporation during the 1970s. In the late 1970s, Control Data Corporation and the
University of Illinois jointly developed what was then the innovative PLATO operating system that
featured real-time chat and multi-user graphical games using long-distance time-sharing networks.
In the 1970s, UNIVAC produced the Real-Time Basic (RTB) system to support a large-scale time-
sharing environment.
A NOS enables users to be aware of the presence of other computers, login to a remote computer,
and transmit and receive data/fles from one computer to another. Since then, a NOS supported
multiple users interacting. However, it has crossed a long way with numerous enhancements and
modifcations in its various forms, and ultimately evolved to multiuser interactive operating sys‑
tems, known as the client–server model, a widely used form observed today.
FIGURE 1.3 Network operating system which is a different kind of operating system used in computer net-
works organized in the form of Workstation-Server (Client-Server) model.
to share both the hardware and software resources available in the entire low-cost arrangement. The
environment thus formed is depicted in Figure 1.3. The outcome was a remarkable one in the design
of computer hardware confgurations that led computing practices to follow a completely different
path. To make this approach operative, the entire system naturally requires the service of a different
type of more sophisticated and complex operating system to manage all the resources. The result-
ing operating system developed along the lines of time-sharing technology became well-suited to
managing LAN-based computing environments.
The operating system eventually evolved with an innovative concept in the design that structured
it as a group of cooperating processes called servers that offer services to their users, called clients.
This client–server model of the operating system was able to manage many cooperating machines
with related network communications. It also provided client and server resource management strat-
egies, new forms of memory management and fle management strategies, and many other similar
aspects. In fact, client and server machines usually all run the same microkernel (the inner level of
the operating system), with both the clients and servers running as user processes.
These operating systems running on distributed hardware confgurations are well-suited to han-
dle distributed applications offering coarse-grained distribution. Moreover, they provide distributed
fle systems that facilitate system-wide access to fles and I/O devices. Some systems also provide
migration of objects such as fles and processes for the sake of improved fault tolerance and load
distribution. Another nice feature of this system is its scalability, which means that the system
confguration can be gradually incremented on demand in size as the workload regularly grows,
of course without affecting its high reliability. In short, these systems exhibit tremendous strength,
offering an excellent cost/performance ratio when benchmarked.
The major drawbacks of this system are in the complex software, weak security, and, above
all, potential communication bottlenecks. Since the distributed fle system is exposed to the user
running commonly as a user process, it is often under potential threat in the form of unauthorized
access or snooping by active/passive intruders, casual prying by non-technical users, or determined
22 Operating Systems
attempts by unscrupulous users. To protect information from such malicious activities, the system
has to provide more elaborate forms of protection mechanisms, including user authentication and,
optionally, data encryption for increased security.
For more details about the client–server model, see the Support Material at www.routledge.com/
9781032467238.
1.6.4.4 Superminis
The power of minicomputers was constantly increasing, and at the same time, numerous types
of diverse application areas started to evolve, particularly in the area of large applications. In this
situation, the existing 16-bit minicomputers were thus observed to have certain limitations from an
architectural point of view. A few of these to mention are: limited addressing capability, a limited
number of operation codes due to existing 16-bit instructions, limited scope to use numbers with
higher precision, and many others. The solution to all these drawbacks of these machines ultimately
led to the introduction of 32-bit minicomputers, following the same line of its predecessor, the 16-bit
minicomputer in the late 1970s, and it was popularly known as a supermini or mid-range system.
The 32-bit supermini computer supported more users working simultaneously, providing more
memory and more peripheral devices. These machines were considered far more powerful than
the gigantic mainframe of the IBM360/75. DEC launched its most powerful supermini family,
popularly known as VAX series. VAX 8842, with the VMS operating system, was one of the most
popular and widely used superminis. IBM introduced its supermini AS 400 series of machines
with operating system OS 400 is still dominating, and at present, its upgrade version known as
P-series is also widely in use in many application areas. A schematic diagram of AS 400 is shown
in Figure 1.4. These machines today also provide support to larger computers when hooked up to
them. Most superminis, including the AS 400 (P-series) are today commonly used as powerful
FIGURE 1.4 Operating systems used in supermini computer system and the hardware design of such a rep-
resentative supermini system (IBM-AS/400).
Computers and Software 23
standalone systems for both scientifc and commercial applications as well as servers in a network
of computers.
For additional details about superminis, see the Support Material at www.routledge.com/
9781032467238.
• Resource sharing
• Communication
• Reliability
• Computation speed-up
• Scalability (incremental growth)
24 Operating Systems
Besides, a DOS often requires personal IDs and passwords for users to strengthen security mecha-
nisms to provide guaranteed authenticity of communication and at the same time permit the users
to remain mobile within the domain of the distributed system.
However, the advantages of DOSs may be sometimes negated by the presence of certain factors:
their ultimate dependence on communication networks; any potential problem in them or their satu-
ration at any time may create havoc. Moreover, easy sharing of resources, including data, though
advantageous, may turn out to be a double-edged sword, since remaining exposed may cause a poten-
tial threat in security even with the provision of user-ids and passwords in communications. We will
now separately discuss in brief the DOS when used in multiprocessor and multicomputer systems.
For more details about common functionalities of DOS, see the Support Material at www.routledge.
com/9781032467238.
FIGURE 1.5 Operating system used in advanced multiprocessor system and the hardware organization of
such a representative multiprocessor system (DEC VAX 9000).
Computers and Software 25
A multiprocessor in the form of true distributed hardware requires different types of operating
systems to be designed with an innovative approach using a new set of modules to meet its objec-
tives that ultimately give rise to the concept of a DOS. This system creates an abstract environment
in which machine boundaries are not visible to the programmer. A DOS of this type uses a single
centralized control system that truly converts the existing collection of hardware and software into
a single integrated system.
A multiprocessor when managed by a true DOS under which a user while submits a job is not
aware of; on which processor the job would be run, where the needed fles are physically located,
or whether the executing job during execution would be shifted from one processor to another for
the sake of balancing load. In fact, the user has no choice and also not informed because every-
thing should be effciently handled in a transparent way by the operating system in an automated
manner. To achieve this, among many options, one is to have more complex processor scheduling
mechanisms that would synchronize several coexisting processors’ activities in order to realize the
highest amount of parallelism. Hence, a true DOS should not be considered an extension or addition
of code to realize more new features over the existing traditional uniprocessor or NOSs. A DOS
though appears to its users very similar to a traditional uniprocessor system offering those system
services that may qualify it as time-sharing, real-time, or any combination of them for the beneft
of local clients, but it may also facilitate shared access to remote hardware and software resources.
The versatile third-generation OS/360 from IBM gradually evolved to become successively MFT,
MVT, and SVS systems and then to higher-generation (fourth-generation) DOSs like MVS, MVS/
XA, MVS/ESA, OS/390, and z/OS. Although these operating systems include the essentials of the
UNIX kernel, a huge amount of new functions were included. These functions provide numerous
supports that are required by modern mission-critical applications running on large distributed
computer systems like z-Series mainframes. It is worthwhile to mention that IBM maintained total
compatibility with the past releases, giving rise to a full family concept, so that programs developed
in the 1960s can still run under the modern DOS z/OS with almost no change. Although z/OS runs
UNIX applications, it is a proprietary OS, in opposition to an open system.
participation of all associated computers in the control functions of the distributed OS. However,
there always exists a possibility of network failure or even failure of an individual computer system
in the collection that may complicate the normal functioning of the operating system. To negotiate
such a situation, special techniques must be present in the design of the operating system that will
ultimately permit the users to somehow access the resources over the network.
However, the special techniques used in the design of a DOS mostly include distributed control,
transparency, and remote procedure calls (RPCs).
Brief details of these special techniques are given on the Support Material at www.routledge.
com/9781032467238.
FIGURE 1.6 Hardware model of representative real-time system controlled by Real-Time Operating System
(RTOS).
to meet the underlying response requirement of a real-time application but cannot guarantee it will
meet it under all conditions. Typically, it meets the response requirements in a probabilistic manner,
say, 95% of the time. Control applications like nuclear reactor control, fight control in aviation, and
guided control of a missile miserably fail if they cannot meet the response time requirement. Hence,
they must be serviced using hard real-time systems. Other types of real-time applications like res-
ervation systems and banking operations do not have any notion of such failure; hence they may be
serviced using soft real-time systems. Hard real-time systems should not be used in situations with
features whose performance cannot be predicted precisely.
The provision of domain specifc interrupts and associated interrupt servicing actions facilitates a real-
time system to quickly respond to special conditions and events in the external system within a fxed dead-
line. When resources are overloaded, hard real-time systems sometimes partition resources and allocate
them permanently to competing processes in applications. This reduces OS overhead by way of avoiding
repeated execution of the allocation mechanisms done by the OS that summarily allow processes to meet
their response time requirements, thereby compromising with the resource utilization target of the system.
A RTOS is thus valued more for effciency, that is, how quickly and/or predictably it can respond
to a particular event, than for the given amount of work it can perform over time. An early example
of a large-scale RTOS was the so-called “control program” developed by American Airlines and
IBM for the Sabre Airline Reservations System.
The key secret in the success of RTOS lies in the design of its scheduler. As every event in the
system gives rise to a separate process and to handling a large number of almost-concurrent events,
the corresponding processes are arranged in order of priority by the scheduler. The operating sys-
tem thus emphasizes processor management and scheduling and concentrates less on memory and
Computers and Software 29
fle management. The RTOS usually allocates the processor to the highest-priority process among
those in a ready state. Higher-priority processes are normally allowed to preempt and interrupt the
execution of lower-priority processes, if required, that is, to force them to go from an executing state
to ready state at any point as per demand of the situation.
Many such processes simultaneously exist, and these must be permanently available in main
memory with needed protection, not only for realizing a quick response but also to enable them to
closely cooperate with one another at certain times to manage current needs. The processes are,
however, seldom swapped to and from main memory.
Another area RTOSs focus on is time-critical device management. For example, a telecommu-
nication system carries out an exchange of information on communication lines by way of trans-
mitting sequence of bits. For error-free message transmission, each and every bit must be received
correctly. A bit, while on the line, remains intact only for a very short time known as a bit period.
The system must respond within this critical time period (bit period) before it gets lost; otherwise
erroneous message transmission may happen.
File management is not very important and not even present in most real-time systems. For
example, many embedded real-time systems used as controllers, like aviation controllers used in
fight control and ballistic missile control have no provision of secondary storage for storing fles
and their management. However, larger installations with RTOS have this management with almost
the same requirements as is found in conventional multiprogramming with time-sharing systems.
The main objective of this fle management is again to meet time-critical criteria: more emphasis
on the speed of fle access rather than building up an effcient fle system for optimal utilization of
secondary storage space and user convenience.
RTOS maintains the continuity of its operation even when faults occur. In fact, RTOS usu-
ally employs two techniques to negotiate such situations: fault tolerance and graceful degradation.
Fault tolerance is realized by using redundancy of resources (the presence of more resources than
the actual requirement) to ensure that the system will keep functioning even if a fault occurs; for
example, the system may have two disks even though the application actually requires only one.
The other one will actually take charge when a fault occurs in the operating disk. Graceful degra‑
dation implies that when a fault occurs, the system can fall back to a reduced level of service and
subsequently revert when the fault is rectifed. In this situation, the user assigns high priorities to
critical functions so that those will be performed in a time-bound manner within a specifed time
even when the system runs in degraded mode.
Designers of RTOSs are quite aware of these and other facts and accordingly developed these
systems in order to achieve all these objectives and many others. Different types of RTOSs are thus
designed and developed with numerous important features so as to attain their different respective
goals. However, some common features, as summarized in the following, must be present in any
type of RTOSs. Those are:
However, more recent RTOSs are found to have almost invariably implemented time-sharing sched-
uling in addition to priority-driven preemptive scheduling.
electronic technology in building the hardware of computer systems, discarding existing mechani-
cal technology, and the concept of the operating system then started to bloom. Computer architec-
ture then progressed through generation after generation, with the sixth generation at present; each
one is, however, distinguished by its own major characteristics. To manage these constantly evolv-
ing, more advanced computer systems, different generations of operating systems have continuously
been developed. Sometimes an innovation in design of an operating system awaits the arrival of a
suitable technology for implementation. Market forces also play an important role in encouraging
a particular design feature. Large manufacturers, by their dominance in the market, also promote
certain features. It is interesting to note that despite rapid technological advances, both in hardware
and in software, the design of the logical structure of the operating system as proposed in the 1960s
has progressed rather slowly.
For any change in design of operating system, it demands high cost in related program develop-
ment and consequently make an impact on the application software run on it. Once the operating
system and related system software of a particular computer become popular and widely accepted,
it is observed that users are very reluctant to switch to other computers requiring radically different
software. Moreover, if the software system comes from a giant manufacturer, a worldwide standard
is enforced by them, and all the newer emerging techniques and methodologies are then imple-
mented by them along the same lines as those of forthcoming members of their family. This has
been observed in operating systems developed by IBM and Microsoft and also in different versions
of Linux/UNIX and Solaris operating systems.
The track of development of operating systems is from single-user operating systems used in
mainframe systems; to batch multiprogramming systems; to timesharing multitasking systems; to
single-user, single-task personal computer-based systems and thereby workstations interconnected
with various forms of networks; and fnally DOSs to manage distributed hardware with multiple
processors. An operating system that manages a single computer on its own is popularly known as a
centralized, single‑CPU, single‑processor, or even traditional/conventional operating system. One
of the important aspects in today’s computer usage is handling of high volumes of numerous forms
of information generated by individual computer (user) that requires proper sharing and willful
exchange as well as fast simultaneous accesses whenever required by other different computers.
That is why it is observed that essentially all computers in industries, educational institutions, and
government organizations are networked and running under an OS providing adequate protection
from unauthorized access and ensuring free access to all shared resources. This leads us to believe
that the DOS is in coarse-grained form at present, and its fne-grained form is in the future. These
systems also implement multiprogramming techniques inherited from traditional batch and then
timesharing and multitasking systems. The growing complexity of embedded devices has ultimately
led to increasing use of embedded operating systems (RTOS). Figure 1.7 exhibits a schematic evolu-
tionary track in the development of modern operating systems.
Modern operating systems have a GUI using a mouse or stylus for input in addition to using
a command-line interface (or CLI), typically with only the keyboard for input. Both models are
centered around a “shell” that accepts and executes commands from the user (e.g. clicking on a but-
ton or a typed command at a prompt). Choosing an OS mainly depends on the hardware architec-
ture and the associated application environment, but only Linux and BSD run on almost any CPU
and supporting nearly all environments. All Windows versions (both Professional and Server) are
mainly for Intel CPUs, but some of them can be ported to a few other CPUs (DEC Alpha and MIPS
Magnum). Since the early 1990s, the choice for personal computers has largely been limited to the
Microsoft Windows family and the UNIX-like family, of which Linux and Mac OS X are becom-
ing the major alternatives. Mainframe computers and embedded systems use a variety of different
operating systems, many with no direct relation to Windows or UNIX but typically more similar to
UNIX than Windows.
UNIX systems run on a wide variety of machine architectures. They are heavily in use as
server systems in commercial organizations, as well as workstations in academic and engineering
Computers and Software 31
FIGURE 1.7 Graphical presentation of stage-wise evolution of operating systems from its very inception
(primitive one) to most modern forms.
environments. Free software UNIX variants, such as Linux and BSD, are increasingly popular and
are mostly used in multiuser environments.
The UNIX‑like family, commonly used to refer to the large set of operating systems that resemble
the original UNIX (also an open system), is a diverse group of operating systems, with several
major sub-categories, including System V, BSD, and Linux. Some UNIX variants like HP’s HP-
UX and IBM’s AIX are designed to run only on that vendor’s proprietary hardware. Others, such
as Solaris, can run on both proprietary hardware (SUN systems) and on commodity Intel x86 PCs.
Apple’s Mac OS X, a microkernel BSD variant derived from NeXTSTEP, Mach, and FreeBSD, has
replaced Apple’s earlier (non-UNIX) Mac OS. Over the past several years, free UNIX systems have
supplemented proprietary ones in most instances. For instance, scientifc modeling and computer
animation were once the province of SGI’s IRIX. Today, they are mostly dominated by Linux-based
or Plan 9 clusters.
Plan 9 and Inferno were designed by Bell Labs for modern distributed environments that later
added graphics built-in to their design. Plan 9 did not become popular because it was originally not
free. It has since been released under the Free Software and Open Source Lucent Public License and
gradually gained an expanding community of developers. Inferno was sold to Vita Nuova and has
been released under a GPL/MIT license.
The Microsoft Windows family of operating systems initially originated as a graphical layer
on top of the older MS-DOS environment, mainly for IBM PCs, but also for DEC Alpha, MIPS,
and PowerPC, and, as of 2004, it ultimately held a near-monopoly of around 90% of the worldwide
desktop market share. Modern standalone Windows versions were based on the newer Windows
NT core borrowed from OpenVMS that frst took shape in OS/2. They were also found in use
on low-end and mid-range servers, supporting applications such as web servers, mail servers, and
database servers as well as enterprise applications. The next addition to the Microsoft Windows
family was Microsoft Windows XP, released on October 25, 2001, and then many others, and
32 Operating Systems
fnally its next generation of Windows named Windows Vista (formerly Windows Longhorn),
adding new functionality in security and network administration, and a completely new front-end
known as Windows‑Black‑Glass. Microsoft then constantly kept releasing newer and newer ver-
sions of Windows, upgrading the latest member of its family with many novel, distinct, important
features applicable to both individual computing as well as server-based distributed computing
environments.
Older operating systems, however, are also in use in niche markets that include the versatile
Windows-like system OS/2 from IBM; Mac OS, the non-UNIX precursor to Apple’s Mac OS X;
BeOS; RISC OS; XTS-300; Amiga OS; and many more.
As one of the most open platforms today, the mainframe fosters a tighter integration between
diverse applications and provides a strong basis for an organization’s service-oriented architecture
(SOA) deployment. Mainframes today not only provide the most secure, scalable and reliable plat-
form but also demonstrate a lower total cost of ownership (TCO) when compared to a true distrib-
uted system. Today, most business data reside on mainframes. Little wonder that it continues to be
the preferred platform for large organizations across the globe. The most widely used notable main-
frame operating systems that came from IBM, such as IBM’s S/390 and z/OS, and other embedded
operating systems, such as VxWorks, eCos, and Palm OS, are usually unrelated to UNIX and
Windows, except for Windows CE, Windows NT Embedded 4.0, and Windows XP Embedded,
which are descendants of Windows, and several *BSDs and Linux distributions tailored for embed-
ded systems. OpenVMS from Hewlett-Packard (formerly DEC) already contributed a lot and is still
in heavy use. However, research and development of new operating systems continue both in the
area of large mainframes and in the minicomputer environment, including DOSs.
Although the operating system drives the underlying hardware while residing on it, only a small
fraction of the OS code actually depends directly on this hardware. Still, the design of the OS pre-
cisely varies across different hardware platforms that, in turn, critically restrict its portability from
one system to another. Nevertheless, it will be an added advantage if the same developed code can
be used directly or by using an interface to produce a multi-modal operating system, such as gen-
eral-purpose, real-time, or embedded. Moreover, if the design of the OS could be made so fexible
that it would allow the system policy to be modifed at will to fulfll its different targeted objectives,
then developers could take advantage by varying the compile-time directives and installation-time
parameters to make the operating system customized, an important aspect in the development of
operating system code. Consequently, it could then fulfll the different operating system require-
ments of a diverse spectrum of user environments, even those with conficting needs. For example,
an OS developed for a desktop computer could be ported to cell phone by using appropriate instal-
lation-time parameters, if provided.
The introduction of cluster/server architecture eventually gave the needed impetus for radical
improvement both in hardware architecture and organization as well as in sophisticated OS and
related system software development for distributed computing (cloud computing) to users on single-
user workstations and personal computers. This architecture, however, provides a blend of distributed
decentralized and centralized computing using resources that are shared by all clients and main-
tained on transparent server systems. Resurgence continues in this domain, maintaining a high pace
of enhancements culminating in more refned technological and application developments, and these
innovative improvements and upgrades are expected to continue relentlessly in the days to come.
SUMMARY
This chapter explains the need for operating systems in computers and in what ways they can help
users handle computers with relative ease. The concept of the operating system and its subsequent
continuous evolution through different generations, starting from its bare primitive form to today’s
most modern versions, to manage different, constantly emerging more powerful and sophisticated
hardware platforms gradually progressed over a period of the last six decades. The evolution took
Computers and Software 33
place from batch processing (resident monitor) and variants of multiprogramming, multitasking,
multiuser, multi-access, virtual machine OSs to ultimately distributed systems of different kinds
along with other types: real-time systems and embedded systems. Side by side, system software in
its primitive form emerged earlier, and its continuous evolution in different forms has met the rising
potential of constantly emerging more advanced hardware and operating systems. In this chapter, an
overview of the generational developments of generic operating systems was given chronologically
and illustrated with representative systems, mentioning each one’s salient features and also draw-
backs. With the introduction of tiny powerful microprocessors as well as small speedier capacious
main memories, more advanced form of hardware organization and architecture were constantly
evolved using multiple processors, the signifcant hardware outcomes are powerful multiprocessors
and multicomputer systems. Due to immense success of networks of computers using sophisticated
hardware technology, two different types of models (multicomputers), computer networks, and true
distributed system, came out. Each requires the service of a different type of sophisticated and
complex operating system to manage resources. This gave rise to two different types of operating
systems, NOSs and true DOS, each with its own targets and objectives. In recent years, a third alter-
native, clustering, built on the premise of low-cost computer networks, emerged, which is essentially
a base substitute for an expensive true distributed system that provides enormous computational
fexibility and versatility. This system, however, needs a completely different type of operating sys-
tem to work. Another form of operating system called a RTOS evolved to meet demands of different
kinds, known as real-time applications. This chapter concludes by discussing the genesis of modern
operating systems and the grand challenges of the days to come. Overall, this chapter is a short
journey through the evolution of the operating system that lays the foundation of operating system
concepts that will be discussed in the rest of this book.
EXERCISES
1. “In the early days of computers, there was no operating system, but there did exist a form
of operating system”: Justify this statement.
2. What is meant by generations of an operating system? In what ways have they been classi-
fed and defned?
3. State and explain the two main functions that an operating system performs. Discuss the
role of the system software in the operation of a computer system.
4. Explain the various functions of the different components of the resident monitor.
5. Why was timesharing not wide spread on second-generation computers?
6. Why is spooling considered a standard feature in almost all modern computer systems?
7. How might a timesharing processor scheduler’s policy differ from a policy used in a batch
system?
8. State and differentiate between multiprogramming, multitasking, and multiprocessing.
Multitasking is possible in a single-user environment; explain with an example.
9. What is meant by degrees of multiprogramming? Discuss some factors that must be con-
sidered in determining the degree of multiprogramming for a particular system. You may
assume a batch system with same number of processes as jobs.
10. A multiprogramming system uses a degrees of multiprogramming m ≥ 1. It is proposed to
double the throughput of the system by modifcation/replacement of its hardware compo-
nents. Give your views on the following three proposals in this context.
a. Replace the CPU with a CPU with double the speed.
b. Expand the main memory to double its present size.
c. Replace the CPU with a CPU with double the speed and expand the main memory to
double its present size.
11. Three persons using the same time-sharing system at the same time notice that the response
times to their programs differ widely. Discuss the possible reasons for this difference.
34 Operating Systems
12. What is meant by interactive systems? What are the salient features of an interactive mul-
tiprogramming system? Write down the differences between multi-user and multi-access
systems.
13. Discuss the impact and contributions of MULTICS in the development of UNIX operating
systems.
14. “The third-generation mainframe operating system is a milestone in the evolution of mod-
ern operating systems”—justify the statement, giving the special characteristics of such
systems.
15. What is meant by computer networks? “A network operating system is a popular distrib-
uted system”: justify the statement with reference to the special characteristics of such a
system. Discuss the main features of a client–server model.
16. Discuss in brief the salient features and key advantages of the distributed operating system
used in a multiprocessor platform.
17. In order to speed up computation in a distributed system, an application is coded as four
parts that can be executed on four computer systems under the control of a distributed
operating system. However, the speedup as obtained is < 4. Give all possible reasons that
could lead to such drop in speed.
18. What are the main characteristics that differentiate a real-time operating system from
a conventional uniprocessor operating system? In a multiprogramming system, an I/O-
bound activity is given higher priority than non-I/O-bound activities; however, in real-time
applications, an I/O-bound activity will be given a lower priority. Why is this so?
19. A real-time application requires a response time of 3 seconds. Discuss the feasibility of
using a time-sharing system for the real-time application if the average response time in
the time-sharing system is: (a) 12 seconds, (b) 3 seconds, or (c) 0.5 seconds.
20. An application program is developed to control the operation of an automobile. The pro-
gram is required to perform the following functions:
a. Monitor and display the speed of the automobile.
b. Monitor the fuel level and raise an alarm, if necessary.
c. Monitor the state of the running car and issue an alarm if an abnormal condition arises.
d. Periodically record auxiliary information like speed, temperature, and fuel level (similar
to a black box in an aircraft).
Comment on the following questions with reasons in regard to this application:
i. Is the application a real-time one? Explain with justifcations.
ii. What are the different processes that must be created in order to reduce the response
time of the application? What would be their priorities?
iii. Is it necessary to include any application-specifc interrupts? If so, specify the inter-
rupts, when they would appear, and their priorities.
WEBSITES
www.gnu.org/philosophy/free-software-for-freedom.html
2 Concepts and Issues
Operating Systems
Learning Objectives
• To explain the needs of an operating system and to give an overview of the objectives and
functions of generic operating system, including its main role as a resource manager.
• To describe in brief the concepts and the general characteristics of the overall organization
of an operating system.
• To give an introductory concept of process and its different types, along with the different
views when it is observed from different angles.
• To describe the major issues in the design of generic operating systems, including inter-
rupts and its processing, resource sharing and protection, scheduling, and many others.
• To describe interrupts and their different types, along with their working and servicing and
their differences from traps.
• To narrate the different types of schedulers and their level-wise organization.
• To articulate the various supports and services offered by an operating system in making
use of hardware resources providing the system calls, procedure calls, signals, message
passing, pipes, etc. required for various types of processing that help the user control the
working environment.
• To introduce the most important common factors with impact on the design of generic
operating systems.
DOI: 10.1201/9781003383055-2 35
36 Operating Systems
To address all these issues while negotiating such complex requirements, a modular approach in
operating system design was initially proposed that resulted in group-wise division of all the pro-
grams of the operating system into four resource categories along with respective major functions
to be carried out as detailed.
1. Keep track of the resources (processors and the status of the process). The program that
performs this task is called the traffc scheduler.
2. Decide who will have a chance to use the processor. The job scheduler frst chooses from
all the jobs submitted to the system and decides which one will be allowed into the system.
If multiprogramming, decide which process gets the processor, when, and for how long.
This responsibility is carried out by a module known as process (or processor) scheduler.
3. Allocate the resource (processor) to a process by setting up necessary hardware registers.
This is often called the dispatcher.
4. Reclaim the resource (processor) when the process relinquishes processor usage, termi-
nates, or exceeds the allowed amount of usage.
It is to be noted that the job scheduler is exclusively a part of process (processor) management,
mainly because the record-keeping operations for job scheduling and processor scheduling are very
similar (job versus process).
1. Keep track of the resources (memory). What parts are in use and by whom? What parts are
not in use (called free)?
2. If multiprogramming, decide which process gets memory, when it gets it, and how much.
3. Allocate the resource (memory) when the processes request it and the policy of item 2
allows it.
4. Reclaim the resource (memory) when the process no longer needs it or has been terminated.
1. Keep track of the resources (devices, channels, bus, control units); this program is typically
called the I/O traffc controller.
2. Decide what an effcient way is to allocate the resource (device). If it is to be shared, then
decide who gets it and for how long (duration); this is called I/O scheduling.
3. Allocate the resource and initiate the I/O operation.
4. Reclaim resource. In most cases, the I/O terminates automatically.
Operating Systems: Concepts and Issues 37
1. Keep track of the resources (information) and their location, use, status, and so on. These
collective facilities are often called the fle system.
2. Decide who gets the resources, enforce protection managements, and provide accessing
routines.
3. Allocate the resource (information); for example, open a fle.
4. Deallocate the resource; for example, close a fle.
This approach is the central theme and was suffcient to derive the conceptual design of an operating
system in a multiprogramming environment. Optimal management of available hardware resources
with almost no human intervention during runtime is a major function that has been achieved with this
operating system design. The introduction of faster hardware technologies and techniques, innovative
designs with ever-increasing processor speed, memory and other resources, and fnally a sharp drop in
the cost of computer hardware paved the way for the emergence of a new concept of parallel architec-
ture in the design and organization of computer systems using many CPUs and larger memories. As
a result, additional complexities are imposed in the conceptual design of existing operating systems
to properly drive this more powerful hardware organization. Further improvements in different direc-
tions in the design of operating systems also have occurred in order to provide a multi‑user environ-
ment that requires sharing and separation of local and global hardware resources. A multiprocessor
computer system, on the other hand, demands an altogether different type of operating system, a
different concept in the design of the operating system that provides a multiprocessing environment.
Since the type of fundamental resources present in modern computers remains unchanged, and only
the speed, capacity, and number of resources attached have signifcantly increased, the central theme
of resource management by and large remains the same and has been treated as a backbone for all
types of emerging concepts in the design of all modern operating systems.
Since, many different modules are present in the operating system for management of various
types of resources, one of the major problems faced by operating system designers is how to man-
age and resolve the complexity of these numerous management functions at many different levels
of detail while offering a product that will be suffciently effcient, reasonably reliable, easy to
maintain, and above all convenient for user. However, operating system design in its early stages
was proposed and realized in the form of monolithic structures. Later, for larger systems, improved
versions of this concept were developed in terms of a hierarchy of levels (layers) of abstraction with
an aim to hide information where the details of algorithms and related data structures being used in
each manager are confned within respective module. Each module is entrusted to perform a set of
specifc functions on certain objects of a given type. However, the details of any module’s operation
and the services it provides are neither visible/available nor of concern to its users. The details of
this approach are cited and explained in later sections.
Now, the obvious question raised is how the operating system handles a user’s job when it is
submitted, and in which way, and at what instant, all these different resource managers will come
into action.
kind. Each such activity comprises one or more operations. To realize each operation, one or more
instructions must be executed. Thus, process is obtained while a set of such instructions is executed,
and these instructions are primitive (machine) instructions and not the user’s. In fact, a process
is a fundamental entity that requires resources (hardware and software) to accomplish its task.
Alternatively, a process is actually a piece of computation that can be considered the basis of the
execution of a program. In brief, a process is considered the smallest unit of work that is individually
created, controlled, and scheduled by the operating system.
Thus, a process is precisely an instance of a program in execution. For example, the “copy” pro-
gram is simply a collection of instructions stored in the system. But when this “copy” program runs
on a computer, it gives rise to a process, the “copy” process. Multiple processes can be executing
the same program. Processes are considered a primary operating-system mechanism for defning,
determining, and managing concurrent execution of multiple programs.
A different approach in expressing processes is to consider them agents representing the intent
of users. For example, when a user wants to compile a program, a process (compilation process)
runs a different specifc program (compiler) that accepts the user program as data and converts the
user program into another form, known as an object program. When the user wants to execute the
object program, a process (perhaps a new one) runs a different program (linker) that knows how to
convert the object program into a new form to make it runnable (executable). In general, processes
run specifc appropriate programs that help the user achieve a goal. While performing their task,
the processes may also need help from the operating system for such operations as calling specifc
programs and placing the converted program in long-term storage. They require resources, such as
space in the main storage and machine cycles. The resource principle says that the operating system
is the owner while allocating such resources.
Another view of a process is the locus of points of a processor (CPU or I/O) executing a collec-
tion of programs. The operation of the processor on a program is a process. The collection of pro-
grams and data that are accessed in a process forms an address space. An address space of a job is
defned as the area of main memory occupied by the job. Figure 2.1 depicts the relationship between
user, job, process, and address space, with two sample address spaces, one for the CPU process,
the other for an I/O process. The operating system must map the address spaces of processes into
physical memory. This task may be assisted by special hardware (e.g. a paged system), or it may be
primarily performed by software (e.g. a swapping system).
However, we now start with the CPU, the most important resource, required by every user, so we
need an abstraction of CPU usage. We defne a process as the OS’s representation of an executing
program so that we can allocate CPU time to it. Apart from executing user jobs, the CPU is also
often engaged in performing many other responsibilities, mainly servicing all types of operating
system requests by way of executing related OS programs to manage all the existing resources and
to create the appropriate environment for the job under execution to continue. All these CPU activi-
ties are, by and large, identical and are called processes but give rise to different types of processes,
namely user processes and OS processes.
This process abstraction turns out to be convenient for other resource management and usage
besides the CPU, such as memory usage, fle system usage (disk/tape), network usage, and number
of sub-processes (i.e. further CPU utilization), which are all assigned on a per-process basis.
When a particular program code resident in memory is shared by more than one user at the
same time, this memory sharing will result in different processes displaced in time and probably in
data. Thus processes are truly a unit of isolation. Two processes cannot directly affect each other’s
behavior without making explicit arrangements to do so. This isolation extends itself to more gen-
eral security concerns. Processes are tied to users, and the credentials of running on a user’s behalf
determine what resources the process can use.
It should be noted that there exists a clean and clear distinction between a program and a pro-
cess. A program consists of static instructions which can be stored in memory. A program is thus
basically a static, inanimate entity that defnes process behavior when executed on some set of data.
Operating Systems: Concepts and Issues 39
FIGURE 2.1 In a generic operating system, the relationship between User, Job, and Process when mapped by
the operating system in two sample address spaces; one for the CPU process, and the other for an I/O process.
A process, on the other hand, is a dynamic entity consisting of a sequence of events that result due
to execution of the program’s instructions. A process has a beginning, undergoing frequent changes
in states and attributes during its lifetime, and fnally has an end. The same code program can result
in many processes. While a program with branch instruction is under execution, the CPU may leave
the normal sequential path and traverse different paths in the address space depending on the out-
come of the execution of branch instructions present in the program. Each locus of the CPU, while
traversing these different paths in address space, is a different process. A single executable program
thus may give rise to one or more processes.
Although process is considered the classic unit of computation, some modern operating systems
have a specialized concept of one or both of two additional fundamental units of computation:
threads and objects. We will discuss these two units later in this chapter. It is true that there exists
40 Operating Systems
no explicit relationship between threads and objects, although some designers have used threads to
realize (implement) objects. While most operating systems use the process concept as the basis of
threads or objects, a few systems, however, implement these alternatives directly in the design of the
operating system as well as in higher-level system software.
A further description of the process when viewed from different angles, such as the user, operat-
ing system, and system programmer views, is given on the Support Material at www.routledge.com/
9781032467238.
A system process (OS process) results when the operating system program is under execution
to initiate and manage computer-system resources. A user process is when a user program
is under execution to solve its own problems. A user process may, however, be of two types:
CPU process and I/O process.
User processes compete with one another to gain control of system resources by way of requesting
operating-system services and are accordingly assigned them. System processes, on other hand, are
assigned with an initial set of resources, and these pre-assigned resources remain unaltered during the
lifetime of the process. A user process cannot create another process directly; it can only create another
process by issuing requests to operating system services. A user process always has a program data
block (PDB) for its own use being offered by the operating system, but system process never requires
one. User processes need the assistance of system processes while issuing system service requests for
many reasons, for example, at the instant of a transition from a CPU process to an I/O process.
• Processor/process management
• Memory management
• Device management
• File management
The remainder of the main memory contains other user programs and data. The allocation of this
memory space (resource) is controlled jointly by the operating system and memory-management
hardware located mainly within the processor. When the operating system program is executed by
the processor, the result is that it directs the processor to use other system resources and controls the
timing of the processor’s execution with other programs.
Since the processor itself is a rich resource and has to do many different things, it must cease
executing the operating system program and execute other programs. Thus, the operating system
has to relinquish control for the processor to allow it to do useful and productive work for the user,
and then it resumes control once again at the appropriate time to prepare the processor to do the next
piece of work. While the user program is under execution, the different management functions of
the operating system will come into operation at different times according to the intent of the user
program. For example, the normal activity of any job could be halted by the occurrence of a defned
event, such as an I/O instruction or an instruction issued by the user seeking a system service. The
operating system decides when an I/O device can be allocated to an executing program and controls
access to it (device management) and allows use of fles (fle management). Switching from one
management function to another while the user program is under execution by the processor (pro-
cessor management) or when request is issued from the user program to the processor to execute the
operating–system program, and also for such many other similar reasons, the normal processing of
the processor with the user program is summarily hindered to a large extent. This event is known as
interrupt that simply interrupts the ongoing processing, and in this situation, the operating system
must have the provision to intervene, to restore normalcy by way of resolving the interrupt to allow
ongoing processing to continue. This principal tool was extensively used by system programmers in
the early days of developing multiprogramming and multi-user interactive systems.
FIGURE 2.2 Location and logical view of functioning of the operating system when it is event-driven in a
computing environment.
42 Operating Systems
care of the situation, and does what is needed to service the event. Consequently, an event handler
routine (interrupt servicing routine) provided by the OS is executed by the CPU to resolve the event.
This physical view is one of the basics for developing a concept about the operating system and its
working that will help to formulate the set of functions required in its design.
User applications always start with the CPU in control. Interrupts are the primary means by which
I/O systems obtain the services of the CPU. With an interrupt, the performance of the computer is
greatly increased by allowing I/O devices to continue with their own operations while the processor
can be engaged in executing other instructions in parallel.
Figure 2.3 shows the interrupt in an application program from the system point of view. Here, the
user program has a WRITE instruction interleaved with processing. This WRITE instruction is basi-
cally a call to a WRITE program (an I/O program), which is a system utility that will perform the actual
I/O operation. When the user program encounters this WRITE instruction during execution, as shown
in Figure 2.3, normal execution is interrupted; it makes a call to the operating system in the form of a
WRITE call. In this case, the WRITE program is invoked that consists of a set of instructions for prepa-
ration (initiation) of the real I/O operation and the actual I/O command which drives the I/O device to
perform the requested functions. After execution of a few of these instructions, control returns to the
user program, allowing the CPU to execute other instructions. Meanwhile, the I/O device remains busy
accepting data from computer memory and printing it. In this way, this I/O operation is carried out
simultaneously and overlaps with the CPU’s instruction executions in the user program.
When the I/O device has completed its scheduled operation or is ready to accept more data from
the CPU, the I/O module for that device then sends an interrupt‑request signal to the processor. The
Operating Systems: Concepts and Issues 43
FIGURE 2.3 In an executing program, the program fow of control when an interrupt occurs, indicated by
an asterisk (*) in the respective instruction of the program.
processor immediately responds by suspending its own operation on the current program, branch-
ing off to execute the interrupt handler program to service that particular I/O device, and then back
once again to resume its original execution after the device is serviced. The point at which the inter-
rupt occurs is indicated by an asterisk (*) in Figure 2.3.
From the user program point of view, an interrupt is an event that breaks the normal sequence
of program execution, and the CPU is then temporarily diverted to execute the corresponding ISR.
When the execution of this ISR routine is over, the interrupt processing (servicing) is completed,
and control once again comes back to resume the original interrupted program from the point where
control was transferred, as illustrated in Figure 2.4. Thus, the user program need not have any spe-
cial code to accommodate interrupts; the processor and operating system are jointly responsible for
suspending the user program and subsequently resuming it from the point it left off. In brief, inter-
rupts are used primarily to request the CPU to initiate a new operation, to signal the completion of
an I/O operation, and to signal the occurrences of hardware and software errors or failures.
A key concept related to interrupt is transparency. When an interrupt happens, actions are taken,
and a program (ISR) runs, but when everything is fnished, the computer should be returned to
44 Operating Systems
exactly the same state as it was before the occurrence of the interrupt. An interrupt routine that has
this property is said to be transparent. Having all interrupts be transparent makes the entire inter-
rupt process a lot easier to understand.
Traps are essentially interrupts but are generated internally by a CPU and associated with the
execution of the current instruction and result from programming errors or exceptional conditions
such as an attempt to:
1. Divide by zero
2. Floating-point overfow
3. Integer overfow
4. Protection violation
5. Undefned op-code
6. Stack overfow
7. Start non-existent I/O device
8. Execute a privileged instruction (system call) when not in a privileged (supervisor mode) state
With a trap, the operating system determines whether the error is fatal. If so, the currently running
process is abandoned, and switching to a new process occurs. If not, then the action of the operat-
ing system will depend on the nature of the error and the design of the operating system. It may go
on to attempt a recovery procedure, or it may simply alert the user. It may even carry out a switch
of process, or it may resume the currently running process (see also processor modes: mode bit).
The essential difference between interrupts and traps is that traps are synchronous with the pro-
gram, and interrupts are asynchronous. If the program is rerun a million times with the same input,
traps will reoccur in the same place each time, but interrupts may vary, depending on the run-time
environment. The reason for the reproducibility of traps and irreproducibility of interrupts is that
traps are caused directly by the program and solved by jumping to a procedure called a trap handler,
and interrupts are, at best, indirectly caused by the program.
Hardware Actions
1. The device controller issues an interrupt signal to tell the processor to start the interrupt
sequence.
2. The processor completes its execution on the current instruction before responding to the
interrupt. However, an immediate response is sometimes required with no waiting for
the completion of the current instruction to service time-critical interrupts. Such an imme-
diate response will result in the loss of the current instruction processing.
3. As soon as the CPU is prepared to handle the interrupt, it asserts an interrupt acknowledge
signal on the bus to the device that issued the interrupt. This acknowledgement ensures the
device removes its interrupt signal.
4. When the device controller fnds that its interrupt signal has been acknowledged, it puts a
small integer on the data line to identify itself. This number is called the interrupt vector.
5. The CPU takes the interrupt vector from the bus and saves it temporarily.
Operating Systems: Concepts and Issues 45
6. The CPU now prepares to transfer control to the ISR. It needs to save the information
needed in the future to resume the current program again after servicing the interrupt. The
minimum information required to save is the program status word (PSW) and the address
of the next instruction to be executed, which is contained in the program counter (PC). The
CPU saves the PC and PSW onto the system control stack (see Figure 2.5a).
7. The CPU then locates a new PC by using the interrupt vector as an index in a table at the
bottom of memory. If the PC is 4 bytes, for example, then interrupt vector n corresponds
to address 4n in the table. This new PC points to the start of the ISRs for the device caus-
ing the interrupt. Loading the PC with the starting address of the appropriate interrupt-
handling program that will respond to the specifed interrupt depends on the architecture
of the computer and the design of the operating system, because there may be different
programs for different types of interrupts and even for different types of devices.
Once the PC has been loaded, the current content of the PC eventually results in the transfer of con-
trol to the beginning of the interrupt-handling program. The start of the execution of this program
begins the software actions resulting in the following operations:
FIGURE 2.5 When an interrupt is serviced, the changes made in memory, registers and stack.
46 Operating Systems
Software Actions
1. The frst thing the ISR does is to save all the processor registers on a system control stack
or in a system table so that they can be restored later, because these registers may be used
by the current program (interrupt handler). Any other “state” information may also need to
be saved. For example, as shown in Figure 2.5, assume that a user program is interrupted
after the instruction at location I. The contents of all of the registers and the address of the
next instruction are now pushed on to the system control stack. The stack pointer is gradu-
ally changed due to this pushing and accordingly updated from its initial content to the new
top of the stack. The PC is now updated to point to the beginning address of the ISR.
2. Each interrupt vector is generally shared by all devices of a given type (such as a terminal),
so it is not yet known which terminal caused the interrupt. The terminal number can be
found by reading a device register.
3. Any other information about the interrupt, such as status codes, can now be read in.
4. If an I/O error occurs, it can be handled here. If required, a special code is output to tell the
device or the interrupt controller that the interrupt has been processed.
5. Restore all the saved registers (Figure 2.5b).
6. The fnal step is to restore the PSW and PC values from the stack. This ensures that the
next instruction to be executed will be the instruction from the previously interrupted
program.
7. Execute the RETURN FROM INTERRUPT instruction, putting the CPU back into the
mode and state it had just before the interrupt happened. The computer then continues as
though nothing had happened.
Like subroutines, interrupts have linkage information, such that a return to the interrupted program
can be made, but more information is actually necessary for an interrupt than a subroutine because
of the random nature of interrupts.
When an ISR completes its execution, the processor checks to see if other interrupts have already
occurred. If so, the queued interrupts are then handled in strict sequential order, as shown in
Figure 2.6.
This approach is nice and simple, but the drawback is that it does not consider the relative priority
or time-critical needs of interrupts waiting in the queue. For example, when an interrupt is issued by
a communication line at the time of input arrival, it may need to be serviced immediately to allow
the next input to come. If the frst set of input has not been processed before the second set arrives,
data may be lost.
So another approach in interrupt processing is to accommodate the priorities of interrupts. This
means that while a lower-priority interrupt processing is under execution, a higher-priority interrupt
Operating Systems: Concepts and Issues 47
FIGURE 2.6 All the interrupts lying in queue, when the current interrupt is under processing, are then ser-
viced in strict sequential order one after another.
FIGURE 2.7 A higher-priority interrupt interrupts the ongoing lower-priority interrupt processing, and the
control is transferred accordingly.
can interrupt the ongoing interrupt processing, and the higher-priority interrupt processing will be
started. When this servicing is over, the interrupted lower-priority interrupt processing will once
again resume, and when this processing completes, control fnally returns to the user program. The
fow of control of this approach is shown in Figure 2.7.
is also called the interrupt vector. This is the fastest and most fexible response to inter-
rupts, since this causes a direct hardware-implemented transition to the correct interrupt-
handling routine. This technique, called vectoring, can be implemented in a number of
ways. In some computers, the interrupt vector is the frst address of the I/O service routine.
In other computers, the interrupt vector is an address that points to a location in memory
where the beginning address of the I/O service routine is stored. This is illustrated in
Figure 2.8. In Intel 80386, the interrupt vectors are 8-byte segment descriptors, and the
table containing the address of the ISRs can begin anywhere in memory.
FIGURE 2.8 Interrupt vectors in memory point to locations in memory where the beginning addresses of
the respective interrupt service routines are stored.
Operating Systems: Concepts and Issues 49
control of different units of a resource at the same time. Memory and I/O devices are examples
of space-multiplexed resources. In the case of time‑multiplexed sharing, a resource will not be
divided into units. Instead a program or a process will be allocated to get exclusive control of the
entire resource for a short specifed period of time. After this specifed time, the resource will be
de-allocated from the currently executing process and will then be allocated to another. Time-
multiplexing is usually done with the processor resource. Under this scheme, a single processor
in the machine is switched among processes which are holding also other resources, like memory
space and I/O devices. In this way, it creates an illusion to the user that the concurrently execut-
ing processes are really running simultaneously, although the fact is that the execution is strictly
sequential. Hence, while referencing concurrent execution, it means that either the execution may
actually be simultaneous (in the case of a multiprocessor) or that the single processor is time-
multiplexed based on certain pre-defned allocation policy (scheduling) across a set of processes
holding space-multiplexed resources.
While encouraging and allowing resource sharing, the operating system should enforce resource
isolation. This means that the system must be reliable to isolate resource access with suffcient
protection based on a certain pre-defned allocation policy (scheduling). The system must also be
able to allow resources to share co-operatively when it is required, without creating damage. For
example, the operating system must provide a memory isolation and protection mechanism that
should ensure loading of two or more programs in different parts of memory at the same time. It
should not only provide suffcient protection to prevent any unauthorized access but ensure that
neither program will be able to change or reference the memory contents being used by the other
program. The operating system must guarantee that the OS codes are kept well protected, allowing
sharing while in main memory, but are not overwritten by any user program. Protection hardware
is often used by the operating system to implement such control access to parts of memory.
Similarly, the processor isolation mechanism should insist the processes sequentially share the
processor according to a pre-defned allocation policy (scheduling). The processor should also be
protected from being indefnitely monopolized by any user program for any unforeseen reason.
For example, due to an error in the user program, the CPU enters an infnite loop and gets stuck
with no message to the operating system. In this situation, the system must have some mechanism,
with the help of a timer, to interrupt the CPU and take control from the CPU to restore operation.
The OS software in all such cases really depends on the available hardware support to implement
key parts of the mechanism to fully ensure resource isolation with protection. While the operating
system implements the abstraction directly from the physical resources, it provides the basic trusted
mechanisms to realize and manage resource sharing.
A scheduler is an OS module that decides which job is to be selected next and elects the next pro-
cess to run. The scheduler is concerned with deciding on policy, enforces this policy, and imposes
relevant strategy that determines who gets what, when, and how much, but never provides an imple-
mentation mechanism.
We can distinguish several classes of scheduling based on how decisions must be made. Four
types of scheduling are typically involved, as shown in Figure 2.9. One of these is I/O scheduling
that takes the decision as to which process’s pending I/O request shall be handled by an available
I/O device. Each device has a device scheduler that selects a process’s I/O request from a pool of
processes waiting for the availability of that particular device. This issue is discussed in more detail
in the following chapter, “Device Management”.
The remaining three types of scheduling are types of processor scheduling that are concerned
with the assignment of processor/processors to processes in such a way as to attain certain sys-
tem objectives and performance criteria, such as processor utilization, effciency, throughput, and
response time. This scheduling activity is once again broken down into three separate functions:
long-term, medium-term, and short-term scheduling, and the corresponding three types of
schedulers are long-term, medium-term, and short-term schedulers. All of them may sometimes
simultaneously exist in a complex operating system, as depicted in Figure 2.9.
FIGURE 2.9 Four-types of schedulers in a modern uniprocessor operating system and their operational
interactions and relationships.
Operating Systems: Concepts and Issues 51
(memory requirement), device requirements, expected execution time, and other related information
about the job. Being equipped with such knowledge beforehand, the job scheduler could then select
a job from its queue that would ensure system balance, that is, to maintain a proper mix of a desired
proportion of processor- and I/O-bound jobs in the system depending on the current environment of
the system and availability of resources at that moment.
The long-term scheduler acts here as a frst-level regulatory valve in attempting to always keep
resource utilization at the desired level. If the scheduler at any point in time detects that proces-
sor utilization has fallen below the desired level, it may admit more processor-bound jobs into the
system to increase the number of processes to attain system balance. Conversely, when the utiliza-
tion factor becomes high due to the presence of too many processor-bound jobs in the system, as
refected in the response time, it may opt to reduce the admission rate of batch-jobs accordingly.
Since this scheduler is usually invoked whenever a job is completed and departs the system, the
frequency of invocation is thus both system- and workload-dependent, but it is generally much lower
than that of the other two types of scheduler. The invocation of this scheduler usually occurs after a
relatively long time, hence its name.
Since an exact estimate of the workload’s characteristics is not regularly available, and the pres-
ence of several parameters and their various combinations often create situations in the system, all
these together ultimately require the scheduler to incorporate rather complex and computationally
intensive algorithms while selecting a job to admit into the system. Once scheduled for execution,
a job or user program is admitted into the system with a transition of state from dormant (submit)–
to–ready and then spawns processes which fnally enter the ready queue awaiting the processor
allocation controlled by the short-term scheduler. This issue will be discussed in detail in Chapter
4,“Processor Management”.
and medium-term scheduling is primarily for performance improvement related to degrees of multi-
programming, where as medium-term scheduling itself is an issue related to memory management.
Chapter 5 discusses the intricacies of space management and describes policies for medium-term
scheduling.
FIGURE 2.10 Level–wise presentation of three levels of schedulers, namely; short-term scheduler, medium-
term scheduler, and long-term scheduler.
User-related services make the computer system more convenient, easy to program, and also
friendly to the users and support them in executing their tasks using the environment consisting
of machine resources and people, like operators, programmers, and system administrators. While
the program is under execution, these services interact with the operating system to make use of
system-related services.
System-related services are mainly concerned with the control and management of system
resources for the effcient operation of the system itself. The user often makes use of these services
for support like resource allocation, access control, quick response, resource isolation with protec-
tion, proper scheduling, reliable operation, and many similar functions and aspects offered by dif-
ferent resource managements, as already discussed.
Operating system users generally use both these services and are commonly divided into two
broad classes: command language users and system call users. Command language users, infor-
mally, are those who invoke the services of the operating system by means of typing in commands
at the terminal or by embedding commands as control cards (like $FTN, $RUN, etc., as discussed in
the previous chapter) or embedding commands in a batch job. We will discuss this issue later in this
chapter. System-call users, on the other hand, obtain the services of the operating system by means
of invoking different system calls during runtime to exert fner control over system operations and
to gain more direct access to hardware facilities, especially input/output resources. System calls are
usually embedded in and executed during runtime of programs.
After executing the required system-related operations while in supervisor mode, the control is
again returned to the user program for further resumption; hence, a change in state, from supervisor
state (mode) to user state (mode), is once again required. The operating system must execute a privi-
leged instruction for such switching of the state to return to the user state before entering the user
program. While the intent of system calls and interrupt processing is primarily for effcient use of
system resources, but they themselves are expensive with regard to consumption of processor time,
leading to a little bit of degradation in overall system performance.
FIGURE 2.11 Mechanism and fow of control when a system call works in the operating system.
56 Operating Systems
However, different operating systems provide various numbers and types of system calls that
span a wide range. The system calls are mostly used to create processes, manage memory, control
devices, read and write fles, and do different kinds of input/output, such as reading from the termi-
nal and writing on the printer/terminal.
A description regarding the working of a system call in relation to Figure 2.11 is given on the
Support Material at www.routledge.com/9781032467238.
• Process management system calls: fork, join, quit, exec, and so on.
• Memory management system calls: brk.
• File and directory system calls: creat, read, write, lseek, and so on.
• Input/output system calls: cfsetospeed, cfsetispeed, cfgetospeed, cfgetispeed, tcsetattr,
tcgetattr, and so on.
Operating system design based on a system call interface has the interesting property that there is
not necessarily any OS process. Instead, a process executing in user mode gets changed to supervi-
sor mode when it intends to execute a core program of operating system services (kernel code) and
back to user mode when it returns from the system call. But, if the OS is designed as a set of separate
processes, it is naturally easier to design and implement it so that it can get control of the machine in
special situations. This is more conducive than the case if the core operating system (kernel) is sim-
ply built with only a collection of functions executed in supervisor mode by user processes. Process-
based operating system design has nicely exploited this concept. This is discussed in Chapter 3.
• Procedure calls are generally considered not appropriate to manage protection hardware
when a change in the privilege level occurs due to transfer of control from a caller to a cal-
lee who is located on the other side of the OS–user boundary.
• Procedure calls mostly require direct addressing that must specify the actual address of the
respective OS routine while invoking specifc OS services in the user program for the sake of
linking and binding to create the load module. It expects that the programmer at the time of
implementation has suffcient knowledge and expertise in this area, but this is actually rare.
However, the concept of generic procedure call mechanism was later extended in a different way
to realize a widely used technique known as a remote procedure call mechanism that is used to
call procedures located on other machines in a client–server environment. This call mechanism,
however, resides at the higher level of the operating system. In spite of having many complications
and hazards in the implementation of RPC, most of them have been successfully dealt with, and
this technique is popular and still in heavy use in client–server environments that underlie many
distributed operating systems.
A brief explanation relating to shortcomings of procedure calls used to obtain OS services is
given on the Support Material at www.routledge.com/9781032467238.
FIGURE 2.12 For trap–Instruction operation, change in mode of CPU from user–mode to supervisor—mode.
58 Operating Systems
The mode bit can be used in defning the domain of memory that can be accessed when the
processor is in supervisor mode versus while it is in user mode. If the mode bit is set to supervisor
mode, the process executing on the processor can access either the supervisor partition (system
space) or user partition (user space) of the memory. But if user mode is set, the process can reference
only the user-reference space.
The mode bit also helps to implement protection and security mechanisms on software by way of
separating operating system software from application software. For example, a protection mecha-
nism is implemented to ensure the validity of processor registers or certain blocks of memory that
are used to store system information. To protect these registers and memory blocks, privileged load
and store instructions must be used to willfully manipulate their contents, and these instructions can
only be executed in supervisor mode, which can only be attained by setting the mode bit to supervi-
sor mode. In general, the mode bit facilitates the operating system’s protection rights.
checks the message, switches the processor to supervisor mode, and then delivers the message to
an appropriate OS process that implements the target function. Meanwhile, the user process waits
for the outcome of the service thus requested with a message receive operation. When the OS
process completes the operation, it passes a message Yk in regard to its operation back to the user
process by a send function that is accepted by the user process using its receive function, as shown
in Figure 2.13. These send and receive functions can easily be put into a library procedure, such as:
The former sends a message to a given destination, and the latter receives a message from a given
source (or from any source, if the receiver does not care).
Message systems also have to deal with the question as to how processes are to be named so
that the process (destination or source) specifed in a send or receive call is unambiguous. Often
a naming scheme is suggested and is used. If the number of concurrently active processes is very
large, sometimes they are also named by grouping similar processes into domains, and then process
addressing requires injection of the domain name into the process name to form a unique process
name. Of course, the domain names must also be unique. Authentication is also an issue of impor-
tance in message-passing systems.
Messages are also used in the area of interrupt management to synchronize interrupt processing
with hardware interrupts. For this, some services are provided that manipulate the interrupt levels,
such as enabling a level, say, by means of an ENABLE system call.
However, one of the major objectives of the design of a message passing scheme is to ultimately
enhance the level of system performance. For instance, copying messages from one process to
another is always a slower activity than doing it with a system call. So, to make this message-passing
approach effcient, one such suggestion out of many, for example, is to limit the message size to what
could ft in the machine’s registers and then carry out message passing using these registers to speed
up the execution.
The distinction between the system call approach and message passing approach has important
consequences regarding the relative independence of the OS behaviors from application process
behaviors and thereby the resulting performance. This distinction has signifcant bearing in the
design issues of the operating system and also an immense impact on the structural design. As a
rule of thumb, operating system design based on a system call interface can be made more effcient
than that requiring messages to be exchanged between distinct processes, although the system call
is implemented with a trap instruction that incurs a high overhead. This effciency is benchmarked
60 Operating Systems
considering the whole cost of process multiplexing, message formation, and message copying ver-
sus the cost of servicing a trap instruction.
2.4.6 SIGNALS
Signals are truly OS abstractions of interrupts. A signal is often used by an OS process to notify an
application process that an asynchronous event has occurred. It often represents the occurrence of
a hardware event, such as a user pressing a delete key or the CPU detecting an attempt to divide by
zero. Hence, signal implementation is likely to use a trap instruction if the host hardware supports
it. Signals also may be generated to notify a process about the existence of a software condition.
For example, an OS daemon may notify a user process that its write operation on a pipe cannot suc-
ceed, since the reader of the pipe has been deleted. In addition, signals may also be generated by
external events (some other process writes to a fle or socket) or internal events (some period of time
has elapsed). A signal is similar to a hardware interrupt but does not include priorities. That is, all
signals are treated equally; signals that occur at the same time are presented to respective process
one at a time, with no particular ordering. Processes can generally defne handler procedures that
are invoked when a signal is delivered. These signal handlers are analogous to ISRs in the OS.
In fact, signals are the software analog of hardware interrupts but do not provide priorities
and can be generated by a variety of causes in addition to timers expiring. Many traps detected by
the hardware, such as executing an illegal instruction or using an invalid address, are also converted
into signals to the guilty process. Signals are also used for interprocess synchronization as well as
for process-to-process communications or to interact with another process in a hurry.
Signals can also be used among application-level processes. Each signal has a type (called a
“name”) associated with it. Contemporary UNIX systems including Linux have different types of
built-in signals. A few UNIX signals are:
2.4.7 LOCKS
Lock is simply an area in common virtual storage and is usually kept resident permanently in main
memory. Apart from their other uses, locks can often be employed to enforce mutual exclusion
between processes at the time of interprocess synchronization (discussed later in Chapter 4) for
access to shared system resources. A lock contains bits which can be set to indicate that the lock is
in use. Locks normally come in two classes, and are of many different types (see Chapters 7 and 9).
The classes are:
Spin: The processor, while executing a test‑and‑set–lock type of instruction (TSL instruction
in IBM machines), constantly tests the lock sitting in a tight loop and thus waiting only
for the lock to set it to guard some of its activity. In fact, while waiting, the processor is
Operating Systems: Concepts and Issues 61
not doing any productive work but only testing the lock, thereby suffering critically from
busy-waiting. The lock used for this purpose is known as a spin lock.
Suspend: The task waiting for an event to occur is made suspended, or the task in a ready
state is explicitly suspended for various reasons; thereby to eliminating undesirable useless
busy-waiting.
Spin locks are used to avoid a race condition at the time of interprocess synchronization for criti-
cal sections that run only for a short time. A suspend lock, on the other hand, is used for the same
purpose but is employed for considerably long critical sections when a required (denied) process is
suspended and another suitable process is then dispatched to utilize the costly CPU time for the sake
of performance enhancement. Local locks are usually always suspend locks, whereas global locks
can be both spin and suspend locks.
Locks can be arranged in a hierarchical level, and many large systems employ this strategy to
prevent the circular-wait condition of deadlocks (to be discussed later in Chapter 4). A hierarchical
arrangement implies that a processor may request only locks higher in the hierarchy than locks it
currently holds.
2.4.8 PIPES
A pipe is an excellent facility introduced by UNIX to connect two processes together in a unipro-
cessor computing system. When process A wishes to send data to process B, it simply writes on the
pipe as though it were an output fle. Process B can then get the data by reading the pipe as though it
were an input fle. Thus, communication between processes looks very similar to ordinary fle reads
and writes. A pipe is essentially a unidirectional channel that may be written at one end and read
at the other end for the sake of communication between two processes. In fact, a pipe is a virtual
communication channel to connect two processes wishing to exchange a stream of data. Two pro-
cesses communicating via a pipe can reside on a single machine or on different machines in network
environment. Apart from its many other uses, it is a powerful tool that was exploited by UNIX in
its earlier versions to primarily carry out inter-process communications; for example, in a producer–
consumer problem, the producer process writes data into one end of the pipe and the consumer
process retrieves it from the other end, as shown in Figure 2.14. In fact, when the pipe is in use, only
one process can access a pipe at any point in time, which implicitly enforces the mutual exclusion
with synchronization between the processes communicating via pipe. This form of communica-
tion is conceptually very similar to the message-passing facility, allowing asynchronous operations
of senders and receivers (producers and consumers), as well as many-to-many mapping between
FIGURE 2.14 Overview of the mechanisms when information fows through UNIX pipes.
62 Operating Systems
senders and receivers. But the major differences are that the pipe facility does not require explicit
synchronization between communicating processes as the message system does and also does not
even require explicit management and formation of messages. Moreover, the pipe is handled at the
system-call level in exactly the same way as fles and device-independent I/O with the same basic set
of system calls. In fact, a pipe can be created or an already existing pipe can be accessed by means
of an OPEN system call. A writer-process writes data into the pipe by means of WRITE calls, and a
reader-process consumes data from the pipe by means of READ calls. When all data are transferred,
the pipe can be closed or destroyed depending on whether further use of it is anticipated.
A pipe is represented in the kernel by a fle descriptor. When a process wants to create a pipe, it
issues a system call to the kernel that is of the form:
The kernel creates the pipe with a fxed size in bytes as a kernel First–In–First–Out data structure
(queue) with two fle identifers. In Figure 2.14, pipeID [0] is a fle pointer (an index into the pro-
cess’s open fle table) to the read-end of the pipe, and pipeID [1] is the fle pointer to the write–end
of the pipe. The pipe’s read-end and write-end can be used in most system calls in the same way that
a fle descriptor is used. The automatic buffering of data and the control of data fow within a pipe
are performed as byte streams by the operating system.
UNIX pipes do not explicitly support messages, although two processes can establish their own
protocol to provide structured messages. There are also library routines that can be used with a pipe
for communication using messages.
The pipe can also be used at the command‑language level with the execution of a pipeline state-
ment, such as, a b;, that is often used to generate output from one program to be input to another
program. This is an additional form of inter-program communication that is established without any
special effort of reprogramming and also without the use of any temporary fles.
Pipes are differentiated by their types. There are two types of pipes: named and unnamed. Only
related processes can share unnamed pipes, whereas unrelated processes can share only named
pipes. In normal pipes (unnamed pipes), the pipe-ends are inherited as open fle descriptors by the
children. In named pipes, the process obtains a pipe-end by using a string that is analogous to a fle
name but which is associated with a pipe. This enables any set of processes to exchange informa-
tion using a “public pipe” whose end names are fle names. Moreover, when a process uses a named
pipe, the pipe is a system-wide resource, potentially accessible by any process. Named pipes must be
managed in a similar way, just as fles have to be managed so that they are not inadvertently shared
among many processes at one time.
Many large complex operating systems have more elaborate, versatile forms of command facilities
and OS software support and services to assist the different types of users across a diverse spectrum.
System commands provide an interface to the operating system for the user so that the user can
directly issue these commands to perform a wide range of system-related functions from user mode.
This interface actually separates the user from operating system details and presents the operating
system simply as a collection of numerous services. The command language interpreter, an operat-
ing system program (called the shell in UNIX), accepts these user commands or job control state-
ments, interprets them, and creates and controls processes as needed.
• Performance:
Users implement security using different degrees of protection facilities offered by a modern OS
for different management, objects, users, or applications to safeguard all types of resources from
unauthorized accesses and threats.
Since all software in the computer is executed by the hardware with the support and services
offered by the OS, the correctness and reliability of the underlying OS are to be ensured at the time
of its design and also when injecting new features into its existing design for required enhancements.
64 Operating Systems
As the complexity and the volume of the OS software are gradually increasing to accommodate
a constantly growing number of users from a diverse spectrum of disciplines working on numer-
ous hardware platforms, the design of the OS should be made easily maintainable. Maintainability
is also closely related to upgradeability, which states that the release of a new version of an exist-
ing OS should include downward compatibility to confrm the upgradeability of its older version.
Maintainability often means a compromise, even accepting relatively inferior performance, but to
what extent it will be tolerated is actually driven by tactical decisions being made by the designers
at the time of OS design.
With the continuous increase in the number of computer users and developers along with the
constant emergence of new avenues in the area of computer applications supported by relentless
developments in the feld of hardware and software technologies ultimately paved the way for the
different types of operating systems to evolve in advanced forms to drive the newer hardware along
with offering more convenient user interface. Of course, downward compatibility of the evolving
OS must be maintained. Sometimes an innovation in the design of an operating system awaits the
implementation of a suitable technology. Market forces and large manufacturers, by their dom-
inance, also promote certain features that eventually steer, control, infuence, and subsequently
direct the choices of the users. However, the most radical commercial development in the implemen-
tations of forthcoming new OS is the Open Systems Foundation (OSF-1) approach that was based on
Mach OS and was originally designed as a multiprocessor operating system. It was initially targeted
to implement different parts of the OS on different processors in the multiprocessor system and ulti-
mately became successful. In fact, today’s common trend in research and development of operating
systems is to implement some variant of a UNIX/Linux interface, especially toward microkernel
implementations for all types of computers as well as a ftting OS for the various types of emerg-
ing cluster architectures to handle a diverse spectrum of distributed application areas. An all-out
attempt in the development process of more innovative system software is also observed that could
negotiate the challenge of newer (upcoming) environments.
Due to the resurgence of electronic technology since the late 1970s, the renaissance in the archi-
tectural evolution of computers and the corresponding development of various types of operating
systems and other system software to drive these constantly emerging more intelligent machines
have progressed through various forms and ultimately culminated in the arrival of distributed hard-
ware equipped with multiple processors in the form of various types of multiprocessors and multi-
computers as well as cluster architectures driven by ftting modern distributed operating systems.
As the working environments constantly expanded with the continuous introduction of various
types of computers with dissimilar architectures and run by individually dedicated operating sys-
tems, there was, thus, no compatibility between them, which eventually caused havoc in the user
domain. To remedy this, there was a keen desire to have technology that would, by some means,
bring about an imposed compatibility between different environments. Consequently, this gave rise
to the birth of what is commonly known as open system technology, which fnally succeeded in
consolidating entire computing environments.
The ultimate objective of open system architecture is, however, to allow the end users to work
on any computer irrespective of its architecture and associated dedicated operating system. It also
enables the end users to carry out information-processing tasks from their own domain in an envi-
ronment of a network of heterogeneous computers with different architectures but interconnected
Operating Systems: Concepts and Issues 65
SUMMARY
This chapter explains why operating systems are needed and gives an overview of the objectives
and functions which serve to defne the requirements that an operating system design is intended
to meet. Here, the concepts and the general characteristics of the overall organization of an operat-
ing system have been explained. The major issues that infuence the design of generic multitasking
operating systems, as well as expansion in the area of fundamental requirements to support hard-
ware abstraction, resource sharing, and needed protection have been discussed. Many other critical
issues that need to be resolved have been identifed. The concept of process is a central theme to all
the key requirements of the operating system, and its view from different angles has been featured.
A brief description of the different supports and services that are offered by operating systems in
the form of user-accessibility and that of system-related subjects has been presented. The chapter
ends with an introduction to the most important factors that have an immense impact in the design
of generic operating systems.
EXERCISES
1. “Certain specifc responsibilities are performed by an operating system while working as
a resource manager”—state these and explain.
2. “The conceptual design of an operating system can be derived in the form of a collection
of resource management”. What is this different management that is found in any generic
operating system? What are the basic responsibilities performed by the different manage-
ment individually?
3. Which of the four basic OS modules might be required on a computer system? Why are
these modules not multi-programmed?
66 Operating Systems
4. State and explain the concept of a process. Distinguish between a process and a program.
Is this difference important in serial (single-process) operating systems? Why or why not?
5. “Processes by nature can be classifed into two types”. State and explain these two types
and show their differences.
6. “A process can be viewed from different angles”. Justify the statement with adequate
explanation. What are the salient features that exist in these different visions?
7. The CPU should be in privileged mode while executing the OS code and in user mode (i.e.
non-privileged mode) while executing a user program. Why is a change in mode required?
Explain how this change in mode is carried out.
8. What is an interrupt? What are the different classes of interrupts that are commonly found
in any generic operating system?
9. Explain why an interrupt is considered an essential element in any operating system.
Explain an interrupt from a system point of view.
10. How is an interrupt viewed from the domain of a user program? Explain with an appropri-
ate diagram.
11. What is meant by interrupt processing? What are the different types of actions being taken
to process an interrupt? Explain the various processing steps that are involved in each of
these types.
12. What does interrupt servicing mean? State and explain the different courses of action
being taken by an operating system in order to service an interrupt.
13. “Interrupts are required to be interrupted in certain situations”. Explain at least one such
situation with an example. Discuss how this situation is normally handled.
14. The kernel of an OS masks interrupts during interrupt processing. Discuss its merits and
drawbacks.
15. State and explain vectored interrupts and non-vectored interrupts.
16. What is a trap? How can it be differentiated from an interrupt?
17. Sharing of resources is required for concurrent processing in order to attain performance
improvement. Explain space-multiplexed sharing and time-multiplexed sharing in this
context.
18. What is meant by protection of hardware resources? Why is protection required for such
resources? How is protection achieved for each individual resource?
19. Why is scheduling of system resources required for concurrent execution in a computer
system?
20. What are the different classes of resource scheduling used in a computer system? State
in brief the numerous responsibilities that are performed by these different classes of
schedulers.
21. Point out the responsibilities that are performed by a short-term scheduler. Explain how the
action of a medium-term scheduler affects the movement of a short-term scheduler.
22. “The operating system provides numerous services that can be broadly categorized into
two different classes”. What are these two classes? State and explain in brief the services
that they individually offer to common users normally.
23. Defne system calls. How can a user submit a system call? State and explain the working
principle of a system call. What is the difference between a system call and a subroutine
call?
24. Various types of service calls are usually incorporated that the kernel may provide. Suggest
a few of them that every kernel should have.
25. How can a device submit a service call? Does a service call always require a context
switch? Does it always require a process switch?
26. System call and interrupt are closely related to each other. Justify this relationship, if there
is any, with reasons.
27. Defne a procedure call. How does it differ from a system call?
Operating Systems: Concepts and Issues 67
28. “A procedure call is simple, usable, and has almost no complexity in implementation on
almost all computer systems; still it is not preferred in contemporary systems to use OS
services”. Would you agree? If so, give reasons in favor of your stance.
29. State some factors that might differentiate between the time to do a normal procedure call
from an application program to one of its own procedures compared to the time to perform
a system call to an OS procedure.
30. What is the difference between a command and an instruction? Discuss the various types
of system commands that are available in a generic operating system. How is the command
executed in a system?
31. A command-language interpreter is not considered a part of an operating system. Justify.
32. “The processor works in two modes”. What are these two modes? Why are these two
modes required? Explain how the switching of modes is carried out by the processor.
33. What is a software interrupt? In what way does it differ from a hardware interrupt? Discuss
its salient features and the distinct advantages of using it.
34. Discuss the working methodology of a message-passing technique. Message-passing tech-
niques are an important constituent of an operating system. Give reasons in favor of this
statement. What are the drawbacks of this technique? What is the difference between a
system call approach and a message-passing approach?
35. What is the role of a signal in the working of an operating system?
36. What are locks? What are the basic functions that a lock performs? What are the different
classes and different types of locks commonly used in an operating system? What are their
specifc usages?
37. Discuss in brief some of the factors that infuence the design of an operating system at the
time of its development. Explain their role in arriving at a specifc trade-off.
3 Structures and Designs
Operating Systems
Learning Objectives
• To illustrate the step-wise evolution of different design models of operating systems based
on numerous design objectives.
• To explain the structure, design, formation, and subsequent realization of primitive mono-
lithic systems, then improved hierarchical and extended machines, and subsequently dif-
ferent types of modular layered systems, mainly for large mainframe machines.
• To describe a revolutionary approach in the design and development of operating systems
for virtual machines.
• To describe the introduction of a new direction in the design and development of different
operating systems, leaving its traditional concept for the emerging client–server model
systems.
• To articulate the arrival of a new dimension in the design and development of operating
systems using a kernel-based approach.
• To demonstrate the novel concepts in the design and development of operating systems
based on monolithic kernels and microkernels and subsequently hybrid kernels.
• To briefy discuss the basic design issues and salient features of generic modern operating
systems, including distributed operating systems.
68 DOI: 10.1201/9781003383055-3
Operating Systems: Structures and Designs 69
when realized, gives rise to what is called a monolithic system. Here, processing, data, and user
interfaces; all reside on the same system with OS code to be simply executed as a separate entity
in privileged mode. All the individual procedures or fles containing the procedures of the OS are
compiled and then bound together into a single object fle with a related linker (linkage editor). The
monolithic structure of the operating system, however, suggests the following organization for its
implementation.
1. A main program within the operating system that invokes the requested service procedure
(table search).
2. A set of service procedures that carry out the execution of requested system calls.
3. A set of utility procedures that do things such as fetching data from the user program
and similar other things. These are needed by several service procedures during their
execution.
This concept was exploited in developing the frst simple batch operating systems in the mid-1950s
and was implemented in the IBM 704 after refnement, and then by the early 1960s, even larger
mainframe computers used this approach with considerable success. It was then implemented as
the IBSYS operating system by IBM for its 7090/7094 computers. But when this design was imple-
mented on PC-DOS and later on MS-DOS and subsequently on earlier Windows-based OSs on PCs,
it worked poorly with multiple users. Even large operating systems when designed as a collection of
monolithic pieces of codes were observed not suitable from their organizational point of view.
For more details about monolithic systems, see the Support Material at www.routledge.com/
9781032467238.
FIGURE 3.1 Design of the earlier operating systems in the form of Extended machine concept using hier-
archical levels.
forms of generalized design and development of more modern operating systems for the days to come.
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
the power, and it is away from the user and closer to the core hardware and thus can then directly
interact with the hardware over a far shorter time scale. Other parts of the operating system belong
to higher levels which are away from the hardware and closer to the user. Some of these upper
parts of the operating system allow the user to communicate directly with the operating system via
commands that interact with the peripheral hardware over a longer duration of time. A represen-
tative modular hierarchical layer/level-structuring approach in designing an operating system as
discussed is depicted in Figure 3.2.
In a strictly hierarchical implementation, a given level can exploit facilities including the objects
and operations provided by all downward intermediate levels (up to bare hardware) by way of simply
FIGURE 3.2 Generalized modular hierarchical layer/level structuring approach in designing an operating
system used in large mainframe systems, and implemented in mid of 1960s.
72 Operating Systems
calling upon the services they offer but without the details of their operations. Each level, however,
already has its own existing set of primitive operations and related hardware support and cannot
make use of anything of its higher levels. This modular hierarchical structuring approach also sup-
ports the concept of information hiding, where the details of data structures, processing algorithms,
and mode of working of each and every module are totally confned within that module at its respec-
tive level. Externally, a module is observed to only perform a set of specifc functions on various
objects of different types. The internal details of its way of working are neither visible nor available
and truly not of concern to the users enjoying the services provided by the respective module.
Over time, these design principles, realized in the form of hierarchical layers/levels, ultimately
became a universal approach for being nicely matched with the working environment but greatly
varying in methods of implementation among contemporary operating systems. The model as pro-
posed by Denning and Brown (DENN 84) did not correspond to any specifc operating system, but
this representative structure (Figure 3.2) otherwise served to gain an overview of an operating sys-
tem when designed and developed in the form of a hierarchical model. This model is summarized
by layer/level with a few respective objects and their corresponding operations in Table 3.1.
TABLE 3.1
Leveled Operating System Design Hierarchy
Level Name Objects Some Operations
5 Information Management
Command language User programming Statements in command
interpreter (Shell) environment (shell) language
User processes User processes Quit, kill, suspend, resume
Supervisor process module
Job scheduler Jobs Scheduling of jobs
Directories Directory create, destroy, open, close, read, write
File systems Files create, destroy, open, close, read, write
4 Device Management
Keep track of status of all External devices create, destroy, open, close, read, write
I/O devices.
I/O Scheduling External devices
Initiate I/O process External devices
Communication Pipes create, destroy, open, close, read, write
3 Processor management upper level
Start process, stop process Processes create, destroy, send and receive,
2 Memory Management
Allocate memory Segments, Pages, allocate, free, fetch, read, write
Release (free) memory
Local secondary store blocks of data, device channels read, write, allocate, free
1 Processor management lower level
P, V primitives Semaphore Interprocess comm.
Process scheduling Processes suspend, resume, wait and signal
0 Procedures Procedure call, call stack call, return, mark stack
Interrupts Interrupt-handling programs invoke, mask, retry,
Instruction sets Microprogram program load, store, branch, add, subtract
interpreter, scalar and array
data, evaluation stack.
Electronic circuits Fundamental components like activate, complement, clear, transfer
registers, gates, buses
Operating Systems: Structures and Designs 73
FIGURE 3.3 Virtual machines that run on a single piece of hardware, but each such machine has its own
dissimilar operating system, and the role of the virtual machine monitor (VMM) in this regard to keep them
operationally integrated.
The heart of the system, the virtual machine monitor, runs on the bare hardware (physical hard-
ware) and creates the required VM interface, providing multitasking concurrently to several VMs of
the next layer up. A VMM is essentially a special form of operating system that multiplexes only the
physical resources among users, offering each an exact copy of the bare hardware, including kernel/
user mode, I/O interrupts, and everything else the real machine has, but no other functional enhance-
ments are provided. These VMs should be considered neither extended machines (in which policy-
dependent and hardware-dependent parts have been completely separated) nor microkernels (to be
discussed later at the end of this chapter) with different modules of programs and other features.
A VMM is actually divided into roughly two equal major components: CP (control program)
and CMS (conversational monitor system) that greatly simplify the design of the VMM and its
implementation. The CP component is located close to the hardware and performs the functions
of processor, memory, and I/O device multiplexing to create the VMs. The CMS, placed above the
CP, is a simple operating system that performs the functions of command processing, information
management, and limited device management. In fact, the CP and CMS are typically used together,
but the CMS can be replaced by any other OS, such as OS/360, DOS, OS/VS1, or OS/VS2. Virtual
machines, however, have many uses and distinct advantages:
• Software development: Programs can be developed and debugged for machine confgura-
tions that are different from those of the host. It can even permit developers to write real
operating systems without interfering with other users.
• Test of network facilities: It can test network facilities, as will be discussed later, by simu-
lating machine–machine communication between several VMs under one VM monitor on
one physical machine.
• Evaluation of program behavior: The VMM must intercept certain instructions for inter-
pretive execution rather than allowing them to execute directly on the bare machine. These
intercepted instructions include I/O requests and most other supervisory calls.
• Reliability: The VMM typically does not require a large amount of code or a high degree
of logical complexity. This makes it feasible to carry out comprehensive check-out pro-
cedures and thus ensures high overall reliability as well as integrity with regard to any
special privacy and security features that may be present. Isolating software components
in different VMs enhances software reliability.
• Security and privacy: The high degree of isolation between independent VMs facilitates
privacy and security. In fact, privacy between users is ensured because an operating sys-
tem has no way of determining whether it is running on a VM or on a bare machine and
therefore no way of spying on or altering any other co-existing VMs. Thus, security can be
enhanced by isolating sensitive programs and data to one dedicated VM.
3.1.4.1 Drawbacks
The VM had a lot of distinct advantages, and it could also be confgured with multiple processors,
but it was unable to provide all of the controls needed to take full advantage of modern multiproces-
sor confgurations. However, it also suffered from several other serious drawbacks; one of these is
that the failure of a single piece of hardware, or any malfunction of the VMM-interface simultane-
ously supporting many VMs would ultimately cause a crash of the entire environment. Moreover,
this hardware was very expensive and also gigantic in size. In addition, the VMM-interface is still
a complex program, since simulating a number of virtual 370s or compatible machines is not that
simple to realize, even with compromises for moderate effciency and reasonable performance. Last
but not least, the inclusion of each additional layer offers a better level of abstraction for the sake of
multiplexing and to simplify code, but that starts to affect the performance to some extent, as many
interactions may then happen between such layers.
Virtual-machine operating systems are too complex, and most instructions are directly executed
by the hardware to speed up the execution. In fact, virtual-machine operating systems cannot be
implemented on computers where dangerous instructions (such as I/O manipulating address-trans-
lation registers, discussed in Chapter 5, or the processor state, including interrupt-return instructions
and priority setting instructions) are ignored or fail to trap in a non-privileged state, as was found in
PDP-11/45.
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
FIGURE 3.4 A generic operating system model runs on single system (uniprocessor) when used in client–
server environment in the premises of computer networks.
FIGURE 3.5 A generic operating system model runs on multiple systems (multiple–processor) when used in
client–server environment in the premises of distributed systems.
the forms, the ultimate end is primarily targeted, apart from many other reasons, to balance the load
of the entire arrangement in a ftting way to realize greater effciency and better performance in a
distributed environment.
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
Processor/Process Management
• Process switching
• Process synchronization at the time of inter-process communication
• Process control block management
Memory Management
• Buffer management
• Allocation of devices to processes and allocation of I/O channels
File Management
Execution-Support Functions
• Interrupt handling
• Procedure handling
• Accounting
• Monitoring and supervision
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
FIGURE 3.6 Operating system design based on kernel–shell concept implemented best in UNIX using lay-
ered implementation, but differently structured.
at this point that it is the library interface, not the system call interface, that has been brought to an
acceptable standard with the specifcation defned by POSIX 1003.1 to implement standardization
in different versions and forms of UNIX.
The kernel in UNIX is not really well structured internally, but two parts are more or less dis-
tinguishable. At the very bottom is the machine-dependent kernel that consists of a few modules
containing interrupt handler, the low-level I/O system device drivers, and part of the memory man-
agement software. Since this directly drives the hardware, it has to be rewritten almost from scratch
whenever UNIX is ported to (installed on) a new machine. In contrast, the machine-independent
kernel is the same on all machines, because it does not depend closely on the particular hardware
it is running on. This code includes system call handling, process management, scheduling, pipes,
signals, paging and swapping, the fle system, and the high-level part of the I/O system (disk strat-
egy, etc.).
In fact, fles and I/O devices are treated in a uniform manner by the same set of applicable system
calls. As a result, I/O redirection and stream-level I/O are fully supported at both the command-
language (shell) and system-call levels. Fortunately, the machine-independent part is much larger
than the machine-dependent part, which is why it is relatively straightforward to port UNIX to a
wide range of different or new hardware. It is interesting to note at this point that portability was
not one of the design objectives of UNIX. Rather, it came as a consequence of coding the system
Operating Systems: Structures and Designs 81
with a comparatively high-level and/or in a high-level language. Having realized the importance of
portability, the designers of UNIX have then decided to confne hardware-dependent code to only a
few modules in order to facilitate easy porting. In fact, today’s UNIX systems in many respects tend
to disloyal to the original design philosophy so that they can better address increasing functionality
requirements.
All versions of UNIX (as shown in Figure 3.6) provide an operating system (kernel), standard
system call library, and large number of standard utility programs. Some of these utility programs
maintain the standard as specifed by POSIX 1003.1, and others that differ between UNIX versions
that are invoked by the user include the command language interpreter (shell), compilers, editors,
text-processing programs, and fle-handling utilities. Out of the three interfaces to UNIX (as shown
in Figure 3.6): the true system call interface, the library interface, and the interface formed by
the set of standard utility programs along with the shell (user interface), this last one is not part of
UNIX, although most casual users think so.
The huge number of utility programs can be again divided into six categories: fle and direc-
tory manipulation commands, flters, compilers and program development tools, text process-
ing, system administration, and miscellaneous. Moreover, the flters that have almost nothing
to do with the operating system and are also usually not found in other existing contemporary
operating systems can be easily replaced without changing the operating system itself at all. In
fact, this and other fexibilities together precisely contribute much to making UNIX popular
and allow it to survive so well even in the food of numerous changes continuously going on in
the domain of underlying hardware technology over time. Some popular UNIX-like operating
systems are:
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
data structures, including resource queues, process descriptors, semaphores, deadlock information,
virtual memory tables, device descriptors, fle descriptors, and the like, is very diffcult. The rea-
son is that each module depends on certain information that is encapsulated in other modules.
Moreover, the performance of a partitioned OS might be too ineffcient for hardware with limited
computing power or information transfer bandwidth. It was thus decided to implement the OS ker-
nel as a single monolith.
A monolithic kernel organization means that almost all of the kernel (main or major) functions:
process and resource management, memory management, and fle management, are implemented
in a single unit. A monolithic kernel, as shown in Figure 3.7, running in kernel space in supervisor
mode consists of all software and data structures that are placed in one logical module (unit) with
no explicit interfaces to any parts of the OS software. In common with other architectures (micro-
kernel, hybrid kernel, discussed later), the kernel here also provides a high-level virtual interface to
its upper level with a set of primitives or system calls that implement the operating system services,
such as process management, concurrency, and memory management, using one or more compo-
nents (parts) that then interact with the bare hardware.
Although every component servicing its respective operations is separate within the unit, the
code integration between them is very tight and diffcult to develop correctly, and, since all the
modules run in the same address space, a bug in one module can collapse the entire system. Still,
when the implementation is complete and trustworthy, the tight internal integration of components
in turn allows the low-level features of the underlying system to be excellently utilized and summar-
ily makes a good monolithic kernel highly effective and equally effcient.
Under this system, the user gets operating system services by issuing necessary instructions
(special trap instructions) known as system calls (supervisor calls) or kernel calls with appropri-
ate parameters in well-defned places, such as in registers or on the stack. When these instructions
are executed, the machine is switched from user mode to kernel mode, also known as supervisor
mode, and control is then transferred to the kernel (operating system), which examines the given
parameters associated with the call and then loads the corresponding service procedure, which is
then executed. After completion, control is then sent back to the user program to resume its ongoing
operation.
FIGURE 3.7 An overview of an operating system design based on Monolithic kernel concept.
Operating Systems: Structures and Designs 83
• MS-DOS, Microsoft Windows 9x series (Windows 95, Windows 98, and Windows 98SE),
and Windows Me
• Traditional UNIX kernels, such as the kernels of the BSDs and Solaris
• Linux kernels
• Mac OS kernels, up until Mac OS 8.6
• OpenVMS
• XTS-400
• Some educational kernels, such as, Agnix
• Syllable (operating system)
Monolithic kernels are further discussed in different sections at the end of this chapter and are
also referred to in Chapter 9, “Distributed Systems”.
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
The BIOS is a collection of low-level device drivers that serves to isolate MS-DOS from the details
of the hardware. BIOS procedures are loaded in a reserved area in the lowest portion of main
memory and are called by trapping to them via interrupt vectors. The basic OS kernel was wholly
implemented in the read-only memory’s (ROM) resident BIOS routines, and two executable fles,
IO.SYS and MS-DOS.SYS (Chappell, 1994). IO.SYS (called IBMBIO.COM in IBM) is a hidden
fle and is loaded in memory at the time of booting, just above interrupt vectors. It provides a pro-
cedure call interface to the BIOS so the kernel can access BIOS services by making procedure calls
to IO.SYS instead of traps to the ROM. This fle holds those BIOS procedures, not in the ROM, as
well as a module called SYSINIT which is used to boot the system. The existence of IO.SYS further
isolates the kernel from hardware details. The MS-DOS.SYS (which IBM calls IBMDOS.COM)
is another hidden fle which is loaded in memory just above IO.SYS and contains the machine-
independent part of the operating system. It handles process management, memory management,
and the fle system, as well as the interpretation of all system calls.
The third part of what most people think of as the operating system is the shell, COMMAND.
COM, which is actually not part of the operating system and thus can be replaced by the user. In
order to reduce memory requirements, the standard COMMAND.COM is split into two pieces: a
resident portion that always resides in memory just above MDDOS.SYS and a transient portion
loaded at the high end of memory only when the shell is active. It can be overwritten by the user
programs if the space is needed. Later, MS-DOS reloads COMMAND.COM afresh from disk if it
is changed.
As the device-dependent code is kept confned to one layer, porting MS-DOS is theoretically
reduced to only writing or modifying the BIOS code afresh for the new hardware. Later releases of
MS-DOS had UNIX-like features. At the command level (shell), MS-DOS provides a hierarchical
fle system, I/O redirection, pipes, and flters. User-written commands can be invoked in the same
way as standard system commands, thereby providing the needed extension of the basic system
functionality.
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
84 Operating Systems
The original UNIX philosophy was to keep kernel functionality as limited as possible with a mini-
mal OS to only implement the bare necessities while emphasizing the other system software to
implement as many normal additional OS functions as possible. In fact, the original UNIX kernel,
as shown in Figure 3.8, was small, effcient, and monolithic, providing only basic machine resource
management and a minimal low-level fle system, along with kernel (OS) extension facilities for
creating specifc computational environments. The device management functions were separately
implemented in device drivers inside the kernel that were added to the kernel at a later time. Thus,
the kernel would often need to be extended by way of reconfguring it with the code of new device
drivers whenever new devices were added without disturbing the existing main kernel at all. The
fle organization implemented byte-stream fles using the stdio (standard I/O) library to format the
byte stream.
Early monolithic UNIX kernels provided two signifcant interfaces, as shown in Figure 3.8. The
frst one was between the kernel and user space programs, such as applications, libraries, and com-
mands. The second one was within the kernel space and that between the main part of the kernel
and the device drivers using the interrupt handler as the interface between the device and the kernel.
However, the UNIX kernel, after repeated modifcations and enhancements, ultimately emerged as
a medium-sized monolithic monitor in which system calls are implemented as a set of co-routines.
In general, once the kernel co-routine starts its execution, it continues with the processor until
completion unless it is preempted by an interrupt. Some system processes, however, also are avail-
able to service device interrupts.
The UNIX kernel, after being expanded, ported, and re-implemented many times, gradu-
ally became large and complex and eventually deviated from the original design philosophy to
emphasize other expansions so that it could better address increasing functionality requirements,
FIGURE 3.8 An overview of traditional UNIX operating system implementation using the design based on
Monolithic kernel concept.
Operating Systems: Structures and Designs 85
particularly in the area of network and graphic device support. Graphic devices are largely handled
in user space, and networks are addressed explicitly by the expanded system call interface.
Modern monolithic kernels, when used in distributed systems, are essentially today’s centralized
operating system augmented with networking facilities and the integration of remote services. Most
system calls are made by trapping the kernel, which services and executes the calls and then returns
the desired result to the user process. With this approach, most machines in a network have disks
with a traditional kernel and manage their own local fle systems. Many distributed systems that are
extensions or imitations of UNIX use this approach to a large extent.
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
The Linux kernel uses the same organizational strategy as found in other UNIX kernels but
religiously sticks to a minimal OS (original principle of UNIX) that also accommodates the current
technology in its detailed design with monolithic kernel organization. Similar to UNIX, Linux also
provides an interface within the kernel between the main part of the kernel and the device drivers.
This facility accommodates any modifcation or incorporation of additional device drivers or fle
systems to the existing kernel that are already statically confgured into kernel organization. But
the problem with Linux in this regard is that the development of Linux is mostly global and carried
out by a loosely associated group of independent developers. Linux resolves this situation simply
by providing an extra mechanism for adding functionality called a module, as shown in Figure 3.9.
Whereas device drivers are statically confgured into a kernel structure, modules can be dynami-
cally added and deleted even when the OS is in action. Modules can also be used to implement
dynamically loadable device drivers or other desired kernel functions.
Linux is designed and structured as a collection of relatively independent blocks or modules,
a number of which can be loaded and unloaded on demand, commonly referred to as loadable
modules. In essence, a module is an executable fle that implements a specifc function, such as a
fle system, a device driver, or some other feature of the kernel’s upper layer. These modules can
even be linked to and unlinked from the kernel at runtime. A module cannot be executed as its
FIGURE 3.9 An overview of traditional Linux operating system implementation based on Monolithic kernel
concept, showing the placement of Linux kernel, device drivers, and modules.
86 Operating Systems
own process or thread; rather it can be executed in kernel mode on behalf of the current process.
However, a module can create kernel threads for various purposes as and when needed. This modu-
larity of the kernel (as also is found in FreeBSD and Solaris) is at the binary (image) level, not at the
kernel architecture level. These two are completely different but are sometimes confused. Modular
monolithic kernels are not to be confused with the architectural level of modularity inherent in
microkernels or hybrid kernels. By virtue of having a modular structure, the Linux kernel, in spite
of being monolithic, has overcome some of its major inherent diffculties in developing and evolving
the modern kernel.
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
The frst (No. 1) implements the extensible nucleus or microkernel that provides a low-level uni-
form VM interface with some form of process and memory management, usually with only the
bare essentials of device management. In this way, it overcomes the problems concerning portabil-
ity, extensibility (scalability), fexibility, and reliability. While this part does not provide complete
FIGURE 3.10 A formal design concept of structural organization of an operating system using (a) mono-
lithic kernel approach, and (b) microkernel (extensible nucleus) model separately.
Operating Systems: Structures and Designs 87
functionality, but it creates an environment in which the second part, that is, policy-dependent
part of the operating system, can be built to meet all the needs and requirements of the application
domain. The second part essentially refects the actual requirements and functions of the specifc
OS that are implemented as server processes (kernel processes), resides and operates on top of the
microkernel (at the upper level of the microkernel) along with user processes, performs interrupt
handling, and provides communication between servers and user processes by means of message
passing as shown in Figure 3.11. Processes here need not to be distinguished at all between kernel-
level and user-level services because all such services are provided by way of message passing.
The microkernel is viewed differently and thus implemented in a different way by different oper-
ating systems, but the common characteristic is that certain essential core services must be provided
by the microkernel, and many other services that traditionally have been part of the operating sys-
tem are now placed as external subsystems (the second item in the list) that interact with the kernel
(microkernel) and also with each other. These subsystems usually include fle systems, process
services, windowing systems, protection systems, and similar others (Figure 3.11).
But whatever the design strategy followed in the architecture of the microkernel (extensible
nucleus), its emergence consequently gave rise to two fundamentally new directions and dimensions
in the operating system design: multiple OSs–single hardware and single OS–multiple hard-
ware, as shown in Figure 3.12. The frst approach, as shown in Figure 3.12(a), concerns the direction
that allows policy-dependent parts of different OSs to run on a VM interface used to implement
policy-specifc extensions built on a single hardware platform that complement the extensible kernel
(nucleus) and form a complete operating system at each end. In fact, the IBM VM system designed
in the 1970s mostly followed the same line, at least conceptually but implemented differently, and
had the intent to inject the extensible nuclei factor into the design of an operating system that eventu-
ally culminated in the emergence of the microkernels of the 1990s.
The second approach appeared to be the reverse of the frst and relates to the portability of an
OS, as shown in Figure 3.12(b). Here, the nucleus (microkernel) is so designed that it can be run
on different types of hardware and provides a high degree of fexibility and modularity. Its size is
FIGURE 3.11 A typical kernel architectural design of an operating system using vertically layered kernel
and horizontally layered microkernel.
88 Operating Systems
FIGURE 3.12 Basic design concept of operating systems using microkernel (extensible nucleus) for the envi-
ronments: a) different multiple OS when used on a single piece of hardware, and b) single portable OS when
used on different hardware organization.
relatively small, and it actually consists of the skeletal policy-independent functions (bare hardware
functions) that are extended with specialized servers to implement a specifc policy. The remaining
policy-dependent part of a specifc OS, which is much larger, runs on the corresponding nucleus
and is portable. For example, when UNIX is installed on new hardware, the extensible nucleus
(hardware dependent/policy independent) of UNIX is confgured afresh on the spot to match the
existing available hardware, and the policy-dependent portion of UNIX is then copied to make it
a completely runnable operating system to drive the specifc hardware in a broad family of vari-
ous hardware products. This well-publicized approach was exploited in Windows NT in which the
nucleus kernel is surrounded by a number of compact subsystems so that the task of implementing
NT on a variety of hardware platforms is carried out easily.
The microkernel design ultimately tended to replace the traditional vertical layered concept of an
operating system with a horizontal one (Figure 3.11) in which the operating-system components exter-
nal to the microkernel reside at the same level and interact with each other on a peer-to-peer basis by
means of messages passed through the microkernel, as shown in Figure 3.13. The microkernel here vali-
dates messages, passes them between components, and grants access to the hardware. This structure is,
however, most conducive to a distributed processing environment in which the microkernel can pass
messages either locally or remotely without necessitating changes in other operating-system compo-
nents. Microkernels are further explored in some detail in Chapter 9, “Distributed Operating Systems”.
Operating Systems: Structures and Designs 89
FIGURE 3.13 A formal design approach of operating systems using horizontally structured microkernel for
the distributed processing environment.
FIGURE 3.14 A schematic block diagram of representative Windows NT operating system organization.
The NT Executive (Figure 3.14) is designed as a layer of abstraction of the NT Kernel and builds
on the NT Kernel to implement a full set of policies, specifc mechanisms for general objects, and
various services that Windows NT offers, including process, memory, device, and fle management.
Since Windows NT builds on an object-oriented approach, it does not strictly adhere to the classic
separation of OS functionalities as observed in a layered architecture while creating its different
modules. Instead, the NT Executive is designed and implemented as a modularized set of elements
(Solomon, 1998): object manager, process and thread manager, virtual memory manager, and so on
at the source code level.
While the NT Kernel and Executive are designed and programmed as separate modules, they are
actually combined with the kernel executable when Windows NT is built. This combined module,
along with the underlying HAL, provides the essential core elements of the OS and implements the
full NT operating system, though this nucleus can be extended again by the subsystems that provide
specifc OS services.
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
is the NT-based kernel inside Windows 2000, Windows XP, Windows Server 2003, Windows
Vista, and Windows Server 2008. The Windows NT-based operating system family architecture
essentially consists of two layers (user mode and kernel mode), with many different modules within
both of these layers.
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
3.1.11 EXOKERNEL
The exokernel concept started to bloom around 1994, but as of 2005,[update] exokernels were still a
research effort and have not been used in any major commercial operating systems. However, an
operating system kernel was developed by the MIT Parallel and Distributed Operating Systems
group. One of the targets of this kernel design is always to keep individual hardware resources
invisible from application programs by making the programs interact with the hardware via a con-
ceptual model. These models generally include, for example, fle systems for disk storage, virtual
address spaces for memory, schedulers for task management, and sockets for network communica-
tion. These types of abstractions, in general, although make the hardware easy to use in writing pro-
grams, but limit performance and stife experimentation in adding new abstractions. This concept
has been further redefned by letting the kernel only allocate the physical resources of the machine
(e.g. disk blocks, memory pages, processor time, etc.) to multiple executing programs and then let-
ting each program make a link to an operating system library that implements familiar abstractions,
or it can implement its own abstractions.
The notion behind this radically different approach is to force as few abstractions as possible
on developers, enabling them to make as many decisions as possible about hardware abstractions.
Applications may request specifc memory addresses, disk blocks, and so on. The kernel only
ensures that the requested resource is free and the application is allowed to access it. Resource man-
agement need not be centralized; it can be performed by applications themselves in a distributed
manner. This low-level hardware access allows the programmer to implement custom abstractions
and omit unnecessary ones, most commonly to improve a program’s performance. Consequently,
92 Operating Systems
FIGURE 3.15 A typical schematic graphical overview of exokernel used in the design of modern operating
systems.
an exokernel merely provides effcient multiplexing of hardware resources, but does not provide any
abstractions, leaving the programmers to choose what level of abstraction they want: high or low.
An application process now views a computer resource in its raw form, and this makes the primi-
tive operations extremely fast; 10–100 times faster than when a monolithic UNIX kernel is used. For
example, when data are read off an I/O device, it passes directly to the requesting process instead
of going through the exokernel. Since traditional OS functionalities are usually implemented at the
application level, an application can then select an OS function from a library of operating systems,
as shown in Figure 3.15. The OS function can then be executed as a process in non-kernel mode,
exploiting the features of the Exokernel. This kernel is tiny in size, since functionality is limited to
only ensuring protection and multiplexing of resources, which are again much simpler than conven-
tional microkernels’ implementation of message passing and monolithic kernels’ implementation
of abstractions. Exokernels and nanokernels are sometimes considered more extreme versions of
microkernels.
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
introduction of many new aspects, including internet and Web technology, multimedia applications,
the client/server model of computing, cloud computing on quasi-distributed systems, and true dis-
tributed computing, have also had an immense impact on the existing concept and design of operat-
ing systems. In addition, internet access for computers as well as LAN and WAN implementations
to build clusters of computers have made systems totally exposed and consequently have invited
increased potential security threats and more sophisticated attacks that eventually had an enormous
impact on design issues of emerging operating systems.
To address and accommodate all these issues, extensive modifcations and enhancements in the
existing structure of operating systems were found inadequate; it actually demanded fresh thoughts
and new methods in organization of operating systems. As a result, numerous forms of structures
and designs of different operating systems with various design elements have been released for both
scientifc and commercial use. Most of the work that led to such developments can now be broadly
categorized into the following areas:
Operating systems developed in recent years with the kernel–shell concept can be classifed into
two major distinct categories: monolithic kernel and microkernel. Each category has already been
discussed in detail in previous sections.
Threads are a relatively recent development in the design of the operating system as an
alternative form of a schedulable and dispatchable unit of computation in place of the tradi-
tional notion of process. A thread is an entity that executes sequentially with a single path of
execution using the program and other resources of its associated process which provides the
environment for the execution of threads. Threads incorporate some of the functionalities that
are associated with their respective processes in general. They have been introduced mainly
to minimize the system overhead associated with process switching, since switching back and
forth among threads involves less processor overhead than a major switch between different
processes. Threads have also been found to be useful for structuring kernel processes. Similar
to a uniprocessor machine, threads can also be implemented in a multiprocessor environment in
a way similar to a process in which individual threads can be allocated to separate processors
to simultaneously run in parallel. The thread concept, however, has been further enhanced and
gave rise to a concept known as multithreading, a technique in which the threads obtained from
dividing an executing process can run concurrently, thereby providing computation speedup.
Multithreading is an useful means for structuring applications and kernel processes even on a
uniprocessor machine. The emerging thread concept is distinguished from the existing process
concept in many ways, the details of which are discussed in Chapter 4 (“Processor Management”)
as well as in subsequent chapters.
An object is conceived of as an autonomous entity (unit) which is a distinct software unit that
consists of one or more procedures with a set of related data items to represent certain closely
correlated operations, each of which is a sibling unit of computation. These procedures are called
services that the object provides, and the data associated with these services are called attributes of
the object. Normally, these data and procedures are not directly visible outside the object. Rather,
various well-defned interfaces exist that permit software to gain access to these objects. Moreover,
to defne the nature of objects, the idea of class is attached to an object to defne its behavior, just
94 Operating Systems
as a program defnes the behavior of its related process. Thus, a class behaves like an abstract data
type (ADT) that maintains its own state in its private variables.
Thus, the unit of a “process model” is now used to exploit objects as an alternative schedulable
unit of computation in the design of operating system and other related system software. Objects
react only to messages, and once an object is created, other objects start to send it messages. The
newly created object responds by performing certain computations on its internal private data and
by sending other messages back to the original sender or to other objects. The objects can interface
with the outside world only by means of messages. Objects are discussed in more detail in Chapter 4.
The introduction of the kernel–shell concept in the design of the operating system together with
the inclusion of the thread model established the multiuser environment with increasing user traffc
load in virtually all single-user personal computers and workstations, realized by a single general-
purpose microprocessor, but ultimately failed to rise to the desired level in performance. At the
same time, as the cost of microprocessors continues to drop due to the constant advent of more
sophisticated electronic technology, it has paved the way for vendors to introduce computer systems
with multiple low-cost microprocessors, known as multiprocessors, that require only a little more
additional cost but provide a substantial increase in performance and accommodate multiple users.
The structures and designs of multiprocessors, although they differ in numerous ways, mainly
shared memory (UMA) architecture and distributed shared memory (NUMA) architecture, but
they can also be defned as an independent stand-alone computer system with many salient features
and notable characteristics and do possess a number of potential advantages over the traditional uni-
processor system. Some of them, in particular, are performance, reliability, scalability, incremental
growth, and also ease of implementation.
The operating systems used in multiprocessors are thus totally different in concept and design
from the traditional uniprocessor modern operating system, and also from each other due to dif-
ferences in their hardware architectures. They hide the presence of the multiple processors from
users and are totally transparent to users. They provide specifc tools and functions to exploit
parallelism as much as possible. They take care of scheduling and synchronization of processes
or threads across all processors, manage common and or distributed memory modules shared by
the processors, distribute and schedule existing devices among the running processes, and pro-
vide a fexible fle system to handle many requests arriving almost simultaneously in the working
environment.
In spite of having several potential merits, multiprocessors of different kinds with different forms
and designs suffer from several shortcomings along with the related cost factors that ultimately
have been alleviated with the introduction of another form of architecture with multiple proces-
sors, known as multicomputers (Chakraborty, 2020). This architecture is essentially built up with
a set of stand-alone full-fedged computers, each with its own resources, but they are connected
to high-speed network interfaces (networks of computers) that eventually gave rise to computer
networks, and/or offering the appearance of a single system image, depending on how the arrange-
ment is driven. This supports high processor counts with memory systems distributed among the
processors. This yields cost-effective higher bandwidth that reduces latency in memory access,
thereby resulting in an overall increase in processor performance. Each of these machines in a
computer network may be locally driven by its own OS supported by NOS for global access, or
these machines are together driven by a completely different type of operating system, known as a
distributed operating system, that offers an illusion of a single-system image with a single main
memory space and a single secondary memory space, plus other unifed access facilities, such as a
distributed fle system. The distributed operating system dynamically and automatically allocates
jobs to the various machines present in the arrangement in a transparent way for processing accord-
ing to its own policy and strategy. These arrangements, along with the respective operating systems,
are becoming increasingly popular, and there are many such products available in the market from
different vendors.
Operating Systems: Structures and Designs 95
The object concept has been gradually refned and continuously upgraded over time to ultimately
culminate in the contemporary object-oriented technologies that have been exploited to introduce
an innovative concept in the design of operating systems. One of the salient features of this object-
oriented design is that it facilitates modular programming disciplines that enable existing applica-
tions to extend by adding modules as and when required without affecting the existing logic and
semantics. When used in relation to the design and development of operating systems, this approach
can create modular extensions to an existing small kernel by means of adding needed objects to it.
Object-oriented design and structure enable developers to customize an operating system (and, in
fact, any other product) with ease without disrupting system integrity. This approach used in design
has really enriched programming methodology and is considered an important vehicle in the design
and development of both centralized and distributed operating systems.
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
SUMMARY
We have already described the concept and step-wise evolution of the generic operating system from
its very primitive form to the most sophisticated modern one, its basic characteristics, the common
principles it follows, the numerous issues linked to it to explain the different functions it performs,
and the common services it usually provides while driving the bare hardware on behalf of the user.
With the constant introduction of more intelligent and sophisticated hardware platforms, many dif-
ferent operating systems have continuously emerged over the last ffty-odd years. This chapter illus-
trates the step-wise evolution of different design models based on numerous design objectives and
their subsequent realization, along with the formation of associated structures of different operating
systems proposed over this period. The dynamism of computer systems was abstracted gradually
by operating systems developed in terms of monolithic design, extended machine concept, and then
versatile layered/leveled concepts and designs. Later, the evolution of VMs and their subsequent
impacts facilitated redefning the concepts and designs of newer operating systems with an innova-
tive form of the kernel–shell concept that eventually became able to manage networks of computers
and computer network implementations. The kernel–shell concept was further reviewed and rede-
fned from its monolithic kernel form to the more classic microkernel and later hybrid kernel forms
to manage more sophisticated computing environments, including demands from diverse spectra
of emerging application areas. A general discussion of the implementations of different models of
operating systems is presented here with an illustration of popular representative operating systems
as relevant case studies for each design model. This chapter concludes by giving an idea of modern
operating systems used in both uniprocessor and multiple processor environments (both multipro-
cessors and multicomputers), including the requirements they fulfll and the salient features they
possess. In short, this chapter lays a foundation for a study of generic operating system structures
and designs that are elaborately discussed in the rest of this book.
EXERCISES
1. What is meant by a monolithic structure of an operating system? What are the key features
of such a design? What are the main drawbacks of this design?
2. What are the factors that infuenced the concept of an extended machine? How have these
ultimately been implemented in the design of the extended machine?
3. What paved the way to design operating systems with layered structures? How does this
design concept negotiate the complexities and requirements of a generalized operating
system?
4. State the strategy in the evolution of hierarchical-level design of an operating system. How
does this concept achieve the generalization that a versatile operating system exhibits?
96 Operating Systems
5. State the levels and explain the functions that each level performs in the hierarchical-level
design of an operating system.
6. What are the factors that accelerate the evolution of a virtual machine? State the concepts
of a virtual machine. Why it is so called? Is a virtual machine a simple architectural
enhancement or a frst step towards a renaissance in the design of an operating system?
Justify your answer with reasons.
7. What is the client–server model in the design of an operating system? What are the salient
features of a client–server model from a design point of view? What are the reasons that
make the client–server model popular in the user domain?
8. Discuss the environment under which the client–server model is effective. State and explain
the drawbacks of a client–server model.
9. “The kernel–shell model in the design of an operating system is a refnement of the exist-
ing extended machine concept”—give your answer in light of extended machine design.
10. What are the typical functions that are generally performed by a kernel in the kernel–shell
model of an operating system? What is meant by shell in this context? What are the func-
tions that are usually performed by a shell in the kernel–shell design of an operating system?
What are shell scripts? Why the shell is not considered a part of the operating system itself?
11. State and explain the features found in the design of a monolithic kernel.
12. “The virtual machine concept of the 1970s is the foundation for the emergence of the
microkernel of the 1990s”. Would you agree? Give reasons for your answer.
13. “The introduction of the microkernel concept in the course of design of a distributed oper-
ating system is a radical approach”. Justify.
14. Briefy explain the potential advantages that are observed in a microkernel design com-
pared to its counterpart; a monolithic design.
15. Give some examples of services and functions found in a typical monolithic kernel-based
operating system that may be external subsystems in a microkernel-based operating system.
16. Explain the main performance disadvantages found in a microkernel-based operating
system.
17. Explain the situation that forces the modern operating system to evolve. State the key fea-
tures that a modern operating system must include.
Learning Objectives
• To envisage the process model and its creation as an unit of execution, its description with
images, the different states it undergoes, and fnally its uses as a building block in the
design and development of generic operating systems.
• To defne and describe threads, a smaller unit of execution, along with their specifc char-
acteristics, the different states they undergo, and their different types.
• To introduce the concept of multithreading and its implementation in the design and devel-
opment of modern operating systems.
• To portray a comparison between the process and thread concepts.
• To defne and describe objects, a bigger unit of execution, along with a description of
object-oriented concepts in the design and development of a few modern operating systems.
• To demonstrate numerous CPU scheduling criteria and the respective strategies to realize
various types of CPU-scheduling algorithms, including the merits and drawbacks of each
of these algorithms.
• To describe the different forms and issues of concurrent execution of processes and
describe the needs of interprocess synchronization.
• To articulate possible approaches using hardware and software solutions to realize inter-
process synchronization.
• To describe some useful synchronization tools, such as semaphores and monitors, includ-
ing their respective merits and drawbacks.
• To demonstrate a few well-known classical problems relating to interprocess synchroniza-
tion and also their solutions using popular synchronization tools.
• To describe the purpose and elements of interprocess communications.
• To demonstrate different models using message-passing and shared-memory approaches to
implement interprocess communications.
• To explain various schemes used to realize interprocess communications.
• To illustrate the approaches used to realize interprocess communication and synchroniza-
tion in Windows and UNIX operating systems.
• To explain deadlock and the related reasons behind its occurrence, along with different
approaches to detect and avoid such deadlock.
• To discuss starvation and the reasons behind it, along with the description of different
strategies to detect and avoid it.
4.1 INTRODUCTION
The operating system (OS) controls and monitors the entire computing system. It consists of a col-
lection of interrelated computer programs, a part of which when executed under its control by the
processor directs the CPU in the use of the other system resources and also determines the timing
of CPU’s execution of other programs. While allowing the processor to do “useful” work for users,
the OS relinquishes control to the processor and then takes control back at the right point of time to
manage the system and prepare the processor to do the next piece of work. The execution of an indi-
vidual program consisting of a sequence of instructions is sometimes referred to as a process or task.
The part of the OS program that implements certain mechanisms to constantly control, manage, and
supervise the activity of the processor in executing different systems as well as user programs is
DOI: 10.1201/9781003383055-4 97
98 Operating Systems
the subject of processor management. The terms processor management and process management
are used sometimes interchangeably. In a well-designed, versatile operating system, a substantial
amount of the execution time (sometimes more than 80%) is used only to execute the OS program.
1. Keep track of the resources (processors and status of the process) using one of its modules
called the traffc scheduler.
2. In the presence of multiple processes (as in multitasking), it decides which process gets
the processor, when, and how long. This is carried out by one of its modules known as the
processor scheduler (also viewed as a micro-scheduler).
Processor Management 99
It is to be noted that the job scheduler is a part of processor management, since the record-keeping
operations for job scheduling and process scheduling are very similar.
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
FIGURE 4.1 A schematic view of the various process states and their interrelations used in the design of the
generic operating systems.
It is quite natural that, at any instant, a number of processes may be in a ready state, and as such,
the system provides a ready queue in which each process, when admitted, is placed in this queue.
Similarly, numerous events cause running processes to block, and processes are then placed into a
blocked queue and each one is then waiting until its individual event occurs. When an event occurs,
all processes in the blocked queue that are waiting on that particular event are then moved to the
ready queue. Subsequently, when the OS attempts to choose another process to run, it selects one
from the ready queue, and that process then enters the run state.
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
of main memory. Out of many acceptable solutions, the most convenient and workable solution
to negotiate this situation is swapping, which involves moving part or all of a process from main
memory to virtual memory in disk by the mid-term scheduler, thereby freeing up the occupied
memory space for the use of the existing running processes or providing an option for a newly cre-
ated process/processes ready to admit. A queue can then form in virtual memory to keep track of
all such temporarily swapped-out processes, called suspended.
When there are no more ready processes and the OS intends to choose another ready process
to run, it can then use one of two options to pick a process to bring into main memory: either it
can bring a previously suspended process already in the queue, or it can admit a newly created
process in the ready state. Both approaches, have their own merits and drawbacks. The problem is
that admitting a new process may increase the total load on the system, but, alternatively, bringing
a blocked process back into memory before the occurrence of the event for which it was blocked
would not be useful because it is still not ready for execution (the event has not yet occurred). One
of the probable solutions may be to specifcally consider the states of the processes when they have
been swapped out and thus suspended. The suspended processes were actually in two different
states: (i) a process was in a ready state but not running, waiting for its turn for the CPU, and (ii) a
process was in a blocked state, waiting on an event to occur. So, there are two additional suspended
states, ready-suspend and blocked-suspend, along with the existing primary process states already
described. The state transition model is depicted in Figure 4.1.
Usually, when there is no ready process available in memory, the operating system generally
prefers to bring one ready-suspend process into main memory rather than admitting a newly created
process that may increase the total load of the system. But, in reality, there may also be other situ-
ations that might need completely different courses of action. In addition, a process in the blocked-
suspend state is usually moved to the ready-suspend state when the operating system comes to
know from the state information relating to suspended processes that the event for which the process
has been waiting has occurred.
In a more complex versatile operating system, it may be desirable to defne even more states. On
the other hand, if an operating system were designed without the process concept in mind, it might
be diffcult for an observer to determine clearly the state of execution of a job at any point in time.
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
FIGURE 4.2 Model structure of different control tables used in the design of the generic operating systems.
Process Table: This table is an array of structures, one entry with several felds containing all the
information of each currently existing process, up to a maximum number of processes (entries) that
the OS can support at any point in time. It is linked or cross-referenced in some fashion with the other
tables relating to management of other resources to refer them directly or indirectly. An entry is made
in this table when the process is created and is erased when the process dies. Although the process table
is ultimately managed by the process manager, various other OS queries may change the individual
felds of any entry in the process table. A fundamental part of the detailed design strategy for the pro-
cess manager is refected in the design of the process table data structure which largely depends on
the basic hardware environment and differs in design approaches across different operating systems.
More details about this topic are given on the Support Material at www.routledge.com/9781032467238.
A case study showing the scheme of a UNIX process table entry is given on the Support
Material at www.routledge.com/9781032467238.
FIGURE 4.3 A schematic design of process images created by the operating system stored in virtual memory.
process images in virtual memory using a continuous range of addresses, but in actual implementa-
tion, this may be different. It entirely depends on the memory management scheme being employed
and the way the control structures are organized in the design of the operating system.
More details about this topic, see the Support Material at www.routledge.com/9781032467238.
A case study showing the scheme used in the creation of a UNIX process image is given on the
Support Material at www.routledge.com/9781032467238.
When the fork (label) command is executed, it results in the creation of a second process within
the same address space as the original process, sharing the same program and all information,
including the same variables. This new process then begins execution of the shared program at
the statement with the specifed label. The original process executing fork continues its execution
as usual at the next instruction, and the new processes then coexist and also proceed in parallel.
Fork usually returns the identity of the child to the parent process to use it to henceforth refer to
the child for all purposes. When the process terminates itself, it uses the quit() instruction, and
consequently, the process is destroyed, its process control block is erased, and the memory space is
released. The join (count) instruction is used by a parent process to merge two or more processes
into a single one for the sake of synchronization with its child (children). At any instant, only one
process can execute the join statement (system call), and its execution cannot then be interrupted:
no other process is allowed to get control of the CPU until the process fnishes executing. This
is one major strategy that a code segment implements in order to install mutual exclusion (to be
discussed later).
Modern systems use the spawn command ( fork, CreateProcess, or other similar names)
that creates a child process in a separate address space for execution, thereby enabling every
child and sibling process to have its own private, isolated address space. This, however, also
helps the memory manager to isolate one process’s memory contents from others. Moreover,
the child process also ought to be able to execute a different program from the parent, and this
is accomplished by using a mechanism (usually the new program and arguments are named
in the system call) that enables the child to redefine the contents of its own address space.
A new process is thus created, and that program now runs directly. One of the drawbacks of
Processor Management 105
this model is that any parameters of the child process’s operating environment that need to be
changed should be included in the parameters to spawn, and spawn has its own standard way
to handle them. There are, of course, other ways to handle the proliferation of parameters to
solve this problem.
An important difference between the two systems is that while spawn creates a new address space
derived from the program, the fork call must create a copy of the parent address space. This can be waste-
ful if that address space will be deleted and rewritten after a few instructions. One solution to this problem
may be a second system call, vFork, that lets the child process use the parent’s memory until an exec is
made. We will discuss other systems to mitigate the cost of fork when we talk about memory management.
However, which model is “better” is an open issue. The tradeoffs here are fexibility vs. overhead, as usual.
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
A case study showing the creation of cooperating processes by using fork, join, and quit is given on
the Support Material at www.routledge.com/9781032467238.
A case study showing the mechanisms in process creation in UNIX is given on the Support
Material at www.routledge.com/9781032467238.
Since the currently running process is to be moved to another state (ready, blocked, etc.), the
operating system must make substantial changes in the global environment to effect this state transi-
tion, mostly involving the execution of the following steps:
• Save the context of the processor, including the program counter and other registers into
the user stack.
• Modify and update the process control block of the currently running process, and change
the state of this process to one of other states, ready, blocked, ready-suspend, or exit. Other
relevant felds must also be updated, including the reason for leaving the running state and
other accounting information (processor usage).
• Move the PCB of this process to the appropriate queue (ready, blocked on event i, ready-
suspend, etc.).
• Select another process for execution based on the policy implemented in the process sched-
uling algorithm, which is one of the major design objectives of the operating system.
• Update the process control block of this process thus scheduled, including changing the
state of this process to running.
• Update memory-management data structures. This may be required depending on how the
address translation mechanism is implemented. This topic is explained in Chapter 5.
• Restore the context of the processor to the state at the time the scheduled process was last
switched out of the running state by way of loading the previous values of the program
counter and other registers.
• Finally, the operating system initiates a mode switch to reactivate the user space and sur-
renders control to the newly scheduled process.
It is now evident that process switching causing a change in state is considerably complex and
time-consuming and, above all, requires substantial effort to carry out. Minimizing time con-
sumption and thereby increasing the effciency of process switching is thus considered one of
the major design objectives of a process-based operating system that, in turn, requires additional
hardware and its supported structure. To make this activity even faster, an innovative special
process-structuring technique, historically called a thread, was introduced, which will be dis-
cussed later.
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
The elements that constitute the context of the current program that must be preserved are only
those that may be modifed by the ISR, and, when restored, they will bring back the machine to the
same state as it was in just prior to the interrupt. Usually, the context is assumed to mean the con-
tents of all processor registers and status fags and maybe some variables common to both the inter-
rupted program and ISR, if any. The mechanism involved in changing context from an executing
Processor Management 107
program to an interrupt handler is called the context switch. Since the interrupted program is not
at all aware of the occurrence of the interrupt, nor does it have any idea of the machine context that
may be modifed by the ISR during its execution, the ISR is entirely entrusted with the responsibility
of saving and restoring the needed context of the preempted activity. Context switching is mostly
hardware-centric; the save/restore operation is, no doubt, much faster than its counterpart, the soft-
ware approach. But, whatever approach is considered, context switching is comparatively less costly
and much faster than its counterpart, process switching, which is considerably more complex and
also more expensive. Linux’s well-tuned context switch code usually runs in about 5 microseconds
on a high-end Pentium.
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
FIGURE 4.4 Different operating system services (functions) execute as separate processes used in the
design of the generic operating systems.
108 Operating Systems
FIGURE 4.5 A schematic view of the thread concept and its relationship with its parent process.
share the program and resources of that process, thereby causing a reduction in state. In thread-
based systems, a process with exactly one thread is equivalent to a traditional process. Each thread
belongs to exactly one process, and no thread can exist outside a process. Here, processes or tasks
are static and correspond to passive resources, and only threads can be scheduled to carry out
program execution with the processor. Since all resources other than the processor are managed
by the parent process, switching between related threads is fast and quite effcient. But switching
between threads that belong to different processes incurs the full process-switch overhead as usual.
However, each thread is characterized by the following (against the characteristics that are usually
found associated with each process):
• The hardware state: Each thread must have a minimum of its own allocated resources,
including memory, fles, and so on, so that its internal state is not confused with the inter-
nal state of other threads associated with the same process.
• The execution state: Similar to a portion of the traditional process’s status information
(e.g. running, ready, etc.).
• A saved processor context: When it is not running.
• A stack: To support its execution.
• Static storage: To store the local variables it uses.
• OS table entries: Required for its execution.
When compared to new process creation and termination of a process, it takes far less time to
create a new thread in an existing process and also consumes less time to terminate a thread.
Moreover, thread switching within a process is much faster than its counterpart, process switch-
ing. So, if an application can be developed as a set of related executable units, it is then far more
effcient to execute it as a collection of threads rather than as a collection of separate processes.
In fact, the existence of multiple threads per process speeds up computation in both uniproces-
sor as well as in multiple-processor systems, particularly in multiprocessor systems, and also
in applications on network servers, such as a fle server that operates in a computer network
110 Operating Systems
FIGURE 4.6 A broad view of an implementation and managing of virtual–terminal session windows on a
physical terminal using thread paradigm.
built on a multicomputer system. Threads also provide a suitable foundation for true parallel
execution of applications on shared-memory multiprocessors. Other effective uses of threads are
found in communication processing applications and transaction processing monitors. Here, the
use of threads simply makes the design and coding of these applications much easier to realize
that subsequently perform servicing of concurrent requests. The thread paradigm is also an ideal
approach for implementing and managing virtual terminal sessions in the context of a physical
terminal, as shown in Figure 4.6, that are now widely used in contemporary commercial window-
ing systems.
In fact, threads exhibit a compromise between two different philosophies of operating system
implementations: (i) conventional heavy state-laden process-based operating systems that offer
adequate protection but can impair real-time performance, as observed in UNIX, and (ii) lean and
fast real-time operating systems that sacrifce protection for the sake of time-critical performance.
Threads provide, on one hand, the benefts of speed and sharing to related threads that constitute
a single application, but on other hand, also offer full protection while communicating with one
another in different applications.
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
are also analogous to process states and process state transitions, respectively. The different states
that a thread may go through are:
• Ready state : When the thread is waiting for its turn to gain access to the CPU for execution.
• Running state : A thread is said to be in the running state when the CPU is executing it.
This means that the code attached to the thread is getting executed.
• Waiting state : A thread is said to be in the waiting state if it was given a chance to execute
but did not complete its execution for some reason. It may choose to go to sleep, thereby
entering a sleeping state. The other possibility may be that some other thread suspends the
currently running thread. The thread now enters a suspended state. Suspending the execu-
tion of a thread is deprecated in new versions.
• Dead state : When the thread has fnished its execution.
Similar to a process scheduler, a thread scheduler in the more traditional model switches the proces-
sor (CPU) among a set of competing threads, thereby making a thread undergo its different states.
In some systems, the thread scheduler is a user program, and in others, it is part of the OS. Since
threads have comparatively few states and less information needs to be saved while changing states,
the thread scheduler has a lower amount of work to do when actually switching from one thread to
another than what is required in process switching. Hence, one of the important motivations in favor
of using threads is the reduced context-switching time that enables the processor to quickly switch
from one unit of computation (a thread) to another with minimal overhead. In addition, the thread
manager also requires a descriptor to save the contents of each thread’s registers and associated stack.
FIGURE 4.7 A schematic approach of the mechanism of kernel-level threads scheduling used in the design
of thread–based operating systems.
in ways similar to those done with processes. Synchronization and scheduling may be provided by
the kernel.
To create a new KLT, a process issues a system call, create_thread, and the kernel then assigns
an id to this new thread and allocates a thread control block (TCB) which contains a pointer to the
PCB of the corresponding process. This thread is now ready for scheduling. This is depicted in
Figure 4.7.
In the running state, when the execution of a thread is interrupted due to the occurrence of
an event, or if it exceeds the quantum, the kernel then saves the CPU state of the interrupted
thread in its TCB. After that, the scheduler considers the TCBs of all the ready threads and
chooses one of them to dispatch. It does not have to take into account which process the selected
thread belongs to, but it can if it wants to. The dispatcher then checks whether the chosen thread
belongs to a different process than the interrupted thread by examining the PCB pointer in the
TCB of the selected thread. If so, the process switch occurs, the dispatcher then saves all the
related information of the process to which the interrupted thread belongs and loads the con-
text of the process to which the chosen thread belongs. If the chosen thread and the interrupted
thread belong to the same process, the overhead of process switching is redundant and hence
can be avoided.
Many modern operating systems directly support both time-sliced and multiprocessor threading
with a process scheduler. The OS kernel, however, also allows programmers to manipulate threads via
the system call interface. Some implementations are called a kernel thread, whereas a lightweight pro-
cess is a specifc type of kernel thread that shares the same state and information. Absent that, programs
can still implement threading by using timers, signals, or other methods to interrupt their own execution
and hence perform a sort of ad-hoc time-slicing. These are sometimes called user-space threads.
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
FIGURE 4.8 An overview of user-level threads and its scheduling approach used in the design of thread–
based operating systems.
114 Operating Systems
continuously in operation. If the thread library cannot fnd a ready thread in the process, it makes
a system call to block itself. The kernel now intervenes and blocks the process. The process will be
unblocked only when an event occurs that eventually activates one of its threads and will resume
execution of the thread library function, which will now perform scheduling and switch to the
execution of the newly activated thread. As the OS treats the running process like any other, there
is no additional kernel overhead for ULTs. However, ULTs only run when the OS schedules their
underlying process.
The thread library code is a part of each process that maps the TCBs of the threads into the PCB
of the corresponding process. The information in the TCBs is used by the thread library to schedule
a particular thread and subsequent arrangement for its execution. This is depicted in Figure 4.8. The
scheduling algorithm can be any of those described in process scheduling, but in practice, round-
robin and priority scheduling are most common. The only constraint is the absence of a clock to
interrupt a thread that has run too long and used up the process’s entire quantum. In that situation,
the kernel will select another process to run. However, while dispatching the selected thread, the
CPU state of the process should be the CPU state of the thread, and the process-stack pointer should
point to the thread’s stack. Since the thread library is a part of a process, the CPU executes in non-
privileged (user) mode; hence, the loading of the new information into the PCB (or PSW) required
at the time of dispatching a thread demands the execution of a privileged instruction (to change user
mode to kernel mode) by the thread library in order to change the PCB’s (or PSW’s) contents and
also to load the address of the thread’s stack into the stack address register. It then executes a branch
instruction to transfer control to the next instruction of the thread. The execution of the thread now
starts.
User-level threads replicate some kernel-level functionality in user space. Examples of user-level
thread systems are Nachos and Java (on operating systems that do not support kernel threads). In the
case of handling Java threads, JVM is used, and the JVM is typically implemented on the top of a
host operating system. JVM provides the Java thread library, which can be linked with the thread
library of the host operating system by using APIs. The JVM for the Windows family of operating
systems might use Win32 API when creating Java threads, whereas Linux, Solaris, and Mac OS X
systems might use the pthreads API which is provided in the IEEE POSIX standard.
The thread library creates ULTs in a process and associates a (user) thread control block
(UTCB) with each user-level thread. The kernel creates KLTs in a process and associates a
kernel thread control block (KTCB) with each KLT. This is depicted in Figure 4.9, which
shows different methods of associating ULTs with KLTs.
In the many-to-one association method [Figure 4.9(a)], all ULTs created in a process by the thread
library are associated with a single KLT which is created in each process by the kernel. This method of
association provides a similar effect as in mere ULTs; ULTs can be concurrent without being parallel
(since they are the smallest unit of computation), thread switching incurs low overhead, and blocking of
a user-level thread leads to blocking of all threads in the process. Solaris initially implemented the JVM
using the many-to-one model (the green thread library). Later releases, however, changed this approach.
In the one-to-one association method [Figure 4.9(b)], each user-level thread is permanently
mapped into a KLT. This method of association provides a similar effect as in mere KLTs. Here,
threads can operate in parallel on different CPUs in a multiple processor system; however, switching
between threads is performed at the kernel level and thus incurs high overhead. As usual, blocking
of a user-level thread does not block other ULTs because they are mapped to different KLTs. For
example, the Windows XP operating system uses the one-to-one model; therefore, each Java thread
for a JVM running on such a system maps to a kernel thread. Beginning with Solaris 9 and onwards,
Java threads, however, were mapped using the one-to-one model.
The many-to-many association method [Figure 4.9(c)] is possibly the most advantageous one.
This method produces an effect in which ULTs may be mapped into any KLT. Thus, it is possible
to achieve parallelism between ULTs by mapping them into different KLTs, but the system can
perform switching between ULTs mapped to the same KLT without incurring high overhead. Also,
blocking a user-level thread does not block other ULTs of the process that are mapped into different
KLTs. Of course, this method requires a complex mechanism that has been observed in the imple-
mentation of later versions of the Sun Solaris operating system and Tru64 UNIX.
FIGURE 4.9(a), (b), (c) Different methods of associating user-level threads with kernel-level threads in the
design of thread–based operating systems: (a) Many-to-one, (b) One-to-one, (c) Many-to-many.
116 Operating Systems
• A thread can voluntarily release control by explicitly going to sleep. In such a case, all the
other threads are examined, and the highest-priority ready thread is allocated to the CPU
to run.
• A higher-priority thread can preempt a running thread. In this case, as soon as the higher-
priority thread wants to run, it does. This is called preemptive multitasking.
However, assignment of priority to threads at the right point in time has several uses and has an
immense impact on controlling the environment in which the thread runs.
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
4.14.6 MULTITHREADING
In thread-based systems, a process with exactly one thread is equivalent to a classical process. Here,
the relationship between threads and processes is one to one (1: 1), and each thread of execution is a
unique process with its own address space and resources. An example system of this type is UNIX
System V. If there are multiple threads within a single process, it is a many-to-one (M:1) relation-
ship, and the process defnes an address space and dynamic resource ownership. Multiple threads
may be created and executed within that process. Representative operating systems of this type are
OS/2, MACH, and MVS (IBM large system). Other relationships, many-to-many (M:M) and one-
to-many (1:M), also exist.
Breaking a single application into multiple threads enables one to impose great control over the
modularity in that application and the timing of application-related events. From the job or appli-
cation point of view, this multithreading concept resembles and is equivalent to a process on most
other operating systems. In brief, it can be said that:
• The concept of multithreading facilitates developing effcient programs that result in the
optimum utilization of the CPU, minimizing CPU idle time.
• A multithreaded program contains two or more parts that can run concurrently.
• Each part of such a program is a separate thread that defnes a separate path of execution.
• Multithreading enables a single program to perform two or more tasks simultaneously. For
example, a text editor can format text while it is engaged in printing as long as these two
actions are being performed by two separate threads.
Concurrency among processes can also be achieved with the aid of threads, because threads in
different processes may execute concurrently. Moreover, multiple threads within the same process
may be allocated to separate processors (multiprocessor system) and can then be executed concur-
rently, resulting in excellent performance improvement. A multithreaded process achieves concur-
rency without incurring the overhead of using multiple processes. As already mentioned, threads
within the same process can exchange information through shared memory and have access to the
shared resources of the underlying process. Windows NT supports multithreading. However, mul-
tiple threads can be executed in parallel on many other computer systems.
Multithreading is generally implemented by time-slicing, wherein a single processor switches
between different threads, in which case the processing is not literally simultaneous; the single
Processor Management 117
single core
FIGURE 4.9(d) Interleaving of thread processing over time in uniprocessor operating systems.
Core 1
Th1 Th3 Th1 Th3 Th1 Th3 Th1 Th3 Th1 ….
Core 2
Th2 Th4 Th2 Th4 Th2 Th4 Th2 Th4 Th2 ….
time
FIGURE 4.9(e) Interleaving of thread processing over time in parallel execution on a multi-core uniproces-
sor operating systems.
processor can only do one thing at a time, and this switching can happen so fast that it creates an
illusion of simultaneity to the end user. For instance, if the PC contains a processor with only a
single core, then multiple programs can be run simultaneously, such as typing in a document with
a text editor while listening to music in an audio playback program. Though the user experiences
these things as simultaneous, in reality, the processor quickly switches back and forth between these
separate processes.
Since threads expose multitasking to the user (cheaply), they are more powerful but even more
complicated. Thread programmers have to explicitly address multithreading and synchronization.
An object-oriented multithreaded process is an effcient means of implementing a server applica-
tion. For example, one server process can service a number of clients. Each client request triggers
the creation of a new thread within the server.
In a single processor chip with multiple computing cores (multicore), each core appears as a sepa-
rate processor to the operating system. With effcient use of these multiple cores, a multithreaded
approach can be more effectively implemented that eventually yields overall improved concurrency.
For example, consider an application with four threads. On a system with a processor with a single
computing core, concurrency merely means that the execution of these threads will be interleaved
over time, as the processing core is capable of executing only one thread at a time. This is illustrated
in Figure 4.9(d). On a system with a processor with multiple cores, concurrency, however, means that
the threads can run in parallel, as the system can assign a separate thread to each core that gives rise to
parallel execution effciently. This situation is depicted in Figure 4.9(e). Programs can now be designed
in a multithreaded pattern to take advantages of multicore systems to yield improved performance.
typically share the state information of a single process and share memory and other resources
directly but are able to execute independently. Context switching between threads in the same
process is certainly faster than context switching between processes. Systems like Windows
NT and OS/2 are said to have “cheap” threads and “expensive” processes. In other operating
systems, not much of a difference can be observed. In Linux, there is truly no distinction
between the concepts of processes and threads. However, multiple threads in Linux can be
grouped together in such a way that one can effectively have a single process comprising mul-
tiple threads.
A multithreading approach allows multiple threads to exist in a single process, and this model
provides a useful abstraction of concurrent execution since the threads of the program lend them-
selves to realizing such operation. In fact, a multithreaded program operates faster on computer
systems with multiple CPUs or CPUs with multiple cores or across a cluster of machines. In this
situation, threads must be carefully handled to avoid race conditions and need to rendezvous
(meeting by appointment) in time in order to process data in the correct order. Threads may also
require atomic operations (often implemented using semaphores) in order to prevent common data
from being simultaneously modifed or read while in the process of being modifed. Nevertheless,
improper handling of threads may lead to critical situations with adverse effects. However, perhaps
the most interesting application of this approach is that when it is applied to a single process on a
multiprocessor system, it yields parallel execution.
Operating systems generally implement threads in one of two ways: preemptive multithreading
or cooperative multithreading. Preemptive multithreading is, however, generally considered supe-
rior, since it allows the operating system to determine when to make a context switch. Cooperative
multithreading, on the other hand, relies on the threads themselves to relinquish control once they
are at a stopping point. This can create problems if a thread is waiting for resource availability. The
disadvantage to preemptive multithreading is that the system may make a context switch at an inap-
propriate time, causing priority inversion or other ill effects which may also be avoided by use of
cooperative multithreading.
Traditional mainstream computing hardware did not have much support for multithreading.
Processors in embedded systems supporting real-time behaviors might provide multithread-
ing by decreasing the thread switch time, perhaps by allocating a dedicated register fle for
each thread instead of saving/restoring a common register fle. In the late 1990s, the idea of
simultaneous execution of instructions from multiple threads became known as simultane-
ous multithreading. This feature was introduced in Intel’s Pentium 4 processor with the name
hyper-threading.
to one another in many different fashions. Within a process, there may be one or more than one
thread (multithreaded) connected in many different ways with one or more than one LWP of the
corresponding process. An LWP within a process is visible to the application, and LWP data struc-
ture can be obtained from the respective process address space. Each LWP, in turn, is always bound
to exactly one single dispatchable kernel thread, which is a fundamental computational entity, and
the data structure for that kernel thread is maintained within the kernel’s address space. The kernel
creates, runs, and destroys these kernel threads to execute specifc system functions. The use of
kernel threads instead of using kernel processes to implement system functions eventually reduces
the overhead of switching (thread switching is much faster and less costly than process switching)
within the kernel.
For more details about this topic with fgures, see the Support Material at www.routledge.com/
9781032467238.
execute a non-executable piece of data). This property in the defnition of the object is known as
encapsulation and offers two distinct advantages:
• It not only protects the objects from any corruption but also safeguards against the types of
problems that may arise from concurrent accesses, such as, deadlocks.
• It hides the internal structure of the object so that interaction with the object is relatively
simple and standardized. Moreover, since the object is modular in concept, if the inter-
nal structure or the procedures associated with an object are modifed without chang-
ing its external functionality, other objects are unaffected. This helps any modifcation or
enhancement in the object-oriented design to be straightforward.
If a process is represented by an object, then there will be one object for each process present in a
system. Clearly, every such object needs its own set of variables. But if the methods (procedures) in
the object are re-entrant procedures, then all similar objects, including every new object of a similar
type, could share the same procedures but with its own set of variables.
To avoid all such diffculties, the object class is redefned to make a distinction between
object class and object instance. The defnition of object classis to project it as a template that
defnes both the variables (attributes) and procedures (services) attached to a particular type of
object. An object instance is an actual object that includes the characteristics of the class that
defnes it. The instance contains values for the variables defned in the object class. The operat-
ing system can then create specifc instances of an object class as needed. For example, there
is a single-process object class and one process object for every currently active process. This
approach simplifes object creation and management. Objects can be defned separately both
in terms of process and thread individually, giving rise to a process object and thread object.
The characteristics of these objects are, of course, different and can be willfully exploited
when implemented in the design and development of operating systems using both processes
and threads.
Object-oriented concepts are becoming increasingly important in the design and development
of operating systems. The object-oriented structure assists in the development of a general-purpose
process facility to provide support for a variety of operating system environments. Objects defne
another mechanism to specify the behavior of a distributed system of computational units. This is
done by specifying the behavior of individual units of serial computation and the model by which
they are coordinated when they execute. Specialized support for the object model enables tradi-
tional sequential programmers to achieve a strong intuition about distributed computing. This help
programmers take advantage of contemporary multiprocessors and also clusters of interconnected
computers.
• Keeping track of the status of the process (all processes are either running, ready, or
blocked). The module that performs this function is called the traffc controller.
• Deciding which process gets a processor: when and for how long. This is performed by the
processor scheduler.
• Allocation of the processor to a process. This requires resetting of processor registers to
correspond to the process’s correct state and is performed by the traffc controller.
• Deallocation of the processor, such as when the running process exceeds its current
quantum (time-slice) or must wait for I/O completion. This requires that all processor
state registers be saved to allow future reallocation. This task is performed by the traffc
controller.
Based on certain pre-defned sets of criteria, the short-term scheduler thus always attempts to maxi-
mize system performance by switching the state of deserving processes from ready to running. It is
invoked whenever an event (internal or external) occurs that may eventually change the global state
of the system. For any such change, the currently running process may be interrupted or preempted
in favor of other existing processes, and then the next deserving process is scheduled to run. Some
of the events that force changes in global system states and thereby require rescheduling include
the following:
In general, whenever one of these events occurs, the short-term scheduler is invoked by the oper-
ating system to take action, mostly to schedule another deserving process with allocation of the
CPU for execution. The responsibilities the short-term scheduler performs in coordination with
the activities of the other two schedulers while providing process-management OS services have
already been discussed in Chapter 2 and hence are not repeated here.
User-oriented criteria relate to the behavior of the system that directly affects the individual
user or process. One such example is response time in interactive systems.
System-oriented criteria emphasize effective and effcient use of the CPU. An example of this
category is throughput, which is the rate at which processes are completed.
Performance-related criteria focus on quantitative yield by the system and generally can be
readily measured. Examples include response time and throughput.
Non-performance-related criteria are qualitative in nature and cannot readily be measured or
analyzed. An example of this category is predictability.
However, all these dimensions can be summarized together in the following various criteria that can
be used as a guideline in designing a well-defned policy:
• Performance-Related Criteria
• System-Oriented: Throughput, processor utilization (effciency)
• User-Oriented: Response time, turnaround time, deadlines
• Other Criteria
• System-Oriented: Fairness, waiting time, priority, resource utilization
• User-Oriented: Predictability
More details on this topic are given on the Support Material at www.routledge.com/9781032467238.
performance for some other class. Moreover, giving one user more means giving other users less. At
best, there can be a reasonable distribution of CPU time with a target of the attainment of a desired
goal. There is certainly no other way out.
While a scheduling policy aims to fulfll several competing goals (criteria), some of these criteria
are observed to be opposing one another. For example, to minimize response time for interactive
users, the scheduler should try to avoid running any batch jobs during prime daytime even if there is
a tremendous fow in incoming batch jobs. The batch users probably will not be happy with this algo-
rithm; moreover, it violates the turnaround criterion. Another example is that while increased proces-
sor utilization is achieved by increasing the number of active processes, this, in turn, causes response
time to decrease. The design approach varies from one environment to another, and a careful balance
of all these conficting requirements and constraints is needed to attain the desired goal appropriate
to the specifc environment. For example, the design objectives of a batch system environment will
focus more on providing an equitable share of processor per unit time to each process (user) or better
throughput and increased resource utilization. Multi-user systems are usually designed with more
emphasis on minimizing terminal response time, while real-time operating systems favor a focus on
the ability to quickly handle bursts of external events responsively to meet certain deadlines.
Scheduler
Preemptive Nonpreemptive
Cooperative Run-to-completion
FIGURE 4.10 Different types of schedulers based on competing criteria and their interrelationships.
process, such as; time already spent in execution, time already spent in the system including execu-
tion and waiting, total service time required by the process, and similar other factors relating to
specifed criteria.
A run-to-completion scheduler means that a job, once scheduled, will be run to completion. Such
a scheduler is known as a nonpreemptive scheduler. Another simple approach in this category is for
the scheduler to assume that each process will explicitly invoke the scheduler periodically, volun-
tarily releasing the CPU, thereby allowing other processes to use the CPU (cooperative scheduler).
This nonpreemptive approach in short-term (process) scheduling allows the running process to
absolutely retain the ownership of the processor and sometimes even of other allocated resources
until it voluntarily surrenders control to the OS or due to a result of its own action, say, waiting for an
I/O completion. This drawback could only be resolved entirely if the operating system itself could
devise some arrangement that would force the running process not to continue at an arbitrary instant
while engaged in involuntary sharing of the CPU. This strategy by the OS that forces temporary
suspension of the logically runnable processes is popularly known as preemptive scheduling. Only
in that situation can another ready process be scheduled.
With preemptive scheduling, a running process may be interrupted at any instant and be moved to
the ready state by the operating system, allowing any other deserving process to replace it. Preemption
thus generally necessitates more frequent execution of the scheduler and may even lead to a critical
race condition (to be discussed later), which may be prevented only by using another method. In addi-
tion, preemption incurs more overhead than nonpreemptive methods since each process rescheduling
demands a complete costly process switch. In spite of accepting all this overhead, preemptive schedul-
ing, in essence, is generally more responsive and may still provide better service to the total population
of processes for general-purpose systems, because it prevents processes from monopolizing the proces-
sor for a very long time. Today most operating systems are such preemptive multitasking systems.
The maximum time a process can keep the CPU is called the system’s time quantum or time-
slice length. The choice of time quantum can have a profound impact on system performance. Small
time quanta give good interactive performance to short interactive jobs (which are likely to block for
I/O). Larger quanta are better for long-running CPU-bound jobs because they do not make as many
time-consuming process switches, thus having less overhead. If the time quantum is kept so small
that the system spends more time carrying out switching of processes than doing useful work, the
system is said to be thrashing, which is also found in other subsystems.
More details on this topic are given on the Support Material at www.routledge.com/9781032467238.
Processor Management 125
Ready Process
Process Control
Block (PCB)
Ready
Enqueuer
List
Process
Dispatcher
Switcher
Scheduler
Process Request
Allocate
Control CPU
Block (PCB)
Resources Done
FIGURE 4.11 The role and actions of process scheduler at the time of process scheduling.
in charge of deciding the order of the processes ready for execution, it ultimately controls which
and ready process is to be allocated the CPU and when to maximize CPU utilization along with
the proper utilization of other resources. As a result, the perceived performance of the system is
greatly infuenced by the working of an appropriate scheduler matched with the existing system’s
environment.
The working of a scheduler also determines the frequency of switching of processes from one
to another, when each such switch requires additional overhead to perform the process switch
operation. Too much switching favors small or interactive jobs to get good service but may invite
damaging effects due to thrashing that may degrade the system performance as a whole. On the
contrary, any attempt by the scheduler to decrease the frequency of process switching in order
to reduce the extra overhead favors large jobs for more CPU time at a stretch, but it will cause
the other ready short jobs not to gain the access of CPU, thereby increasing the turnaround time
of processes and also decreasing the throughput of the entire system.
Scheduler design, particularly in a multitasking system, is critical to the performance of each
individual process and also to the overall behavior of the system. Numerous scheduling strategies
have been studied and tried in order to make a particular scheduler ft a specifc environment, but
ultimately, it is the overall performance that depends on the choice of the strategy to be implemented.
Service time, t: The total amount of time a process needs to be in the running state before
it is completed. In other words, the service time represents the amount of
time the process will use the CPU to accomplish its useful work.
Processor Management 127
Wait time, W: The time the process spends waiting in the ready state before its frst transi-
(response time) tion to the running state to receive its frst unit of service from the processor.
Turnaround time, T : This is the duration of time that a process p is present; i.e. (fnish time –
arrival time). The turnaround time T counts not only how long a process
p needs but also how long it sits in the ready list while other processes
are run. Once it starts, it might be preempted after some time, letting
it continue further later. The entire time a process p is on the ready list
(until it leaves our view to go to other lists) is charged to T. The process is
not visible to the short-term scheduler while it is waiting for I/O or other
resources, and therefore the wait time is not included in T.
Missed time M : T – t. The missed time M is the same thing, except we do not count the
(wait time W) amount of time t during which a process p is actually running. M
measures the amount of time during which p would like to run but
is prevented.
Response ratio R : t/T. The response ratio represents the fraction of the time that p is
receiving service. If the response ratio R is 1, then p never sits in the
ready list while some other process runs.
Penalty ratio P : T/t. The penalty ratio P is the inverse of R. If the response ratio R is 1/100,
then P = 100, and the process seems to be taking 100 times as long
as it should; the user may be annoyed. A response ratio greater than 1
doesn’t make any sense. Similarly, the penalty ratio P ranges from 1
(which is a perfect value) upward.
Kernel time: The time spent by the kernel in making policy decisions and carrying them
out. This time includes context-switch and process-switch time. A well-tuned
operating system tries to keep the kernel time between 10 and 30 percent.
Idle time: The time spent when the ready list is empty and no fruitful work can be accomplished.
be organized as a simple FIFO data structure (where each entry will point to a process descriptor).
Processes that arrive are added to the tail of the queue by the enqueuer, and the dispatcher will take
(remove) processes from the head of the queue (the oldest process in the ready queue) when the cur-
rent running process stop executing. The preemptive version of this algorithm is commonly known
as round-robin (RR) scheduling, discussed later.
FCFS never takes into account the state of the system and the resource requirements of the
individual scheduled processes. It ignores service time requests and all other criteria that may infu-
ence performance with respect to turnaround and waiting time. In the absence of any preemp-
tion, resource utilization and the system throughput rate may be quite low. Since FCFS does not
discriminate jobs on the basis of their required service time, short jobs may suffer considerable
turnaround delays and waiting times when one or more long jobs are present in front of them in the
system. Consequently, this scheduling may result in poor performance under a specifc set of system
requirements and thus has been dropped from favor.
Although FCFS is not an attractive alternative approach on its own for a single-processor system,
it is often combined with a priority scheme to provide an effective scheduling mechanism. The
scheduler in that situation may maintain a number of queues of processes, one queue for each prior-
ity level, and dispatch each process on a frst-come-frst-served basis within each queue. One exam-
ple of such a system is known as feedback scheduling, which is discussed later in this subsection.
The characteristics of FCFS scheduling with fgure is given on the Support Material at www.
routledge.com/9781032467238.
With SPN, although the overall performance in terms of response time is signifcantly improved, the
long processes may be penalized. Moreover, if the ready list is saturated, which is often obvious, then long
processes tend to remain in the ready list while short processes continuously receive service. In the extreme
case, when the system has little idle time, long processes will never be served, and such starvation of long
processes may ultimately be a serious liability of the scheduling algorithm. Moreover, especially for longer
processes, the variability of response time is increased, and thus predictability is reduced.
Another diffculty with the SPN policy is that it requires for its working a different ingredient:
explicit information about the service-time requirements or at least an estimate of the required
service time of each process. But, due to the presence of many issues, it is really diffcult for either
the user or the scheduler to fgure out which of the currently runnable processes is the shortest one.
In fact, the success of SPN in practical application depends mostly on the accuracy of prediction of
the job and process behavior, and that also imposes additional overhead of correctly computing a
predictor calculation at runtime. That is why, in spite of having several merits, due to the presence of
all these critical issues and the absence of preemption in this scheduling, this method is not usually
favored as suitable for a time-sharing or transaction-processing environment.
A prediction calculation, analysis, and example fgure for SPN are given on the Support
Material at www.routledge.com/9781032467238.
SPN; for middle-length processes, HPRN has an intermediate penalty ratio; and for very long pro-
cesses, SPN becomes worse than FCFS, but HPRN is still in the middle.
However, HPRN still has some distinct disadvantages. First of all, it is not preemptive, so it
cannot beat RR or PSPN for short processes. A short process that unfortunately arrives just after
a long-process started executing will still have to wait a very long time. Second, it is generally not
as good as SPN (as indicated by the result of various simulations), which uses the same technique:
knowledge of process length without preemption. Third, HPRN is more expensive to implement,
since the penalty ratio must be calculated for every waiting process whenever a running process
completes or blocks.
Now, in regard to placement of processes in the list, this RR approach has been further modifed
a little bit in order to implement “fairness”, which is the basic philosophy of RR scheduling. A few
other variants of RR scheduling are; inverse of remainder of quantum, limited round robin, and the
multiple-level feedback variant of round robin.
RR scheduling is actually very sensitive to the length of the time-slice or quantum to be chosen. If
the quantum chosen is very short, then short processes will move through the system relatively quickly.
But, if the chosen quantum is not very large in comparison to process switching, this will consume more
time to handle system overhead, which sometimes leads to signifcantly lowering CPU effciency. Thus,
very short time quanta are legitimately discarded. On the contrary, if the time-slice is taken too long
than the process switch, then the extra overhead due to frequent preemption will be reduced, but the
last positioned user in a long queue will have to wait a longer time to get its turn, and the response time
will be appreciably increased. Moreover, in this situation, most of the processes would be completed
within the specifed time-slice and usually surrender control to the OS rather than being preempted by
an interval timer; the ultimate advantage of preemptive scheduling will be lost, and RR scheduling will
eventually degenerate to simple FCFS scheduling. Therefore, the optimal value of the time-slice lies
somewhere in between, but it mostly depends on the environment, which consists of both the computing
system being used and the nature of the workload. This workload, in turn, is primarily determined by
the type of program being submitted and also the instants of their arrivals.
Moreover, consideration of the qualitative defnition of too short, short, or long duration of the
time-slice is not really convincing, because a relatively short time interval in one kind of hardware
system may be a comparatively long one in another in terms of the number of instructions executed
by the processor (CPU speed) in the system. That is why the execution of an instruction-per-quan-
tum measure is more realistic for comparing different systems, because the time-duration measure
does not refect the fact that processors with different speeds may generally accomplish different
volumes of work within a specifed time-slice.
In summary, the round robin is particularly effective in a general-purpose time-sharing or trans-
action-processing system as well as in multiuser environments where terminal response time is a
critical parameter. The choice of a suitable time-slice matched with the existing environment is an
infuencing factor in its performance metric. That is why selection of a time-slice duration is kept
user-tunable and can be modifed by the user at system generation (OS installation).
An example with fgures relating to the operation of RR scheduling, its Gantt chart, and sched-
uling characteristics in tabular form are given on the Support Material at www.routledge.com/
9781032467238.
auxiliary queue, and the processes which are released from their I/O activities are now moved to this
queue instead of sending them as usual to the ready queue. Now, when the CPU is available and a dis-
patching decision is to be made, the processes in this auxiliary queue get preference over those in the
main ready queue. When a process is dispatched from this auxiliary queue, it runs only for a time dura-
tion equal to the basic time quantum minus the total time it already spent when it was last selected from
the main ready queue. Performance evaluation carried out by the authors revealed that this approach is
indeed superior to RR scheduling in terms of processor-time distribution to implement fairness.
An example with fgures relating to the operation of VRR scheduling is given on the Support
Material at www.routledge.com/9781032467238.
best achievable average penalty ratio because it keeps the ready list as short as possible. It manages
this feat by directing resources toward the process that will fnish soonest and will therefore shorten
the ready list soonest. A short ready list means reduced contention and leads to a low penalty ratio.
An example with figures relating to the operation of PSPN scheduling, its Gantt chart, and sched-
uling characteristics in tabular form are given on the Support Material at www.routledge.com/
9781032467238.
If the newly arrived job has a priority lower than that of the currently running job, the action to be
taken will be identical to that of nonpreemptive priority scheduling.
If priority-based scheduling is implemented, there remains a high possibility that low-priority pro-
cesses may be effectively locked out by higher-priority ones. In general, with this scheduling scheme,
no such guarantee could be given with regard to the expected time of completion of a job after its
admission into the system. To get rid of this uncertainty, the usual remedy is to provide an aging prior-
ity in which a check prevents high-priority processes from running indefnitely. The schedulers in this
situation have the option to reduce the priority of the currently running process at each clock interrupt.
When this action causes the priority of the currently running process to drop below that of the next
highest-priority process, a process switch occurs and the system leaves the currently executing process.
Eventually, over time, the priorities of the low-priority older processes become higher than that of the
running high-priority processes and will ultimately get their turn within a reasonable period of time.
Event-driven (ED) scheduling is another variant of priority-based scheduling used in real-time
operating systems to schedule real-time events (processes). In such systems, all the processes are
time-critical and must be executed within specifc deadlines. The entire workload of the system may
consist of a collection of periodic processes, executed cyclically within a specifed period (deadlines),
and aperiodic processes whose times of arrival are not predictable. This means certain processes
arrive with different already-assigned fxed priorities and other processes with dynamically varying
priorities. The scheduler always takes the highest-priority ready process whenever a signifcant event
(arrival of an important process) occurs. Arrival of a higher-priority important process will immedi-
ately preempt the currently running process, process switching will be carried out within a very short
period (a special type of hardware is used to speed up the process switching mechanism), and the new
higher-priority process will be executed. Different types of schedulers under this category have been
devised to negotiate all such situations, as explained in the section “Deadline Scheduling”.
An example with fgures relating to the operation of preemptive priority-based scheduling, its
Gantt chart, and scheduling characteristics in tabular form are given on the Support Material at
www.routledge.com/9781032467238.
• Operating-system jobs
• Interactive jobs
• Batch jobs
Processor Management 137
Pre-emptive priority-
System based (ED) scheduling
processes or FCFS
High-Priority Queue
Interactive
RR-Scheduling Between-Queues
processes Medium-Priority Queue CPU
Scheduling (Priority,
Biased, Time-slice)
Batch-like
processes FCFS Scheduling
Low-Priority Queue
Scheduling
within-queue
FIGURE 4.12 An illustration of actions of scheduler at the time of scheduling of processes arranged in
multiple-level queue.
This would eventually result in the conventional single ready queue being partitioned into a few
ready queues. As shown in Figure 4.12, three such separate ready queues are formed. A ready
process may then be assigned to one of these queues on the basis of its attributes, which may be
provided either by the user or the system. Multiple-level queue are thus an extension of priority-
based scheduling (multiple priority-level queues) in which all processes of the same characteristics
(priority) are placed in a single queue. Within each queue, jobs can be scheduled using an algorithm
that is best suited for that queue considering its workload. For example, queues containing interac-
tive jobs can be scheduled round robin, while queues containing batch jobs can be scheduled using
FCFS or SPN.
Between queues, a scheduling discipline should be devised to allocate the CPU to a particular
queue. Typical approaches in this regard are to use absolute priority or some form of modifed
time slicing injecting bias considering relative priority of the processes within particular queues.
In the case of using absolute priority scheduling between queues, the highest-priority queue will
frst be handled, and all the processes from this highest-priority queue (usually consisting of
OS processes) are serviced in some order until that queue is empty. This ordering of processes
within the queue can be implemented using some other scheduling discipline that may be event-
driven, or FCFS can also be chosen, since the queue consists of processes of similar nature
and the overhead of FCFS is low. When the highest-priority queue is empty, the next highest-
priority queue may be serviced using its own best-matched scheduling discipline (e.g. a queue
formed with interactive jobs normally uses RR scheduling). When both higher-priority queues
are empty, the next high-priority queue (e.g. consisting of batch jobs) may then be serviced using
its own best-matched scheduling discipline. In this way, all queues will be handled one after
another in order.
In general, during the execution of any process in any queue, if a new process arrives that is char-
acterized as a member of a higher-priority queue, the currently running job will be preempted and
the newly arrived job will start executing. This strategy ensures responsiveness to external events
and interrupts, of course with an extra cost of frequent preemptions and their associated overhead.
A variant of this strategy for distributing CPU utilization across queues may be to assign a
certain percentage of processor time to each queue, commensurate according to its priority. The
highest-priority queue will then be given a larger portion of the CPU time, and the lowest-priority
queue will be allocated a smaller portion of the CPU time.
As is expected, multiple queue scheduling, by nature, is a very general discipline that exploits
all the features and advantages of different “pure” scheduling disciplines by way of combining
them into one single form of scheduling. Consequently, each of these more sophisticated constitu-
ent scheduling algorithms also contributes overhead that ultimately increases the overhead of this
discipline as a whole. However, distinct advantages of MLQ were observed and recognized early on
138 Operating Systems
From outside
world — Q1 ->
Highest – Priority Queue Job completed
Time – slice = 10 ms
Timer expires
— Q2 ->
Job completed
Medium – Priority Queue
Time – slice = 20 ms
Timer expires
— Q3 ->
FIGURE 4.13 An illustration of actions of scheduler in Multi-level feedback scheduling for jobs with a
single CPU burst.
the job is placed at the tail of the next lower queue. If the end of the CPU burst is reached before
the expiry of the time-slice, and if such an end of the CPU burst is due to an issue of I/O request,
the job leaves the ready queue and joins the waiting pool of jobs, remaining there waiting for its I/O
completion. The job at this moment is outside the vision of the scheduler. But, if the end of the CPU
burst marks the end of the job, it leaves the system as completed.
To explain the operations of multi-level feedback queues, let us FIrst consider jobs with only one
CPU burst. These jobs enter the highest-priority, Q1, from the outside world, as shown in Figure 4.13.
The job at the head of this queue is assigned to the CPU. If it completes its CPU burst within the
time-slice assigned to that queue, that job leaves the system as a completed job. If it cannot complete
its CPU burst within the time-slice assigned to Q1, the job is then placed at the tail of the next lower-
level queue, Q2 (refer to Figure 4.13).
Jobs in the second queue Q2 will only be taken up for execution when all the jobs in Q1 are
FInished. If a new job arrives in Q1 when a job in Q2 (the job which was at the head of Q2) is under
execution, that running job in Q2 will be preempted and the newly arrived job in Q1 will get the
CPU to start its execution. The newly arrived job will either leave the system from the FIrst queue
itself or enter the second queue Q2 at its tail. At this time, if no other job is available in Q1, any job
that is at the head of the queue Q2 will be started, or the preempted process at the head of queue Q2
will resume execution. Similarly, jobs in Q3 will only be taken up for execution if and only if all the
jobs in Q1 and Q2 are FInished. If a job arrives in Q1 while a job in Q3 (the one that was at the head
of Q3) is under execution, the job in Q3 will be preempted.
The entire idea is to give preferential treatment to short processes, and a job with a large CPU
burst will ultimately sink down to the lowest queue. If the job at the head of the lowest queue does
not complete its remaining CPU burst within the time-slice assigned to that queue, it will be placed
at the tail of the same queue, as shown in Figure 4.13.
Let us now consider a more practical situation with jobs in general. Usually, each job consists
of several CPU and I/O bursts. If the total CPU burst time of a job is more than its total I/O burst
time, the job is called a CPU-bound job, and the converse of this is called an I/O-bound job. As
140 Operating Systems
From outside
world — Q1 ˜
1
Job completed
Highest – Priority Queue 2
Waiting pool Time – slice = 10 ms To waiting pool for Q1
for Q1 3
— Q2 ˜
1
Job completed
Medium – Priority Queue 2
Waiting pool Time – slice = 20 ms To waiting pool for Q2
for Q2 3
— Q3 ˜
1
Job completed
Waiting pool Low – Priority Queue 2
for Q3 Time – slice = 30 ms To waiting pool for Q3
3
FIGURE 4.14 An illustration of actions of modifed scheduler in Multi-level feedback scheduling for jobs
in general.
usual, jobs that enter the system from the outside world are placed in the highest-priority queues in
FCFS order. The job at the head of this queue gets the CPU frst. If it issues an I/O request before the
expiry of its time-slice, the job will then leave the ready queue to join the waiting pool. Otherwise,
this job will naturally be placed at the tail of the next lower queue if it is not completed within the
specifed time-slice.
Jobs which leave the ready queue due to an I/O request will eventually become ready–to–run
after completion of I/O. Now a policy is to be settled about the placement of these ready-to-run jobs
at the time of their re-entry after I/O completion. The question obviously arises as to in which queue
these jobs are to be placed. One strategy may be to mark each job while leaving the ready queue (due
to I/O request) with the identity of the queue from which it left. When this job once again becomes
ready–to–run, it will re-enter the same queue from which it left and be placed at the tail. Figure 4.14
depicts a conceptual picture of this strategy.
Placing a job once again in the same queue from which it left due to an I/O request is, however,
not a judicious decision. It is also supported by the fact (as shown in Figure 4.14) that while a job
having CPU burst happened to be a long burst is allowed to use only one time-slice and then it is
preempted and fnally it is pushed down to a lower priority queue. It will never be promoted to
a higher-priority queue. Thus, a better strategy would perhaps to be one which will adapt to the
changing trends with regard to CPU and I/O bursts. It is acceptable that after one long CPU burst,
the remaining workload will be less, and hence, subsequent CPU bursts may be of shorter duration.
Such a self-adjusting strategy is expected to be superior to a non-adapting strategy like the one we
just discussed.
Following this line, one such self-adjusting strategy can be derived that will place a ready-to-run
job in a queue one level above the queue from which it left due to an I/O request, because it can be
logically assumed that after returning, the amount of work remaining for that job to complete its
execution will be less. Figure 4.15 illustrates such a strategy. Under this strategy, jobs start to move
up and down between queues.
One serious problem with this scheme is that the turnaround time of longer processes can stretch
out alarmingly, leading to a situation of starvation if new jobs happen to continuously enter the
Processor Management 141
From outside
world — Q1 ->
1
Job completed
Waiting pool Highest – Priority Queue 2
Time – slice = 10 ms To waiting pool for Q1
for Q1 3
— Q2 ->
Waiting pool 1
Job completed
for Q2 Medium – Priority Queue 2
Time – slice = 20 ms To waiting pool for Q2
3
— Q3 ->
Waiting pool 1
for Q3 Job completed
Low – Priority Queue 2
Time – slice = 30 ms To waiting pool for Q3
3
FIGURE 4.15 An illustration of actions of another type of modiFIed scheduler used in Multi-level feedback
scheduling for jobs in general.
system. To negotiate this situation, one approach could be to vary the time-slices assigned to the
queues to compensate for this drawback. For example, a process scheduled from Q1 will be allowed
to execute for 1 time unit, and then it will be preempted; a process scheduled from Q2 will be
allowed to execute for 2 time units, and so on. In general, a process scheduled from Qi will be
allowed to execute 2i time units before preemption. Observations with this scheme, varying execu-
tion time for different types of processes (queues), when taken at random, reveal that this scheme
works quite nicely.
However, the beauty of this scheduling discipline is that it favors short processes, but at the
same time, it also forces the resource-consuming processes to slowly “sink down” into lower-level
queues, thereby working as FIlters in order to keep processor utilization high. This way of think-
ing is also supported by observations on program behavior that suggest that completion rate has a
natural tendency to decrease with increasing service. This means that the more service a process
receives, the less likely it is to be completed, even if it is given a little more service. That is why the
feedback mechanism in MLQs tends to rank processes dynamically according to the actual amount
of time already used, favoring to those that have received less. This is actually reFLected in that when
a process surrenders control to the OS before its time-slice expires (due to I/O request), it is rightly
rewarded by being moved up in the hierarchy of queues.
In fact, MLQ with feedback scheduling is the most general one to incorporate many of the simple
scheduling algorithms appropriate to each individual queue. Use of a feedback mechanism makes
this scheduling more adaptive and responsive to the actual runtime behavior of processes, which
seems to be more sensible. However, one of the major drawbacks of this class of scheduling is that
it always suffers from a comparatively high overhead due to manipulation of global queue, as well
as the presence of many constituent scheduling algorithms used by individual queues for their own
internal scheduling, thereby contributing their own individual overhead that increases the overhead
of this discipline as a whole.
142 Operating Systems
Intrinsic properties that distinguish one process from another usually include service-time
requirements, storage needs, resources held, and the amount of I/O required. This sort of
information may be obtained either before the process starts or may be determined while it is
running and may even be changed during execution that may be placed in the process block.
Extrinsic properties are characteristics that have to do with the user who owns the process.
Extrinsic properties mainly include the urgency of the process and how much the user is
willing to pay to purchase special treatment.
Dynamic properties indicate the load that other processes are placing on resources. These
properties include the size of the ready list and the amount of main storage available. Out
of the commonly used policies, round-robin scheduling and all its variants are preemptive
and non-intrinsic, while PSPN is preemptive and intrinsic. Nonpreemptive FCFS is non-
intrinsic, and nonpreemptive SPN and HPRN use intrinsic information.
For more details on this topic with a fgure, see the Support Material at www.routledge.com/
9781032467238.
utilization, the average is normalized by dividing by the weight of that group. The greater the weight
assigned to the group, the less its utilization will affect its priority. Usually, the higher the numerical
value of the priority, the lower it is.
It is to be noted that the actual share of CPU time received by a group of processes may some-
times differ from the fair share of the group due to lack of activity in its processes or in processes of
other groups. Different operating systems, particularly systems using the lottery scheduling policy
and the scheduling policy used in the UNIX operating system, differ in the way they handle this
situation. Lot of work has been done in this area by many OS researchers and designers. Interested
readers can consult Kay and Lauder (1988) and Woodside (1986).
For more details on this topic with computation, see the Support Material at www.routledge.
com/9781032467238.
1. Use multiple-level feedback up to a fxed number z of time-slices, then use FCFS for the
last queue. This method reduces the number of process switches for very long processes.
2. Use round robin up to some number of time-slices. A process that needs more time is to be
put in a second-run queue, which can be treated with SRR scheduling. Very long processes
are eventually placed in a third queue that could use FCFS. RR could have absolute prece-
dence over SRR, which, in turn, has precedence over FCFS, or each could be given a fxed
percentage of total time.
1. Use RR. However, instead of keeping the quantum constant, adjust it periodically, perhaps
after every process switch, so that the quantum becomes q/n, where n is the size of the
ready list. If there are very few ready processes, each gets a long quantum, which avoids
process switches. But if there are many, the algorithm becomes more fair for all, but at the
expense of more process switching. Processes that need only a small amount of time get a
quantum, and a small one may be able to fnish soon. The quantum should not be allowed
to drop below a given minimal value so that process switching does not start to consume
undue amounts of time.
2. Offer the current process an extra quantum whenever a new process arrives. The effect of
this gift is to reduce process switching in proportion to the level of saturation.
3. Some versions of UNIX use the following scheduling algorithm. Every second, an internal
priority is calculated for each process. This priority depends on the external priority (set
by the user) and the amount of recent time consumed. This latter fgure rises linearly as
the process runs and decreases exponentially as the process waits (whether because of
short-term scheduling or other reasons). The exponential decay again depends on the cur-
rent load (that is, the size of the ready list); if the load is higher, the CPU usage fgure of a
process decays more slowly. Processes with higher recent CPU usage get lower priorities
than those with lower-recent CPU usage. The scheduler runs the process with the highest
priority in the ready list. If several processes have the same priority, they are scheduled in
RR fashion.
144 Operating Systems
1. Use RR, but let the quantum depend on the external priority of the process. That is, allow
larger quanta for processes run for a user willing to pay a premium for this service.
2. The worst service next (WSN) method is a generalization of many others. After each time-
slice, compute for each process how much it has suffered so far. Suffering is an arbitrarily
complex fgure arrived at by crediting the process for how much it has had to wait, how
many times it has been preempted, how much its user is paying in premiums, and how
urgent it is. The process is also debited for such items as the amount of time it has actually
used and the other resources it has consumed (resources like space and access to secondary
storage). The process with the greatest suffering is given the next quantum.
3. The user buys a guaranteed response ratio. At the time of scheduling, a suffering function
is used that takes into account only the difference between the guaranteed response ratio
and the actual response ratio at the moment.
results have been derived, the next step is to validate the model by comparing its predictions against
reality. This step is easy if the model describes an actual situation, but it is harder if it describes a
hypothetical situation. Nevertheless, it is an useful tool that may be used to rationally compare and
study the overall behavior of the various scheduling disciplines for performance analysis.
For an illustration of this topic with a fgure, see the Support Material at www.routledge.com/
9781032467238.
Simulation: When analysis is inadequate or fails for being too complex, or when queuing net-
works do not able to effectively describe the situation, simulation may be used. Simulations are
programs, often quite complex, that mimic the dynamic behavior of the actual system in the mod-
eled one. In fact, simulation usually involves tracking a large number of processes through a model
(such as a queuing network) and collecting statistics. Whenever a probabilistic choice is to be made,
such as when the next arrival should occur or which branch a process will take, a pseudo-random
number is generated with the correct distribution. It is also possible to drive the simulation with
traces of real systems to better match the reality. Depending on the nature of the input to be fed to
the modeled system, a simulation may be trace-driven or self-driven. A trace-driven simulation uses
an input that is a trace of actual events collected and recorded on a real system. A self-driven simu-
lation uses an artifcial workload that is synthetically generated to closely resemble the expected
conditions in the target systems.
Simulations, just like analytic models, must be validated to ensure that they are adequate to ratio-
nalize the situation that is being modeled. They are often run several times in order to determine how
much the particular pseudo-random numbers being chosen affect the results. Simulations tend to
produce enormous amounts of data that must be carefully fltered before they are used. Simulations
often use extremely detailed models and therefore consume enormous amounts of computer time.
Moreover, accurate simulations of comparatively complex systems are also critical in terms of design
and coding. For these reasons, simulations are usually appropriate only if analysis fails.
For an illustration of this topic with fgures, see the Support Material at www.routledge.com/
9781032467238.
Experimentation: If simulation appears diffcult and does not work, usually due to the complexity
of the model, experimentation is the last resort and often requires a signifcant investment in equip-
ment, both to build the system to be tested and to instrument it to acquire the required statistics. For
example, it can be far cheaper to simulate the SPN scheduling method than to implement it properly,
debug it, execute it on a user community for a while, and measure the results. Likewise, it is cheaper to
simulate the effect of installing a new disk than to acquire it on rent, connect it, see how well it works,
and then disconnect it after being convinced that the improvement is not cost-effective. However,
experimentation almost always ensures accurate results, since by defnition, it uses a truthful model.
Long-term scheduling also blends into medium-term scheduling somewhat. The decision not to
allow a process to start may be based on explicitly declared resource needs that are not currently avail-
able, in which case it can be said that the process has really started (as far as the long-term scheduler is
concerned) but is still waiting for resources (as far as the medium-term scheduler is concerned).
In fact, the long-term scheduler since acts as a frst-level throttle in keeping resource utilization
at the desired level but infuencing and regulating in many ways the decisive action to be taken by
the short-term scheduler at the time of scheduling.
For more details on this topic, see the Support Material at www.routledge.com/9781032467238.
intensive phase might still be considered interactive, and the scheduler might favor it instead of
treating it as the equal of a background computation.
usage of shared information and to synchronize the operation of the constituent processes. It is
worth mentioning that OSs typically provide only the minimum mechanism to address concurrency,
since there are so many ways to implement concurrency (e.g. from the point of view of program-
ming languages, the designer’s ability to set a collection of concurrent processes without interfer-
ence, various methods of communication between concurrent processes without interference, etc.)
but none of them have been found to dominate the area. However, as an aid to users, the OS provides
a set of primitives so that interprocess synchronization can be achieved in a well-structured way
without using interrupts.
Cooperating processes require the simultaneous existence of two different communicating pro-
cesses that typically share some resources in addition to interacting with each other (one produces
an output which is the input to the other) that, in turn, demands synchronization to preserve prece-
dence relationships and prevent concurrency-related timing problems.
While all sorts of measures have been taken and numerous techniques have been devised for
smooth execution of concurrent processes with proper synchronization for the sake of performance
improvement and increased productivity, correct implementation, however, may invite other prob-
lems of different types, including a possibility of deadlock among concurrent processes. Deadlock
will be discussed in a subsequent section later in this chapter.
The details of this topic are given on the Support Material at www.routledge.com/9781032467238.
1. No two processes may be in their critical section at the same time in order to ensure mutual
exclusion.
2. When no process is in a critical section, any process that requests entry to its critical sec-
tion must be permitted to enter without delay.
3. No assumptions may be made about relative process speeds and priorities, the number of
contending processes, or the availability of the number of CPUs.
4. No process running outside its critical section may block other processes.
5. No process should have to wait forever to enter its critical section, that is, to grant entrance
to one of the contending processes into the critical section for only a fnite duration, thereby
preventing deadlock and starvation.
A willing process attempting to enter a critical section frst negotiates with all other concurrent pro-
cesses to make sure that no other conficting activity is in progress, and then all concerned processes
are informed about the temporary unavailability of the resource. Once consensus is reached, the
winning process can safely enter the critical section and start executing its tasks. After completion,
the concerned process informs the other contenders about the availability of the resource, and that
may, in turn, activate the next round of negotiations.
We will now turn to developing mechanisms that can be used to provide mutual exclusion and
thereby synchronization. We will start with simple but conservative techniques and move toward
more complex, liberal ones. In each case, we will show how certain primitives can be built to ensure
mutual exclusion.
All mechanisms developed to realize mutual exclusion and synchronization ultimately depend
on the synchronous nature of hardware. Some rely on the fact that processors can be made uninter-
ruptible. These processor-synchronous methods work only for individual processors. Others rely
on the fact that main storage can service only one access request at a time, even in a multiprocessor.
These store-synchronous methods have a wider range of applicability.
in a multiprocessor. This synchronization atomicity of main storage can be used to defne a new
main-storage variable called a switch that can then be shared by two concurrent activities to imple-
ment mutual exclusion between them. This variable allows only one or the other to enter its critical
region. To mitigate the confict that may arise due to simultaneous access to this common shared
switch by concurrent independent processes, a user-made software approach can be employed to
serialize the multiple accesses to the shared switch variable. When this approach is implemented;
no special hardware facility, no additional OS assistance, and above all, seeking of any particular
supporting attribute at the level of programming language have ever been at all assumed. Moreover,
these approaches do not consider any ordering of access to be granted to the contending processes
while arbitrating the conficting situation.
The solution as offered in this algorithm is able to preserve mutual exclusion, considering all the
critical situations that may also arise. But, while Peterson’s approach is simple enough, it is not free
from busy waiting that may summarily degrade the performance of the entire system. However, this
drawback does not decline its theoretical signifcance but only tends to limit its direct applicability in
practice. Moreover, this solution can also be generalized to any number of processes that compete over
the same shared resources. This generalization can be achieved by use of a spin switch (Hofri, 1990).
The details of this approach with the algorithm are given on the Support Material at www.routledge.
com/9781032467238.
Interrupt disabling, however, can cause severe problems; it actually disables clock interrupts and in
turn disables the scheduler temporarily, resulting in no preemptions, thereby affecting rescheduling.
In that situation, a lower-priority process could prevent a higher-priority process from continuing.
It also outlaws concurrency altogether by disabling all other innocent disjoint processes not related
to the blocking process. Moreover, events related to real-time processing are unable to respond eff-
ciently; devices that need immediate service cannot be entertained until the ongoing activity in the
critical region completes. In fact, attempts to disable the interrupt force the entire system virtually
into a state of suspension. In addition, if the user process is given the power to handle interrupts for
synchronization purposes, it may be dangerous and even totally unreliable if inappropriate moves
are taken by the user that may lead to total collapse of the entire system, or the system may be
trapped in a deadlock. The disabling-interrupts approach is not at all suitable for the multiprocessor
systems (multiple CPUs) with shared memory, because it works only on the concerned CPU and
is not applicable to other CPUs, which may cause a race condition among the processes running
on them. Still, it is often convenient to entrust the kernel itself to disable interrupts for those few
instructions causing a race condition. That is why it is sometimes a useful technique to disable
Processor Management 153
interrupts within the kernel (implementation of semaphore WAIT and SIGNAL as system calls
using DI and EI instructions at the system level only, to be discussed later in this chapter), but is not
considered appropriate as a general mutual exclusion mechanism for user processes.
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
The test-and-set lock instruction negotiates the conficts among the contending processes by
allowing only one process to receive a permit to enter its critical section. The basic idea is simple.
A physical entity called lock byte is used as a global variable which controls the entry to the critical
section (to access the shared resource). This global variable is set to free when the guarded shared
resource is available. Each process intending to access the shared resource must obtain a permit to
do so by executing the TSL instruction with the related control variable as an operand. When several
concurrent contending processes compete, the TSL instruction guarantees that only one of them
will be allowed to use the shared resource.
In fact, when this TSL instruction is executed, the value of the lock byte (memory word) will be
read and tested, and if it is 0 (free), it is then replaced by 1 (busy) or any non-zero value and returns
true, and the process may now enter its critical section. This entire sequence of operations is guaran-
teed to be indivisible; that is, it is to be carried out atomically; no other process/processor can access
the lock (word) until the TSL instruction is completed.
The shared control variable used as a lock itself becomes a new critical section related to its
testing and setting. If a process is interrupted after testing the shared control variable but before it
sets it (after the if statement but before the assignment statement in the following declared function),
the solution fails. This new, smaller critical section that manipulates the shared control variable can
be handled by certain forms of needed system calls embedded in the TSL instruction. In this case,
interrupts are disabled in OS code only while the variable is being manipulated, thereafter allowing
them to be enabled while the main critical section is being executed. Hence, the amount of time
that the interrupts are disabled is very short. When the execution of the TSL instruction is over, the
shared resource becomes protected for the exclusive use of the process that calls the TSL instruc-
tion. The test and set can be defned as follows:
The beauty of the TSL instruction lies in the guarantee of its atomic (indivisible) action with regard
to testing of the global variable and its subsequent setting at the time of entering the critical section
154 Operating Systems
(occupying the corresponding shared resource). The IBM System/360 family with a number of
models were the frst computers to include TSL instruction in hardware. Since then, the archi-
tectural design of almost all commercial computers includes explicit hardware provisions, either
in a similar form to a TSL instruction or its functional equivalent for implementation of mutual
exclusion. In fact, the other type of instructions, INCREMENT MEMORY and SWAP MEMORY
AND REGISTER, are also available; each one, however, uses similar approaches to TSL instruc-
tion and carries out its respective indivisible operation at the time of execution to implement mutual
exclusion.
• Exchange Instruction
A different type of instruction that uses a similar approach to TSL instruction to implement
mutual exclusion is the Exchange (XCHG) instruction, which carries out the contents of a variable
(lock byte) to be tested and subsequently set as an indivisible operation. This XCHG instruction may
also be used successfully in a multiprocessor environment by providing an additional arrangement
of a special LOCK prefx over the system bus that enables this instruction to enforce its atomicity
(read-modify-write cycle on the bus) in a multiprocessor system, thereby ensuring mutual exclusion.
This is found in the implementation of the Intel iAPX–86 (80 x 86) family of processors with the
use of an WAIT operation. In fact, when this exchange (XCHG) instruction is executed, it exchanges
the contents of a register with the contents of the lock byte (specifed memory word). During execu-
tion of this instruction, access to the lock byte is blocked for any other instruction referencing that
particular lock byte.
While both TSL and ECHG instructions are easy to implement, do not affect interrupts, and
relieve the system from the state of any type of suspension, the performance advantages offered
by them are often offset by some of their serious drawbacks. The busy waiting caused by them
leads to serious degradation in the effective utilization of the system, and the busy waiting of many
Processor Management 155
processes lying in a queue with arbitrary ordering may cause the possibility of acute starvation
for some processes In addition, the process or processes in a state of busy waiting may hold some
resource(s) that may be required by the process executing in its critical section. But access to the
resource(s) will simply be denied because of the principle guiding the mutual exclusion mechanism.
Consequently, the process in the critical section cannot complete its execution, and all other pro-
cesses in the busy-waiting state will then remain in their existing state forever. The entire system
may then simply be in deadlock. All these drawbacks that are found in hardware-based solutions
summarily caused severe ill effects that greatly outweigh its many distinct advantages, and that is
why a search for other effective mechanisms has been done.
For more details about these topics, see the Support Material at www.routledge.com/
9781032467238.
4.17.5.8 Semaphores
Dijkstra proposed a reliable, effcient, and specialized mechanism to implement solutions to syn-
chronization problems, especially for mutual exclusion among an arbitrary number of cooperating
processes using a synchronization tool called semaphore. His innovative approach was the frst
one to use a software-oriented OS primitive (semaphore) to accomplish process synchronization
and is still considered a viable one for managing communities of competing/cooperating pro-
cesses. Hence, it found its way into a number of experimental and commercial operating systems.
Competing/cooperating processes can safely progress when they hold a permit, and a semaphore can
be roughly considered a permit provider. A process requests a permit from a semaphore, waits until a
permit is granted, proceeds further after obtaining one, and returns the permit to the semaphore when
it is no longer needed. If the semaphore does not have a permit, the requesting process is blocked until a
permit is available. The semaphore immediately receives a permit when a process returns one. Hence, a
permit request is a blocking operation, and the permit return is not. In fact, the semaphore manager only
keeps a count of the number of permits available and manipulates the number accordingly.
A semaphore s is an OS abstract data type, a special variable used to synchronize the execution
of concurrent processes. It has two member components. The frst component, count, is an integer
variable which can take values from a range of integers that indicates the number of permits the sema-
phore (counting semaphore) has. The second component, wait-queue, is a queue of blocked processes
waiting to receive permits from the semaphore. The initial value of count is created with a fxed num-
ber of permits, and the initial value of wait-queue is NULL, indicating no blocked processes.
156 Operating Systems
The two standard primitive atomic operations that can be invoked to access a semaphore struc-
ture are wait (or up) and signal (or down). Each primitive takes one argument, the semaphore vari-
able s, for permit request and permit release actions. A process takes a permit out of a semaphore
(the semaphore transmits a signal) by invoking the operation wait (or up) on the semaphore and
inserts a permit into (or releases a permit to) a semaphore (the semaphore receives a signal) by
invoking the signal (or down) operation on the semaphore. In Dijkstra’s original paper, the wait (or
up) operation was termed P (from the Dutch word proberen, meaning “to test”) and the signal (or
down) was called V (from the Dutch word verhogen, meaning “to increment”). Operating systems
often distinguish between counting and binary semaphores. The value of a counting semaphore can
take values from a range of integers. The value of a binary semaphore can range only between 0
and 1. On some systems, binary semaphores are known as mutex locks, as they are essentially locks
that provide mutual exclusion.
Both of these operations include modifcation to the integer value of the semaphore that, once
started, is completed without any interruptions; that is, each of these two operations is indivisible
(atomic action). In fact, semaphore variables can be manipulated or inspected by only three avail-
able operations, as depicted in Figure 4.16 and defned as follows:
Wait and signal are primitives of the traffc controller component of processor management that
are embedded in the scheduler instead of built directly on the hardware.
A wait (s ) sets the process in the queue of blocked processes, if needed, and then sets the pro-
cess’s PCB to the blocked state. The processor is now available, so another process is then selected
by the process scheduler to run.
signal operation execution on a semaphore, as shown, frst checks if there are any blocked pro-
cesses waiting in the wait-queue. If there are, one of them is awakened and offered a permit using
a scheduling discipline such as FCFS, the fairest policy to avoid indefnite delay of a process in a
Processor Management 157
semaphore that may otherwise cause starvation if other processes are given preference. The process
selected by the scheduler is now ready to run again. Otherwise, the semaphore member variable
count is simply incremented by one.
There is some controversy over whether the scheduler should switch immediately to the waiting
process to be activated in the domain of signal (Figure 4.16). An immediate switch guarantees that
whatever condition awaited by that activity still holds, since the signal operation is in progress, and
no other activity has had a chance to run. The disadvantage of an immediate switch within the sig-
nal domain is that it tends to increase the total number of switch processings. Moreover, the process
that called signal may likely to call wait for a new region soon that may ultimately cause the said
process itself to block in any case. The hysteresis principle suggests that the running process should
be allowed to continue its remaining processing.
When semaphore is supported, semaphore operations as well as the declaration of semaphore
variables are commonly provided in the form of system calls in the operating system or as built-in
functions and types in system implementation languages.
s.count := s.count – 1
if s.count < 0 then
begin
block () ; / * block the process * /
place the process in s.queue
end ;
s.count := s.count + 1
if s.count ≤ 0 then
begin
remove a process P from s.queue ;
wakeup () ;
place the process in ready list ;
end ;
FIGURE 4.17 An algorithm illustrating the defnition of general (counting) semaphore primitives (wait and signal).
158 Operating Systems
Program/segment mutual_exclusion;
...
const n = . . .; ( * number of processes )
var s : semaphore ( := 1 ) ;
process P ( i : integer ) ;
begin
while true do
begin
wait ( s ) ;
< critical section > ;
signal ( s ) ;
< remaining P( i ) processing >
end [ while ]
end ; [ P( i ) ]
[ main process ]
begin [mutual_exclusion ]
s := 1; ( free )
initiate P(1), P(2), . . ., P(n)
end [mutual_exclusion ]
FIGURE 4.18 An algorithm that implements mutual exclusion of competing processes using semaphores.
In this implementation (Figure 4.17), semaphore values may be negative, although semaphore values
are never negative under the classical defnition of semaphores. In fact, if a semaphore value is negative,
its magnitude actually indicates the number of processes waiting on that semaphore. This fact results
from switching the order of the decrement and then the test in the implementation of the wait() operation.
As an illustration of the use of semaphores, let us consider that n different processes identifed in
the array P(i) share a common resource being accessed within their own critical sections, as shown
in Figure 4.18. Each process ensures the integrity of its critical section by opening its critical sec-
tion with a wait() operation and closing the critical section with a signal() operation on the related
semaphore; that is, the semaphore is executed atomically. This means that in each process, a wait
(s) is executed just before the critical section. If the value of s is negative, the process is suspended.
If the value of s is 1, then it is decremented to 0, and the process immediately enters its critical
section. Because s is now no longer positive, any other process that now attempts to execute wait()
will make s negative and hence will not be allowed to enter its critical section. The process will
be blocked and will be placed in the queue. When the process that already entered its critical sec-
tion ultimately leaves the region, it closes its critical section with a signal on the same semaphore.
This will increment s by 1, and one of the blocked processes (if any) is removed from the queue of
blocked processes associated with the semaphore and put in a ready state. When it is next scheduled
by the operating-system scheduler, it can then enter its critical section.
All these together frmly guarantee that no two processes can execute wait() and signal() opera-
tions on the same semaphore at the same time. This is realized (in a single-processor environment)
by simply inhibiting interrupts during the time the wait() and signal() operations are executing. Once
interrupts are inhibited, instructions from different processes cannot be interleaved, and only the cur-
rently running process executes until interrupts are re-enabled and the scheduler can regain control.
Processor Management 159
It is to be noted that, since the wait() and signal() operation executions by the different pro-
cesses on the same semaphore must exclude one another, this situation itself is the mutual exclusion
problem; hence, busy waiting with this defnition of wait() and signal() operations is really not
completely eliminated. In fact, we have moved busy waiting from the entry section to the critical
sections of the application programs. However, the critical section [containing wait() and signal()
implementations] is usually very small, and almost never occupied; hence, it does involve in limited
busy waiting albeit for a shorter duration, and that also occurs rarely. But if the critical section in an
application program is relatively long and is almost always occupied, busy waiting in that situation
really cannot be completely avoided.
adversely affect the degrees of parallelism among the contending processes in systems. Apart from
that, it creates other bad situations, such as the starvation of processes and deadlock in the system
that, in turn, also require additional mechanisms to resolve. Hence, it is necessary to willfully con-
trol all these ill effects of serialization, and that can be accomplished by varying the granularity of
individual semaphores.
The fnest granularity of semaphores at one end is realized by dedicating a separate semaphore
to guard each specifc shared resource from simultaneous use by contending processes. As a result,
a huge number of semaphores are required to guard all these different shared resources available in
a system for the sake of synchronization. The storage requirement overhead is then also appreciable,
and the total time required by these semaphores to operate contributes a huge runtime overhead due
to processing of numerous waits and signals.
The coarse granularity of semaphores, on the other hand, can be made by assigning each sema-
phore to guard a collection of shared resources, possibly of similar types. This approach reduces
the storage requirement and runtime overhead but adds extra cost required to negotiate an increased
number of conficts as well as enforcing rigorous serialization of processes, which may also have no
other resources in common. In fact, coarse-grained semaphores, apart from creating priority inver-
sion, may severely affect parallelism to such an extent that it often outweighs the benefts already
accrued. Thus, the trade-off between coarse-grained and fne-grained semaphores must be care-
fully analyzed, and a satisfactory balance must then be realized on the basis of the application being
handled after willful manipulation and compromise.
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
4.17.5.14.1.1 Producers and Consumers with an Unbounded Buffer Any number of producers
and consumers can operate without overlap on the buffer of unbounded capacity using their respec-
tive service rates within the specifed time-slice. After the initialization of the system, a producer
must obviously be the frst process to run in order to provide the frst item for the consumer. Each
time the producer generates an item, an index (in) into the buffer is incremented. From that point on,
a consumer process may run whenever there is an item in the buffer produced but not yet consumed.
The consumer proceeds in a similar fashion incrementing the index (out) but ensuring that it does
not attempt to consume from an empty buffer. Hence, the consumer must make sure that it works
only when the producer has advanced beyond it (in>out). Alternatively, if the consumer is consid-
ered a frst process, then it begins with waiting for the frst item to be produced by the producer.
A single general semaphore (counting semaphore) uses here a “produced” variable initialized
with 0 as a counter to keep track of the number of items produced but not yet consumed. Since the
buffer is assumed to be unbounded, the producer may run at any time to produce as many items as
it can. When the producer generates an item, it is placed in the buffer, and this fact is signaled by
means of the general semaphore PRODUCED; hence, no extra counter or check over the counter is
required here. According to the assumption and per the nature of the problem, this implies that the
consumer can never get ahead of the producer. However, this approach, in general, cannot guarantee
system integrity due to having several limitations under certain situations.
The entire implementation could even be realized by a different algorithm employing a binary
semaphore (in place of a general semaphore) by means of using the two primitives WAIT and
SIGNAL attached to the semaphore in each critical section and calling them at the right point
for mutual exclusion. In that situation, an additional variable counter is required which is to be
incremented and decremented and to be checked at the right point in the procedure, PRODUCER
and CONSUMER, to keep track of whether the buffer is empty, and if so, provisions for appropri-
ate actions (wait) are to be made accordingly. But this implementation also suffers from certain
limitations.
Initially, both solutions to the problem, counting semaphores and binary semaphores, are found
to have shortcomings under certain situations. After detecting faws, a refned, corrected approach
to overcome the limitations of the solutions was formulated by taking appropriate actions at the
right point within the existing algorithms. Although this example is not a realistic one, it can be
concluded that it is a fairly representative one that demonstrates both the power and the pitfalls of
the semaphore while it is in action.
The details of these two approaches to separately solve the problem with algorithms, their
limitations, and fnally the correct solution with refned algorithms, are described on the Support
Material at www.routledge.com/9781032467238.
4.17.5.14.1.2 Producers and Consumers with a Bounded Buffer The producer/consumer prob-
lem, initially introduced with an unbounded buffer, demonstrates the primary issues and its related solu-
tions associated with concurrent processing with virtually no restriction over the execution of producers.
Processor Management 163
However, the unbounded buffer assumption is not a practical approach and may not be directly appli-
cable in real-life situations where computer systems with memory (buffer) of fnite capacity are used.
This section will thus deal with the same producer/consumer problem, now with a bounded buffer and
its solution so that it may be applicable in realistic situation where the shared buffer has a fnite capac-
ity. Here, the fnite buffer consists of n slots, each capable of holding one item. It is implemented in a
circular fashion by “wrapping around” from the last (highest index) to the frst (lowest index) position.
Two pointer variables are associated with the buffer, in and out, the former for the producer and the
latter for the consumer, are used to indicate the current slots (or the next place) inside the buffer for the
producers to produce an item and for the consumers to consume the next item, respectively. This is
depicted in Figure 4.20. These pointer variables, in and out, are initialized to 0, incremented according
to the execution of the producer or consumer, and must be expressed in terms of modulo, the size of the
buffer. Now, the producer and consumer functions can be expressed as follows:
producer :
begin
produce pitem ;
while ( ( in + 1 ) mod buffersize = out )) do [ nothing ]
buffer [ in ] := pitem
in := ( in + 1 ) mod buffersize
end [ while ]
end [ producer ]
consumer :
begin
while in = out do [ nothing ]
citem := buffer [ out ]
out := ( out + 1 ) mod buffersize
consume citem
end [ while ]
end [ consumer ]
As usual, producers may produce items only when the shared global buffer is empty or partially
flled, that is, only when there are empty spaces available in the buffer to accept items. Otherwise,
new items produced might overwrite the already existing items produced earlier but not yet con-
sumed, which may damage the processing, making it unreliable. All the producers must be kept
waiting when the buffer is full. Similarly, consumers, when executing, may absorb only produced
items, making the buffer empty, thereby enabling the producers to run. The consumers must wait
when no items are available in the buffer; hence, they can never get ahead of producers.
At any point in time, the buffer may be empty, partially flled, or full of produced items. Let produce
and consume represent the total number of items produced and consumed respectively at any instant,
bb(1) bb(2) bb(3) bb(4) bb(5) bb(6) ... ... ... bb(n)
↑ ↑
out in
(a)
bb(1) bb(2) bb(3) bb(4) bb(5) bb(6) ... ... ... bb(n)
↑ ↑
in out
(b)
FIGURE 4.20 An algorithm illustrating the logic to solve the classical producer–consumer problem with
bounded (fnite) buffer used in circular fashion.
164 Operating Systems
and let item-count be the number of items produced but not yet consumed at that instant, that is, [item-
count = produce – consume]. Let canproduce and canconsume be two conditions indicating the current
status of the buffer that can be used to control the execution of the producers and consumers, respec-
tively, with the conditions [canproduce: item-count < buffersize], since the producers are allowed to run
only when there are empty spaces available in the buffer, and [canconsume:item-count > 0], since the
consumers can continue their execution if there exists at least one item produced but not yet consumed.
Figure 4.21 shows a solution to the producer/consumer problem with a bounded buffer in which
two types of semaphores have been used. The general semaphores canproduce and canconsume
Program/segment bb–producer–consumer
...
const
buffersize = . . .
type
item = . . .
var
buffer : array [ 1 . . . buffersize ] of item ;
canproduce, canconsume : semaphore ; [ general ]
pmutex , cmutex : semaphore ; [ binary ]
in, out : ( 1 . . . buffersize ) ;
procedure producers ;
var pitem : item ;
begin
while true do
begin
wait(canproduce) ;
pitem := produce ;
wait(pmutex) ;
buffer [ in ] := pitem ;
in := ( in mod buffersize ) + 1 ;
signal(pmutex) ;
signal(canconsume) ;
other–producer–processing
end [ while ]
end [ producers ]
procedure consumers ;
var citem : item ;
begin
while true do
begin
wait(canconsume) ;
wait(cmutex) ;
citem := buffer [ out ] ;
out := ( out mod buffersize ) + 1 ;
signal(cmutex) ;
signal(canproduce) ;
consume (citem) ;
other–consumer–processing
end [ while ]
end [ consumers ]
FIGURE 4.21 An algorithm describing the modifed solution of the classical producer–consumer problem
with bounded (fnite) buffer using semaphores (both counting and binary).
Processor Management 165
[ main program ]
begin [ bb–producer–consumer ]
in := 1
out := 1
signal(pmutex) ;
signal(cmutex) ;
[ canconsume := 0 ; ]
for i := 1 to buffersize do signal(canproduce) ;
initiate producers , consumers
end [ bb–producer–consumer ]
represent the two conditions to control the execution of producer and consumer processes, respec-
tively, as already explained. Two binary semaphores pmutex and cmutex are used to protect the
buffer (atomic action) while producers and consumers are active in their respective turns manipulat-
ing their index (in or out). Consequently, this solution supports multiple concurrent producers and
consumers.
As shown in Figure 4.21, the producer processes, producers, can run only when there is any
empty space in the buffer, indicated by the semaphore canproduce. This semaphore may initially
be set to the value corresponding to the buffer size, thus allowing producers to get up to buffersize
items ahead of consumers. Alternatively, it can also be set to 0, showing the buffer is empty at
the beginning, and the producers can then proceed accordingly. However, when a consumer com-
pletes its turn, it empties a slot by consuming the item, removing it from the buffer, and signals
the fact through the canproduce semaphore to wake up a waiting (sleeping) producer, if there is
any. Similarly, the canconsume semaphore indicates the availability of items already produced and
behaves almost in the same manner as the unbounded-buffer version of the consumers. Each time
a producer process runs, it increments the value of canconsume [signal (canconsume)], and a con-
sumer process decrements it by executing a wait operation.
Actions being taken as shown in Figure 4.21 in relation to buffer manipulations by both produc-
ers and consumers are treated here as critical sections which are kept protected with the use of the
binary semaphores pmutex and cmutex to make this solution more versatile while using the global
buffer. The modulo operator is used to implement the buffer in circular fashion. Two indices, in and
out, are used by producers and consumers, respectively, to increase the degree of concurrency in
the system. Usually, two sets of processes operate at different ends of the buffer, and they compete
with their peers within the group (intra-group), but not between the groups (inter-group). This situa-
tion is handled by a single semaphore mutex in the solution of the unbounded–buffer case presented
earlier. However, a single semaphore may unnecessarily block the producers whenever a consumer
is in its critical section and vice-versa. As the indices are disjoint, two semaphores offer more con-
currency, of course with a little bit of additional overhead.
kept waiting unless a writer has already received the access right to use the shared resource. That
is, no reader should wait for other readers to complete because a writer is waiting. In other words,
unlike the mutual exclusion problem, many readers are allowed to concurrently operate on a shared
resource (critical section), as they do not change the content of the shared resource. Based on this
discussion, a typical appropriate solution of this problem can be obtained, the details of which are,
however, outside the purview of our present discussion.
• Semaphores are not really structured: They require strict adherence to related rules and
regulations developed for each specifc problem to solve, and failing in this may lead to
corrupting or blocking the entire system. They also impose strict serialization of processes
during runtime and thus require a specifc form of service discipline among the waiting
processes.
• Semaphores do not support data abstraction: They only ensure protected access to critical
sections and never control the type of operations to be performed in the critical section by
the contending processes. Moreover, while they perform interprocess communication via
global variables which are now exposed, that in turn invites severe threats.
• Semaphores do not have any programming-language constructs: The user program has no
liberty but can only religiously and carefully follow the synchronization protocol under-
lined by the semaphore. Since the compiler is not aware of the specifc resource to be
shared, no compilation check can be carried out to detect syntactic errors.
Besides other pertinent issues, all these drawbacks, however, encouraged designers to devise alter-
native appropriate suitable mechanisms that could, by and large, alleviate the notable disadvantages
of using semaphores.
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
The rationale for these semantics is that a signal should represent the situation that an event has
just occurred, not that it occurred sometime in the past. If another process detects this occurrence
at an arbitrary time later (as is the case with the passive semaphore operations), the causal relation-
ships among calls to the wait and signal functions are lost. These semantics, however, have a lasted
bearing and far-reaching infuence in the design and development of monitors, another synchroniza-
tion tool described in later sections.
An example detailing this topic is given on the Support Material at www.routledge.com/
9781032467238.
region v do;
where the sentence or sequence of sentences following do is executed as a critical section. At the
time of generating code for a region, the compiler essentially injects a pair of wait and signal opera-
tions or their equivalents (calls to semaphore) around the critical section. Thus, mutual exclusion,
as enforced by the use of the semaphore, has been properly realized by the use of region. Although
the use of region may invite deadlock and also starvation of the processes waiting in the queue for
the same shared variable, those can be alleviated by the compiler and with the use of appropriate
scheduling algorithms on the waiting queue, respectively. Similar to a semaphore, while the critical
region is used as the only facility to synchronize competing processes, it also requires busy waiting
in a wait operation until some condition happens, in addition to its other drawbacks. To solve the
synchronization problem without a busy wait, Brinch Hansen proposed a slight modifcation to the
existing critical region, another construct called the conditional critical region. This construct is
in all respects syntactically similar to the critical region, simply with an addition of only an extra
instruction with a new keyword called await attached to an arbitrary Boolean condition to be used
inside a critical region. This construct, when implemented, permits a process waiting on a condition
within a critical region to be suspended and places it in a special queue until the related condition is
fulflled. Unlike a semaphore, a conditional critical region can then admit another process into the
critical section in this situation. When the condition is eventually satisfed, the suspended process
is awakened and is once again brought into the ready queue. Thus, the conditional critical region
satisfes both mutual exclusion and synchronization without a busy wait.
Conditional critical regions are very easy to use for complex mutual exclusion and synchroni-
zation problems. But they are confusing and even cumbersome while keeping track of dynamic
changes of the several possible individual conditions that the await statement requires. The com-
mon implementation of the conditional critical region normally assumes that each completed
process may have modifed the system state in such a way that resulted in some of the waited-on
conditions being fulflled. This incurs the additional cost involved in frequent checking of con-
ditions. Moreover, the code that modifes shared data may be scattered throughout a program,
making it diffcult to keep track in a systematic manner. Due to such undesirable complex imple-
mentation, conditional critical regions are rarely supported in commercial systems directly.
For more details with algorithms and also an example on this topic, see the Support Material
at www.routledge.com/9781032467238.
168 Operating Systems
4.17.5.18 Monitors
The limitations as experienced in semaphore operations and the complexities involved in intro-
ducing conditional critical region approach to negotiate the interprocess synchronization problem
ultimately gave rise to introduce a new concept, proposed frst by Hoare (1974) in the form of a
higher-level synchronization primitive called a monitor which was later further refned by Brinch
Hansen (1975) in a slightly different way.
The monitor is a programming-language construct that provides equivalent functionality to that
of a semaphore and has been implemented in a number of programming languages, including Ada,
Concurrent Pascal, Pascal-plus, Modula-2, Modula-3, Mesa, Concurrent Euclid, and also in a few
other languages. But they received a big boost when Arm architecture lately implemented it. Java
synchronized classes are essentially monitors, but they are not as pure as monitor advocates would
like, yet there is a full language-supported implementation for users to realize a synchronization
mechanism for Java threads. More recently, they have also been implemented as a program library
that, especially, enables the user to put a monitor lock on any object (e.g. to lock all linked lists).
Monitors are formed primarily to provide structural data abstraction in addition to concurrency
control. This means that it not only controls the timing but also determines the nature of operations
to be performed on global shared data in order to prevent harmful or meaningless updates. This
idea is realized by way of fxing a set of well-defned and trusted data-manipulation procedures
to be executed on global data in order to limit the types of allowable updates. Abstract data types
actually hide all the implementation in data manipulation. Depending on whether the users are just
encouraged or actually forced to access data by means of the supplied procedures, data abstraction
may be regarded as weak or strong, respectively. The weak form of data abstraction may be found
in an environment supported by semaphores, since it never enforces using the supplied procedures
for global data manipulation, and it therefore sometimes makes the entire system truly vulnerable
to its users.
4.17.5.18.1 Defnition
A monitor is a software module consisting of a collection of procedures, an initialization sequence,
local data variables, and data structures that are all grouped together in a special kind of package
for manipulating the information in the global shared storage. The main characteristics of a monitor
are:
• The local data variables with a data structure embedded in a monitor are only acces-
sible by the monitor’s supplied procedures and not directly by any other declared external
procedure.
• With the use of public interface, a process can enter the monitor by invoking one of its
private procedures.
• At any instant, only one process may be allowed to execute in the monitor, while the other
processes that have invoked the monitor by this time are suspended and are kept waiting
until the monitor next becomes available.
The frst two characteristics encourage modularization of data structures so that the implementation
of the data structure is private, with a well-defned public interface. This has a close resemblance to
the defnition of an object as found in object-oriented software. In fact, an object-oriented operating
system or a programming language can readily implement a monitor as an object with the required
special characteristics.
The third characteristic emphasizes that the monitor is able to achieve mutual exclusion, since
the data variables in the monitor can be accessed by only one process at a time. Thus, the shared
data structure can be easily protected by placing it in a monitor. If the data in a monitor represent
some resource, then the monitor ensures a mutual-exclusion facility at the time of accessing the
resource.
Processor Management 169
Since the monitors are a programming-language construct, the compiler is quite aware that they
are of special types and hence can handle calls to monitor procedures differently from other proce-
dure calls, thereby arranging for mutual exclusion. Typically, when a process calls a monitor proce-
dure, the frst few instructions of the called procedure will check to see whether any other process
is currently active within the monitor. If so, the calling process will be kept suspended until the
other releases the monitor. If no other process is active in the monitor, the calling process may then
be safely allowed to enter. In any event, the user need not be aware of how the compiler arranges
for mutual exclusion. It is enough only to know that by turning all the critical sections into monitor
procedures, no two processes will ever be allowed to execute their critical sections simultaneously.
monitor–name : monitor
begin
declaration of private data ; /* local variables used by monitor */
procedure pub–name ( formal parameters ) /* public procedures */
begin
procedure body ;
………………. ;
end ;
procedure priv–name /* private procedures */
initialization of monitor data ;
………………. ;
end ( monitor–name ) ;
FIGURE 4.22 An algorithm showing the typical format to declare a monitor.
170 Operating Systems
monitor sharedBuffer {
int balance ;
public ;
produce ( int item ) ( balance = balance + item ; ) ;
consume ( int item ) ( balance = balance – item ; ) ;
}
FIGURE 4.23 An algorithm explaining the use of monitor while handling a shared variable.
concurrent requests internally by means of its own codes and local variables with a specifc data
structure that is absolutely hidden from users. In this way, interprocess synchronizations and com-
munications are handled by the monitor.
However, in many cases, it is observed that when a process is executing inside the monitor, it
discovers that it cannot proceed until some other process takes a particular action on the informa-
tion protected by the monitor. So a way is needed for an executing process to block when it cannot
proceed; otherwise the process will perform an undesirable busy wait. In this situation, processes
should be allowed to wait within the monitor on a particular condition without affecting other moni-
tor users signifcantly. Consequently, another process may then enter the monitor to execute its own
purpose. This idea of internal signaling operations was borrowed by monitors from semaphores.
In the case of the producer/consumer problem, when the producer inside the monitor fnds that
the buffer is full, it cannot proceed until the consumer process consumes it. So, in this situation, a
mechanism is needed by which the producer process will not only be suspended but temporarily
relinquish the monitor so that some other process may enter and use it. Later, when the condition is
satisfed, and the monitor is again available, the blocked process needs to be resumed and allowed
to reenter the monitor at the point where it left. To accommodate this situation, monitors incorporate
condition variables to realize a solution with synchronization. This particular aspect of monitors is
similar to conditional critical regions.
cwait(c): Suspends execution of the invoking process on condition c until another process
performs a csignal on the same condition. After execution of cwait, the monitor is then
available for use by another process.
csignal(c): Resumes execution of exactly one other process suspended after a cwait on the
same condition. If there exist several such processes, one of them is chosen; if no process
is waiting, then do nothing; the signal is not saved (and will have no effect).
queue: Returns a value of TRUE if there exist at least one suspended process on the condition
variable and FALSE otherwise.
Condition variables are not counters. They do not accumulate signals for later use as semaphores do.
In fact, monitor wait/signal operations behave differently from those for semaphores. If a process in
a monitor signals; that is, the process executes csignal(x), the respective signaled condition queue is
then inspected. If some activity is waiting in that queue, the signaler enters the queue and one waiter
brought from the corresponding waiting queue is then allowed to be active (ready state) in the moni-
tor. If no activity is waiting on the condition variable in that queue, the signaler proceeds as usual,
and the signal is simply ignored and lost without creating any damage. Since a monitor condition is
Processor Management 171
essentially a header of the related queue of waiting processes, one of the consequences of this is that
signaling on an empty condition queue in the monitor has no effect. The monitor, however, allows
only one process to enter at any point in time; other processes that intend to enter will then join a
queue of suspended processes while waiting for the availability of the monitor. But when a process
inside the monitor suspends itself on a certain condition x by issuing cwait(x), the process is then
placed in a queue of processes waiting on the same condition and then reenters the monitor later
when the condition is met. Apart from introducing an urgent queue for every condition, a separate
condition queue is formed, and processes that are blocked on certain conditions will then be placed
in the respective queues.
When an executing process in the monitor detects a change in the condition variable x, it issues
csignal(x), which alerts the corresponding condition queue that the condition has changed. Here lies
the difference in the behavior of the signal operation that distinguishes Hoare’s version of monitor
semantics from Brinch Hansen’s approach.
With Hoare’s approach, if a process P1 is waiting on a condition queue (for a signal) at the
time when P0 issues it (signal) from within the monitor, P0 is either to be suspended (blocked) on
the monitor or immediately exit the monitor, while P1 at once begins execution within the moni-
tor. When P1 completes its execution in the monitor, P0 will once again resume its execution in the
monitor. In general, this defnition of monitors says that if there is at least one process in a condition
queue, a process from that queue runs immediately when another process issues a corresponding
signal for that condition. The process issuing the signal must either be suspended (blocked) on the
monitor or immediately exit the monitor The rationale for Hoare’s approach is that a condition is
true at a particular instant in time when the signal occurs, but it may not remain true later, when P0,
for example, fnishes its execution with the monitor. In his original paper, Hoare uses these seman-
tics to simplify proofs of the correct behavior of the monitor.
Brinch Hansen’s monitor semantics incorporate the passive approach. (These semantics
are also known as Mesa monitor semantics because of their implementation in the Xerox Mesa
programming language. But Mesa semantics, in particular, are different; their approach is very
similar to Brinch Hansen’s with regard to the situation that arises due to the behavior of the
csignal operation, but their proposed solution to the situation is different. Mesa semantics will
be discussed in detail separately later in this section.) Hansen’s approach is different. It states
that when P0 executes a signal (as already described), appropriate for a non-empty condition
queue, the signal for that particular condition is saved, and P0 is not to be suspended; rather
it will be allowed to continue. When P0 later leaves the monitor, a process at the head of the
respective condition queue, say, P1, will attempt to resume its execution in the monitor after
rechecking the condition before it starts. He argues for rechecking due to the fact that even
though signal indicates an event has occurred, the situation may have changed by the time
P0 performs signal and P1 is allocated the CPU. This deliberation also favors fewer process
context switches than with Hoare’s approach, which will ultimately tend to enhance the overall
performance of the system.
With Hoare’s semantics, a situation that leads to a wait operation may be looked as:
...
if (resource–Not–Available) resource-Condition.wait
...
/ *Now available—continue. . ./*
...
When another process executes a resource-Condition.signal, a process switch occurs in which one
of the blocked processes gains control of the monitor and continues executing at the statement fol-
lowing the if statement. The process performed signal is then blocked and delayed until the waiting
process fnishes with the monitor.
172 Operating Systems
With Brinch Hansen’s semantics, the same situation could appear as:
...
while (resource–Not-Available) resource-Condition.wait
...
/ *Now available—continue. . ./*
...
This code fragment ensures that the condition (in this case resource-Not-Available) is rechecked
before the process executing resource-Condition.wait precedes. No process switch occurs until the
process performed signal voluntarily relinquishes the monitor.
Mesa semantics and monitors with notify and broadcast are discussed on the Support Material
at www.routledge.com/9781032467238.
the internal details of the procedures embedded in monitors but must have full knowledge of the
interface specifcations required for invoking these monitor procedures at a specifc point whenever
needed. The rest is up to the compiler to detect the conformity of syntax and semantics of user
processes with the monitor’s specifcations at the time of compilation, and necessary errors may be
displayed if identifed.
The details of a monitor-based solution with an algorithm of the producer/consumer bounded
buffer problem, and also a user process using monitor to solve the same problem with an algorithm,
are given on the Support Material at www.routledge.com/9781032467238.
• While the process issuing the csignal must either immediately exit the monitor or be
blocked (suspended) on the monitor, it carries the cost of two additional context switches:
one to suspend the process and another to resume it again when the monitor later becomes
available.
• When a csignal is issued, a waiting process from the corresponding condition queue must
be activated immediately, and the process scheduler must ensure that no other process
enters the monitor before activation. Otherwise, the condition under which the process is
going to be activated may change.
While these issues as raised in the proposed model with respect to the behavior of csignal operation
are very similar to those as proposed by Brinch Hansen in his model, the solution approach as pro-
posed in this model to cope the situation is quite different. This proposed model was implemented
in the programming language Mesa (Lampson 80) and thus is sometimes also referred to as Mesa
semantics. In Mesa, a new primitive, cnotify, is introduced to solve these issues by replacing the
existing csignal primitive with the following interpretation: When a process active in a monitor
executes cnotify(x), it causes the x condition queue to be not only notifed but the signaling process
continues to execute rather than being blocked or exiting the monitor. The result of this notifcation
is that the process at the head of the condition queue will be resumed later at some time when the
monitor is next available. However, since there is no guarantee that another process will not enter the
monitor by this time before the start of the waiting process, the waiting process thus must recheck
the condition before resuming. So, at the cost of one extra rechecking of the condition variable, we
are saving some processing time by avoiding extra process context switches and, above all, ignoring
such constraints as to when the waiting process must run after a cnotify.
With the advantages of having a cnotify primitive to notify a waiting process following a pre-
scribed rule rather than forcible reactivation, it is also possible to add a cbroadcast primitive with
specifc rules to the repertoire. The broadcast causes all processes waiting on a condition to be
placed in a ready state, thereby relieving the process (using cbroadcast) from the burden of knowing
exactly how many other processes should be reactivated.
A broadcast, in addition, can also be used in a situation when a process would have diffculty
precisely fguring out which other process to reactivate. A good example is a memory manager.
The memory manager has k bytes free; a process terminates, releasing an additional m bytes, but
the memory manager does not know which waiting process can proceed with a total of k + m
bytes; hence, it uses broadcast, and all processes check for themselves whether it matches their
requirements.
Besides all these advantages, this model also supports several other useful extensions.
Brief details of this topic with algorithms and its advantages are given on the Support
Material at www.routledge.com/9781032467238.
174 Operating Systems
• Monitor code itself is more modular in its design structure, with all parts of synchroniza-
tion protocols under one module. This facilitates easy code rectifcation to modify any
local effects, and even a major change in the monitor code will not at all affect users’ code
as long as interfacing rules remain unchanged. This is in contrast to semaphores, where
synchronization actions may be a part of each user process and also may span a number of
processes, and thus any change in its structure and manipulation rules requires a thorough
modifcation to all related user processes.
• Its ability to hide all the details of implementation from the users makes it quite transparent
and more secure in a way similar to the ISR.
• The use of monitors supports modular programming at the time of program development
to solve any related problem that, in turn, facilitates easier debugging and faster mainte-
nance of monitor-based programs.
• Monitor code is usually more regimented, with complementing synchronizing actions
found in the neighboring procedures (signal-receiving code). When semaphore is used, the
pair of wait and signal operations may be spread over different processes and/or even in
different modules.
Although monitors represent a signifcant advancement over the devices used earlier, but some of
their major strengths are directly related to their weaknesses.
• While the presence of a number of monitors within an operating system may facilitate
increased concurrency and provide fexibility in modular system design with ease of main-
tenance, management of several system resources entrusted to such separate monitors may
invite deadlocks. This may especially happen when a monitor procedure calls another
monitor procedure (nested calls).
• As the defnition of monitors virtually eliminates the possibility of any access to monitor vari-
ables by external agents, it thus leaves very little scope to system implementers to combat any
problems that may happen inside the monitor or outside due to the execution of the monitor.
• Since the users are bound to live only with those methods that are provided by public
monitor procedures for global data manipulation, but those are often found not to meet
users’ requirements while accessing a given shared resource. For example, if a certain fle
structure is imposed by the fle monitor that does only all reads and writes, then applica-
tion programmers are effectively denied the freedom of interpreting the fles in any other
way. In many situations, it may not be acceptable to some categories of users, especially
system programmers.
• Monitors never provide any control over the ordering of the waiting queues The standard
policy of treating them in FIFO order is not always appropriate. Some people therefore
prefer a more general mechanism for inspecting and reordering the various queues.
Processor Management 175
• The artifcial use of condition variables, which introduces much complexity in monitors, is
also found inconvenient to programmers for regular use.
• While a monitor requires that exclusion not be in force for very long, this hinders some
applications which might require shared data for a very long time; for example, exactly this
happens in the well-known readers-writers problem.
4.17.5.23 Conclusions
Monitors have not been widely supported by commercial operating systems, including UNIX
(though some versions of UNIX support mechanisms patterned after monitors), but they still are
considered a powerful high-level language construct and as such are incorporated into many pro-
gramming languages, including Ada, that have been useful for solving many diffcult problems.
Since it hides all the details of its implementation from users to make itself transparent that, in turn,
also makes it more secure and enables easy code modifcation whenever required. Monitors act as
an external functional extensions of user processes, but they differ from the traditional external
procedures in that they provide additional facilities for concurrency control and also signaling that
make parallel programming much less error-prone than their counterpart, the semaphore.
It is interesting to note that the structuring and implementation logic of monitors conceptually look
very similar to the kernel of an operating system in all respects, and its different attributes are also very
similar to those of kernels, as already described in the previous section. But the essential difference
between a monitor and a kernel is that in a monitor-based operating system, there coexist a collection of
monitors in charge of different resources where each monitor controls a particular resource or a small
group of related resources. In contrast, the kernel of an operating system (monolithic implementation), in
essence, is a comparatively large single monitor consisting of a huge number of complex programs with
numerous interactions that may be sometimes diffcult to debug and enhance and, above all, tedious to
maintain. In addition, for being less reliable, monolithic operating systems often restrict concurrency by
allowing at most one of their routines to be active at a time. On the contrary, each monitor itself imple-
ments mutual exclusion (concurrency), enforcing serial execution of procedures, and the presence of a
large number of monitors simply permits unrestricted concurrency between processes that use separate
monitors. In fact, monitors were originally introduced as an essential tool in structuring of OSs.
Finally, it is observed that the increasing trend in implementing concurrent applications run-
ning within an address space using threads has appreciably changed the nature and overall
behavior of the general problem. While synchronization of threads running across different
address spaces is in nature similar to the usual interprocess synchronization already described,
it is natural to possibly implement many of the characteristics of a monitor in programmer-
scheduled threads within an address space. The solutions thus targeted are then much easier to
derive than semaphore-based synchronization and are easier to implement than full monitors.
The reason is that the threads share a common address space, and only one thread will execute
at a time in the space. Basically, the approach allows the program to control the address space
while scheduling threads for execution so that a thread runs only when it does not violate a
critical section. Whereas a generic solution cannot make any assumptions about the presence or
absence of critical sections, thread scheduling is usually performed by the program developer
while building application programs.
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
calls in the form of a generalized wait and signal operations (primitives) in which several operations
can be done simultaneously, and the increment and decrement operations may cause a change in the
semaphore values, but that will always be greater than 1. User-specifed keys are used as semaphore
names. A key is associated with an array of semaphores. Individual semaphores in the array can
be accessed with the help of subscripts. A process intending to gain access to a semaphore makes
a semget system call with a key as a parameter. If a semaphore array matched with the key already
exists, the kernel makes it accessible to the process that makes the semget system call; otherwise
the kernel creates a new semaphore array, assigns the key to this array, and makes it accessible to
the process.
The kernel provides a single system call semop for wait and signal operations. The call uses
two parameters: a key and a list (subscript, op) of specifcations, where subscript identifes a
particular semaphore in the semaphore array and op is the operation to be performed. The entire
set of allowable operations is prescribed in the form of a list, where each operation is defned on
one of the semaphores in the semaphore array and is performed one at a time in an atomic man-
ner. This means that either all operations as defned are performed one at a time and the process
is then free to continue its execution, or none of the operations are performed and the process
is then blocked. Associated with each semaphore are queues of such processes blocked on that
semaphore. A blocked process is activated only when all operations, as indicated in semop, can
succeed.
Execution of semop itself is also atomic in nature; that is, only one process can access the sema-
phore at any instant, and no other process is allowed to access the same semaphore until all opera-
tions are completed or the process is blocked. It is interesting to note that the semantics of semop
itself facilitate avoiding deadlocks. A single semop either allocates all the resources that a pro-
cess requires, or it is not allocated any of the resources. This attribute of semop resembles the all-
requests-together approach, which is one way to prevent (avoid) deadlocks.
For more details about this topic, see the Support Material at www.routledge.com/9781032467238.
4.18.1 MESSAGES
To negotiate all these issues, one approach may be to use a single mechanism, popularly known as
message passing. Messages are a relatively simple mechanism able to implement both interprocess
communication and synchronization and are often used in contemporary commercial centralized
systems as well as in distributed system environments. Computer networks also normally use this
attractive message passing mechanism to manage both interprocess communication and synchroni-
zation between concurrent processes that run on different nodes (machines) interconnected with one
another in networks. Distributed operating systems running on loosely coupled system (multicom-
puters) as well as on tightly coupled systems (multiprocessor) usually also exploit messages for this
purpose. Messages can also be used simply as a communication mechanism explicitly intended to
copy information (even to transfer major portions of the operating systems and/or application pro-
grams) without using shared memory from one address space into another process’s address space
of even other nodes (machines) located remotely. Sometimes the operating system itself is involved
in this communication activity, since concurrency is to be implemented across address spaces, but
those are inherently guarded by the protection mechanisms of memory management in order to
strictly prevent any form of malfunctioning. The OS in that situation must inject an additional
mechanism by which the information can be copied from one application’s address space to that of
another. Applications can then use the OS as an intermediary to share information by way of copy-
ing it in these systems. Messages have been even found in use in the implementation of 32-bit system
buses, such as Multibus II (Intel) and Futurebus (IEEE), designed for microcomputer systems that
provide specialized hardware facilities for interprocessor communication with low overhead.
More details on this topic with a figure are given on the Support Material at www.routledge.
com/9781032467238.
FIGURE 4.24 A typical message format of message-passing mechanism used in interprocess communica-
tion and synchronization.
178 Operating Systems
There may also be additional felds containing control information, such as a pointer feld so that
a linked list of messages can be created, a sequence number that keeps track of the number and
order of messages passed between sender and receiver, and sometimes a priority feld. The optional
message body normally contains the actual message, and the length of the message may vary from
one message to another (variable-length message), even within a single operating system. However,
designers of operating system normally do prefer short, fxed-length messages for the sake of mini-
mizing processing load and to reduce storage overhead.
…….
…….
process A ;
………
send ( B, message ) ;
………
………
process B;
receive (A, message) ;
………
Here, message represents the contents that are to be transmitted as the actual message, and B
and A in the parameters identify the destination (receiver) and the source of the message (sender),
respectively. Direct naming, by nature, is a symmetric communication in the sense that it is a one-
to-one mapping, and each sender must know the name of its receiver and vice-versa. Although this
provides a safe and secure means of message communication, but it may impose a severe constraint
when a service routine is implemented by the send/receive mechanism for public use by a com-
munity of users, since the name of each customer in this case must be known to the service routine
beforehand, so it can expect the request.
Processor Management 179
…….
…….
process A ;
…….
send ( mailbox-1, message ) ;
…….
…….
process B ;
receive ( mailbox-1, message) ;
…….
The send operation places the generated message into the named mailbox; mailbox-1, and the
receive operation removes a message from the named mailbox, mailbox-1, and provides it to the
receiving process through the private variable, message.
Ports also are often statically associated, in place of a mailbox, with a process for message com-
munication; that is, the port is created and assigned to the process permanently. In particular, a port
is typically owned by and created by the receiving process, and when the process is destroyed, the
port is also destroyed automatically.
The placement of the mailbox with regard to its location and ownership is a design issue for
the OS developer to decide. That is, should the mailbox be put in an unused part of the receiver’s
(process’s B) address space, or should it be kept in the operating system’s space until it is needed by
the receiver (process B)? Figure 4.25 shows the mailbox for process B, which is located in the user’s
space. If the mailbox is kept in this manner in the user space, then the receive call can be simply a
library routine, since the information is being copied from one part of B’s address space to another.
However, in that situation, the translation system (compiler and loader) will have to take care to
allocate space in each process for the mailbox. Under this arrangement, there remains a possibil-
ity that the receiving process may sometimes overwrite parts of the mailbox inadvertently, thereby
destroying the links and losing messages.
The alternative to this approach is to keep B’s mailbox in the operating system’s space and
defer the copy operation until B issues the receive call. This situation is illustrated in Figure 4.26.
This option shifts the responsibility of mailbox space arrangement from user process onto the OS.
Consequently, it prevents any occurrence of inadvertent damage of messages or headers, since the
mailbox is not directly accessible to any application process. While this option requires the OS to
allocate memory space for mailboxes for all processes within its own domain, it at the same time
puts a system-wide limit on the number of messages awaiting delivery at any given instant. The
user processes, however, can access the mailbox only with the help of respective system calls that
the operating system should provide. In addition, the operating system should have extra support
for maintenance of mailboxes, such as, create_mailbox and delete_mailbox. Such mailboxes can
be viewed as being owned by the creating process, in which case they terminate with the process,
or they can be viewed as being owned by the operating system, in which case an explicit command,
such as, delete_mailbox is required to destroy the mailbox.
The distinct advantage of using indirect naming (addressing) in message communication is that
it makes the horizon totally open by decoupling the sender and receiver, allowing greater fexibility
180 Operating Systems
Info. to Copy of
be shared shared info
Mailbox for B
FIGURE 4.26 A schematic block diagram of message-passing mechanism using mailboxes placed in operat-
ing system space.
in the use of messages. It provides a relationship that can be one-to-one, one-to-many, or many-to-
one, as well as many-to-many mappings between sending and receiving processes.
A one-to-one mapping is typically defned statically and permanently by creating a dedicated
specifc mailbox for the exclusive use of only two processes. This essentially establishes a private
communication channel between them that insulates all their interactions from other erroneous
interferences. One-to-many mapping, on the other hand, is provided by a mailbox dedicated to a
single sender but used by multiple receivers. It is useful for applications in which a message or infor-
mation is to be broadcast to a set of processes. Many-to-one mapping is particularly important for
server processes, and it may be implemented by providing a public mailbox with numerous senders
(user processes) and a single receiver (server process). If the sender’s identity for any reason is felt
to be important, that could be provided within the body of the message itself. In the case of many
senders, the association of a sender to a mailbox may occur dynamically. Primitives, such as, con-
nect and disconnect, may be used for this purpose. Modern systems thus favor the implementation
of mailboxes in the domain of operating system, since it is a more versatile approach.
4.18.3.2 Copying
The exchange of messages between two processes simply means to transfer the contents of the
message from the sender’s address space to receiver’s address space. This can be accomplished in
Processor Management 181
several ways, such as by either copying the entire message directly into the receiver’s address space
or by simply passing a pointer to the message between two related process. In essence, message
passing can be carried out by value or by reference. In distributed systems with no common mem-
ory, copying cannot be avoided. However, in centralized system, the trade-off is between safety
versus effciency.
Copying of an entire message from a sender’s space to a receiver’s space in message transmis-
sion exhibits several advantages. This approach keeps the two processes decoupled from each
other, yet an imprint of the sender’s data is made available to the receiver. The original with the
sender remains unaffected irrespective of any act be by the receiver on the data at its own end.
Similarly, the receiver’s data also remain totally protected from any sort of direct access by the
sender. Consequently, any malfunction of either process is fully localized in the sense that it can
corrupt only the local copy and not the other one. The availability of such multiple private copies
of data is always benefcial from a certain standpoint and is a possible alternative to its counter-
part that contains a single copy of carefully protected global data, as is found in a typical monitor
approach.
The message copying approach, however, also suffers from several drawbacks. It consumes
additional processor and memory cycles that summarily lead to usage of extra system time. This
asynchronous message communication is a useful feature, but due to existing memory protec-
tion schemes, it may also require that each message frst be copied from the sender’s space to
the operating system’s space (to any buffer) and from there to the receiver process’s space (as
already mentioned in the mailbox); that is a double copying effort and also additional support
of an extra dynamic memory pool which is required by OS to operate while delivering only a
single message.
Use of Pointers: To get rid of the complexities arising out of copying, an alternative approach
was devised in favor of using a pointer to pass it to the message between the sender and receiver
processes. Although this provides a faster solution by way of avoiding the copying of message, but it
enables the receiver to get entry into the sender’s addressing space that may otherwise cause a threat
to security. Moreover, as a single copy of message is accessed both by the sender and the receiver,
an additional mechanism is then required to synchronize their access to the message, to signal the
end of any receiver operation so that the sender may reclaim it for any modifcation, if required.
Copy-on-write : Besides these two approaches with their merits and drawbacks, that is, multiple
copies of messages with copying and a single copy of the message without copying, an alternative
viable hybrid approach in the line of UNIX copy-on-write facility ultimately came out that was
eventually taken by the Mach operating system. With copy-on-write, sender and receiver copies of
the exchanged message are logically distinct. This allows each process to operate on the contents
of the message freely with no concern for interference. However, the OS attempts to optimize per-
formance by initially sharing a single physical copy of the message that is mapped into the address
spaces of both sender and receiver. As long as both processes only read the message, a single
physical copy is enough to serve the purpose. However, when either process attempts to modify the
physically shared message space, the operating system intervenes and creates a separate physical
copy of the message. The address-map tables are accordingly re-mapped, and each process contin-
ues with its own separate physical copy of the message. Thus, the copy-on-write scheme supports
logical decoupling and at the same time eliminates the copying overhead in systems where the ratio
of reads to writes is high.
the queue is decided based on the priority given along with the message in the priority feld, which
is a part of control information. Another alternative is to allow the receiver to inspect the message
queue and select which message to receive next.
When the message exchange is synchronous, both the sender and the receiver must arrive together
to complete the transfer. In synchronous systems, the synchronous send operation incorporates a
built-in synchronization strategy which blocks (suspends) the sending process until the message is
successfully received by the receiving process. In fact, when a sender wants to send a message for
which no outstanding receive is issued, the sender must be blocked until a willing receiver accepts
the message. In other words, the send call synchronizes its own operation with the receipt of the
message.
The synchronous send–receive mechanism has many advantages. First of all, it has compara-
tively lower overhead and easier to implement. Second, the sender knows that its message has been
actually received, and there is no possibility of any damage once the send statement crosses over
and the send operation is completed. Last but not least, if the sender attempts to transmit a message
to a nonexistent process, an error is then returned to the sender so that it can synchronize with the
occurrence of the error condition and take appropriate action. However, one of the serious draw-
backs of this approach is its forcible implement of synchronous operation of senders and receivers,
which may not be desirable in many situations, as exemplifed by the public server processes in
which the receiver and the sender processes usually run at different times.
Processor Management 183
With asynchronous message exchange, the asynchronous send operation delivers the message
to the receiver’s mailbox with the help of the operating system and buffers outstanding messages,
then allows the sending process to continue operation without waiting, regardless of any activity of
the receiver with the message (the mailbox may, however, be located within the domain of receiver’s
process or in the address space of operating system, as already discussed). The sending process here
need not be suspended, and the send operation is not at all concerned with when the receiver actu-
ally receives the message. In fact, the sender will not even know whether the receiver retrieves the
message from its mailbox at all.
The distinct advantage of an asynchronous send operation, which behaves like a “set and forget”
mode of operation, substantially increases the desired degree of concurrency in the system. All the
messages being sent to a particular receiver are queued by the system without affecting the sender,
which also allows other senders to create new messages, if required.
Although the asynchronous send operation is a useful feature, its drawbacks may cause several
adverse situations to happen. For example, if a sender transmits a message to a nonexistent process,
it is not then possible for the OS to identify the specifc mailbox in which to buffer the message. But
the sender is completely unaware of this situation and continues after “transmitting” the message,
and, as usual, does not expect a return value. The fate of this message is then simply unpredict-
able. Since there is no blocking to discipline the process, these types of messages keep consuming
system resources, including processor time and buffer space, to the detriment of other processes
and also the operating system. Moreover, as there is no mechanism that causes an alert, such as
UNIX signals, there is no way for the OS to tell the sending process about the status of its opera-
tion. Therefore, additional systems are required that will block an asynchronous send operation
in this situation until the message is actually placed in the receiver’s mailbox. But there exists no
such implied synchronization between the sending and receiving process, since this fundamentally
opposes the philosophy behind the asynchronous send operation: that the sender process would not
be suspended, and the receiver may retrieve the mailbox at any arbitrary time after the message has
been delivered. Therefore, the non-blocking send places the burden entirely on the programmer
to ascertain that a message has actually been received. Hence, the processes must employ “reply
messages” to acknowledge the receipt of an actual message. We will discuss next an appropriate
mechanism so that this approach can be properly accomplished.
Another situation may occur that sometimes becomes critical due to the inherent drawback of the
asynchronous send operation. When a sending process starts producing messages uncontrollably
that quickly exhaust the system’s buffering capacity, it then creates blockage of all further message
communication between other processes. One way to solve this problem may be to impose a cer-
tain limit on the extent of buffering for each sender–receiver pair or on a mailbox basis. However,
in either case, this uncommon situation due to buffering of outstanding messages causes to incur
additional system overhead.
Another common problem that frequently happens related to both of these implementations is
starvation (indefnite postponement). This usually happens when a message is sent to a defnite
destination but never received. Out of many reasons, this may be due to crashing of the receiver
or a fault in the communication line, or it may be that a receiver is waiting for a message which is
never created. Whatever it may be, ultimately this failure to complete a transaction within a fnite
time is not at all desirable, especially in an unbuffered (synchronous) message system, because
that may automatically block the unmatched party. To address this problem, two common forms
of receive primitive are used: a non-blocking (wait less) version and a blocking (timed-wait)
implementation.
The blocking form of the receive primitive is a blocking receive operation that inspects the des-
ignated mailbox. When a process (receiver) calls receive, if there is no message in the mailbox, the
process is suspended until a message is placed in the mailbox. Thus, when the mailbox is empty, the
blocking receive operation synchronizes the receiver’s operation with that of the sending process.
But if the mailbox contains one or more messages, the calling process is not suspended, and the
184 Operating Systems
receive operation immediately returns control to the calling process with a message. Note that the
blocking receive operation is exactly analogous to a resource request in the sense that it causes the
calling process to suspend until the resource, that is, an incoming message, is available.
The non-blocking form of the receive primitive is a non-blocking receive operation that inspects
the designated mailbox and then returns control to the calling process immediately (with no waiting
and without suspending) either with a message, if there is one in the mailbox, exactly in the same
manner as in the case of blocking receive, or with an indicator that no message is available. As the
blocking and non-blocking functions of receive are sometimes complementary, both these versions
of receive are often supported in some systems.
In short, both sender and receiver together can give rise to four different types of combinations,
three of which are common, although any particular system is found to even have one or two such
combinations implemented.
• Synchronous (blocking) send, blocking receive: Both the sender and receiver are blocked
until the message is delivered; this is sometimes referred to as rendezvous (a meeting
by appointment). This combination is particularly useful when tight synchronization is
required between processes.
• Asynchronous (non-blocking) send, blocking receive: Although the sender here may be
allowed to continue on, the receiver is blocked until the requested message arrives. This
is probably the most useful combination. It allows a process to send one or more messages
quickly to several destinations as and when required. A process that must receive a mes-
sage before it can proceed to do other useful work needs to be blocked until such a message
arrives. A common example in this case is a server process that exists to provide a service
or resource to other client processes which are to be blocked before the request of a service
or the resource is granted by the server. Meanwhile, the server continues with its own work
with other processes.
• Asynchronous (non-blocking) send, non-blocking receive: Neither the sender nor the
receiver is required to wait.
where time-limit is the maximum allowable time, expressed in clock ticks or any standard unit of
time, that the receiver can wait for the message. If none arrives, the OS would then return control
to the receiver and provide it with an indicator, perhaps via a special system message, that the time
limit has elapsed. The sender processes can also be modifed in this scheme by using an interlock
mechanism within itself that is of the form:
Sender:
………
send ( mailbox1, message )
receive ( ack, time-limit )
………
Processor Management 185
Receiver:
………
receive ( mailbox1, message, time-limit )
if message-received-in-time then
send (ack)
The sender sends a message, and the receiver after receiving the message will send back a special
acknowledgement message, ack, for which the sender waits. If the receiver for any reason does not
receive the original message, the time limit eventually will expire, and the sender then regains con-
trol (from its timed-out receive operation), at which point it can take appropriate remedial action.
The receiver process also cannot be held captive by a late message; it is signaled and informed about
the fault as soon as its own receive times out.
the same name. This problem of conficting names can be considerably reduced by simply grouping
machines into domains, and then the processes can be addressed as [email protected].
Under this scheme, the domain names must be unique.
Authenticity is another issue in the design of message-passing systems. It is realized by a tech-
nique by which the authentication of interacting processes involved in communication is verifed. It
is surprisingly diffcult, particularly in the face of threats organized by malicious, active intruders,
and hence requires complex mechanisms, usually based on cryptography, by which a message can
be encrypted with a key known only to authorized users.
In the case of a centralized system in which the sender and receiver exist on the same machine,
the design considerations may be altogether different. The fundamental question is whether it is
judicious to use a message-passing system that employs a relatively slow operation of copying mes-
sages from one process to another than that of its counterpart, the comparatively fast semaphore or
monitor, for the sake of realizing better performance. As a result, much work has been carried out
to effectively improve message-passing systems, and as such many alternatives have been proposed.
Among them, Cheriton (1984), for example, suggested limiting the size of the message so as to ft
it in the machine’s register and then perform message passing using these registers to make it much
faster.
FIGURE 4.27 An algorithm illustrating the Mutual exclusion of competing processes by using messages.
This approach is conceptually similar to semaphore implementation. It also shows that the
empty message used here only for signaling is enough to fulfll the desired purpose for syn-
chronization. Here, a single message is passed from process to process through the system as
a token permit to access the shared resource. The receive operation (as shown in process user
in Figure 4.27) should be an atomic action (indivisible) in the sense while delivering the mes-
sage, if there is any, to only one caller (blocked process) when invoked concurrently by several
users. The process after receiving the message (as shown in process user in Figure. 4.27) would
then be able to proceed accordingly. The other remaining waiting processes, if any, would
remain blocked as usual and get their turns one at a time when the message is returned by pro-
cesses (send operation in process user in Figure 4.27) while coming out of the critical section.
However, if the mailbox (mutex) is found empty at any instant, all processes will then be auto-
matically blocked. These assumptions hold true for virtually all message-passing facilities. The
real strength of message-passing systems is, however, fully extracted when a message contains
actual data and transfers them at the same time to the desired destination, thereby accomplish-
ing the ultimate purpose of achieving both interprocess synchronization and communication
within a single action.
This discussion has convincingly established the fact that messages should not be considered a
weaker mechanism than semaphores. It appears from Figure 4.27 that a message mechanism may
be used to realize a binary semaphore. General semaphores can similarly be implemented by simu-
lating messages by increasing the number of token messages in the system to match the initial value
188 Operating Systems
of an equivalent general semaphore. For example, if there are eight identical tape drives available in
a system that are to be allocated by means of messages, then eight tokens need to be initially created
by sending null messages to the appropriate mailbox.
Message passing, however, unlike other strong contenders, is a robust and versatile mechanism
for the enforcement of mutual exclusion and also provides an effective means of interprocess com-
munication. That is why messages in different forms with numerous implementation provisions and
add-on facilities are often found in use in both types of distributed operating systems, especially
network operating systems.
important, we restrict ourselves from entering the domain of those primitives due to the shortage of
available space and for being beyond the scope of this text.
continues. If there is none, the sender comes out, doing an up as usual to allow the other deserving
process to start.
When a send fails to complete due to a mailbox being full, the sender frst queues itself on the
destination mailbox, then does an up on mutex and a down on its own semaphore. Later, when a
receiver removes a message from the full mailbox, an empty slot will be formed, and the receiver
will then notice that someone is queued attempting to send to that mailbox; one of the sender in the
waiting queue will then be activated (wake-up).
Similarly, when a receive is attempted on an empty mailbox, the process trying to receive a mes-
sage fails to complete and hence queues itself in the receive queue of the relevant mailbox, then does
an up on mutex and a down on its own semaphore. Later, when a sender, after sending a message,
observes that someone is queued attempting to receive from that mailbox, one of the receivers in the
waiting queue will be activated. The awakened receiver (process) will then immediately do a down
on mutex and then continue with its own work.
This representative example shows that a user can build a system primitive intuitive to them
if the system provides another. It can similarly be shown that semaphores and messages can be
equally implemented using monitors, and likewise, semaphores and monitors can also be imple-
mented using messages. We will not proceed to any further discussion of these two issues due to
space limitations. However, interested readers, for the sake of their own information, can consult the
book written by Tanenbaum on this subject.
name of the port is then announced within the system. A process that intends to communicate
with a server sends a connection request to the port and becomes its client. While sending a
message, a client can indicate whether it expects a reply. When the server receives the request,
it returns a port handle to the client. In this way, the server can communicate with many clients
over the same port.
For small messages, the message queue in the port object contains the text of the message. The
length of each such message can be up to 304 bytes. As already mentioned (in “Copying”), such
messages get copied twice during message passing to keep the system fexible and at the same time
reliable. When a process sends a message, it is copied into the message queue of the port. From
there, it is copied into the address space of the receiver. The length of the message is, however, kept
limited to only 304 bytes in order to mainly control the overhead of message passing within an
affordable limit, both in terms of space and time.
The second method of message passing is used for large messages. In order to avoid the overhead
of copying a message twice, a message is not copied in the message queue of the port. Instead, the
message is directly put into a section object. The section object is mapped in both the address spaces
of the client and the server processes. When the client intends to send a message, it puts the text of
the message in the section object and sends a message to the port (to signal the port) indicating that
it has put a message in the section object. The server itself then views the message in the section
object. In this way, the use of the section object helps to avoid the copying of the message into the
server’s address space.
The third type of message passing using LPC is comparatively faster and hence is called
quick LPC. Here again, the actual message is passed in a section object that is mapped in both
the address spaces of the client and the server processes. Quick LPC uses two interesting fea-
tures that are not present in the other types of LPC. Here, the server creates a thread for every
client. Each thread is totally dedicated to requests made by the respective client. The second
feature is the use of event-pair objects to synchronize the client and server threads. Each event-
pair object consists of two event objects: the server thread always waits on one event object,
and the client thread waits on the other one. The message-passing mechanism proceeds as
follows: The client thread submits a message in the section object; it then itself waits on its
own respective event object and signals the corresponding event object of the pair on which
the server thread is waiting. The server thread similarly waits on one event object and signals
the corresponding event object. To facilitate error-free message passing, the kernel provides a
function that ensures atomicity while signaling on one event object of the pair and waiting on
the other event object of the same pair.
• Semaphores
• Signals
• Messages
• Pipes
• Sockets
• Shared memory
While semaphores and signals are used only to realize synchronization between processes, the oth-
ers, such as messages, pipes, sockets, and shared memory, provide an effective means of commu-
nicating data across processes (interprocess communication) in conjunction with synchronization.
For more details about this topic, with a description of each of the tools and supporting fgures,
see the Support Material at www.routledge.com/9781032467238.
Processor Management 193
FIGURE 4.28 A schematic view of a situation of three deadlocked processes with three available resources.
Process 1 acquires resource 1 and is requesting resource 2; Process 2 is holding resource 2 and
is requesting resource 3; Process 3 acquires resource 3 and is requesting resource 1. None of the
processes can proceed because all are waiting for release of a resource held by another process. As
a result, the three processes are deadlocked; none of the processes can complete its execution and
release the resource that it owns, nor they can be awakened, even though other unaffected processes
in the system might continue. The number of processes and the number and kind of resources pos-
sessed and requested in a deadlocked situation are not important.
This example illustrates the general characteristics of deadlock. If these three processes can run
serially (batch-wise) in any arbitrary order, they would then merrily complete their run without any
deadlock. Deadlock thus results primarily due to concurrent execution of processes with uncontrolled
granting of system resources (physical devices) to requesting processes. However, deadlocks can also
occur as a result of competition over any kind of shared software resources, such as fles, global data,
and buffer pools. In a database system, for example, a program may have to normally lock several
records it is using to avoid a race condition. If process X locks record R1 and process Y locks record
R2, and then each process tries to lock the other one’s record in order to gain access to it, deadlock is
inevitable. Similarly, deadlocks can also result from execution of nested monitor calls. All these things
together imply that deadlocks can occur on hardware resources as well as also on software resources.
Even a single process can sometimes enter a deadlock situation. Consider a situation when a pro-
cess issues an I/O command and is suspended awaiting its completion (result) and then is swapped
194 Operating Systems
out for some reason prior to the beginning of the I/O operation. The process is blocked waiting on
the I/O event, and the I/O operation is blocked waiting for the process to be swapped in. The process
thus goes into a deadlock. One possible way to avoid this deadlock is that the user memory involved
in the I/O operation must be locked in main memory immediately before the I/O request is issued,
even though the I/O operation may not be executed at that very moment but is placed in a queue for
some time until the requested device is available.
Deadlock is actually a global condition rather than a local one. If a program is analyzed whose
process involves a deadlock, no discernible error as such can be noticed. The problem thus lies not
in any single process but in the collective action of the group of processes. An individual program
thus generally cannot detect a deadlock, since it becomes blocked and unable to use the processor to
do any work. Deadlock detection must be thus handled by the operating system.
never release them. Conversely, a process can release units of consumable resources without ever
acquiring them. More specifcally, a consumable resource is characterized as one that can be
created (produced) and destroyed (consumed) by active processes. Common examples of consum-
able resources are messages, signals, interrupts, and contents of I/O buffers. In fact, there is as
such no limit on the number of such resources of a particular type in a system, and that can even
vary with time. For example, an unblocked producing process may create any number of such
resources, as happens in the producer/consumer problem with an unbounded buffer. When such a
resource is acquired by an active process, the resource ceases to exist. Since such resources may
have an unbounded number of units, and moreover, the allocated units are not released, the model
for analyzing consumable resources signifcantly differs from that of serially reusable resources.
These consumable resources, however, can sometimes also cause deadlocks to occur.
For more details and fgures about this topic, see the Support Material at www.routledge.com/
9781032467238.
fundamentally different from the other three. In truth, the fourth condition causes a situation that
might occur depending on the sequencing of requests and releases of resources by the involved
processes.
1. Mutual exclusion condition: Each resource is either currently assigned to exactly one pro-
cess only or is available.
2. Hold-and-wait condition: Process currently holding allocated resources granted earlier
can request new resources and await assignments of those.
3. No preemption condition: Resources previously granted cannot be forcibly taken away
from a process holding it. They must be released by the respective process that holds them.
4. Circular wait condition: There must exist a close chain of two or more processes, such that
each process holds at least one resource requested by the next process in the chain.
All four conditions must be simultaneously present for a deadlock to occur. If any one of them is
absent, no deadlock is possible. Thus, one way to negotiate deadlocks is to ensure that at every
point in time, at least one of the four conditions responsible for the occurrence of deadlocks is to be
prevented by design.
The frst three conditions are merely policy decisions for error-free execution of concurrent pro-
cesses to realize enhanced performance and hence cannot be compromised. On the other hand, these
three conditions are the primarily ones that can invite a deadlock to exist. Although these three con-
ditions are necessary conditions for deadlock to occur, deadlock may not exist with only these three
conditions. The fourth condition, which is a consequence of the frst three conditions that might
occur depending on the sequencing of requests and releases of resources by the involved processes,
is actually a suffcient condition for a deadlock to exist and hence is considered a defnition of dead-
lock. This states that, given that the frst three conditions exist, a sequence of events may happen in
such a way that leads to an unresolvable circular–wait condition, resulting ultimately in a deadlock.
X
P1 P2
A B
R1 R2
Y
(a) (b)
(c)
(a) Holding a resource.
(b) Requesting a resource.
(c) Deadlock occurs.
the sake of correctness and also several unusual restrictions on processes for the purpose of conve-
nience. Thus, there exists an unpleasant trade-off between correctness and convenience. It is obvi-
ously a matter of concern about which is more important and to whom. Under such conditions, it is
diffcult to arrive at any well-acceptable judicious general solution.
4.19.6.1 Recovery
Second phase: Once the deadlock detection algorithm has succeeded and detected a deadlock, a strat-
egy is needed for recovery from deadlock to restore the system to normalcy. The frst step in deadlock
recovery is thus to identify the deadlocked processes. The next step is to break the deadlock by the
following possible approaches. Incidentally, none of them are, in particular, observed to be promising.
One of the simplest methods to recover the system from deadlock is to kill all the deadlocked
processes to free the system. This may be one of the most common solutions but not the most
acceptable for many reasons, yet it is still taken into consideration in operating system design.
Processor Management 199
2. Rollback
Since the occurrence of deadlocks is most likely, system designers often keep the provision of
maintaining previously defned separate periodical checkpoints for all the processes residing in the
system. Checkpointing a process means that everything with regard to its runtime state, including
its memory image, state of the resources being currently held by the process, and similar other
important aspects, is recorded in a special fle so that it can be restarted later using this information.
As the execution of a process progresses, an entire collection of checkpoint fles, each generated at
different times, is sequentially accumulated.
When a deadlock is detected, the frst step is to identify which resources are responsible for
the deadlock. To recover the system, a process that holds one such resource is rolled back to an
earlier moment when it did not have the resource, and then the process will be preempted so that
the resources owned by the current process can now be withdrawn and fnally will be assigned
to one of the deadlocked processes that needs it. Later, the rolled-back process can once again
be restarted from the specifed point at a convenient time. The risk in this approach is that the
original deadlock will recur because if the restarted process tries to acquire the resource once
again, it will have to wait until the resource becomes available. This strategy thus requires that
rollback and restart mechanisms be built into the system to facilitate high reliability and/or avail-
ability. In general, both rolling back and subsequent restarting may be diffcult, if not impossible,
for processes that cannot be safely repeated. This is true for processes that have made irreversible
changes to resources acquired prior to deadlock. Common applications include reading and writ-
ing messages to the network, updating fles while journalizing in transaction-processing systems
(e.g. examination processing, fnancial accounting, and reservation systems), and checkpointing in
real-time systems.
3. Preemption
At a certain point in time, it may be possible to select a process to suspend for the purpose of
temporarily taking a needed resource from this suspended process and giving it to another process
in order to break the existing deadlock. The selection criterion should be cost-effective, and re-
invocation of the detection algorithm is then required after each preemption to inspect whether the
deadlock breaks. Under certain environments, particularly in batch processing operating systems
running on mainframes, this practice is very common, not to break any deadlocks but to expedite
the execution of a particular process by way of offering it unrestricted resource access.
Taking a resource in this manner away from a process and giving it to another one for use, then
returning it again to the former process may be a tricky at best but highly dependent on the nature
of the process and the type of resource that can be withdrawn and then easily given back.
4. Killing Processes
The most drastic and possibly simplest approach to break a deadlock is to kill one or more pro-
cesses successively until the deadlock no longer exists. The order in which processes are selected
for killing should be on the basis of certain predefned criteria involving minimum loss. After each
killing operation, the detection algorithm must be re-invoked to inspect whether the deadlock still
exists. Different selection criteria can be formulated in this regard while choosing a process to kill.
The process to be selected as victim should have:
Some of these quantities can be directly measured; others cannot. It is also diffcult to assess esti-
mated time remaining. In spite of having limitations, these parameters are still taken into account, at
least as a guiding factor while selecting a process as a victim in order to break the existing deadlock.
A brief description of this approach is given on the Support Material at www.routledge.com/
9781032467238.
The frst of the four conditions (Section 4.19.4), mutual exclusion, as already mentioned, is usu-
ally diffcult to dispense with and cannot be disallowed. Simultaneous access by many processes
to a particular resource is usually provided for the sake of performance improvement that requires
mutual exclusion; hence mutual exclusion must be incorporated and supported by the operating sys-
tem, even though it may invite deadlock. Some resources, such as shared or distributed databases,
may at any instant allow multiple accesses for reads but only one exclusive access for writes. In this
case, deadlock can occur if more than one process seeks write permission.
However, some types of device, such as printers, cannot be given simultaneously to many pro-
cesses without implementing mutual exclusion. By spooling printer output (thereby avoiding imple-
mentation of mutual exclusion), several processes can be allowed to generate output at the same
time. In this approach, the only process that actually requests the physical printer is the printer
daemon. Since this daemon never seeks any other resources, deadlock, at least for the printer, can be
eliminated. But when competition for disk space for the purpose of spooling happens, this, in turn,
may invite deadlock and cannot be prevented.
Unfortunately, not all devices have the provision for being spooled. That is why the strategy
being religiously followed in this regard is: avoid assigning any resource when it is not absolutely
necessary, and try to ensure that as few processes as possible may claim the resource, which can be
easily enforced initially at the job scheduler level at the time of selecting a job for execution.
Processor Management 201
2. Hold-and-Wait Condition
The second of the four conditions (Section 4.19.4), hold-and-wait, appears slightly more
attractive. This condition can be eliminated by enforcing a rule that a process is to release all
resources held by it whenever it requests a resource that is not currently available; hence the
process, in turn, is forced to wait. In other words, deadlocks are prevented because waiting
processes are not holding any resources unnecessarily. There are two possible approaches to
implement this strategy.
a. Require all processes to request all their needed resources prior to starting execution.
If everything is available, the process will then be allocated whatever it requires and
can then run to completion. If one or more resources are not available (busy), nothing
will then be allocated, and the process will be left waiting. This approach is, however,
ineffcient, since a process may be held up for a considerable period, waiting for all its
resource requests to be fulflled, when, in fact, it could have started its execution with
only some of its requested resources.
Apart from that, a real diffculty with this approach is that many processes cannot predict how
many resources they will need until execution is started. Although sometimes exerting additional
effort, estimation of resource requirements of processes is possible, but such estimation with regard
to preclaiming resources usually tends to be conservative and always inclined to overestimation. In
fact, preclaiming necessarily includes all those resources that could potentially be needed by a pro-
cess at runtime, as opposed to those actually used. This is particularly observed in so-called data-
dominant programs in which the actual requirements of resources are only determined dynamically
at runtime. In general, whenever resource requirements are to be declared in advance, the overesti-
mation problem cannot be avoided.
This estimation task appears somewhat easier for batch jobs running on mainframe systems
where the user has to frst submit a list (JCL) of all the resources that the job needs along with
each job. The system then tries to acquire all resources immediately, and if available, allocates
them to the job until the job completes its execution; otherwise, the job is placed in the waiting
queue until all the requested resources are available. While this method adds a slight burden on
the programmer and causes considerable wastage of costly resources, it, in turn, does prevent
deadlocks.
Another problem with this approach is that the resources will not be used optimally because
some of those resources may actually be used only during a portion of the execution of the related
process, and not all the resources will be used all the time during the tenure of the execution. As a
result, some of the resources requested in advance will be tied up unnecessarily with the process
and remain idle a long time until the process completes, but only for the sake of deadlock preven-
tion; they cannot be allocated to other requesting processes. This will eventually lead to poor
resource utilization and correspondingly reduce the level of possible concurrency available in the
system.
b. The process requests resources incrementally during the course of execution but should
release all the resources already being held at the time of encountering any denial due to
unavailability of any requested resources. The major drawback of this approach is that
some resources, by nature, cannot be withdrawn safely and given back easily at a later
time. For example, when a fle is under process, it cannot be stopped, because it may
corrupt the system if not carried to completion. In fact, withdrawal of a resource and
later its resumption are only workable if this does not damage the integrity of the system
and moreover if the overhead for this act due to context/process switch is found to be
within the affordable limit.
202 Operating Systems
3. No-Preemption Condition
The third condition (Section 4.19.4), the no-preemption condition, can obviously be prevented by
simply allowing preemption. This means that the system is to be given the power to revoke at any
point in time the ownership of certain resources which are tied up with blocked processes. The no-
preemption condition can be prevented in several ways. One such way is that if a process holding
resources is denied a further request, the process must relinquish all its original resources and, if
necessary, request them again, together with the additional new resources already allocated.
However, preemption of resources, as already discussed, is sometimes even more diffcult than
the usual approach of voluntary release and resumption of resources. Moreover, preemption is pos-
sible only for certain types of resources, such as CPU and main memory, since the CPU can be
regularly preempted by way of routinely saving its states and status with the help of process/context
switch operations, and memory can be preempted by swapping its pages to secondary storage.
On the other hand, some types of resources, such as partially updated databases, cannot be pre-
empted without damaging the system. Forcibly taking these resources away may be tricky at best
and impossible at worst. Therefore, preemption is possible only for certain types of resources, and
that too if and only if deadlock prevention is felt to be more important than the cost to be incurred
for the process switch operation associated with preemption. Since this approach cannot be appli-
cable for all types of resources in general, it is therefore considered less promising and hence is
dropped from favor.
4. Circular Wait
Attacking the last one, the fourth condition (Section 4.19.4), the circular wait, can be eliminated
in several ways: One simple way is to impose a rule saying that a process is allowed only to have a
single resource at any instant. If it needs a second one, it must release the frst one. This restriction,
however, can create hindrance, and that is why it is not always acceptable.
Another way to prevent a circular wait is by linear ordering of different types of system resources
by way of numbering them globally:
1. Printer
2. Plotter
3. Tape drive
4. CD-ROM drive
5. Disk drive, and so on
It is observed that the system resources are divided into different classes Rk, where k = 1, 2, . . ., n.
Deadlock can now be prevented by imposing a rule which says: processes can request and acquire
their resources whenever they want to, but all requests must be made in strictly increasing order
of the specifed system resource classes. For example, a process may request frst a printer and
then a disk drive, but it cannot request frst a tape drive and then a plotter. Moreover, acquiring
all resources within a given class must be made with a single request, and not incrementally. This
means that once a process acquires a resource belonging to class R x, it can only request resources
of class x + 1 or higher.
With the imposition of this rule, the resource allocation graph can never have cycles. Let us
examine why this is true, taking two processes, A and B. Assume that a deadlock occurs only if
A requests resource x and B requests resource y. If x > y, then A is not allowed to request y. If x <
y, then B is not allowed to request x. Either way, deadlock is not possible, no matter how they are
interleaved.
The same logic also holds for multiple processes. At any instant, one of the assigned resources
will be of the highest class. The process holding that class of resource will never ask for a resource
Processor Management 203
of same class already assigned (as discussed). It will either fnish or, at worst, request resources of
higher classes. Eventually, it will fnish and release all its acquired resources. At this point, some
other process will acquire the resource of the highest class and can also complete its execution.
In this way, all processes will fnish their execution one after another, and hence there will be no
deadlock.
A slight variation of this algorithm can be made by dropping the requirement that resources
are to be acquired in a strictly increasing order of the specifed system resource classes and also
requiring that no process request a resource with a class lower than that of the resource it is already
holding. Now if a process initially requests resources with classes R x and R x + 1, and then releases
both of them during the tenure of execution, it is effectively starting afresh, so there is no reason to
restrict it from requesting any resource belonging to any class less than x.
One serious drawback of this approach is that the resources must be acquired strictly in the pre-
scribed order, as opposed to specifc requests when they are actually needed. This may mean that
some resources are to be acquired well in advance of their actual use in order to obey this rule. This
adversely affects resource utilization and lowers the degree of concurrency, since unused resources
already acquired are unavailable for needed allocation to other requesting processes already under
execution.
Another practical diffculty faced with this approach is to develop a particular numerical order-
ing of all the resources so that the specifc rule can be imposed to eliminate the deadlock problem.
In reality, the resources, such as disk spooler space, process table slots, locked database records,
and similar other abstract resources, together with the number of usual potential resources and their
different uses, may be so large that no specifc ordering could be effectively possible to work out for
actual implementation.
to underutilize resources by refusing to allocate them if there is any possibility for a deadlock.
Consequently, it is rarely favored for use in modern operating systems.
Deadlock avoidance can be accomplished with the following two approaches:
1. Process Initiation Refusal: A process should not be started if its claims might lead to a
deadlock. Deadlock avoidance requires all processes to declare (pre-claim) their maxi-
mum resource requirements prior to execution. In fact, when a process is created, it must
explicitly state its maximum claim: the maximum number of units the process will ever
request for every resource type. The resource manager can honor the request if the stated
resource requirements do not go beyond the total capacity of the system or do not exceed
the total amount of resources that are available at that time, and it then takes the appropri-
ate actions accordingly.
2. Resource Allocation Refusal: The main algorithms to ensure deadlock avoidance are
based on the policies to be adopted while allocating resources in an incremental way to
requesting processes. Once the execution of a process begins, it then starts requesting its
resources in an incremental manner as and when needed, up to the maximum declared
limit. The resource manager keeps track of the number of allocated and the number of
available resources of each type, in addition to keeping a record of the remaining number
of resources already declared but not yet requested by each process. If a process requests
a resource which is temporarily unavailable, the process is then placed in waiting (sus-
pended). But if the requested resource is available for allocation, the resource manager
examines whether granting the request can lead to a deadlock by checking whether each
of the already-active processes could complete, in case if all such processes exercise all
of their remaining options by acquiring other resources that they are entitled to by virtue
of the remaining claims. If so, the resource is allocated to the requesting process, thereby
ensuring that the system will not be deadlocked.
At any instant, a process may have no or more resources allocated to it. The state of the system is
simply defned as the current allocation of resources to all already-active processes. Thus, the state
of a system consists of two vectors: the total number of resources, and the total number of available
resources, as well as two matrices, claim and current allocation, as already defned. A state is said
to be safe if it is not deadlocked and there exists at least one way to satisfy all currently pending
requests issued by the already-active processes following some specifc order without resulting in
a deadlock. An unsafe state is, of course, a state that is considered not to be safe. As long as the
processes tend to use less than their maximum claim, the system is likely (but not guaranteed) to be
in a safe state. However, if a large number of processes are found to have relatively large resource
demands (at or very near their maximum claim) almost at the same time, the system resources will
then be heavily under use, and the probability of the system state being unsafe happens to be higher.
Safety Evaluation: Attainment of a safe state can possibly be guaranteed if the following strat-
egy is used: when a resource is requested by a process, make the decision while granting the request
on the basis of whether the difference between the current allocation and the maximum requirement
of the process can be met with the currently available resources. If so, grant the request, and the
system will possibly remain in the safe state. Otherwise, refuse granting the request issued by this
process, and suspend (block) the process until it is safe to grant the request, because such a request,
if granted, may lead to an unsafe state culminating ultimately in a deadlock. Then exit the operating
system.
When a resource is released, update the available data structure to refect the current status of
state and reconsider pending requests, if any, for that type of resource.
It is important to note that if a state is unsafe, it does not mean that the system is in a deadlock
or even indicate that a deadlock is imminent. It simply means that the situation is out of the hands
of the resource manager, and the fate will be determined solely by future courses of action of the
Processor Management 205
processes. Thus, the main difference between a safe state and an unsafe state is that as long as the
state is safe, it is guaranteed that all processes will be completed, avoiding deadlock, whereas in
an unsafe state, no such guarantee can be given. That is why a lack of safety does not always imply
deadlock, but a deadlock implies non-safety as a whole.
Brief details on this topic with fgures and examples are given on the Support Material at www.
routledge.com/9781032467238.
• There must be a fxed number of resources to allocate and a fxed number of active pro-
cesses in the system.
• The maximum resource requirement for each process must be declared in advance.
• Claiming of resources in advance indirectly infuences and adversely affects the degree of
concurrency in systems.
• The resource manager has to always perceive a greater number of system states as unsafe
that eventually keep many processes waiting even when the requested resources are avail-
able and able to be allocated.
• Requirement of additional runtime storage space to detect the safety of the system states
and also the associated overhead in execution time for such detection.
• The active processes must be independent; that is, the order in which they execute must be
free from encumbrance of any synchronization requirement.
R3, R4). The nature of the current system state Sk is determined by the pattern of resources already
allocated to processes, as shown in Figure 4.30(b) as an example. At any instant, the system state
can be deduced by enumerating the number of units of resource type held by each process. Let a
vector E be the total number of resources in existence in the system for each type of resource Rj, as
shown in Figure 4.30(a). Let Alloc be a table, as shown in Figure 4.30(b), in which row i represents
process Pi, column j represents Rj, and Alloc (i, j) is the number of units of resource Rj held by the
process Pi. Let another table Maxc be the maximum claim on resource Rj by process Pi, as shown in
Figure 4.30(c). Let Need be a table, as shown in Figure 4.30(d), in which row i represents process Pi,
column j represents Rj, and Need(i, j) is the number of units of resource Rj is needed by the process
Pi. This table can be generated by computing the element-wise difference of two tables, Maxc(i, j)
and Alloc(i, j), such as;
Need (i, j) = Maxc (i, j) — Alloc (i, j) for all, 0 < i ≤ n and 0 < j ≤ m
The available resource vector Avail for each type of resource Rj is simply the difference between the
total number of each resource that the system has and what is currently allocated. This is shown in
Figure 4.30(e) as an example.
This algorithm determines whether the current allocation of resources is safe by considering
each process Pi and asking: if this process suddenly requests all resources up to its maximum claim,
are there suffcient resources to satisfy the request? If there are, then this process could not be dead-
locked in the state. So there is a sequence whereby this process eventually fnishes and returns all
its resources to the operating system. In this way, if we can determine that every Pi for all i, 0 <i≤ n
can execute, we declare the state is safe.
R1 R2 R3 R4
R1 R2 R3 R4
E= 7 5 3 2 P1 4 1 1 0
Total number of each type P2 0 2 1 2
of resource in system. P3 4 2 1 0
(a) P4 1 1 1 1
R1 R2 R3 R4 P5 2 1 1 1
P1 3 0 1 0 Maxc : (Maximum claim)
P2 0 1 0 0 (c)
P3 1 1 1 0
P4 1 1 0 1
P5 0 1 0 1 R1 R2 R3 R4
ALLOC : (Resource allocated) P1 1 1 0 0
(b) P2 0 1 1 2
P3 3 1 0 0
R1 R2 R3 R4
P4 0 0 1 0
2 1 1 0
P5 2 0 1 0
Avail : (Number of available Need : (Resource still needed)
resources of each type) (d)
(e)
FIGURE 4.30 An analytical example detailing Banker’s algorithm in deadlock avoidance with multiple
resources.
Processor Management 207
Assuming that all the vectors, such as; E and Avail and all the tables such as Alloc, Maxc, and
Need are present. Then
1. Find Pi such that Need (i, j) ≤ Avail(j) for each i, 0 < I ≤ n and all j, 0 < j ≤ m. If no such
Pi exists, then the state is unsafe and the system will eventually enter deadlock, since no
process can run to completion; halt the algorithm; and exit to the operating system.
2. Assume the process of the chosen row thus obtained, requests all the resources it needs up
to its maximum claim (which is certainly possible) and fnishes. Mark the process as termi-
nated, reclaim all the resources it acquired, and add all those resources to the Avail vector.
3. Repeat steps 1 and 2 until either all processes are marked terminated, in which case the
initial state was safe, or until a deadlock occurs, in which case the state was unsafe.
In step 1, if it is found that several processes are eligible to be chosen, it does not matter which one
will be selected: in any case, either at best, the pool of available resources gets larger due to release
of resources indicating poor utilization of resources but at the same time enabling many other wait-
ing processes to start and fnish, thereby avoiding deadlock to occur; or at worst, the pool of avail-
able resources remain the same.
The banker’s algorithm is an excellent approach, at least from a theoretical point of view. In spite of
having immense importance even today, especially the in academic domain, but in practice, the bank-
er’s algorithm has been neither approved nor accepted as a workable one to implement. One of the main
reasons is that the basic assumptions on which the algorithms has been derived are not sound enough
to trust, and hence, has the same drawbacks that all other avoidance strategies, in general, do have. The
most important one is that processes are rarely aware in advance about their maximum resource needs.
Moreover, the number of processes is not fxed but varies dynamically as new users frequently enter and
exit. In addition, resources that were supposed to be available can suddenly go out of reach (disk drives
not functioning), thereby reducing the total number of available resources. Furthermore, avoidance is
overly conservative, and there are many more things that actually prevent the algorithm from working;
hence, in reality, few operating systems, if any, use the banker’s algorithm for deadlock avoidance.
The Banker’s algorithm used for a single type of resources is discussed on the Support
Material at www.routledge.com/9781032467238.
• Group all system resources into a number of different disjoint resource classes.
• Apply a resource (linear) ordering strategy, as described earlier in relation to deadlock
prevention, while attacking circular wait, which can possibly prevent deadlocks between
resource classes.
• Within a resource class, use an algorithm that is most suitable for that particular class.
Grouping of resources can be accomplished based on the principle being followed in the design hier-
archy of a given operating system. Alternatively, this grouping can also be made in accordance with
208 Operating Systems
the dominant characteristics of certain types of resources, such as permitting preemption or allow-
ing accurate predictions, and similar others. To describe this technique with an example, consider
an usual system with the classes of resources as listed in the following, and ordering of this listing
follows the same order in which these classes of resources are being assigned to a job or process
during their lifetime, that is, starting from entry to the system until they exit. These resources are:
• Swapping space: An area of secondary storage (disk) designated for backing up blocks of
main memory needed for use in swapping processes.
• Process ( job ) resources: Assignable devices, such as printers and fxed disks with remov-
able media, like disk drives, tapes, CDs, and cartridge disks.
• Main memory: Assignable on a block basis, such as in pages or segments to processes.
• Internal resources: Such as I/O channels and slots of the pool of the dynamic memory.
The ultimate objective of this strategy is frst to prevent deadlock between this four classes of
resources simply by linear ordering of requests in the order as presented. Next, to prevent deadlock
within each class, a suitable local deadlock-handling strategy is chosen according to the specifc
characteristics of its resources. Within each individual resource class, the following strategies, for
example, may be applied:
overall approach in this regard is simply based on deadlock prevention through resource ranking (as
already described in the previous section). The related approach is to lock and unlock (release) the
data structures in the kernel in a standard manner. However, there are some exceptions to this rule
inherent to UNIX. Because all kernel functionalities cannot lock the data structures in the standard
order, deadlocks cannot be totally prevented. We present simplifed views of two arrangements that
are used to avoid deadlocks.
The UNIX kernel uses a disk cache (buffer cache) to speedup accesses to frequently used disk
blocks. This disk cache consists of a pool of buffers in primary memory and a buffer list with a
hashed data structure to inspect whether a specifc disk block exists in a buffer. The buffer list is
maintained using the least recently used (LRU) replacement technique in order to facilitate reuse
of buffers. The normal order of accessing a disk block is to use the buffer list with the hashed data
structure to locate a disk block; if found, put a lock on the buffer and also put a lock on the respective
entry in the buffer list thus obtained in order to update the LRU status of the buffer. If the requested
disk block is not found in the buffer list, it is obvious that the process would then merely want to
obtain a buffer for loading this requested new block. To achieve this, the process frst puts a lock on
the buffer list. It would then directly access the buffer list and inspect whether the lock on the frst
buffer in the list has already been set by some other process. If not, it then sets the lock and uses the
respective buffer; otherwise it repeats the same course of action on the next entry in the buffer list.
Deadlocks are possible because of this order of locking the buffer list, and the buffer is different
from the standard order of setting these locks.
UNIX uses an innovative approach to avoid deadlocks. The process looking for a free buffer
uses a technique that enables it to avoid getting blocked on its lock. The technique is to use a special
operation that attempts to set a lock but returns with a failure condition code if the lock is already
set. If this happens, the operation is repeated with an attempt to set the lock on the next buffer, and
so on until it fnds a buffer that it can use. In this way, this approach avoids deadlocks by avoiding
circular waits.
Deadlock is possible in another situation, when locks cannot be set in a standard order in the
fle system function that establishes a link. A link command provides pathnames for a fle and a
directory which is to contain the link to the fle. This command can be implemented by locking the
directories containing the fle and the link. However, a standard order cannot be defned for locking
these directories. Consequently, two processes trying simultaneously to lock the same directories
may get deadlocked. To avoid the occurrence of such deadlocks, the fle system function does not try
to acquire both locks at the same time. It frst locks one directory, gets it work done in the desired
manner, and then releases the lock. It then locks the other directory and does what it wants to do.
Thus, it requires and acquires only one lock at any time. In this way, this approach prevents dead-
locks because the hold-and-wait condition is not satisfed by these processes.
manager fails or is not willing to honor the request within the time interval specifed by this param-
eter, it can unblock the calling process and give it a failure indication. The process could then try
again, try a different request, or even may back out of whatever action it is trying to perform, or it
can, at best, terminate if it fnds no other way out.
There exist differences between deadlock prevention and avoidance. Prevention includes any
method that negates one of the four conditions already explained. Avoidance, on the other hand,
includes methods like the banker’s algorithm that takes advantage of prior knowledge (such as max-
imum claims). This advance knowledge alerts the system beforehand while navigating the progress
diagram and still avoids dangerous regions.
The conservative liberal metaphor helps in rating various policies that we have seen. Liberal
policies have a greater potential for parallelism and throughput because they allow more fexible
navigation of the progress diagram, but they have an equal potential for deadlock and starvation.
The advance-claim algorithm provides a reasonably liberal method that is still conservative enough
to prevent deadlock. To prevent starvation as well, other conservative steps must be taken, such
as banning new arrivals or concentrating resources on particular processes. Overly conservative
methods like one-shot allocation remove all worries about deadlock and starvation but at the cost of
severe reduction of concurrency, leading to potential degradation in system performance.
Although we have devoted a considerable amount of attention to the advance-claim algorithm, it
is mostly interesting from the theoretical end and bears some academic importance but is not prac-
ticable. It can be pointed out that many applications are unable to compute reasonable claims before
they start working; they can only discover what they will need once they have made some progress.
Furthermore, resources can suddenly become unavailable if the hardware malfunctions.
Therefore, most operating systems do not implement the advance-claim algorithm. It is much more
common to grant resources to processes following the most liberal policy imaginable and accept-
able. If this policy leads to deadlock, the deadlock is detected (either by the resource manager or
by disgruntled users) and broken by injuring some process(es). Hierarchical allocation has also
been used successfully, especially within the kernel, to make sure that its modules never become
deadlocked.
Our ultimate goal is to achieve the most liberal resource-allocation policy without encounter-
ing any deadlock. Serialization avoids deadlock but is very conservative. Figure 4.31 also shows
the resource-allocation policies that we have already explained. The one-shot, hierarchical, and
advance-claim algorithms put increasingly less onerous restrictions on the processes that wish to
request resources. The advance-claim algorithm also requires a certain amount of prior knowledge
about the resources required by each process. The more information of this nature we know, the
FIGURE 4.31 A pictorial representation of the liberal-conservative spectrum in rating various policies relat-
ing to deadlock problem and its solution approaches (Finkel).
Processor Management 211
closer we can get to our goal. However, the algorithms that make use of such knowledge are increas-
ingly expensive to execute.
4.19.13 STARVATION
Deadlock arises from dynamic resource sharing. If resources are not shared, there can never be
deadlock. Moreover, deadlock results due to overly liberal policies for allocation of resources.
Starvation is a problem that is closely related to deadlock. In a dynamic system, it is required
that some policy be formulated to decide which process will get what resource when and for how
long. While some of these policies are found quite reasonable and also lucrative from the point of
resource allocation to negotiate deadlock, they sometimes may give rise to a peculiar situation in
which some processes are continuously postponed and denied the desired service. Even when the
needed resources are available, a process is still getting blocked. It happens that it might never get
a chance to run again; we call this danger starvation, and the said process then enters a state of
starvation, even though the process is not deadlocked.
Starvation arises from consistently awarding resources to the competitors of a blocked process.
As long as one competitor has the resources, the blocked process cannot continue. Starvation, in gen-
eral, may result due to overly liberal policies for reassignment of resources once they are returned.
When resources are released, they can be granted to any process waiting for them in the resource
wait list (The short-term scheduler most likely should not switch to the waiting processes but should
normally continue with the releasing processes, in accord with the hysteresis principle.) However,
not all policies used for granting those resources prevent starvation as being experienced from dif-
ferent real-life situations. Sill, increased conservatism may be one way to prevent this problem.
A refnement of this policy is to grant resources whenever they are available after being released
to those processes with low requirements, thereby allowing them to complete rather than allocating
them to the frst blocked process. With best hope, it can be expected that the processes we favor
in this way will eventually release enough resources to satisfy the requirement of the frst blocked
process. More precisely, we might sort the resource wait list by the order in which processes block.
When resources are freed, we scan this list and grant resources only to processes whose current
request can be fully satisfed. Unfortunately, this frst-ft policy may insist the frst blocked pro-
cess to eventually go into starvation if a continuous stream of new jobs with smaller requirements
arrives.
One way to modify this strategy is to allow partial allocation. As many units of the resource can
be granted to the frst process in the list as may be done safely; the rest may be allocated (if safe)
to other later processes in the list. Even this policy fails, because there may be situations in which
no resources may safely be granted to the frst blocked process, others may otherwise continue
even then.
Another further modifcation could be to order the resource wait list according to some safe
sequence (if there are several safe sequences, any one will do). Resources are granted, either par-
tially or fully, starting at the head of this list. However, there is no way to guarantee that any par-
ticular process will ever reach the beginning of this list. A process may remain near the end of the
safe sequence and hence be left in starvation and never be favored with the allocation of resources.
Starvation detection: The approach to detect starvation can be modeled exactly along the lines
of deadlock detection. By the time deadlock is detected, it may be too late, but it is never too late to
fx starvation. Starvation might be signaled by a process remaining on the resource wait list for too
long, measured in either units of time or units of process completion.
Once starvation is detected, new processes may be denied resources until starving processes
have completed. Of course, processes that already have resources must be allowed to get more, since
they may appear earlier on belonging to the safe sequence. This approach does certainly work, but it
must be tuned carefully. If starvation detection is too sensitive, new processes are banned too often,
with the result that there is an overall decrease in throughput. In other words, the policy becomes
212 Operating Systems
too conservative. If detection reveals that it is not sensitive enough, processes, however, can nor-
mally go with starvation for a long time before they are fnally allowed to run.
Starvation can be avoided by using a frst-come-frst-serve resource allocation policy. With this
approach, the process waiting the longest units of time gets served next. In due course, any given
process will eventually become the oldest, and, in turn, gets a chance with the needed resources.
Starvation control, however, has not been extensively dealt with. Holt (1972) suggested maintain-
ing counters for each blocked process that are periodically incremented. When a counter exceeds
a critical value, the scheduler has to fnd a safe sequence and fnish jobs in that order. He also sug-
gested partial allocations but a requirement that resources not be granted to jobs with zero holdings.
The approach of banning new jobs was also introduced and considered. Another approach to handle
starvation, which allocates resources according to a safe-sequence order, is found in the Boss 2
operating system for the RC4000 computer.
Fortunately, starvation is seldom a critical problem in the working of an actual operating system,
because nonpreemptible, serially reusable resources, such as printers and tape drives, are mostly
underused, and the existence of such idle periods tends to prevent starvation.
SUMMARY
This chapter introduces the details of processes, the characteristics of processes in terms of their
different states, and their other features in order to realize already-defned functions (described in
Chapter 2) that are to be performed by the processor (process) management module. The process
concept, however, has been further refned and gives rise to a new construct of a relatively small
executable unit known as thread. The reverse trend also has been observed. The concept of a rela-
tively large executable unit, called object consisting of processes and threads has been started to
evolve, and a new concept of object-oriented design opened a new horizon that ultimately led to the
development of some contemporary powerful operating systems. We presented here some cases as
examples in relation to the actual implementation of these different concepts in the development of
modern operating systems.
Controlling multiple processes in a multitasking environment requires a specifc discipline real-
ized by a mechanism known as scheduling that allocates a processor or processors to each different
process at the right point in time for substantial performance improvement with optimal resource
utilization. Out of four types of scheduling discussed in detail in Chapter 2, one of these, I/O sched-
uling, will be described in Chapter 7. Out of the remaining three types of scheduling, long-term and
medium-term scheduling are concerned primarily with performance concerns in relation to degrees
of multiprogramming. Hence, the only scheduling that remains is short-term scheduling, also called
process (processor) scheduling and discussed here in this chapter on a single-processor system.
Since the presence of multiple processors in a system adds additional complexity while this type of
scheduling is carried out, it is convenient to frst focus on the single-processor (uniprocessor) case to
describe the fundamental approaches so that more elaborate forms of handling multiple processors
can be easily explained. This section focuses on the various criteria based on which different sched-
uling policies can framed and then addresses different strategies for the design and implementation
of respective policies to derive corresponding scheduling algorithms.
Operating systems, while supporting multitasking, multiprocessing, and distributed processing,
often use concurrent execution among a related group of processes that allows a single application
to take advantage of parallelism between the CPU and I/O devices on a uniprocessor and among
CPUs on a multiprocessor for the sake of performance improvement. However, multiple processes
that compete for or cooperatively share resources (including information) introduce the potential
for new problems that raise a lot of issues in software implementation. Hence, the design criteria of
the OS for accommodating concurrency must address a host of these issues, including sharing and
competing for resources, synchronization of the activities of multiple processes, communication
among processes, and allocation of processor time to processes.
Processor Management 213
EXERCISES
1. Describe how processes are developed and executed in the system following its different
states.
2. What is meant by process control block? Give the approximate structure of a process con-
trol block with the different categories of information that it contains.
3. What are the events that are commonly responsible for a process switch occurring?
Enumerate the steps usually followed when a process switch occurs.
4. What are the main drawbacks that have been faced in the design of an operating system
based on the process concept? How have these drawbacks been negotiated with making
changes in concepts?
5. What is meant by “thread”? What are the advantages that can be obtained by using the
thread as a unit of computation?
6. What are the type of system calls that a thread should avoid using if threads are imple-
mented at the user level?
7. An OS supports both kernel-level threads and user-level threads. Justify the following
statements:
a. If a candidate for a thread is a CPU-bound computation, use a kernel-level thread if the
system contains multiple processors; otherwise, make it a user-level thread.
b. If a candidate for a thread is an I/O-bound computation, use a user-level thread if the
process containing it does not contain a kernel-level thread; otherwise, make it a kernel-
level thread.
8. A process creates eight child processes. It is required to organize the child processes into
two groups of four processes each such that processes in a group can send signals to other
processes in the group but not to any process outside the group. Implement this require-
ment using the features in UNIX.
214 Operating Systems
• Process scheduling
9. State and explain the different criteria that are involved in the design strategy and design
objectives of a process scheduling mechanism and its associated algorithm.
10. Assume that the following jobs to execute with a single-processor system, with the jobs
arriving in the order listed here:
I Service Time
1 60
2 20
3 10
4 20
5 50
a. If the system uses FCFS scheduling, create a Gantt chart to illustrate the execution of
these processes.
b. What is the turnaround time for process P3?
c. What is the average wait time for the processes?
11. Use the process load as given in the previous example, and assume a system that uses SJN
scheduling:
a. Create a Gantt chart to illustrate the execution of these processes.
b. What is the turnaround time for process P4?
c. What is the average wait time for the processes?
12. Prove that, among nonpreemptive scheduling algorithms, SJN/SPN provides the minimum
average waiting time for a set of requests that arrive at the same time instant.
13. Assume that the following jobs execute with a single-processor system that uses priority
scheduling, with the jobs arriving in the order listed here, where a small integer means a
higher priority:
Calculate T, M, and P for each process under the following policies: FCFS, SPN, PSPN,
HPRN, RR with t = 1, RR with t = 5, and SRR with b/a = 0.5 and t = 1. Assume that if
events are scheduled to happen at exactly the same time, new arrivals precede termina-
tions, which precede quantum expirations.
18. Some operating systems are used in environments where processes must get guaranteed
service. Deadline scheduling means that each process specifes how much service time
it needs and by what real time it must be fnished. Design a preemptive algorithm that
services such processes. There will be occasions when deadlines cannot be met; try to dis-
cover these situations as early as possible (before starting a process, if it cannot be fnished
in time).
19. A multiprogramming time-sharing operating system uses priority-based scheduling for
time-critical processes and round-robin scheduling for interactive user processes. At cer-
tain times, the hardware is upgraded by replacing the CPU with a functionally equivalent
model that is twice as fast. Discuss the changes that different classes of users will experi-
ence. Do some parameters of the operating system need to be changed? If so, which ones
and how? Explain the expected change in the system behavior as a consequence of such
changes.
20. A group of processes Gk in a system is using fair-share scheduling. When a process P1
from Gk is selected for scheduling, it is said that “P1 is a selection from Gk”. Show that if
processes do not perform I/O operations, two consecutive selections from Gk cannot be for
the same process
21. State the distinct advantages that can be obtained from a multilevel feedback queuing
scheduler. Which type of process is generally favored by this scheduler: a CPU-bound
process or an I/O-bound process? Explain briefy why.
• Interprocess Synchronization
22. The processes P0 and P1 share variable V2, processes P1 and P2 share variable V0, and
processes P2 and P3 share variable V1. Show how processes can use enable interrupt and
disable interrupt to coordinate access to the variables V0, V1, and V2 so that the critical
section problem does not arise.
23. How is mutual exclusion implemented using general semaphores? Explain the drawbacks
and limitations of semaphores in general. State the properties and characteristics that a
semaphore in general exhibits.
216 Operating Systems
24. “A general semaphore is superfuous since it can be implemented with a binary semaphore
or semaphores”—explain.
25. It is sometimes found that a computer has both a TSL instruction and another synchroniza-
tion primitive, such as semaphores and monitors, in use. These two types play a different
role and do not compete with each other. Explain this with reasons.
26. Two processes p1 and p2 have been designed so that p2 writes a stream of bytes produced by
p1. Write a skeleton of procedures executed by p1 and p2 to illustrate how they synchronize
with one another using P and V. (Hint: consult the producer/consumer problem)
27. Semaphores can be realized in a programming-language construct called critical region.
Discuss the mechanism by which it can be realized. State the limitations that you may
encounter.
28. An inventory manager issues the following instructions to the store manager in regard to
a particular item: “Do not purchase the item if the number of items existing in the store
exceeds n, and hold any requisition until the number of items existing in the store is large
enough to permit the issue of the item”. Using a particular item, implement these instruc-
tions with the help of a monitor.
29. You have an operating system that provides semaphores. Implement a message system.
Write the procedures for sending and receiving messages.
30. Show that, using message, an interrupt signaling mechanism can be achieved.
31. Show that monitors and semaphores have equivalent functionality. Hence, show that a
monitor and message can be implemented using semaphores. This will demonstrate that a
monitor can be used anyplace a semaphore can be used.
32. Solve the producer/consumer problem using monitors instead of semaphores.
33. Suppose we have a message-passing mechanism using mailboxes. When sending to a full
mailbox or trying to receive from an empty one, a process does not block. Instead it is
provided an error code. The process in question responds to the error code by just trying
again, over and over, until it succeeds. Does this scheme lead to race conditions?
34. What sequence of SEND and (blocking) RECEIVE operations should be executed by a
process that wants to receive a message from either mailbox M1 or mailbox M2? Provide
a solution for each of the following cases:
a. The receiving process must not be blocked at an empty mailbox if there is at least one
message in the other mailbox. The solution can use only the two mailboxes and a single
receiving process.
b. The receiving process can be suspended (blocked) only when there are no messages in
either mailbox, and no form of busy waiting is allowed.
35. Discuss the relative time and space complexities of the individual implementations of the
message facility. Propose an approach that you consider to be the best trade-off in terms of
versatility versus performance.
36. “Deadlock is a global condition rather than a local one”. Give your comments. What are
the main resources that are held responsible for the occurrence of a deadlock?
37. State and explain the conditions that must be present for a deadlock to occur.
38. “A deadlock can occur even with a single process”. Is it possible? If so, justify the statement
with an appropriate example.
39. What are the merits and drawbacks of the recovery approach when deadlock has already
occurred and been detected?
Processor Management 217
40. Discuss the merits and shortcomings of the deadlock avoidance strategy.
41. An OS uses a simple strategy to deal with deadlock situations. When it fnds that a set of
processes is deadlocked, it aborts all of them and restarts them immediately. What are the
conditions under which the deadlock will not recur?
42. Compare and contrast the following resource allocation policies:
a. All resource requests together.
b. Allocation using resource ranking.
c. Allocation using the banker’s algorithm in the light of (i) resource idling and (ii) over-
head of the resource allocation algorithm.
43. Three processes share four resources that can be reserved and released only one at a time.
Each process needs a maximum of two units. Show that a deadlock cannot occur.
44. When resource ranking is used as a deadlock prevention policy, a process is permitted
to request a unit of resource class R x only if rank x > rank y for every resource class Ry
whose resources are allocated to it. Explain whether deadlocks can arise if the condition is
changed to rank x ≥ rank y.
45. A system is composed of four processes [P1, P2, P3, P4] and three types of serially reusable
resources, [R1, R2, R3]. The number of total existing resources are C = (3, 2, 2)
a. Process P1 holds 1 unit of R1 and requests 1 unit of R2.
b. Process P2 holds 2 units of R2 and requests 1 unit each of R1 and R3.
c. Process P3 holds 1 unit of R1 and requests 1 unit of R2.
d. Process P4 holds 2 units of R3 and requests 1 unit of R1.
46. Can a system be in a state that is neither deadlocked nor safe? If so, give an example. If not,
prove that all states are either deadlocked or safe.
47. A situation in which there are several resource classes and for each class there is a safe
sequence but the situation is still unsafe—justify.
48. Can a process be allowed to request multiple resources simultaneously in a system where
deadlocks are avoided? Justify why or why not.
49. What are the effects of starvation that may affect the overall performance of a system?
Discuss the mechanism by which starvation can be detected and then avoided. Explain
how starvation is avoided in the UNIX and Windows operating systems.
More questions for this chapter are given on the Support Material at www.routledge.com/
9781032467238
Coffman, E. G., Elphick, M. J., Shoshani, A. “System Deadlocks”, Computing Surveys, vol. 3, pp. 67–78,
1971.
Dijkstra, E. W. Cooperating Sequential Processes. Eindhoven, Technological University, 1965. (Reprinted in
Programming Languages, F. Genuys, ed., Academic Press, New York, 1968)
Habermann, A. N. “Prevention of System Deadlocks”, Communications of the ACM, vol. 12, no. 7, pp. 373–
377, 385, 1969.
Hall, L., Shmoys, D., et al. “Scheduling to Minimize Average Completion Time Off-line and On-line
Algorithms”, SODA: ACM–SIAM Symposium on Discrete Algorithms, New York, ACM, 1996.
Havender, J. W. “Avoiding Deadlock in Multitasking Systems”, IBM Systems Journal, vol. 7, pp. 74–84, 1968.
Hoare, C. A. R. “Monitors: An Operating System Structuring Concept”, Communication of the ACM, vol. 17,
no. 10, pp. 549–557, 1974.
Hofri, M. “Proof of a Mutual Exclusion Algorithm”, Operating Systems Review, vol. 24, no. 1, pp. 18–22,
1990.
Holt, R. C. “Some Deadlock Properties of Computer Systems”, Computing Surveys, vol. 4, pp. 179–196, 1972.
Isloor, S., Mersland, T. “The Deadlock Problem: An Overview”, Computer, vol. 13, no. 9, pp. 58–78, 1980.
Kay, J., Lauder, P. “A Fair Share Scheduler”, Communications of the ACM, vol. 31, no. 1, pp. 44–55, 1988.
Kleinrock, L. Queuing Systems, vols. I and II, New York, Willey, 1975–1976.
Lamport, L. “The Mutual Exclusion Problem Has Been Solved”, Communications of the ACM, vol. 34, no. 1,
p. 110, 1991.
Lampson, B. W., Redell, D. D. “Experience with Processes and Monitors in Mesa”, Proceedings of the 7th
ACM Symposium on Operating Systems Principles, New York, ACM, pp. 43–44, 1979.
Levine, G. “Defning Deadlock with Fungible Resources”, Operating Systems Review, vol. 37, no. 3, pp. 5–11,
2003.
Levine, G. “The Classifcation of Deadlock Prevention and Avoidance is Erroneous”, Operating System
Review, vol. 39, pp. 47–50, 2005.
Lipton, R. On Synchronization Primitive Systems, PhD thesis, Pittsburgh, PA, Carnegie–Mellon University, 1974.
Marsh, B. D., Scott, M. L., et al. “First-class User-level Threads”, Proceeding of the Thirteenth ACM
Symposium on Operating Systems Principles, New York, ACM, October, pp. 110–121, 1991.
Mauro, J., McDougall, R. Solaris Internals: Core Kernel Architecture, London, Prentice Hall, 2007.
Philbin, J., Edler, J., et al. “Thread Scheduling for Cache Locality”, Architectural Support for Programming
Languages and Operating Systems, vol. 24, pp. 60–71, 1996.
Reed, D. P., Kanodia, R. K. “Synchronization with Eventcounts and Sequencers”, Communications of the
ACM, vol. 22, pp. 115–123, 1979.
Robins, K., Robins, S. UNIX Systems Programming: Communication, Concurrency, and Threads, Second
Edition, London, Prentice Hall, 2003.
Rypka, D. J., Lucido, A. P. “Deadlock Detection and Avoidance for Shared Logical Resources”, IEEE
Transactions on Software Engineering, vol. 5, no. 5, pp. 465–471, 1979.
Schlichting, R. D., Schneider, F. B. “Understanding and Using Asynchronous Message Passing Primitives”,
Proceedings of the Symposium on Principles of Distributed Computing, New York, ACM, pp. 141–147, 1982.
Scrhrage, L. E. “The Queue M/G/I with Feedback to Lower Priority Queues”, Management Science, vol.13,
pp. 466–474, 1967.
Siddha, S., Pallipadi, V., et al. “Process Scheduling Challenges in the Era of Multi-core Processors”, Intel
Technology Journal, vol. 11, 2007.
Vahalia, U. UNIX Internals: The New Frontiers, London, Prentice Hall, 1996.
Woodside, C. “Controllability of Computer Performance Tradeoffs Obtained Using Controlled–Share Queue
Schedulers”, IEEE Transactions on Software Engineering, vol. SE-12, no. 10, pp. 1041–1048, 1986.
5 Memory Management
Learning Objectives
• To defne the key characteristics of memory systems and basic requirements of primary
memory.
• To signify the use of memory hierarchy for access-time reduction.
• To describe the basic requirements of memory management considering the sharing and
separation of memory along with needed protection.
• To explain the required address translation for both static and dynamic relocation.
• To discuss the implementation and impact of memory swapping.
• To defne the functions and responsibilities of memory management.
• To mention the different memory management schemes being used along with their com-
parison parameters.
• To describe the contiguous memory allocation schemes including different methods of
memory partition (both static and dynamic), and various techniques used in their manage-
ment with each of their respective merits and drawbacks.
• To implement noncontiguous memory allocation schemes using paged memory manage-
ment along with its related issues and respective merits and drawbacks.
• To describe the various segmented memory management schemes and their related issues,
including the support needed from the underlying hardware.
• To illustrate different kernel memory allocation schemes along with real-life implementa-
tions as carried out in UNIX and Solaris.
• To demonstrate the implementation of virtual memory with paging and its different aspects
with related issues, along with its actual implementation in VAC (DEC), SUN SPARC, and
Motorola systems.
• To illustrate the signifcance of the translation lookaside buffer (TLB) and its different
aspects in the performance improvement of paged memory management.
• To demonstrate segmentation and segmentation with paging in virtual memory and also its
actual implementation in real-life in the Intel Pentium.
• To examine the various design issues that appear in the management of virtual memory
and subsequently their impacts on its overall performance.
• To describe in brief the real-life implementations of memory management carried out in
UNIX, Linux, Windows, and Solaris separately.
• To explain the objectives, principles, and various design issues related to cache memory.
5.1 INTRODUCTION
The one single development that puts computers on their own feet was the invention of a reliable form
of memory: the core memory. A journey through the evolution of computers convincingly estab-
lishes the fact that due to important breakthroughs in the technological advancement in electronic
industry, the size and the speed of the memory itself has constantly increased and greatly paved the
way for succeeding generations of computers to emerge. Still, at no point in time in the past and even
today, with sophisticated technology, is there ever enough main memory available to satisfy cur-
rent needs. In addition, as technology constantly advances, the speed of CPUs increases at a much
faster rate than that of memory, causing a continuous increase in the speed disparity between CPU
and memory, thereby adversely affecting the performance of the computer system as a whole. To
negotiate such situations of space scarcity and speed disparity, computer memory came out with
perhaps the widest range of type, technology, organization, and performance as well as its cost to
keep the cost/performance of a computer system within an affordable limit. A typical computer
system thus equipped with a diverse spectrum of memory subsystems maintaining a well-defned
hierarchy; some are internal to the system to hold information that is directly accessible and refer-
enced by the CPU, called primary memory or physical memory, and some are external to the system
to store information, called secondary memory, accessed by the CPU via an I/O module. When data
and programs are referenced by the CPU for execution, they are loaded in primary memory from
this secondary memory; otherwise they remain saved in secondary memory (on a storage device).
Discussions of secondary memory are outside the domain of this chapter and will be provided in
Chapter 6, “Device Management”.
While primary memory has faster access times than secondary memory, it is volatile, and sec-
ondary memory, on the other hand, is comparatively slower in operation but is a long-term persis-
tent one that is held in storage devices, such as disk drives, tape drives, and CD-ROM. However,
all processes (programs) must be resident in a certain portion of primary (main) memory before
being activated. In other words, anything present in primary memory is considered active. The pri-
mary memory is therefore part of the executable memory (and is also sometimes called executable
memory), since the CPU can fetch information only from this memory. Information can be loaded
into CPU registers from primary memory or stored from these registers into the primary memory.
As the size of the main memory in any computer system, whether large or small, is limited, a chal-
lenge is always faced at the time of application program development in respect to the size of the
programs to be developed so that these programs or parts of them along with all associated infor-
mation could be accommodated in the available primary memory when they will be used by the
CPU, and this information is then written back to the secondary memory soon after it has been used
or updated. If this challenge could be met, the execution time of a process could then be reduced
substantially. However, from now, we will always refer to primary memory or main memory by the
term “memory or core memory”; otherwise we will specifcally use the term “secondary memory”.
Memory in a uniprogramming system is basically divided into two parts: one part permanently
holds the resident portion (kernel) of the operating system, while the other part is used by the cur-
rently active programs. In a multiprogramming/multitasking environment, this division of memory
is even more complex; the user part (non-OS part) of memory is once again subdivided here to
accommodate multiple processes (programs). The task of this subdivision and many other responsi-
bilities, such as allocation of fnite sizes of memory to requesting processes, assistance in managing
the sharing of memory, minimizing the memory access time, and others, are carried out statically/
dynamically by the operating system, and this activity is known as memory management. In the
hierarchical design of a layered operating system, as described in Chapter 3, memory management
belongs to Level 2. The supporting operations of the secondary memory lie in the basic I/O system
(devices) located in Level 4 in the hierarchical design of OS that shuttles portion of address spaces
between primary and secondary memory while responding to requests issued by the memory man-
ager, and those are examined in details in the chapter on device management.
Design of effective memory management is therefore an important issue, since the overall
resource utilization and the other performance criteria of a computer system are greatly infuenced
and critically affected by the performance of the memory management module, not only in terms of
its effectiveness in handling merely memory but also as a consequence of its impact on and interac-
tion with the other resource managers.
In this chapter, we will present the principles of managing main memory and then investigate
different forms of memory management schemes, ranging from very simple to highly sophisticated.
The ultimate objective is, however, to provide needed memory spaces to each individual program
for its execution. We will start at the beginning from the point of the simplest possible memory
management system and then gradually proceed to more and more elaborate forms of advanced
Memory Management 221
ones. An attempt is made to explore the salient features of these different types of memory manage-
ment schemes: their merits, their drawbacks, their limitations, and above all their implementations
in contemporary representative computer systems whenever possible.
More details on this topic are given on the Support Material at www.routledge.com/9781032467238.
evolution of newer electronic technology, but the speed and size ratios still tend to remain relatively
unchanged.
More details on this topic are given on the Support Material at www.routledge.com/9781032467238.
fulfll these crucial requirements, at least to an acceptable extent for the sake of interprocess coop-
eration as well as to minimize the waste in memory space by avoiding redundancy.
form a single module (load module) that can then straightaway be loaded into memory and
directly executed. While the single load module is created as output, each location-sensi-
tive address within each object module must then be changed (relocated) from its symbolic
addresses to a reference to the starting address location of the created load module. This
form of relocation refers to static relocation performed by a relocating linker, when dif-
ferent object modules are combined by the linker to create a load module. At loading time,
the created load module (executable program) is now loaded by an absolute loader into
memory starting strictly at the location pre-defned and specifed in the header of the load
module with no address translation (relocation) at the time of loading. If the memory area
specifed in the header of the load module is not free, then the program has to wait for the
specifc memory slot to become free, although there are other slots available in memory.
A relocating loader, however, removes this drawback by loading an executable module
to any available slot and then translates all location-sensitive information within the mod-
ule correctly to bind to the actual physical location in which this module will be loaded.
Since the software relocation involves considerable space and time complexity, systems
with static relocation are commonly and practically not favored over only static binding
of modules.
5.5.3.3 Advantages
The notable feature of this scheme is that the relocation process is free from any additional memory
management strategies and is simply an hardware-only implicit base addressing implemented only
at runtime (fnal stage). Moreover, here the logical (virtual) address space in which the processor-
generated addresses prior to relocation are logical (virtual) addresses is clearly separated from the
physical address space containing the corresponding physical addresses generated by the mapping
hardware used to reference the physical memory. This attribute nowadays facilitates designing
a fexible form of an on-chip (earlier an optional off-chip) advanced memory-management unit
(MMU) within the microprocessor CPU for the purpose of generating physical addresses along
with other supports. Another distinct advantage is that it gives the OS absolute freedom by permit-
ting it to freely move a partially executed program from one area of memory into another, even in
the middle of execution (during runtime) with no problem accessing information correctly in the
newly allotted space. This feature is also very useful, particularly to support swapping programs in/
out of memory at any point in time. Consequently, this approach demands extra hardware to pro-
vide one or more base registers and some added overhead also involved in the additional computa-
tions required in the relocation process. Contemporary architectures, however, address these issues
Memory Management 225
supporting implicit base addressing; they provide a dedicated adder to allow address calculation to
proceed in parallel with other processor operations, and with this overlapping, they minimize the
impact on the effective memory bandwidth to an affordable limit.
While selecting a victim for swapping out, the swapper takes many relevant factors into account,
such as the current status, memory residence time, size of each resident process, consideration of
age to avoid thrashing, and similar other vital factors that could infuence the performance of the
system as a whole. The selection of a process to be once again swapped in is usually based on the
fulfllment of certain criteria: the priority of the process and the amount of time it spent in second-
ary storage (aging); otherwise this may invite the problem of starvation, the availability of resource
for which it was swapped out, and fnally the fulfllment of minimum disk-resident time requirement
from the instant of its being swapped out. This is specifcally required in order to exert control over
forthcoming thrashing.
• Swap fle: When a partially executed process is swapped out, its runtime process image,
along with the runtime states consisting of the contents of the active processor registers,
as well as its other data and stack locations, are temporarily saved in a swap fle. One of
two types of swap fles is commonly used for this purpose and must be created, reserved,
and allocated statically/dynamically before or at the time of process creation. Those are:
• A system-wide single swap fle in a special disk (usually in system disk) for all active
processes, to be created at the time of system generation.
• A dedicated swap fle for each active process that can be created either statically at the
time of program preparation or dynamically at the time of process creation.
However, each of these two approaches has its own merits and also certain drawbacks.
• Policy decisions: Swapping of a process is sometimes urgently needed and is thus favored,
even after accepting that the cost of swapping is appreciable. That is why the operating
system often enforces certain rules to designate a process as not swappable if it belongs to a
226 Operating Systems
given class of privileged processes and users. All other processes, however, may be treated
as swappable by default. Sometimes, the memory management (operating system) decides
the policy and implements it on its own when it fnds that a relatively large process in mem-
ory remains blocked over a considerable duration of time, when the other waiting processes
that deserved to be run can be placed there for the sake of performance improvement.
• Relocation: In systems that support swapping, re-placement of a swapped-out process in
main memory when it is once again reloaded as a result of swap-in is carried out using
dynamic allocation (already explained in the last section) whenever a suitable size of mem-
ory anywhere is found. Of course, this approach has comparatively high overhead and also
requires some sort of additional hardware facilities.
• Determining allocation policy for memory, that is, deciding to which process it should go,
how much, when, and where. If primary memory is to be shared by one or more processes
concurrently, then memory management must determine which process’s request needs to
be served.
• The allocation policy adopted by memory management along with the memory organi-
zation scheme being employed has a direct impact in the overall utilization of system
resources. The overall system performance is also greatly affected by the way in which
the memory-management policy and the job-scheduling policy infuence each other while
allocating memory to different processes. The ultimate objective of memory management
and job scheduling is to minimize the amount of memory wastage and maximize the num-
ber of processes that can be accommodated in the limited available memory space.
• Allocation technique—once it is decided to allocate memory, the specifc locations must
be selected and allocation information updated.
• Deallocation technique and policy—handling the deallocation (reclamation) of memory. A
process may explicitly release previously allocated memory, or memory management may
unilaterally reclaim the memory based on a deallocation policy. After deallocation, status
information must be updated.
• Handling the virtual memory mechanism—keeping track of virtual memory allocation
and also interacting with mass storage (device) handlers; to manage swapping on pre-
defned policies between main memory and disk when main memory is not large enough
to hold all the processes.
Apart from all these, memory management provides mechanisms that allow information to
migrate up and down the memory hierarchy with the necessary binding of addresses, which is an
essential requirement for information movement. It also employs numerous strategies to distribute
a limited size of memory to many processes to load only once, thereby enhancing memory utiliza-
tion. While distributing memory, it also ensures the protection (integrity) of each active process
when many such processes coexist in memory. To address all these issues, memory management
must fulfll certain fundamental requirements in order to satisfy the basic demands of the operating
environment.
Each of these strategies while implemented exhibits some merits as well as certain drawbacks.
Therefore, a comparison of these schemes involves analysis of each scheme informally with respect
to certain essential parameters, such as:
• Wasted memory
• Overhead in memory access
• Time complexity
A brief discussion of this topic with a fgure is given on the Support Material at www.routledge.
com/9781032467238.
operating systems. This approach basically implies dividing the available physical memory into
several partitions, which may be of the same or different sizes, each of which may be allocated to
different processes. While allocating a partition of the memory to a process, many different strate-
gies could be taken that depend on when and how partitions are to be created and modifed. These
strategies, in general, have different targets to achieve, with their own merits and drawbacks. This
approach is again differentiated mainly in two ways: one that divides primary memory into a num-
ber of fxed partitions at the time the operating system is confgured before use, and the other one
that keeps entire memory as it is and dynamically partitioned it into variable-sized blocks according
to the demand of the programs during their execution. In this section, we will discuss fxed parti-
tioning with static allocation and variable partition with dynamic allocation.
• Allocation strategy: A fxed-partition system requires that a process address space size
(known from the process image) correlate with a partition of adequate size. Out of many
available partitions, selection of a particular partition for allocation to a requesting process
can be made in several ways; two common approaches are the frst ft and best ft. The
frst-ft approach selects the frst free partition large enough to ft the requesting process
for allocation. The best-ft approach, on the other hand, requires the smallest free partition
out of many such that meets the requirements of the requesting process. Both algorithms
need to search the PDT to fnd a free partition of adequate size. However, while the frst-
ft terminates upon fnding the frst such partition, the best-ft continues to go through the
entire table to process all qualifying PDT entries to fnd the most appropriate (tightest) one.
As a result, while the frst-ft attempts to speed up the execution, accepting costly memory
wastage within the partition; the best-ft aims to optimize memory utilization, sacrifcing
Memory Management 231
even execution speed. However, the best-ft algorithm can be made proftable if the free
partitions in the PDT are kept sorted by size; then the PDT could intuitively fnd a suit-
able partition ftting the requesting process quickly, thereby making the execution speed
much faster. It is to be noted that the best-ft algorithm discriminates against small jobs as
being unworthy of occupying a whole partition in order to attain better memory utilization,
whereas usually, it is desirable to give the smallest jobs (assumed to be interactive jobs) the
best service, not the worst.
• Allocation method: The job scheduler chooses one job from the job queue for execution
in response to the request issued either from the user end or due to the availability of one or
more free partitions reported by the memory manager. In some situations, there may be a
few free partitions, but none is found to accommodate the incoming job; the job in question
will then have to wait until such a partition is available. Another job ftting the available
partition will then be taken from the job queue for execution in order to keep the memory
utilization high, even disobeying the ordering of process activations intended by the sched-
uling algorithm that, in turn, may affect the performance of the system as a whole. Another
situation may happen when a high-priority job is selected for execution but no matching
partition is available: the memory manager then decides to swap out one suitable process
from memory to make room for this incoming job, even accepting the additional overhead,
but its justifcation should be carefully decided beforehand. It is interesting to observe that
although memory management and processor scheduling reside in separate domains of the
operating system with different types of responsibilities and targets, operation of the one
may often affect and infuence the normal operation of the other when static partitioning of
memory is employed. However, the actions of memory management should be coordinated
with the operation of processor scheduling in such a way as to extract the highest through-
put while handling an environment consisting of conficting situations.
• Allocation Schemes: Although a number of allocation schemes are available for this kind
of systems, two main approaches are common:
1. Fixed memory partition with separate input job queues for each partition.
2. Fixed memory partition with single input job queues for all the partitions.
Each of these approaches has precise merits in some situations and also specifc drawbacks in other
situations.
• Protection: Different partitions containing user programs and the operating system should
be protected from one another to prevent any kind of damage that may be caused by acci-
dental overwrites or intentional malicious encroachment. Adequate protection mechanisms
are thus required that can be realized in many different ways, described in the last section.
• Swapping: Swapping is carried out in this system mainly to negotiate an emergency, such
as a high-priority job that must be immediately executed, or in situations when a job is
waiting and idle for resources needed for its execution, thereby preventing other intended
jobs from entering. Swapping- in is also done for an already swapped-out job in order to
increase the ratio of ready to resident processes and thereby improve the overall perfor-
mance of the system. There are also other situations when swapping is urgently needed.
The mechanisms required to implement swapping were discussed in a previous section.
• Fragmentation: In a fxed-partition system, allocation of a process in a partition of ade-
quate size causes an amount of memory space to remain unused internally within the
partition when the process is loaded. This phenomenon is called internal fragmentation,
or sometimes internal waste. The extent to which internal fragmentation causes memory
wastage in a given system varies depending on several factors, such as the number of
partitions, the size of each individual partition, frequency of execution of processes of a
specifc size, and average process size and variance. This waste also tends to increase due
232 Operating Systems
to the provision of one or two large partitions, usually required for large processes, but they
mostly arrive infrequently, thereby causing these partitions to be mostly underutilized or
poorly utilized. However, the sum of all such internal fragmentations that occur in each
partition sometimes even exceeds the size of a specifc partition. Since internal fragmenta-
tion in fxed-partitioned system is inevitable and cannot be avoided, an effcient memory
management strategy would thus always attempt to keep this internal fragmentation to a
minimum affordable limit, of course, with no compromise in any way with the overall
performance of the system.
• Conclusion: Fixed-partition memory management is one of the simplest possible ways to
realize multiprogramming with modest hardware support and is suitable for static environ-
ments where the workload can be ascertained beforehand. But the negative impact of inter-
nal fragmentation is one of its major drawbacks. It is equally disadvantageous in systems in
which the memory requirement of the job is not known ahead of time. Moreover, the size of
the executable program itself faces severe restrictions imposed by partition size. In addition,
programs that grow dynamically during runtime may sometimes fnd this system unsuit-
able due to nonavailability of needed space in the partition thus allocated, and no operating
system support is available at that time to negotiate this situation. Another pitfall of this sys-
tem may be of fxing the number of partitions that limits the degree of multiprogramming,
which may have an adverse effect on the effective working of short-term (process) schedul-
ing and may create a negative impact on the overall performance of the system. With the use
of swapping mechanisms, this situation may be overcome by increasing the ratio of ready to
resident processes, but that can only be achieved at the cost of additional I/O overhead. Due
to all these issues and others, timesharing systems as a whole thus required operating sys-
tem design to move away from fxed-partition strategies indicating a move towards handling
of dynamic environments that could use memory spaces in a better way.
Still, fxed-partition memory management was widely used in batch multiprogramming systems
in the early days, particularly in OS/360, a versatile operating system used on large IBM main-
frames for many years. This system was a predecessor of OS/MVT (multiprogramming with vari-
able number of tasks) and OS/MFT, which, in turn, is a predecessor of MULTICS that ultimately
became converted into today’s UNIX.
More details on this topic with fgures are given on the Support Material at www.routledge.com/
9781032467238.
5.8.1.2.2 Overlays
An overlay mechanism used in software development is a technique by which a program larger than
the size of the available small user area (partition) can be run with almost no restrictions in relation
to the size of the offered area, without getting much assistance from the operating system in this
regard. In the overlay technique, a user program can be subdivided by the developer into a number
of modules, blocks, or components. Each such component is an entity in a program that consists
of a group of logically related items, and each could ft in the available memory. Out of these com-
ponents, there is a main component (root segment) and one or more fxed-size components known
as overlays. The root segment is always kept resident in the main memory for the entire duration
of program execution. The overlays are kept stored on a secondary storage device (with extensions
either .ovl or .ovr) and are loaded into memory as and when needed. Overlay 0 would start running
frst; when it was done, it would call another overlay. Some overlay systems were highly complex,
allowing multiple overlays to reside in memory at any point in time.
In most automatic overlay systems, the developer must explicitly state the overlay structure in
advance. There are many binders available that are capable of processing and allocating overlay
structure. An appropriate module loader is required to load the various components (procedures)
of the overlay structure as they are needed. The portion of the loader that actually intercepts the
Memory Management 233
calls and loads the necessary procedure is called the overlay supervisor or simply the fipper. The
root component is essentially an overlay supervisor that intercept the calls of the executing resident
overlay components during runtime. Whenever an inter procedure reference is made, control is
transferred to the overlay supervisor, which loads the target procedure, if necessary.
The hardware does not support overlays. Checking every reference would be unacceptably slow
in software. Therefore, only procedure calls are allowed to invoke new overlays. Procedure invoca-
tion and return are, however, more expensive than they usually would be, because not only must
the status of the destination overlay be examined, but it may also have to be brought into secondary
storage. However, the software, such as translators, like compilers and assemblers, can be of great
assistance in this regard.
The overlay concept itself opened a wide spectrum of possibilities. To run the program, the frst
overlay was brought in and ran for a while. When it fnished, or even during its execution, it could
read in the next overlay by calling it with the help of the overlay supervisor, and so on. The super-
visor itself undertakes the task of necessary input–output to remove the overlay or overlays that
occupy the place the desired one needs to be loaded and then bring the required overlay into that
position. To implement the overlay mechanism, the programmer had a lot of responsibilities, such
as breaking the program into overlays, deciding where in the secondary memory each overlay was
to be kept, arranging for the transport of overlays between main memory and secondary memory,
and in general managing the entire overlay process without any assistance from the hardware or
operating system. If, by some means, the entire burden of the overlay mechanism and its related
management responsibilities could be shifted onto (entrusted with) the operating system, relieving
the programmer from hazardous bookkeeping activities, we would nearly arrive at the doorstep in
the emergence of an innovative concept now called paging. Paging is discussed in detail in a later
section in this chapter.
An overlay mechanism is essentially a more refned form of a swapping technique which swaps
only portions of job’s address space and is called overlay management. Overlays work well only
with applications where the execution of the program goes through well-defned phases, each of
which requires different program units. Thrashing can result from the inappropriate use of overlays.
Overlays, normally used in conjunction with single contiguous, partitioned, or relocatable parti-
tioned memory management, provide essentially an approximation of segmentation but without
the segment address mapping hardware. Segmentation is discussed in a later section in this chapter.
In spite of being widely used for many years with several merits and distinct advantages, the over-
lay technique is critically constrained due to the involvement of much work in connection with over-
lay management. To get out of it, a group of researchers in Manchester, England, in 1961 proposed a
method for performing the overlay process automatically with no intimation even to the programmer
that it was happening (Fotheringham, 1961). This method is now called virtual memory, in which all
management responsibility is entrusted to the operating system, releasing the programmer from a lot
of annoying bookkeeping tasks. It was frst used during the 1960s, and by the last part of the 1960s
and early 1970s, virtual memory had become available on most computers, including those for com-
mercial use. Nowadays, even microprocessor-based small computer systems have highly sophisti-
cated virtual memory systems. Virtual memory is discussed in detail in later sections in this chapter.
More details on this topic with a figure are given on the Support Material at www.routledge.
com/9781032467238.
variable, the loader is called and the segment containing the external references will only then be
loaded and linked to the program at the point where it is frst called. This type of function is usually
called dynamic linking, dynamic loading, or load-on-call (LOCAL).
Dynamic linking and subsequent loading are powerful tools that provide a wide range of possibilities
concerning use, sharing, and updating of library modules. Modules of a program that are not invoked
during the execution of the program need not be loaded and linked to it, thereby offering substantial sav-
ings of both time and memory space. Moreover, if a module referenced by a program has already been
linked to another executing program, the same copy can be linked to this program as well. This means
that dynamic linking often allows several executing programs to share one copy of a subroutine or library
procedure, resulting in considerable savings both in terms of time and memory space. Runtime support
routines for the high-level language C could be stored in a dynamic link library (fles with extension .dll).
A single copy of the routines in this library could be loaded into memory. All C programs currently in
execution could then be linked to this one copy instead of linking a separate copy in each object program.
Dynamic linking also provides an interesting advantage: when a library of modules is updated, any
program that invokes a new module starts using the new version of the module automatically. It provides
another means to conserve memory by overwriting a module existing in memory with a new module.
This idea has been subsequently exploited in virtual memory, discussed in the following section.
In an object-oriented system, dynamic linking is often used to refer software object with its allied
methods. Moreover, the implementation of the object can be changed at any time without affecting
the program that makes use of the object. Dynamic linking also allows one object to be shared by
several executing programs in the way already explained.
Dynamic linking is accomplished by a dynamic loader (a constituent of the OS services) which
loads and links a called routine and then transfers the control to the called routine for its execution.
After completion (or termination) of the execution, the control is once again sent back to the OS for
subsequent necessary actions. The called routine, however, may be still in memory if the storage
page supports that, so a second call to it may not require another load operation. Control may now
simply be passed from the dynamic loader to the called routine. When dynamic linking is used, the
association of an actual address with the symbolic name of the called routine is not made until the
call statement is executed. Another way of describing this is to say that the binding of the name to
an actual address is delayed from load time until execution time. This delayed binding offers greater
fexibility as well as substantial reduction in storage space usage, but it requires more overhead since
the operating system must intervene in the calling process. It can be inferred that this delayed bind-
ing gives more capabilities at the expense of a relatively small higher cost.
More details on this topic with figures are given on the Support Material at www.routledge.
com/9781032467238.
memory from secondary storage, and it is then linked to the calling program at the point where the
call occurs. When another calling program needs to use the same shared program, the kernel frst
checks whether a copy of the shared program is already available in memory. If so, the kernel then
links the existing shared program to this new calling program. Thus, there exists only one copy
of the shared program in main memory, even when it is shared by more than one program. While
dynamic sharing conserves memory space, it is at the cost of complexity in its implementation.
Here, the kernel has to always keep track of shared program(s) existing in memory and perform the
needed dynamic linking. The program be shared also has to be coded in a different way. It is written
as a reentrant program to negotiate the mutual interference of programs that share it.
Reentrant programs: When a shared program is dynamically shared by other sharing programs,
the data created by one executing sharing program and embedded in the shared program should
be kept protected by the shared program from any interference that may be caused by any of the
other sharing programs while in execution. This is accomplished by allocating a separate data
area dynamically for each executing sharing program and holding its address in a CPU register.
The contents of these registers are saved as part of the CPU state when a process switch occurs
and once again reloaded into the CPU registers when the program is brought back into main
memory. This arrangement and related actions ensure that different invocations of the shared
program by individual sharing programs will not interfere with one another’s data.
More details on this topic with fgures are given on the Support Material at www.routledge.com/
9781032467238.
• Operation Methodology: Initially the entire user area in memory is free and treated
as a single big hole for allocation to incoming processes. Whenever a request arrives to
load a process, the memory management then attempts to locate a contiguous free area of
memory to create a partition whose size is equal to or a little bit larger than the request-
ing process’s size declared at the time of its submission or otherwise available (from the
process image fle). If such a free area suitable for the requesting process is found, the
memory management then carves out a contiguous block of memory from it to create
236 Operating Systems
an exact-ft partition for the process in question, and the remaining free memory, if any,
is then returned to the pool of free memory for further allocation later on. The block of
memory or the partition thus created would then be loaded with the requesting process
and is declared allocated by registering the base address, size, and the status (allocated)
in the system PDT or its equivalent, as we will see next. As usual, a link to or a copy
of this information is recorded in the corresponding process control block. The process
is now ready for execution. If no such suitable free area is available for allocation, an
appropriate error message is shown to user, or other appropriate actions are taken by
the operating system. However, after successful allocation of a partition to a deserving
process, if another request arrives for a second process to load, the memory management
should then start attempting to locate a suitable free area immediately following the area
already allocated to the frst process. This allocation, if possible, once made, is then
recorded in the modifed PDT (an extension of the previously defned PDT). In this way,
successive memory allocation to requesting processes is to be continued until the entire
physical memory is exhausted, or any restriction is imposed by the system on further
admission of any other process. Although all these partition-related information includ-
ing free memory areas are essentially remained unchanged as long as the process(es)
resides in memory, but may sometimes need to be updated each time a new partition is
created or an existing partition is deleted.
When adequate room is not available for a new process, it can be kept waiting until a suitable
space is found, or a choice can be made either to swap out a process or to shuffe (compaction,
to be discussed later) to accommodate the new process. Generally, shuffing takes less time than
swapping, but no other activity can proceed in the meantime. Occasionally, swapping out a single
small process will allow two adjacent medium-sized free pieces to coalesce into a free piece large
enough to satisfy the new process. A policy to decide whether to shuffe or swap could be based
on the percentage of time the memory manager spends in shuffing. If that time is less than some
fxed percentage, which is a tunable parameter of the policy, then the decision would be to shuffe.
Otherwise, a segment would be swapped out.
When a resident process is completed, terminated, or even swapped out, the operating system
demolishes the partition by returning the partition’s space (as defned in the PDT) to the pool of
free memory areas and declaring the status of the corresponding PDT entry “FREE” or simply
invalidating the corresponding PDT entry. For a swapped-out process, in particular, the operating
system, in addition, also invalidates the PCB feld where the information of the allocated partition
is usually recorded.
As releasing memory due to completion of a process and subsequent allocation of available
memory for the newly arrived process continues, and since the sizes of these two processes are not
the same, it happens that after some time, a good number of tiny holes is formed in between the
two adjacent partitions, not large enough to be allocatable that are spread all over the memory. This
phenomenon is called external fragmentation or checkerboarding, referring to the fact that all the
holes in the memory area are external to all partitions and become increasingly fragmented, thereby
causing only a simple waste. This is in contrast to internal fragmentation, which refers to the waste
of memory area within the partition, as already discussed earlier. Concatenation or coalescing,
and compaction, to be discussed later, are essentially two means that can be exploited to effectively
overcome this problem.
• Memory Allocation Model: Under this scheme, when a process is created or swapped in,
the memory of a particular size that is allocated may not be suffcient when the process is
under execution, since its data segments can grow dynamically and may even go beyond
the allocated domain. In fact, the allocated memory used by an executing process essen-
tially constitutes of the following:
Memory Management 237
The executable program code and its associated static data components are more or less constant
in size, and this size information can be obtained from the directory entry of the program fle. The
stack contains program data and also other data consisting of the parameters of the procedures,
functions, or blocks in a program that have been called but have not yet been exited from, and also
return addresses that are to be used while exiting from them. These data are allocated dynamically
when a function, procedure, or block is entered and are deallocated when it exits, making the stack
grow and shrink accordingly. The other kind of data that can grow dynamically are created by a
program using features of programming languages, such as the new statements of Pascal, C++, and
Java or the malloc and calloc statements of C. Such data are called program controlled dynamic
data. Normally, PCD data are allocated memory using a data structure called a heap. As execution
proceeds, the size of the stack, the PCD data, and their actual memory requirements cannot be
predicted, as they constantly grow and shrink. That is why a little extra memory is allocated for
these two dynamic components whenever a process is newly created, swapped in, or moved (to be
discussed in next section), as shown in Figure 5.1(a) using two such processes.
To realize fexibility in memory allocation under this approach, an alternative form of memory
allocation model, as depicted in Figure 5.1(b), is developed to accommodate these two dynamic
components. The program code and its allied static data components in the program are allocated
adequate memory per their sizes. The stack and the PCD data share a single large area at the top
of its allocated memory but grow in opposite directions when memory is allocated to new enti-
ties. A portion of memory between these two components is thus kept free for either of them. In
this model, the stack and PCD data components, however, do not have as such any individual size
restrictions. Still, even if it is found that the single large area as offered to both stack and PCD data
is not adequate and runs out, either the process will have to be moved to a hole with enough space
(discussed next in dynamic allocation) or swapped out of memory until a beftting hole can be cre-
ated, or it ultimately may be killed.
A program during execution when calls its needed procedures from the runtime library offered
by the programming languages, these library routines themselves perform allocations/deallocations
activities in the PCD area offered to them. The memory management is not at all involved in these
activities, and in fact, it just allocates the required area for stack and PCD, and nothing else.
Memory Memory
Area for B-STACK
growth one Area for
area growth
B Actually B-PCD Data
in use
B-Program
and Static data
FIGURE 5.1 In variable–partitioned memory management, memory space allocation for; a) a growing data
segment, and b) as one free area for both a growing stack and a growing data segment.
238 Operating Systems
• Dynamic Allocation: Sill, the executing process often runs out of already-allocated space
and then requests more memory for its ongoing computation. One simple way to negotiate this
situation may be that the memory manager could then block the process until more adjacent
space becomes available. But this strategy is not favored for many interactive users, since it
might involve very long waits for service. Alternatively, the memory management could fnd
a larger hole that matches the requirement and then move the process to the new hole, thereby
releasing the currently used space. However (similar to compaction, to be discussed later),
the system would then require some means of adjusting the program’s addresses accordingly
when it moves to the new address space, and the additional overhead linked with it should
be taken into account when such an attempt is made for the sake of tuning the performance.
More details on this topic with fgures are given on the Support Material at www.routledge.
com/9781032467238.
corresponding free space of respective size. Hence, it is not necessary to join the pieces
to their neighbors explicitly.
• One of the important design issues related to this scheme is the size of the allocation
unit to be chosen. The smaller the allocation unit, the closer space to be allocated to
the job’s required space and hence the less the memory wastage in internal fragmen-
tation, but the size of the bit map in that case will be larger. If the allocation unit is
chosen to be large, the size of the bit map will be small, but there remains a possibil-
ity that an appreciable amount of memory may be wasted in internal fragmentation,
particularly in the last allocated unit of each process, if the process size is not an
exact multiple of the allocation unit. Hence, the tradeoff must be in the selection of
the proper size of the allocation unit that should lie somewhere in between.
• Implementation of the bit map approach is comparatively easy with almost no addi-
tional overhead in memory usage. The bit map provides a simple way to keep track
of memory space in a fxed-sized memory because the size of the bit map depends
only on the size of memory and on the size of the allocation unit. However, the main
drawback of this approach is that when a job is required to be brought into memory,
the memory manager must search the bit map for the needed free space to fnd a
stream of needed consecutive 0 (or 1) bits of a given length in the map, which is
always a slow operation. So, in practice, at least for this type of memory allocation
strategy, bit maps are not often used and consequently fell out of favor.
• More details on this topic with a fgure are given on the Support Material at www.
routledge.com/9781032467238.
3. Linked Lists: Another common approach to keep track of memory usage is by main-
taining a linked list of allocated and free memory segments, where a segment represents
either a process or a hole (free area). Each entry in the linked list specifes a process (P)
or a hole (H), the address at which it starts, the length (the number of allocation units),
and a pointer to the next entry. The entries in the list can be arranged either in increasing
order by size or in increasing order by address. Keeping the linked list sorted in increas-
ing order by address has the advantage that when a process is completed, terminated,
or even swapped out, updating the existing list requires only replacing a P with an H.
This process normally has two neighbors (except when it is at the very top or bottom of
memory), each of which may be either a process or a hole. If two adjacent entries are
holes, they can then be concatenated (coalesced) into one to make a larger hole that may
then be allocated to a new process, and the list also becomes shorter by one entry.
It is sometimes more convenient to use a double-linked list to keep track of allocated and free
memory segments rather than a single-linked list, as already described. This double-linked struc-
ture makes it much easier for the departing process to locate the previous entry and see if a merge
is possible. Moreover, the double-linked list of only free areas (free list) is otherwise advantageous
too, since this organization facilitates addition/deletion of new memory areas to/from the list. In
addition, if the entries in this list can be arranged in increasing order by size, then the “best pos-
sible” area to satisfy a specifc memory request can be readily identifed.
More details on this topic with figures are given on the Support Material at www.routledge.
com/9781032467238.
• Allocation Strategy: At the time of memory allocation, two fundamental aspects in the
design are: effcient use of memory and the effective speed of memory allocation. Effcient
use of memory is again primarily associated with two issues:
1. Overhead in memory usage: The memory manager itself uses some memory to accomplish
its administrative operations, while keeping track of memory segments, and the memory
usage by a requesting process, to only keep track of its own allocated memory as well.
2. Effective use of released memory: When a memory segment is released by a process, the
same can be reused by fresh re-allocation.
240 Operating Systems
Effcient use of memory is an important factor because the size of a fresh request for memory
seldom matches the size of any released memory areas. Hence, there always remains a possibility
that some memory area may be wasted (external fragmentation) when a particular memory area is
re-allocated. Adequate care should be taken to reduce this wastage in order to improve memory
utilization and to avoid additional costly operation like compaction (to be discussed later). Common
algorithms used by the memory manager for selection of a free area for creation of a partition to
accomplish a fresh allocation for a newly created or swapped-in process are:
The simplest algorithm is frst ft, where the memory manager scans along the free list of segments
(holes) until it fnds one big enough to service the request. The hole is then selected and is broken up
into two pieces: one to allocate to the requesting process, and the remaining part of the hole is put back
into the free list, except in the unlikely case of an exact ft. First ft is a fast algorithm because it searches
as little as possible. But the main drawback of this approach is that a hole may be split up several times,
leading to successively smaller holes not large enough to hold any single job, thereby resulting in waste.
Moreover, there is a possibility that a suffciently large hole may be used up by this technique that may
deprive a large process from entering that is already in the job queue but came at a later time.
Best ft is another well-known algorithm in which the entire list is searched and then selects the
smallest hole that is adequate in size for the requesting process. Rather than breaking up a big hole
which might be needed later, best ft tries to optimize memory utilization by selecting one close to the
actual needed size, but it then constantly generates so many tiny holes due to such allocation that these
holes eventually are of no further use. Best ft is usually slower than frst ft because of searching the
entire list at every allocation. Somewhat surprisingly, it also results in more wasted memory than frst
ft and its variant, next ft (to be discussed next), because it tends to constantly fll up memory with so
many tiny, useless holes. First ft, in this regard, usually generates comparatively large holes on average.
Next ft is basically a minor variation of frst ft. It works in the same way as frst ft does, except
that it always keeps track of the position in the linked list when it fnds a suitable hole. The next
time it is called, it starts searching the list for a suitable hole from the position in the list where it
left off, instead of always starting at the beginning, as frst ft does. Simulations carried out by Bays
(1977) reveals that next ft usually gives a little worse performance than frst ft. However, the next ft
technique can also be viewed as a compromise between frst ft and best ft. While attempting a new
allocation, next ft searches the list starting from the next entry of its last allocation and performs
allocation in the same way as frst ft does. In this way, it avoids splitting the same area repeatedly
as happens with the frst ft technique, and at the same time it does not suffer from allocation over-
head as found in the best ft technique, which always starts searching from the beginning of the list.
Worst ft is the opposite of best ft. It always take the largest available hole that exceeds the size of
the requesting process. Best ft, in contrast, always takes the smallest possible available hole matched
with the size of the requesting process. The philosophy behind the worst ft technique is obviously to
reduce the rate of production of tiny, useless holes that best ft constantly generates. However, studies
based on simulation reveal that worst-ft allocation is not very effective in reducing wasted memory,
particularly when a series of requests are processed over a considerable duration of time.
Quick ft is yet another allocation algorithm which normally maintains separate lists of some
of the most common sizes of segments usually requested. With this form of arrangement and using
quick ft, searching a hole of the required size is, no doubt, extremely fast but suffers from the same
drawbacks similar to most of the other schemes that sort by hole size; particularly when a process
completes or terminates or is swapped out, fnding its neighbors to see whether a merge is possible is
Memory Management 241
extremely tedious. If merging cannot be carried out, it is quite natural that the memory will quickly
fragment into a large number of tiny, useless holes.
While working with experimental data, Knuth concluded that frst ft is usually superior to best
ft in practice. Both frst ft and next ft are observed to perform better than their counterpart, best
ft. However, next ft has a tendency to split all the areas if the system is in operation for quite a long
time, whereas frst ft may not split all of the last few free areas, which often helps it allocate these
large memory areas to deserving processes.
To implement any of these methods, the memory manager needs to keep track of which pieces of
physical area are in use and which are free using a data structure known as boundary tag (discussed
later in “Merging Free Areas”). In this structure, each free area has physical pointers that link all free
areas in a doubly linked list. They do not have to appear in order in that list. Also, each such area,
whether free or in use, has the frst and last words reserved to indicate its status (free or busy) and its
length. When a free block is needed, the doubly linked list is searched from the beginning or from
the last stopping point until an adequate area is found. If found, it is then split up; if necessary, to
form a new busy piece and a new free piece. If the ft is exact, the entire piece is removed from the
doubly linked free list. When a piece is returned to free space, it is joined to the pieces before it and
after it (if they are free) and then put on the free list.
All four algorithms as presented can be further sped up by maintaining distinctly separated lists for
processes and holes, and only the hole list can be inspected at the time of allocation. The hole list can again
be kept sorted in ascending order by size that, in turn, enables the best ft technique to work faster, ensur-
ing that the hole thus found is the smallest possible required one. With this arrangement, both frst ft and
best ft are equally fast, and next ft is simply blunt. On the other hand, this arrangement invites additional
complexity and subsequent slowdown when deallocating memory, since a now-free segment has to be
removed from the process list and inserted appropriately into the hole list, if required, with concatenation.
Various other algorithms are also found, particularly to counter situations when the size of the
requesting processes, or even the probability distribution of their need and process lifetimes, are not
known to the system in advance. Some of the possibilities in this regard are described in the work
of Beck (1982), Stephenson (1983), and Oldehoeft and Allan (1985).
in many other different ways; one such useful derivation of it is the unused memory rule,
whose explanation with mathematical computation is outside the scope of this discussion.
The details of this topic with mathematical computation are given on the Support
Material at www.routledge.com/9781032467238.
• Merging Free Areas: External fragmentation results in the production of many useless tiny
holes that may give rise to substantial wastage in memory, which eventually limits the system
in satisfying all the legitimate demands that could otherwise be possibly met. Consequently, it
affects the expected distributions of available memory among the requesting processes. Merging
of free areas including many useless tiny holes is thus carried out to generate relatively large
free areas that can now hold processes, thereby providing better memory utilization to improve
system performance as well as neutralizing the evil effects caused by external fragmentation.
Merging can be carried out whenever an area is released, and it is accomplished by checking the
free list to see if there is any area adjacent to this new area. If so, this area can be merged with the
new area, and the resulting new area can then be added to the list, deleting the old free area. This
method is easy to implement but is expensive, since it involves a search over the entire free list every
time a new area is added to it. Numerous types of techniques already exist to accomplish merging
in various ways. Two generic techniques to perform merging that work most effciently, boundary
tags and memory compaction, will now be discussed.
• Boundary tags: Using boundary tags, both the allocated and free areas of memory can be
tracked more conveniently. A tag is basically a status descriptor for a memory segment con-
sisting of an ordered pair (allocation status, size). Two identical tags containing similar infor-
mation are stored at the beginning and the end of each segment, that is, in the frst and last
few bytes of the segment. Thus, every allocated and free area of memory contains tags near
its boundaries. If an area is free, additional information, the free list pointer, follows the tag at
the starting boundary. Figure 5.2 shows a sample of this arrangement. When an area becomes
free, the boundary tags of its neighboring areas are checked. These tags are easy to locate
because they immediately precede and follow the boundaries of the newly freed area. If any
of the neighbors is found free, it is immediately merged with the newly freed area. Three pos-
sibilities (when the new free area has its left neighbor free, when the new free area has its right
neighbor free, and when the new free area has both neighbors free) may happen at the time of
merging, which is depicted on the Support Material at www.routledge.com/9781032467238.
The details of this topic with a fgure and explanation are given on the Support Material at www.
routledge.com/9781032467238.
• Memory Compaction: In the variable partitioned memory management scheme, the
active segments (processes) are interspersed with the holes throughout the entire memory
FIGURE 5.2 In variable–partitioned memory management, use of boundary tags and free–area pointer to
coalesce the free spaces to minimize the external fragmentation.
Memory Management 243
in general. To get out of it, one way may be to relocate some or all processes towards one
end of memory as far as possible by changing memory bindings such that all the holes can
come together and be merged to form a single one big segment. This technique is known
as memory compaction. All location-sensitive information (addresses) of all involved pro-
cesses has to be relocated accordingly. If the computer system provides a relocation register
(base register), relocation can be achieved by simply changing the contents of the relocation
register (relocation activity, already discussed separately in the previous section). During
compaction, the processes involved (those processes to be shifted) must be suspended and
actually copied from one area of memory into another. It is, therefore, important to care-
fully decide when and how this compaction activity is to be performed. Figure 5.3 is self-
explanatory and illustrates the use of compaction while free areas are to be merged.
Now the question arises of how often the compaction is to be done. In some systems, compaction
is done periodically with a period decided at system generation time. In other systems, compaction
is done whenever possible or only when it is needed, for example, when none of the jobs in the job
queue fnds a suitable hole, even though the combined size of the available free areas (holes) in a
scattered condition exceeds the needs of the request at hand. Compaction here is helpful and may
be carried out to fulfll the pending requirement. An alternative is with some systems that execute
memory compaction whenever a free area is created by a departing process, thereby collecting most
of the free memory into a single large area.
When compaction is to be carried out, it is equally important to examine all the possible options for
moving processes from one location to another in terms of minimizing the overhead to be incurred while
selecting the optimal strategy. A common approach to minimize the overhead is always to attempt to
relocate all partitions to one end of memory, as already described. During compaction, the affected pro-
cess is temporarily suspended and all other system activities are halted for the time being. The compac-
tion process is completed by updating the free list (in linked-list approach) and the affected PDT entries.
As the compaction activity, in general, is associated with excessive operational overhead, dynamic parti-
tion of memory is hardly ever implemented in systems that do not have dynamic relocation hardware.
Details on this topic with a fgure and explanation are given on the Support Material at www.
routledge.com/9781032467238.
0 0 0 0
OS OS OS OS
(200K) (200K) (200K) (200K)
200K 200K 200K 200K
100K A A
300K
A (200K) (250K) (200K)
(200K) 400K 400K
450K
500K A
B (250K)
150K (200K)
(350K)
650K 650K 650K
750K B
B B
(350K) (350K) (350K)
(250K)
1000K 1000K 1000K 1000K
FIGURE 5.3 In variable–partitioned memory management, compaction of memory is used to reduce the
effect of external fragmentation. An example of memory compaction is shown with different confguration in
memory usage.
244 Operating Systems
• Space Management with the Buddy System: Memory management with linked lists
sorted by hole size made the allocation mechanism very fast, but equally poor in han-
dling merging after deallocation of a process. The buddy system (Knuth, 1973), in fact,
is a space (memory) management algorithm that takes advantage of the fact that comput-
ers use binary numbers for addressing, and it speeds up the merging of adjacent holes
when a process is deallocated. This scheme performs allocation of memory in blocks of
a few standard sizes, essentially a power of 2. This approach reduces the effort involved
in allocation/deallocation and merging of blocks, but it is extremely ineffcient in terms of
memory utilization, since all requests must be rounded up to the nearest power of 2, which
summarily leads to a substantial amount of internal fragmentation on average, unless the
requirement of the requesting process is close to a power of two.
The entire memory is here split and recombined into blocks in a pre-determined manner during
allocation and deallocation, respectively. Blocks created by splitting a particular block into two
equal sizes are called buddy blocks. Free buddy blocks can later be merged (coalesced) to recon-
struct the block which was split earlier to create them. Under this system, adjacent free blocks that
are not buddies are not coalesced. Thus, each block x has only a single buddy block that either pre-
cedes or follows x in memory. The size of different blocks is 2n for different values of n ≥ t, where t
is some threshold value. This restriction ensures that memory blocks are not uselessly small in size.
To control the buddy system, memory management associates a 1-bittag with each block to indi-
cate the status of the block, that is, whether allocated or free. As usual, the tag of a block may be
located within the block itself, or it may be kept separately. The memory manager maintains many
lists of free blocks; each free list consists of free blocks of identical size, that is, all blocks of size 2k
for some k ≥ t, and is maintained as doubly linked list.
Operation methodology: Allocation of memory begins with many different free lists of block
size 2c for different values of c ≥ t. When a process requests a memory block of size m, the system
then inspects the smallest power of 2 such that 2i ≥ m. If the list of blocks with size 2i is not empty,
it allocates the frst block from the list to the process and changes the tag of the block from free to
allocated. If the list is found empty, it checks the list for blocks of size 2i+1. It takes one block off
this list and splits it into halves of size 2i. These blocks become buddies. It then puts one of these
blocks into the free list for blocks of size 2i and uses the other block to satisfy the request. If a block
of size 2i+1 is not available, it looks in the list for blocks of size 2i+2. It then takes one block off this
list and, after splitting this block into two halves of size 2i+1, one of these blocks would be put into
the free list for blocks of size 2i+1, and the other block would be split further in the same manner
for allocation as already described. If a block of size 2i+2 is not available, it starts inspecting the
list of blocks of size 2i+3, and so on. Thus, several splits may have to be carried out before a request
can be ultimately met.
After deallocation of a process that releases a memory block of size 2i, the buddy system then
changes the tag of the block to free and checks the tag of its buddy block to see whether the buddy
block is also free. If so, it merges these two blocks into a single block of size 2i+1. It then repeats the
check for further coalescing transitively; that is, it checks whether the buddy of this new block of
size 2i+1 is free, and so on. It enters a block in the respective free list only when it fnds that its buddy
block is not free.
Buddy systems have a distinct advantage over algorithms that sort blocks (multiples of block
size) by size but not necessarily at addresses. The advantage is that when a block of size 2x bytes is
freed, the memory manager has to search its buddy block only from the free list of 2x block to see
if a merge (coalesce) is possible. While other algorithms that allow memory blocks to be split in
arbitrary ways, all the free lists (or a single list of all holes) must then be searched to fnd a merge
which is more time consuming; the buddy system in this regard exhibits a clear edge over the others.
The serious drawback of this approach is that it is extremely ineffcient in terms of memory
utilization. The reason is that all the requests must be rounded up to a power of 2 at the time of
Memory Management 245
allocation, and it happens that the sizes of most of the requesting processes are not very close to
any power of 2. This results in a substantial amount of internal fragmentation on average, which is
simply a waste on the part of memory usage, unless the size of a requesting process comes close to
a power of 2.
Various authors (notably Kaufman, 1984) have modifed the buddy system in different ways to
attempt to get rid of some of its problems. The UNIX 5.4 (SVR 4) kernel uses this basic approach
for management of memory that is needed for its own use and, of course, adds some modifcations
to the underlying existing strategy (a new one is the lazy buddy allocator). A brief discussion of this
modifed approach is provided in a later section.
The details of this topic with a fgure and explanation are given on the Support Material at www.
routledge.com/9781032467238.
Space Management with Powers-of-Two Allocation: This approach construction-wise resembles
the buddy system but differs in operation. Similar to the buddy system, the memory blocks are also
always maintained as powers of 2, and separate free lists are maintained for blocks of different sizes.
But here an additional component is attached to each block that contains a header element, which is
used for two purposes. This header element consists of two felds, as shown in Figure 5.4. It contains
a status fag which indicates whether the block is currently free or allocated. If a block is free, another
feld in the header then contains size of the block. If a block is allocated, the other feld in the header
then contains the address of the free list to which it should be added when it becomes free.
When a requesting process of size m bytes arrives, the memory manager starts inspecting the
smallest free block that is a power of 2 but large enough to hold m bytes, that is, 2i ≥ m. It frst checks
the free list containing blocks whose size is the smallest value x such that 2x ≥ m. If the free list is
empty, it then checks the list containing blocks that are the next higher power of 2 in size, and so
on. An entire block is always allocated to a request; that is, no splitting of blocks is carried out at
the time of allocation. Thus, when a process releases a block, no effort is needed to merge (coalesce)
adjoining blocks to reconstruct larger blocks: it is simply returned to its free list.
Operation of the system starts by creating blocks of desired sizes and entering them in the corre-
sponding free lists. New blocks can be created dynamically whenever the system runs out of blocks
of a requested size or when no available block can be allocated to fulfll a specifc request. The UNIX
4.4 BSD kernel uses this approach for management of memory that is needed for its own use and,
of course, adds some modifcations to the underlying existing strategy (a new one is the McKusick–
Karels allocator). A brief discussion of this modifed approach is provided in a later section.
Memory in use
Memory utilization factor =
Total memory committted
246 Operating Systems
where memory in use is the amount of memory in actual use by the requesting processes, and
total memory committed includes allocated memory, free memory available with the memory
manager, and memory used by the memory manager to store its own data structures to manage
the entire memory system. The larger the value of the utilization factor, the better the perfor-
mance of a system because most of the memory will then be in productive use, and the converse
is the worst.
In terms of this utilization factor, both the buddy system and powers-of-two allocation scheme
stand to the negative side because of an appreciable internal fragmentation in general. These
schemes also require additional memory to store the list headers and tags to control their operations
and keys for protection. The powers-of-two allocation scheme, in addition, suffers from another
peculiar problem. Since it does not merge free blocks to reconstruct large blocks, it often fails to
satisfy a request to hold a job of suffciently large size even when free contiguous blocks of smaller
sizes are available that could satisfy the request if they could be merged. In a buddy system, this
situation rarely occurs, and it could happen only if adjoining free blocks are not buddies. In fact,
Knuth (1973) reports that in simulation studies, a buddy system is usually able to achieve 95 percent
memory utilization before failing to fulfll a specifc requirement.
On other hand, the frst-ft, next-ft, and best-ft techniques although provide better memory utili-
zation, since the space in an allocated block is mostly used up, but the frst-ft and next-ft techniques
may sometimes give rise to the appreciable internal fragmentation which is inherent in the nature of
these two techniques. However, all these techniques also suffer from external fragmentation since
free blocks of tiny sizes are continuously generated that are too small to satisfy any standard requests.
0 0
OS OS
100K 30K 100K 30K P(1)
130K 130K
A A
400K 400K
Process P 50K 425000 50K P(2)
0K 450K 450K
55,000 B B
120K
(a) 650K 650K 40K
80K 690 K 40K P(3)
730K 730K
C C
830K 830K
70K 70K
900K 900K
D D
1000K 1000K
1M 1M
(b) (c)
Table containing
memory information of
OS processes (As for
Area example, of process P)
Memory
Management Instruction currently
Unit being executed
(MMU)
Actual physical
address
FIGURE 5.6 A schematic representation of address translation used in non–contiguous memory allocation
scheme is shown.
Kbytes, the physical address of the referenced instruction will be (400K + 25K) = 425 Kbytes,
as shown in Figure 5.5(c).
• Logical addresses, physical addresses, and address translation: In the case of non-
contiguous allocation, there is a clean separation between user’s logical address space and
actual physical address space. The logical address space is the logical view of the process,
and the real view of the active process in the memory is called the physical view of the
process. During execution, the CPU issues the logical address and obtains the physical
address of the corresponding logical address. The OS stores information about the memory
areas allocated to different components of a process (here the process P) in a table created,
managed, and controlled by the memory management unit. The table includes the memory
start addresses and the sizes of each of the components of the process (here it is made of
three components P(1), P(2), and P(3) of the process P). This is depicted in Figure 5.6.
The CPU always sends the logical address of each constituent (instruction or data) of the
process to the MMU, and the MMU, in turn, uses the table containing memory allocation
information to compute the corresponding actual physical address, which is also called the
effective memory address. The computing procedure using memory mapping hardware
to derive the corresponding effective memory address from a logical address at execution
time is called address translation.
In general, a logical address issued by the CPU mainly consists of two parts: the id of the pro-
cess component containing the issued address and the specifc byte within the component (offset)
indicating the particular location. Each referenced address will then be represented by a pair of
the form:
(compk, offsetk)
Where offset offsetk is the displacement of the target location from the beginning of the process
component compk. The MMU computes the effective memory address of the target location (compk,
offsetk) using the formula:
The OS (memory management) records the required memory information concerning any process
(here, process P) in the table as soon as it schedules a process and provides that to MMU in order to
carry out the required address translation whenever it is needed.
Logical and physical organization in relation to this topic with a fgure are given on the Support
Material at www.routledge.com/9781032467238.
• Paging
• Segmentation
be necessarily contiguous, and the policy for page frame allocation has practically no impact on
memory utilization, like frst-ft, best-ft, and so on, since all frames ft all pages and any such ft in
any order is as good as any other.
Thus, we observed that simple paging, as described here, is very similar to a fxed-partition mech-
anism. The differences are that, with paging, the partitions are rather small and obviously of a fxed
size, a program may occupy more than one partition, and these partitions also need not be contiguous.
The consequences of making each page frame the same in size as a page are twofold. First, a page
can be directly and exactly mapped into its corresponding page frame. Second, the offset of any
address within a page will be identical to the offset in the corresponding page frame; no additional
translation is required to convert the offset within a page in the logical address to obtain the offset
of the corresponding physical address; only one translation is required, and that is to convert a page
in the logical address into the corresponding page frame to derive the required physical address.
Using a page size of a power of 2 has two important effects: the logical addressing scheme is
much easier and convenient to use for the programmer and is also transparent to the developer of
all system utilities like the assembler, the linker, and the database management system. Second, it
is relatively easy to implement a mechanism in hardware that can quickly carry out the required
address translation during runtime.
An example of a simple paging system with illustrations is given on the Support Material at www.
routledge.com/9781032467238.
• Address translation: The mechanism to translate a logical address into its corresponding
physical address uses the following elements:
s: Size of a page.
ll: Length of logical address.
lp: Length of physical address.
np: Maximum number of bits required to express the page number in a logical address.
no: Maximum number of bits required to express the offset within a page.
nf: Maximum number of bits required to express the page frame number in a physical address.
Since the size of a page s is a power of 2, then s = (2 to the power no). Hence, the value of no depends
on the size of the page, and the least signifcant no bits in a logical address give the offset bi. The
remaining bits in a logical address form the corresponding page pi = np = ll—no. The values of bi
and pi can be obtained by grouping the bits of a logical address as follows:
← np →← no →
pi bi
The effective memory address (physical address) is then generated using the following steps that are
needed for address translation.
• Extract the page number pi as the leftmost np bits of the logical address.
• Use the page number as an index into the process page table to fnd the corresponding page
frame number. Let it be qi, and the number of bits in it is nf.
• The remaining rightmost no bits in the logical address that are the offset within the page
are then simply appended (concatenated) on the right side of the of nf as already obtained
to construct the needed physical address.
Since each page frame is identical in size to a page, no bits are also needed as offset bi to address
the location within a page frame. The number of bits needed to address a page frame in an effective
Memory Management 251
memory address is nf = lp – no. Hence, an effective memory address (physical address) of a targeted
byte bi within a page frame qi can be represented as:
← nf →← no →
qi bi
The processor hardware after accessing the page table of the currently executing process carries
out this logical-to-physical address translation. The MMU can then derive the effective memory
address by simply concatenating qi and bi to generate the physical address lp number of bits.
Conclusively, it can be stated that with simple paging, main memory is divided into many
small frames of equal size. Each process is then divided into frame-size pages. When a process is
loaded into memory for execution, all of its pages, if possible, are brought into available frames,
and accordingly a page table is set up to facilitate the needed address translation. This strategy, in
effect, summarily solves many of the critical issues causing serious problems that are inherent in a
partitioned-memory management approach.
An example of address translation with illustration is given on the Support Material at www.
routledge.com/9781032467238.
• Protection: Under the paging scheme, the dedicated page table for the running process
is used for the purpose of address translation that prevents any memory reference of the
process from crossing the boundary of its own address space. Moreover, use of a page table
limit register which contains the highest logical page number defned in the page table of the
running process together with the page table base register which points to the start address
of the corresponding page table of the same running process can then detect, arrests, and
confrm any unauthorized access to memory beyond the prescribed boundaries of the run-
ning process. Moreover, protection keys can also be employed in the usual way with each
memory block (page frame) of a specifc size. In addition, by associating access-right bits
with protection keys, access to a page may be made restricted whenever necessary, and it is
particularly benefcial in situations when pages are shared by several processes.
A brief description of this topic is given on the Support Material at www.routledge.
com/9781032467238.
• Sharing; Shared Pages: Sharing of pages in systems with paged memory management is
quite straightforward and falls into two main categories. One is the static sharing that results
from static binding carried out by a linker or loader before the execution of a program begins.
With static binding, if two processes X and Y statically share a program Z, then Z is included
in the body of both X and Y. As a result, the sizes of both X and Y will be increased and will
consume more memory space if they exist in main memory at the same time.
The other one is the dynamic binding that removes one of the major drawbacks of static binding, par-
ticularly in relation to redundant consumption of memory space. This approach can be used to conserve
memory by binding (not injecting) a single copy of an entity (program or data) to several executing pro-
cesses. The shared program or data still retains its own identity, and it would be dynamically bound to
several running processes by a dynamic linker (or loader) invoked by the memory management. In this
way, a single copy of a shared page can be easily mapped to as many distinct address spaces as desired.
Since each such mapping is performed with a dedicated entry in the page table of the related sharing
process, each different process may have different access rights to the shared page. Moreover, it must
be ensured by the operating system that the logical offset of each item within a shared page should be
identical in all participating address spaces since paging is transparent to the users.
A brief description of this topic with a fgure is given on the Support Material at www.rout-
ledge.com/9781032467238.
252 Operating Systems
5.8.2.1.1.1 Conclusion: Merits and Drawbacks A paging system is entirely controlled and
managed by the operating system that outright discards the mandatory requirement of contigu-
ous allocation of physical memory; pages can be placed into any available page frame in memory.
It practically eliminates external fragmentation and thereby relieves the operating system of the
tedious task of periodically executing memory compaction. Wasted space in memory due to internal
fragmentation for each process is almost nil and will consist of only a fraction of the last page of a
process. Since this system is quite simple in allocation and deallocation of memory, the overhead
related to management of memory is appreciably lower in comparison to other schemes. Using a
small size in the page, the main memory utilization may be quite high when compared to other
schemes and that too in conjunction with process scheduling may optimize the usage of memory
even more.
The paging system, however, also suffers from certain drawbacks. It increases the space require-
ment overhead for the purpose of management of memory, and more time is spent to access the
entities. Those are:
• Creation of a memory-map table to keep track of the entire memory usage available with
the computer system.
• Creation of a page-map table per process. The storage overhead of the page-map table,
also known as table fragmentation, may be quite large in systems with small page size and
large main memory.
• Although it is not a signifcant amount, it still wastes memory due to internal fragmenta-
tion that may happen in the last page of a process. If a good number of processes exist in
the system at any instant, this may result in appreciable wastage in the main memory area.
• The address translation process is rather costly and severely affects the effective memory
bandwidth. Of course, this can be overcome by incurring extra cost with the use of addi-
tional dedicated address translation hardware.
In addition, it is generally more restrictive to implement sharing in a paging system when compared
with other systems, in particular with segmentation. It is also diffcult to realize adequate protection
within the boundaries of a single address space.
different segments of a process may be loaded in separate, noncontiguous areas of physical memory
(i.e. base address of segments are different), but entities belonging to a single segment must be
placed in contiguous areas of physical memory. Thus, segmentation can be described as a hybrid
(dual) mechanism that possesses some properties of both contiguous (with regard to individual
segments) and noncontiguous (with respect to address space of a process) approaches to memory
management. Moreover, the segments thus generated are, in general, individually relocatable.
From the operating system’s point of view, segmentation is essentially a multiple-base-limit ver-
sion of dynamically partitioned memory. Memory is allocated in the form of variable partitions;
the main difference is that one such partition is allocated to each individual segment. It is normally
required that all the segments of a process must be loaded into memory at the time of execution
(except in the presence of overlay schemes and virtual memory, which will be described later). As
compared to dynamic partitioning, a program with segmentation may occupy more than one parti-
tion, and these partitions also need not be contiguous.
• Principles of Operation: The programmer declares the segments while coding the appli-
cation. The translator (compiler/assembler) generates logical addresses of each segment
during translation to begin at its own logical address 0. Any individual item within a spe-
cifc segment is then identifed by its offset (displacement) relative to the beginning of the
segment. Thus, addresses in segmented systems have two components:
• Segment name (number)
• Offset (displacement) within the segment
Hence the logical address of any item in the segmented process has the form (sk, bk), where sk
is the id of the segment (or segment number) in which the item belongs, and bk is the offset of the
item’s specifc location within that segment. For example, assume that there is an instruction cor-
responding to a statement, call hra.cal, where hra.cal is a procedure in segment hraproc, may use
the operand address (hraproc, hra.cal), or may use a numeric representation for sk and bk for that
specifc statement.
Figure 5.7(a) illustrates the logical view of a load-module sample of a process P with all its seg-
ments. Figure 5.7(b) depicts a sample placement of this already-defned segments into physical
memory with the resulting SDT [Figure 5.7(a)] already formed by the operating system in order to
facilitate the needed address translation. Each segment descriptor (entry) in this table shows the
physical base address of the memory area allocated to each defned segment, and the size feld of
the same segment descriptor is used to check whether the supplied offset in the logical address is
within the legal bounds of its enclosing segment. Here, the name of the segment is also included
254 Operating Systems
0 Segment
(CODE) Number Name Size Address
a–1 0 CODE a 40000
0
(STACK)
1 STACK b 48000
b–1
0 (UPDATE) 2 UPDATE c 60000
c–1
0 (Data) 3 DATA s 30000
500 ... (zzz) Segment table of P
s–1
Segment Map
(a)
Logical address
3 500
Seg. Offset
No.
20000
60000
UPDATE
(b)
Address translation mechanism
Max.
Memory
FIGURE 5.7 A schematic representation of an example showing different memory segments and related
address translation mechanism used in segmented memory management.
in the respective segment descriptor but only for the purpose of better understanding. The MMU
uses this SDT to perform address translation and uses the segment number provided in the logi-
cal address to index the segment descriptor table to obtain the physical base (starting) address of
the related segment. The effective memory address is then generated by adding the offset bk of the
desired item to the base address (start address) of sk of its enclosing segment. Thus, the address
translation mechanism in the segmentation approach is defnitely slower in operation and hence
more expensive than paging.
Apart from the comparatively high overhead while carrying out address translation, segmenta-
tion of address space incurs extra costs due to additional storing of quite large segment descriptor
tables and subsequently accessing them. Here, each logical address reference requires two physical
memory references:
In other words, segmentation may reduce the effective memory bandwidth by half in comparison to
the actual physical memory access time.
Memory Management 255
A brief description of this topic with an example is given on the Support Material at www.
routledge.com/9781032467238.
• Hardware Requirements: Since the size of the SDT itself is related to the size of the logi-
cal (virtual) address space of a process, the length of an SDT may vary from a few entries to
several thousand. Tat is why SDTs may themselves be ofen assigned a separate special type
of segment (partition) of their own. Te accessing mechanism of SDT is ofen facilitated by
means of a dedicated hardware register known as a segment descriptor table base register
(SDTBR), which is set to point the base (starting) address of the SDT of the currently running
process. Another dedicated register called the segment descriptor table limit register (SDTLR)
is set to mark the end of that SDT as indicated by SDTBR. Tis register actually limits the
exact number of entries in the SDT, as there are an actual number of segments defned in
a given process. As a result, any attempt to access a nonexistent segment (that may have a
specifc segment number) may be detected and access will then be denied with exceptions.
When a process is initiated by the operating system, the base and limit of its SDT are nor-
mally kept in the PCB of the process. Upon each process switch, the SDTBR and SDTLR are
loaded with the base and length, respectively, of the corresponding SDT of the new running process.
Moreover, when a process is swapped out, all the SDT entries in relation to the affected segments
are invalidated. When a process is again swapped back in, the SDT itself is swapped back with
required updating of the base felds of all its related segment descriptors so as to refect the new load
addresses. Since this action is an expensive one, the swapped-out SDT itself is generally not used.
Instead, the SDT of the swapped-out process may be discarded, and a new up-to-date SDT is created
from the load-module map of the swapped-in process when it is once again loaded back in memory.
In the case of memory compaction, if supported, when it is carried out, the segments of a process are
relocated. This requires updating of the related SDT entries for each segment that is moved. In such
systems, some additional or modifed data structures and add-on hardware support may be needed
so as to identify the related SDT entry that points to the segment scheduled to be moved.
Segment Descriptor Caching: To reduce the duration of time required for the slow address
translation process in segmented system, some designers suggest keeping only a few of the most
frequently used segment descriptors in registers (segment descriptor register, SDR) in order to avoid
time-consuming memory access of SDT entries for the sake of mapping. The rest of the SDT may
then be kept in memory in the usual way for the purpose of mapping. Investigations of the types of
segments referenced by the executing process reveal that there may be functionally three different
categories of segment, instructions (code), data, and stack. Three dedicated registers (for the purpose
of mapping) may thus be used, and each register will contain the base (beginning) address of each
of such respective segment in main memory and its size (length). Since, in most machines that use
segmentation, the CPU emits a few status bits to indicate the type of each memory reference, the
MMU can then use this information to select one of these registers for appropriate mapping.
Segment descriptor registers are initially loaded from the SDT. In a running process, whenever an
intersegment reference is made, the corresponding segment descriptor is loaded into the respective reg-
ister from the SDT. Since in a running process, different segments appear at different times, SDRs are
normally included in the process state. When switching of process occurs, the contents of the SDRs of
the departing process are stored with the rest of its context. Before dispatching the new running pro-
cess, the operating system loads the SDRs with their images, recorded in the related PCB. Use of such
hardware-assisted SDRs has been found to accelerate the translation process satisfactorily; hence, they
are employed in many segmented architectures, including Intel’s iAPX-86 family of machines.
• Protection: Separation of the address spaces of distinctly diferent processes, placing them
in diferent segments in disjoint areas of memory, primarily realizes protection between
the processes. In addition, the MMU at the time of translating a logical address (sk, bk)
compares bk with the size of the segment sk, available in a feld of each and every SDT entry,
256 Operating Systems
thereby restricting any attempt to break the protection. However, protection between the
segments within the address space is carried out by using the type of each segment defned
at the time of segment declaration depending on the nature of information (code, data, or
stack) stored in it. Access rights to each segment can even be included using the respective
access-rights bits in the SDT entry. Since the logically related items are grouped in seg-
mentation, this is one of the rare memory-management schemes that permit fnely grained
representation of access rights. Te address mapping hardware at the time of address trans-
lation checks the intended type of reference against the access rights for the segment in
question given in the SDT. Any mismatch will then stop the translation process, and an
interrupt to the OS is issued.
1. A program may access only data that reside on the same ring or on a ring with lower
privilege.
2. A program may call services residing on the same or a more privileged ring.
• Sharing: Shared Segments: Ease of sharing is the appealing beauty of the segmentation
approach. Shared entities are usually placed in separate dedicated segments to make shar-
ing fexible. A shared segment may then be mapped by the appropriate segment descriptor
tables to the logical address spaces of all processes that are authorized to use it. Te intended
use of based addressing together with ofsets facilitates sharing, since the logical ofset of a
given shared item is identical in all processes that share it. Each process that wants to use a
shared object (segment) will have a corresponding SDT entry in its own table pointing to
the intended shared objects that contains information including the base address, size, and
their access-right bits. Tis information is used at the time of address translation. Diferent
process may have diferent access rights to the same shared segment. In this way, segmented
systems conserve memory by providing only a single copy of the objects shared by many
authorized users rather than having multiple copies. Te participating processes that share
a specifc object keep track of their own execution within the shared object by means of
their own program counter, which is saved and restored at the time of each process switch.
Sharing of segments, however, often invites some problems in systems that also support swap-
ping. When swapping out shared objects or any one of the participating processes which are autho-
rized to reference the shared objects are swapped in, they are always placed in currently available
locations that may be different from their previously occupied locations. The OS should at least
keep track of construction of the SDT before and after swapping of both shared segments and pro-
cesses that use them. It must ensure proper mapping of all logical address spaces of participating
processes to the shared segments in main memory.
A brief description of this topic with a fgure is given on the Support Material at www.routledge.
com/9781032467238.
5.8.2.1.2.1 Conclusion: Merits and Drawbacks Segmentation is one of the basic tools of mem-
ory management and permits the logical address space of a single process to break into separate
logically related entities (segments) that may be individually loaded in noncontiguous areas of
physical memory. The sizes of the segments are usually different, and memory is allocated to
Memory Management 257
the segments according to their sizes, thereby mostly eliminating the negative effect of internal
fragmentation. As the average segment sizes are normally smaller than the average process sizes,
segmentation can reduce the amount of external fragmentation and its bad impact as happens
in dynamically partitioned memory management. Other advantages of segmentation have also
been observed that include dynamic relocation, adequate protection both between segments and
between the address spaces of different processes, easy sharing, and suffcient fexibility towards
dynamic linking and loading.
However, one of the shortcomings of the segmentation approach is its address translation
mechanism. Address translation of logical-to-physical address in such systems is compara-
tively complex and also requires dedicated hardware support that ultimately enables a sub-
stantial reduction in effective memory bandwidth. In absence of overlay and virtual memory,
another drawback of segmentation is that it cannot remove the problem that limits the size
of a process’s logical address space by the size of the available physical memory. However,
this issue has been resolved, and the best use of segmentation was realized as soon as virtual
memory was introduced in the architecture of computer systems, which will be discussed in
the following sections. Contemporary segmented architectures, implemented on the platform
of modern processors, such as Intel X-86 or the Motorola 68000 series, support segment sizes
on the order of 4 MB or even higher.
pages, and an integral number of pages is then allocated to each segment so that only those pages
that are actually needed have to be around. Paging and segmentation thus can be combined to gain
the advantages of both, resulting in simplifying memory allocation and speeding it up, as well as
removing external fragmentation. A page table is then constructed for each segment of the pro-
cess, and a field containing the address pointing to the respective page table is kept in the segment
descriptor in the SDT. Address translation for a logical address in the form (sk, bk) is now carried
out in two stages. In the first stage, the entry sk in the SDT is searched, and then the address of its
page table is obtained. The byte number bk, which is the offset within the segment, is now split into
a pair (psk, bpk), where psk is the page number within the segment sk, and bpk is the byte number
(offset) within the respective page pk. The page table is now used in the usual way to determine the
required physical address. The effective address calculation is now carried out in the same manner
as in paging: the page frame number of psk is obtained from the respective page table, and bpk is
then concatenated with it to determine the actual physical address.
Figure 5.8 shows process P of Figure 5.7 in a system that uses segmentation with paging. Each
logical address is thus divided here into three fields. The upper field is the segment number, the
middle one is the page number, and the lower one is the offset within the page. The memory map,
as shown in Figure 5.8, consists of a segment table and a set of page tables, one of each segment.
Each segment is paged independently, and the corresponding page frames again need not be con-
tiguous. The internal fragmentation, if exists, occurs only in the last page of each segment. Each
segment descriptor in the SDT now contains the address that points to the respective page table
of the segment. The size field in the segment descriptor, as usual, prevents any invalid reference,
Logical Address
SK bK
Segment Offset
Number
0
1
Page Tables
FIGURE 5.8 A schematic representation of segmentation with paging approach used in segmented–paged
memory management.
Memory Management 259
which facilitates memory protection. Different tradeoffs exist with respect to the size of the segment
feld, the page feld, and the offset limit. The size of the segment feld being set puts a limitation on
the number of segments that can be declared by users within this permissible range, the page feld
that defnes the number of pages within each segment determines the segment size, and the size of
the offset feld describes the page size. All three parameters are entirely determined by the design
of the operating system, which depends mostly on the architecture of the computer system being
employed.
would be ineffective and not appropriate for such dynamic kernel memory allocation. That is why
a separate kernel memory allocatoris used, and three such kernel memory allocators are worth
mentioning:
• McKusick–Karels allocator
• Lazy buddy allocator
• Slab allocator
The details of each of these three allocation strategies, with computations, are separately given
on the Support Material at www.routledge.com/9781032467238.
• Only a part of the process and not the entire one gets loaded to begin its execution, that ulti-
mately enables main memory to accommodate many other processes, thereby eventually
increasing the degrees of multiprogramming that, in turn, effectively increases processor
utilization.
Memory Management 261
• The mandatory requirement in the size of the program that must be restricted by the size
of the available memory has now been waived without use of any sort of overlay strategy.
The user now perceives a potentially larger memory allocated in a specifc area on the disk,
and the size of such memory is related to disk storage. The operating system, in association
with related hardware support, automatically loads parts of a process during runtime from
this area into main memory for execution as and when required, and that too without any
participation of the user or any notifcation to the user.
The whole responsibility is now shouldered by the operating system in order to create an illusion
to the user while providing such an extremely large memory store. Since this large memory is
merely an illusion, it is historically called virtual memory, in contrast with the main memory,
known as real memory(physical memory), which is the only place the actual execution of a
process can take place. The details of virtual memory management are generally transparent to
the user, and the virtual memory manager creates the illusion of having a much larger memory
than may actually be available. It appears as if the physical memory is stretched far beyond
the actual physical memory available on the machine. Virtual memory can thus be defned as
a memory hierarchy consisting of the main memory of computer system and a specifed area
on disk that enables the execution of a process with only some portions of its address space in
main memory. The virtual memory model loads the process components freely into any avail-
able areas of memory, likely to be non-adjacent, for execution. The user’s logical address can
then be referred to as a virtual address. The virtual address of every memory reference used by
a process would be translated by the virtual memory manager (the part of the OS responsible
for memory management) using special memory mapping hardware (a part of the MMU) into
an actual address of the real memory area where the referenced entity physically resides. This is
done on behalf of the user in a way transparent to the user. This model almost totally solves the
memory fragmentation problem since a free area of memory can be reused even if it is not large
enough to hold an entire process.
While the other memory management techniques attempted to approximate 100 percent
memory utilization, the implementation of the virtual memory concept attains a utilization logi-
cally greater than 100 percent. That is, the sum of all the address spaces of the jobs being mul-
tiprogrammed may exceed the size of physical memory. This feat is accomplished by removing
the requirement that a job’s entire address space be in main memory at one time; instead, only
portions of it can be loaded, and the image of the entire virtual address space of a process rests
on the disk. In traditional memory management schemes, the user’s logical address space starts
from 0 to a maximum of N, the size of the actual physical memory available to the user. Under
virtual memory management, the user’s logical address space starts from 0 and can extend up to
the entire virtual memory space, the size of which is decided by the machine’s addressing capa-
bility or the space available on the backing store (on the disk, to be discussed). For a 32-bit sys-
tem, the total size of the virtual memory can be 232, or approximately 4 gigabytes. For the newer
64-bit chips and operating systems that use 48- or 64-bit addresses, this can be even higher.
Virtual memory makes the job of the application programmer much simpler. No matter how
much memory the application needs, it can act as if it has access to a main memory of that size and
can place its data anywhere in that virtual space that it likes. Moreover, a program may run without
modifcation or even recompilation on systems with signifcantly different sizes of installed memory.
In addition, the programmer can also completely ignore the need to manage moving information
back and forth between different kinds of memory. On the other hand, the use of virtual memory can
only degrade the execution speed and not the function of a given application. This is mainly due to
extended delays caused in the address translation process and also while fetching missing portions’
of a program’s address space at runtime.
Brief details about this topic are given on the Support Material at www.routledge.com/
9781032467238.
262 Operating Systems
process A
Real A(0)
memory
A(1)
A(0) A(2)
process B
A(3)
B(1) B(0)
A(2) B(1)
B(2)
C(0)
process C B(3)
B(3) C(0) B(4)
C(2) C(1) B(5)
C(2) B(6)
C(3)
Memory
allocation C(4)
information
Virtual Memory (backing store)
FIGURE 5.9 A schematic representation of basic operations of virtual memory management system, show-
ing presently required portions of currently executing processes loaded into main memory.
264 Operating Systems
FIGURE 5.10 In virtual memory management system, a new component is swapped in main memory from
virtual memory, whenever required, by way of swapping out an existing component from main memory, not
in use, to make room.
and the new component brought from virtual memory can be immediately loaded in its place in
memory. In this case, the overhead involved in writing back the component on the virtual memory
is simply avoided, thereby saving a considerable time that may summarily result in an appreciable
improvement in overall system performance.
• Paging
• Segmentation
Paging and segmentation differ both in approach (strategy) as well as in implementation, particularly
in the manner in which the boundaries and size of process components are derived. Under the paging
scheme, each process component is called a page, and all pages are identical in size. Page size is deter-
mined by the architecture of the computer system. Page demarcation in a process is implicitly also car-
ried out by architecture. Paging is therefore invisible to the programmer. In segmentation, each process
component is called a segment. Segments are a user-oriented concept declared by the programmer to
provide a means of convenience for organization and logical structuring of programs and data for the
purpose of virtual memory implementation. Thus, identifcation of process components is performed by
the programmer, and, as segments can have different sizes, there exists no simple relationship between
virtual addresses and the corresponding physical addresses, whereas in paging, such a straightforward
relationship does exist. Paging and segmentation in virtual memory systems, therefore, have different
implementations for memory management and different implications for effective memory utilization.
Paging and segmentation schemes when implemented in virtual memory system give rise to two
different forms of memory management, paged virtual memory management and segmented virtual
memory management respectively. Some operating systems often exploit a mechanism by combin-
ing segmentation and paging approaches called, segmented paged memory management, to extract
the advantages of both approaches at the same time. Obviously, the address translation mechanisms
of these schemes also differ from one another and are carried out by means of either page-map
tables, segment descriptor tables, or both.
5.9.4.1 Paging
Paging is an obvious approach associated with systems providing virtual memory, although virtual
memory using segmentation is also equally popular and will be discussed later. Paging is considered
Memory Management 265
Prot. Other
P R M info info page Frame No. #
the simplest and most widely used method for implementing virtual memory, or, conversely, the
paging approach is found best implemented in virtual memory. In simple paging (without virtual
memory), as already discussed, when all the pages of a process are loaded into main memory, the
respective page table for that process is created and then loaded into main memory. Each page
table entry contains the page frame number of the corresponding page in main memory. The same
approach is also employed for a virtual memory scheme based on paging but with an important
exception: in virtual memory systems, only some portions of the address space of the running pro-
cess are always present in main memory, and the rest need not be required to be in main memory.
This makes the page table entries more complex than those of simple paging.
As shown in Figure 5.11, the page table entry, in addition, contains a present (P) or valid bit,
indicating whether the corresponding page is present in main memory. If the bit indicates that the
page is in main memory, then the page table entry also includes the page frame number of that page.
The referenced bit (R) indicates whether the page already present in memory is referenced. The
bit R is set whenever the page is referenced (read or written). The page table entry also contains a
modify or dirty bit (M) to indicate whether the contents of the corresponding page have been altered
or modifed after loading in main memory. If the contents of the page remain unchanged, it is not
necessary to write back the page frame of this page on the backing store when its turn comes to
replace the page in the frame that it currently holds. The associated overhead can then be avoided.
The protection information (Prot info) bit for the page in the page table indicates whether the page
can be read from or written into by processes. Other information (Other info) bits are kept for the
page in the page table for storing other useful information concerning the page, such as its position
in the swap spaces. In addition, other control bits may also be required in the page table for various
other purposes if those are managed at the page level.
Virtual address
Page offset
Register
n-bits Page table
base address
Page table
Page
+
offset
<--->
frame
page
frame +
Physical address
MMU
Program
Paging mechanism
Main
Memory
FIGURE 5.12 A schematic block diagram of address translation mechanism used in the management of
virtual memory with paging system.
Since the offsets of the virtual addresses being issued by the CPU are not mapped, the high-order
bits of the physical address are obtained after translation; that is, the page frame number needs to be
stored in a PMT. All other PMT entries are similarly flled with page frame numbers of the locations
where the corresponding pages are actually loaded.
The address translation mechanism in paged systems is illustrated in Figure 5.12. When a par-
ticular process is running, a register holds the starting (base) address of the page table for that cur-
rently running process. The page number (pk) of a virtual address is used to index that page table
to obtain the corresponding frame number. This is then combined (concatenate) with the offset
portion (bk) of the virtual address to produce the desired real (physical) address. It is obvious that
the feld containing the page number in the virtual address is longer than the feld containing the
frame number.
In general, each process, even of average size, can occupy huge amounts of virtual memory,
and there is only one page table for each process. If the size of the pages is considered moderate,
a good number of page table entries are still required for each process. Consequently, the amount
of main memory devoted to page tables alone could be substantially high and may severely affect
and limit the space requirements for the execution of users’ applications. In order to overcome
this problem, most virtual memory management schemes hold page tables in virtual memory
instead of storing them in main memory. When a process is under execution, only a part of its
page table containing few page table entries, including the currently executing page, are made
available in main memory.
now gets control and fnds that the interrupt is related to a page fault. It then invokes the virtual mem-
ory handler and passes it the page number pk that caused the page fault. The virtual memory handler
then consults the page table to get the other-info feld (as shown in Figure 5.12) of the page table
entry of page pk that contains the disk block address of page pk. After getting this address, the virtual
memory handler looks up the free-frames list to fnd a currently free page frame. If no such free page
frame is available, some other actions (to be discussed later) are taken so that a free page frame can
be obtained. However, it now allocates the free page frame to page pk and starts an I/O operation to
load pk in the free page frame. Note that page I/O is distinct from I/O operations performed by pro-
cesses, which are called program I/O. When the I/O operation is completed, the system updates page
pk’s entry in the page table by setting the valid bit to 1, putting the free page frame number in the page
frame # feld of page pk, and also marking the frame as being in a normal state. This ends the related
procedures to be followed when a page fault occurs that ultimately bring the page pk into memory.
The faulting instruction is now brought back to the state when the page fault occurred. All other
actions that are required after interrupt servicing are then accordingly carried out to resume the
execution of the faulting instruction, assuming that nothing has happened.
A brief description of this topic is given on the Support Material at www.routledge.com/
9781032467238.
• Two-Level Paging: Under this scheme, there is a page directory (like that shown in Figure
5.19) in which each entry points to a page table. Typically, the maximum size of a page
table is kept restricted to be equal to one page. Tis strategy is used by the Pentium proces-
sor, for example. Here, an approach with a two-level paging scheme is considered, which
employs typically a 32-bit addressing, using byte-level addressing, the page size assumed 4
Kbytes (212) and the virtual address space taken as 4 Gbytes (232). Te number of pages now
required to address this virtual memory is 220 (232 ÷ 212 = 220) pages. If each page table entry
is taken as 4 bytes (22) in length, then the page table consisting of 220 page entries requires
4 Mbytes (222). Te root directory is kept in one page (4 Kbytes = 212) with each entry 4
bytes (22) in length, so it consists of 210 entries, and each such entry points to one user page
table, which again consists of 210 page table entries. In this way, the total 220 virtual pages are
mapped by the root page table with only 210 entries. It is to be noted that the root page table
consisting of one page is always kept resident in main memory.
Brief details about this topic are given on the Support Material at www.routledge.com/
9781032467238.
268 Operating Systems
5.9.4.1.3.1 Case Study: Two-Level Paging: VAX (DEC Systems) The 32-bit virtual addresses
used in VAX providing a virtual space of size 232 = 4 gigabytes are split into three felds in which the
high-order 2 bits (value = 00, 01, 10, 11) signifes the nature of the use (user space, OS space, etc.)
of the respective virtual space. With the use of the leftmost 2 bits in the virtual address, the entire
virtual space is actually partitioned into (22 =) four sections, and the size of each section is 232 ÷ 4 =
230 = 1 Gb. Each such section starts from 0, 1, 2, and 3 Gb with the values of this 2 high-order bits in
each sections are 00, 01, 10, 11 respectively. The page size is taken as 512 (= 29) bytes, and hence, the
number of bits used to express offset within the page is then of 9 bits. The number of bits used in each
entry (32 bits = 4 bytes in length) of the page table to express the number of virtual pages is, therefore,
32—(2 + 9) = 21 bits. Hence, the number of pages present in the system 221 (= 2 million) pages, and
as each page entry in the page table is 32 bits (or = 4 = 22 bytes) in length, that ultimately makes the
size of the page table equal to 221 × 4, or 223 bytes (= 8 Mbytes), which is quite large. While managing
to keep this huge size of page table in memory, designers eventually opted for a two-level page table
scheme that allows user page tables to be themselves paged out when they are not currently needed.
The paging structure and address translation mechanism of VAX are quite complicated but pos-
sess several distinct advantages. But it imposes the need for two memory references to the page
tables on each user memory reference; the frst one is to the system page table, and the second is to
the user page table, thereby causing a serious drawback due to repeated memory visits, which are
quite time consuming. However, this shortcoming has been overcome with the use of special hard-
ware (associative memory, to be discussed later) support that enables bypassing the path most of the
time, making it much lucrative and also practicable.
The details of this topic with a fgure are given on the Support Material at www.routledge.com/
9781032467238.
5.9.4.1.3.2 Case Study: Three-Level Paging: Sun SPARC The architecture of SPARC, a RISC
processor introduced by SUN Microsystems, uses a three-level page table to realize a three-level
paging mechanism. Under this scheme, when a process is loaded into memory, the operating system
assigns it a unique context number, similar to a process-id, which is kept reserved and fxed for the
process during its entire lifetime. A context table is built up with all the context numbers assigned to
the processes available in the system and is permanently resident in hardware. In this way, it helps to
avoid reloading the tables when switching from one process to another. MMU chips usually support
such 4096 contexts in almost all models.
When a memory reference is issued, the context number and virtual address are presented to the
MMU, which uses the context number as an index into its context table to fnd the top-level page table
number for that context (which is the context of currently executing process). It then uses Index1 to
select an entry from the top-level page table. The obtained entry then points to the next level page table,
and so on until the target page is found. If, during this translation process, any entry in the respective
page table is not found, the mapping cannot be completed and a page fault occurs. The running process
must then be suspended until the missing page-table page is brought in, and the affected instruction
once again can be restarted. Too many memory references to access the respective page tables (three
levels) on each user memory reference, however, make the system very slow due to repeated visits to
slower memory. Hence, to speed up the lookup, special hardware with associative memory is provided.
A brief description of this topic with a fgure is given on the Support Material at www.routledge.
com/9781032467238.
5.9.4.1.3.3 Case Study: Four-Level Paging: Motorola 68030 The four-level paging scheme
used by Motorola on its 68030 chip is highly fexible and sophisticated. The beauty of this scheme
is that the number of levels of page tables is programmable, from 0 to 4, controlled by the operat-
ing system. Moreover, the number of bits in the virtual address to be used at each level is also
programmable. This chip determines the value of the feld widths that are written to a global
translation control register (TCR). In addition, since many programs use far less than 232 bytes
Memory Management 269
of memory, it is possible to instruct the MMU to ignore the uppermost n insignifcant bits. It
should be noted that the operating system need not use all four levels if the job can be executed
with fewer. Many other attractive features are found in this memory management scheme, but,
at present, we restrict ourselves not to proceed anymore to enter any further details of its paging
and associated address translation mechanism. Our sole intention is only to show that four-level
paging is possible and that it is implemented in practice for commercial use. Interested readers
can go through the respective manuals to get a clear understanding of its implementation and
related operations.
Page Offset
Hash
function
Hash
table
FIGURE 5.13 A formal design approach of inverted page table structure and related address translation
mechanism employing a hash table used in the management of virtual memory with paging system.
270 Operating Systems
Page number: This is the page number portion of the issued virtual address.
Process-id: The process that occupies this page. The combination of page number and pro-
cess-id uniquely identifes a page within the virtual address space of a particular process.
Frame number: This is the page frame in memory which is owned by that particular process
indicated by page number and process-id.
Control bits: This feld includes many fags, such as valid, referenced, and modifed, protec-
tion, and locking information.
Chain pointers: This feld is null (often indicated by a separate bit) if there are no chained entries
for this entry. Otherwise, this feld contains the index value of the next entry in the chain.
In Figure 5.13, each entry of IPT contains the process-id (P) and page number (p); the pair
(P, p) is then used to carry out the required address translation. P is obtained when the scheduler
selects a process for execution: it copies the id of the process (P) from its PCB into a register of the
MMU. The page number portion (p) is taken from the virtual address as issued. This page p is then
searched in IPT using a hashing mechanism that generates a hash value v from the supplied page
number p. This hash value is then used as a pointer to the IPT. If the IPT entry as indicated by the
pointer contains the page p, then this page exists, and the corresponding page frame number f pres-
ent in the indicated IPT entry is copied for use in address translation. With this type of hashing,
collision may occur when more than one virtual address may map into the same hash table entry. A
chaining technique (coalesced chaining) is thus used here for managing this overfow. The hashing
technique used, however, normally results in chains of table entries that are typically short; hardly
between one and two entries. These table entries are individually visited following the chain when
searching of a particular page is carried out in order to obtain the corresponding page frame number
before fnally declaring a page fault. Address translation in this way is then completed by combining
the frame number f thus obtained with the offset b present in the virtual address.
When a page fault occurs, the needed page is brought using a conventional page table that may
be stored on disk instead of in main memory. This overhead is possibly unavoidable given the large
size of the page table to handle the large amount of information in the system. An IPT is often orga-
nized with the use of associative memory to speed up the look-up operation. On a hit, the IPT is not
needed. Only when a miss occurs, the page table is then consulted to fnd a match for the virtual
page as required by the issued virtual address. The hash table look-up as described can be done
either in hardware or by the operating system. As the software mechanism for look-up is compara-
tively slow, and if the look-up is done in software, care should be taken that this look-up not happen
very often. Many systems, however, use IPTs, including the versatile IBM RS 6000 and AS 400
(now called P-series) systems. The Mach operating system on the RT-PC also uses this technique.
Variations in the approaches as well as implementations of IPTs have also been observed on the
PowerPC, UltraSPARC, and IA-64 (Intel) architectures.
A brief description of this topic is given on the Support Material at www.routledge.com/
9781032467238.
Page# Offset
Hit
TLB value
miss
Secondary
Page table Memory
Miss
+
Main
Memory
Page fault
value
Load
page
FIGURE 5.14 A schematic block diagram in which use of Translation Lookaside Buffer along with cache
operation used in the management of virtual memory with paging system.
a buffer and functions in the same way as a memory cache, dedicated to the use of only address
translation mechanism; it is called a translation lookaside buffer (TLB). Each entry in the TLB
must include the page number as well as the complete information of a page table entry. The search
is carried out to inspect a number of TLB entries simultaneously to determine whether there is a
match on page number. This technique is often referred to as associative mapping.
Given a virtual address, the processor will frst inspect the TLB for the desired page table entry,
as shown in Figure 5.14. If it is present (TLB hit), then the frame number from the entry is retrieved,
and the physical address is formed in the usual manner. But, if the desired page table entry is not
found (TLB miss), then the process will use the page number portion of the virtual address to exam-
ine the corresponding page table entry. If the “present or valid bit” is set, then the page is in main
memory, and the processor can retrieve the respective frame number from the corresponding page
table entry to form the physical address. If the present bit is not set in the page table, then a page
fault will occur, the necessary actions will be taken to load the needed page in memory to resolve
the page fault, and page table updating will be carried out in the usual manner, as already discussed.
However, once the physical address is generated, and if the system supports memory cache system
(not the TLB cache), the cache is then consulted to see if the cache block containing that word is
present. If so, it is the content of the address thus referenced and hence is returned to the CPU for
subsequent processing. If the cache does not contain that word (cache miss), the word is retrieved
from main memory as usual.
272 Operating Systems
Associative memory is expensive; hence, its size is relatively small to accommodate this addi-
tional cost. It can contain only a few entries of the recently used pages referenced by a process as
well as the ones most likely to be needed in the near future. Whenever the search for a page in the
TLB fails, the hardware arranges the page from elsewhere as already described and stores it in the
TLB. This may sometimes require displacing an existing entry from the TLB to make room for the
new one.
The presence of a TLB, however, can accelerate (speedup) the address translation, which can be
shown by appropriate computations. But this performance improvement as obtained, on average,
cannot be achieved due to a lot of hindrances including the decrease in the locality of references
within a process, caused by many different practical factors. Moreover, while the effective mul-
tithreading approach is used in applications for yielding better performance, it may also result in
abrupt changes in the instruction stream, thereby causing the applications to almost spread all over
the entire address space, that eventually declines the locality of reference and its proftable use. As
a result, with increasing requirements for memory by executing processes, and as locality of refer-
ences decreases, it is natural that, using a TLB of limited size, the hit ratio of TLB accesses tends to
decline. Eventually, the presence of the TLB can itself create a potential bottleneck in performance.
While the use of a larger TLB with more entries can improve TLB performance, but TLB size can-
not be increased as often as memory size is made, since the TLB size is closely related to the other
aspects of hardware design, namely, cache memory and main memory, as well as the number of
memory accesses per instruction cycle, which may create additional issues required to be resolved.
An alternative approach may be to use a larger page size (superpage) so that each page table entry
in the TLB can address a larger block of memory. But the use of a larger page size itself, can lead to
performance degradation for many reasons (to be discussed later).
Among many other alternatives, the use of multiple page sizes on the whole provides a reasonable
level of fexibility for the effective use of a TLB. For example, program instructions in a process that
occupies a large contiguous region in the address space may be mapped using a small number of large
pages rather than a large number of small pages, while the threads, stacks, and similar other smaller
entities could then be mapped using the small page size. Still, most contemporary commercial operat-
ing systems prefer and support only one page size irrespective of the availability of hardware support.
One of the main reasons behind this is that many issues of the operating system are closely interrelated
to the underlying page size, and thus a change to multiple page sizes may be a complex proposition.
However, TLB is still an important and most expensive component in the address translation
process that is managed using probabilistic algorithms, in contrast to the deterministic mapping of
the register-assisted type, as already described. The typical strategies followed for TLB manage-
ment in paging systems, such as fetch, allocation, and replacement policies, are all OS issues and
hence are to be carefully designed, and the entries are to be properly organized so as to make the
best use of the limited number of mapping entries that the costly small TLB can hold. In fact, the
solutions to all these issues must be incorporated in the TLB hardware, although that makes the
TLB too machine specifc, but it is then to be managed in identical ways that any form of hardware
cache design usually follows.
A brief description of this topic is given on the Support Material at www.routledge.com/
9781032467238.
5.9.4.1.6 Superpages
Continuous innovation in technology in electronics engineering since the 1990s has abruptly modi-
fed and enhanced the traditional design and architecture of computer systems as well as their
resources, mainly increasing capability, reducing size and cost, and increasing speed. Moreover,
as user density has constantly increased, introducing a diverse spectrum of application areas, the
number and size of processes executed by computer systems have rapidly grown. The increase in
sizes of memory and processes has created critical problems in the proftable use of TLBs, since the
size of TLBs cannot be increased equally and proportionately to the similar increase in memory
Memory Management 273
and cache size, mainly due to cost. As a result, TLB reach, which is the product of page size and the
number of entries in a TLB, has increased only marginally, but the ratio of TLB size to memory size
has gone down by a factor of over 1000, which consequently lowers TLB hit ratios, thereby causing
severe degradation in the overall performance of the system. In addition, processor caches have also
become larger than TLB reach. This badly affects cache performance because access to instruc-
tions or data in a cache may be slowed due to frequent TLB misses and subsequent look-ups through
page tables. However, to mitigate all these issues, one possible way may be to use a larger page size
(superpage) so that TLB reach becomes larger. But, this approach, in turn, may invite some prob-
lems, namely larger internal fragmentation and more page I/O, apart from issues of additional cost.
A superpage is similar to a page of a process, except that its size is a power-of-two multiple of
the size of an usual page, and its start address in both the logical and physical address spaces is
aligned on a multiple of its size. In spite of having some drawbacks (already discussed in last sec-
tion), this feature increases TLB reach, which, in turn, offers a higher TLB hit ratio without expand-
ing the size of the costly TLB.
The sizes and number of superpages to be allocated in a process are generally adapted according
to the execution characteristics of a process. The memory manager in some situations may combine
pages of a process into a superpage of appropriate size if the pages are accessed very often and sat-
isfy the contiguity requirement as well as the address alignment in the logical address space. This
action is called a promotion. For example, program instructions in a process that are often accessed
and occupy a large contiguous regions in the address space may be mapped using a small number
of superpages rather than a large number of small pages. A promotion thus increases TLB reach
and releases some of the TLB entries that were assigned to individual pages of this new superpage.
On the contrary, if the memory manager ever observes that some pages in a superpage are not used
regularly, it may decide to disband the superpage into its individual pages. This action is called
demotion that now enables memory manager to release some memory space which can then be
used to load other useful pages that eventually may reduce the page fault frequency and thereby
increase the desirable hit ratio. Several multiprocessor architectures, including Pentium, IA–64,
Alpha, UltraSPARC, and MIPS 4000, support a few superpage sizes and allow a TLB entry for a
page or superpage.
5.9.4.1.7 Protection
Most of the issues in regard to protection in paged virtual systems have already been discussed
previously in Section 5.8.2.1.1. Those are mainly:
Protection in paged virtual memory, however, is analogous to that of the logical and physical address
spaces, as discussed earlier in the section on simple paging.
the page fault will be resolved in the usual way with necessary modifcation to the relevant entry of
the respective page table.
Management of pages involved in sharing could be implemented in a better way by maintaining
the information of all shared pages in a separate shared page table and collecting all page reference
information for all shared pages to store as page entries in this table. This arrangement will facili-
tate better management of this table separately for mapping shared pages. A related version of this
technique is used in the Windows operating system.
A brief description of this topic with a fgure is given on the Support Material at www.routledge.
com/9781032467238.
Virtual address
Seg # Offset
S =b
Main
Memory
Register
Segment
table ptr + a1+ b
Segment Table
Other
+ P M Control
Base
Length
Other
Addr. info.
info. a1
b
Segment
segment
1 - - - a1 -- - -- -
Segment
missing
needed
actions
Program Virtual
Address Translation in Segmentation memory
other useful information concerning the segment, such as its position in the swap spaces. In addi-
tion, other control bits may also be required in the segment table for various other purposes, such as
protection or sharing, if those are mapped and managed at the segment level.
Under segmentation, the virtual (logical) address space is inherently two-dimensional in nature.
A logical address kkkk is specifed as a pair (sk, bk), where sk is the segment number (name) and
bk is the offset within the segment. If n bits are used to represent sk, then a process can contain a
maximum of 2n segments. If m bits are used to represent bk, then the maximum size of a segment is
2m bytes or words. Once again it is to be noted that the size of individual segments is not fxed, and
they are usually unequal but bounded within the limit of maximum size.
When a specifc process is executed, the address translation starts using a register that holds the
starting address of the segment table of that particular process. The segment number (sk) present
in the virtual address as issued is used to index this table and look up the corresponding begin-
ning memory address of the segment. This address is then added to the offset portion (bk) of the
virtual address to produce the desired physical address, as shown in Figure 5.15. When the required
segment as indicated by the segment number (sk) in the virtual address is not present in memory,
a “missing segment” fault is raised that activates certain actions to load the required segment in
memory. If suffcient free memory is available, then the needed segment is loaded. Otherwise a
segment-out operation (s) may have to be carried out in order to make room for the new segment
to load before actual loading of the segment begins. Replacement of segments is a policy decision
which is carried out by the memory management only at the time of each replacement.
Since segments do not have fxed size, removing one segment from memory may not be suffcient
for loading another segment (as opposed to paged virtual memory systems). So, many segments may
need to be removed to make room for a new segment to load. Different segment sizes may be criti-
cal, but this also invites external fragmentation, which can, however, be negotiated either by means
of memory compaction or by frst ft/best ft strategies.
Segmented virtual memory, by virtue of its inherent two-dimensional nature, exhibits a nice fea-
ture. It permits a segment to grow or shrink dynamically in size. A segment can simply be allowed
to grow in its present location if the adjoining memory area is free. Otherwise dynamic growth of
a segment can be tackled by shifting it to a larger memory area with required relocation, thereby
releasing the memory area already occupied by it.
(virtual memory) by using a paging scheme. Under this scheme, only the needed portion of execut-
ing programs is maintained in memory in terms of pages rather than segments. The page faults are
serviced as usual as in paged virtual memory systems.
In this combined scheme, a user’s address space (virtual memory) is divided into a number of
segments per the choice of the programmer. Each segment, in turn, is divided into a number of
fxed-size pages, each page equal in length to page frames in main memory. In the case of a seg-
ment shorter in length than a page, the segment holds the entire page. It is important that from the
programmer’s point of view, this scheme is nothing but true segmentation, and the logical (virtual)
address is still in the form (sk, bk), where sk is the segment number and bk is the segment offset,
which is the byte number within the segment. From the system’s point of view, the segment offset
bk is viewed as being split into a pair (pk, bk), where pk is the page number within the segment sk and
bk is the byte number (offset) within the respective page pk. So, the generalized logical address in
combined segmentation with paging systems has the form (sk, pk, bk), where the symbols have their
usual signifcance, as already described.
Address translation in this system is carried out in two stages, as shown in Figure 5.16. In the frst
stage, when a particular process is in execution, a register holds the starting address of its segment
table. From the virtual address as presented, the processor uses the segment number portion sk to
index the process segment table to obtain the particular segment table entry that will indicate the
start of the respective page table (out of a number of existing page tables, one per process segment)
for that segment. If the present bit in the segment table entry as obtained is not set, the target seg-
ment is absented from real memory, and the mapping hardware generates a segment-fault exception,
which is processed as usual. Otherwise, if the present bit is set, then in the second stage, the page
number portion pk of the virtual address is used to index the corresponding page table (as obtained
from segment table entry) to fnd a specifc page table entry. If the present bit in the page table
entry as obtained is set, the corresponding frame number is looked up. This frame number is then
combined with the offset number bk of the virtual address to generate the real target address. If the
present bit in the page table is not set, the target page is absent from real memory, and the mapping
hardware generates a page-fault exception, which is processed as usual. At both stages of mapping,
Segment Page
table table
pointer ptr
Page
Segment
Table
table
Page#
seg.#
frame
+ + +
page
FIGURE 5.16 A representative block diagram of address translation mechanism used in the management of
virtual memory with segmented paging system.
Memory Management 277
the length felds in the respective table are used to confrm that the memory references of the run-
ning process are not violating the boundaries of its address spaces.
With both the segment table entry and the page table entry having the same usual formats, many
variations of this powerful scheme are possible depending on the types of information that are to
be included in the respective tables. While the combination of segmentation and paging is certainly
appealing, the address translation here involves two levels of indirection to complete the mapping
of each virtual address: one through the segment table and the other through the page table of the
referred segment. Hence, it requires two memory references if both tables are held in main memory,
which may eventually reduce the effective memory bandwidth by two-thirds. This may be too much
to tolerate even with all the added benefts. To help the operating system to speed up the translation
process, address translation buffers (similar to TLB, as already explained) may be employed for
both the segment and page table references. Alternatively, a single set of address translation buffers
may be exploited, each containing a pair (sk, pk) and the corresponding page frame number.
Memory sharing can be achieved in the usual way as performed in a segmented virtual mem-
ory scheme. Memory protection is carried out at the level of segments by including the protection
information with each entry of the segment table. Address translation buffers may contain access
validation information as copied from the segment table entries. If the access validation could be
performed at the level of the segment, access validation at the page level is then not needed.
Further brief details on this topic are given on the Support Material at www.routledge.com/
9781032467238.
The output of the paging unit is a 32-bitreal address, while the output of the segmentation unit is
a 32-bit word called a linear address. If both segmented and paging mechanisms are used, every
memory address generated by a program goes through a two-stage translation process:
By using the pipeline approach, overlapping the processing in the formation of virtual, linear,
and real addresses as well as overlapping the memory addressing and fetching operation, the total
amount of delay can be once again reduced to a large extent so that the next real address is ready by
the time the current memory cycle is completed. In fact, the on-chip MMU has a segmentation unit
that processes the virtual address AV, translating it to produce a linear address N as output, which is
then fed to a paging unit that processes the linear address N to produce a real address AR.
A brief description of this topic with a fgure is given on the Support Material at www.
routledge.com/9781032467238.
278 Operating Systems
TABLE 5.1
Operating System Policies on Aspects for Virtual Memory Implementation
• Page Size • Working set Management
Working set size
• Fetch Policy
Fixed
Prepaging
Variable
Demand paging
Replacement Scope
Anticipatory paging
Local
• Replacement Policy Global
Page–fault–frequency
• Replacement algorithms
Optimal • Cleaning Policy
Not–Recently–Used (NRU) Precleaning
First–in–First–Out (FIFO) Demand
Clock
• Placement Policy
Least–Recently–Used (LRU)
Least–Frequently–Used (LFU) • Load Control
Page Buffering Degree of
multiprogramming
• Working set theory
Memory Management 279
run. Consequently, this results in better memory utilization, thereby increasing degrees of multi-
programming, demonstrating better CPU utilization, and ultimately producing higher throughput.
Another method in this category called clustering brings an additional few adjacent pages in
at the same time as the required one with the expectation that the pages nearby will have a higher
probability of being accessed soon. If adjacent pages in the virtual store are kept in adjacent loca-
tions on the backing store, the cost of bringing in a group of pages might be only slightly higher than
bringing in just one. The source of this effciency is discussed in Chapter 6, “Device Management”.
However, if these additional pages are not needed, they will simply be thrown out soon, since they
have not been touched.
Prepaging: Here, pages are loaded before letting processes run. This strategy makes use of
the characteristics of most of the secondary memory devices, such as disks, in which the pages of
a process are stored contiguously. Under this strategy, these contiguous pages can be brought in
at one time rather than one at a time, thereby reducing the repeated disk access time (seek time +
latency time), making this procedure more effcient as a whole. This strategy, of course, may fail
or be ineffective if most of the extra pages that are being brought in are of no use. Observations
on different forms of prepaging, however, do not yet confrm whether it matches with the normal
working environment.
Anticipatory fetching: This attempts to determine in advance which pages will be referenced
within a short period. Those pages will then be brought into memory before they are actually
referenced.
Adviced paging: This is another method in which the process may inform the memory manager
through a service call that a particular page is about to be accessed and thus should be brought in
or that a particular page will not be used again for a long time and might as well be paged out. Such
a call would have this form: Page Advice (starting address, ending address, direction), which tells
the memory manager that the process is about to use the region in virtual space between the two
addresses, so it might swap in the relevant pages (if direction is ‘in’) or that this region will not be
accessed for a considerable period (if direction is ‘out’).
Advised paging, however, takes advantage of the programmer’s or compiler’s knowledge of
how the program works, knowledge that can be more accurate than that of the memory manager.
However, the programmer has no knowledge at all of the other processes that are competing in the
ready list, but the memory manager does know. It would be foolhardy for the memory manager to
put too much credence in advice. At best, the memory manager might turn off the reference bit for
a page that the process claims will be not used for a while, making it a candidate for page-out. For
bringing in, the advice should most likely be ignored.
status of the page frame in this regard is required to be maintained, and this is done by the hard-
ware using a modifed bit (dirty bit) in the page table. When a page is loaded into memory, this bit
is cleared (bit = 0) by the mapping hardware, and it is set by the mapping hardware whenever the
page is written to (modifed). Whenever a frame is selected for eviction by the replacement strategy,
this bit provided by the hardware is consulted for taking ftting action; otherwise all evicted pages
would be unnecessarily copied to disk regardless of whether they have actually been modifed,
thereby having an adverse effect on the management of virtual memory and the performance of the
entire system.
While it would be possible to select a page at random to replace at each page fault, better system
performance can be attained if certain criteria are considered at the time of making the decision to
replace a specifc page. A page replacement decision is sometimes found diffcult, since it involves
several interrelated aspects that need to be addressed. Those are:
• How many page frames are allowed to be allocated to each active process.
• Whether the replacement should remain confned within the set of pages belonging to the
requesting process that caused the page fault or involve all the page frames in main memory,
which may sometimes increase the number of page frames already allocated to the process.
• What criteria should be used to select a particular page for replacement when a new item
is to be brought in and there is no free frame.
Out of these three aspects, the frst two lie in the domain of working set management, which will
be discussed in the following subsection. The third one is, of course, concerned with replacement
policy, which is the subject of this subsection.
Replacement of a page is a policy decision which is often infuenced and based on page-reference
strings(memory reference information) derived from the actual memory references made by an
executing program, and it can be kept track of with the assistance of the page table. The behavior of
the various replacement policies can be suitably exhibited by means of the page-reference strings.
Most of the policies that attempt to select a page for removal should choose the page likely to have
a remote chance of being referenced in the near future. The principle of locality can be used as a
guideline to reveal that there is a relationship between recently referenced history and near-future
behavior. Thus, the design methodology of most of the policies relies on an objective in which future
program behavior can be predicted on the basis of its past pattern. The related page-in and page-out
in this regard is refected in the respective page table.
Locking Page Frames: All the replacement policies have some limitations and must abide by
certain restrictions at the time of selecting a page for replacement. Some of the page frames (e.g.
kernel of the OS and key control structures, I/O buffers in memory engaged in I/O operation, etc.)
needed to be locked in memory, and the corresponding page stored in that frame cannot be replaced.
A few other frames with time-critical aspects may also be locked into memory. All the frames of
these types are to be kept outside the domain of replacement activity. Locking is usually achieved
by associating a bit to be set as locked (lock bit) with each such frame, and this bit can be included
with each page table entry in the current page table.
A brief description of locking page frames is given on the Support Material at www.routledge.
com/9781032467238.
Shared Page Frames: In a timesharing system, it is very common that several users are run-
ning the same program (e.g. a compiler, a database, a text editor) using shared pages to avoid having
multiple copies of the same page in memory at the same time. These shared pages are typically
read-only pages that contain program text and not pages that contain data. The problem arises when
the pages of a specifc program are removed (page-out) along with the pages it is sharing that may, in
turn, ultimately cause a large number of page faults to occur when other active programs in memory
require those shared pages. Likewise, when a program is completed or terminated, it is essential to
inspect all the page tables to fnd those shared pages that are still in use by other active programs so
282 Operating Systems
that their disk space (virtual memory) will not be freed by chance. Such checking over all the page
tables is usually too expensive. Hence, special data structures or additional hardware support are
needed to keep track of all such shared pages.
the software under the control of the operating system (memory management). However, the
number of references made to a page or the order in which these references were made will not
be known.
The R and M bits combined can be used to build a simple effective page replacement algorithm.
When a process is started, both these page bits are initially set to 0 by the operating system for all its
pages. Periodically (e.g. on each clock interrupt), the R bit is cleared (set to 0) to distinguish pages
that have not been referenced recently from those that have been. At each page fault, the operating
system scans all the pages and divides them into four distinct categories based on the current values
of the R and M bits.
At frst glance, although class 1 pages seem impossible, they occur when a class 3 page has its
R bit cleared by a clock interrupt. Clock interrupts, however, do not clear the M bit because this
information is only needed to decide whether the page has to be rewritten to disk at the time of its
removal for the sake of immediate time saving.
The not-recently used (NRU) algorithm removes a page at random from the lowest numbered
non-empty class. This algorithm implicitly indicates that it is always appropriate to remove a modi-
fed page that has not been referenced (R = 0, M = 1) in at least one clock (typically 20 msec.) than
a clean page that is in heavy use (R = 1, M = 0).
The distinct advantage and possibly the main attraction of NRU is that it is easy to understand,
effcient to implement, and offers a performance that, while certainly not optimal, is often ade-
quate. Variations of this scheme are in use in different versions of UNIX.
1. If a page is frequently and continually used, it will eventually become the oldest and will
be removed even though it will be needed again immediately.
2. Some strange side effects can occur contrary to normal expectations.
3. Others algorithms have been found more effective.
284 Operating Systems
The most noted side effect, called the FIFO anomaly, or Belady effect, is that under certain
circumstances, adding more physical memory can result in poorer performance when one would
expect a larger memory to result in a better performance. The actual page traces that result in this
anomaly are, of course, very rare. Nevertheless, this perverse phenomenon coupled with the other
objections noted has ultimately caused FIFO in its pure form to drop from favor.
Brief details about this section with examples and fgures are given on the Support Material at www.
routledge.com/9781032467238.
Brief details of this section with examples and fgures are given on the Support Material at www.
routledge.com/9781032467238.
Memory Management 285
Here, at each clock interrupt, the memory management scans all the frames in memory. For
each page, the R bit, which is either 0 or 1, is added to the content of this additional feld (coun-
ter). In effect, the counter aims to keep track of how often each page has been referenced. When
it comes time to replace a page, the page with the lowest value in this feld will be chosen for
replacement.
One of the serious drawbacks found with both LFU and NFU in certain situation that arises
because of their keeping track of everything and not forgetting anything For example, when a
multi-pass compiler runs, pages that were heavily used during pass 1 may still have a high count
well into later passes. In fact, pass 1 usually has the largest programs and the longest execution
time of all passes; the pages containing codes for subsequent passes when executed will always
have lower counts than the pass 1 pages. Consequently, memory management, during the execution
of pass 2 or subsequent passes, will remove useful current pages instead of pages (of pass 1) that
are now no longer in use. However, NFU can be modifed in several ways. A small modifcation to
NFU gives rise to a modifed algorithm known as aging, which is able to simulate LRU quite well.
pages is then set to 0, indicating that these frames are now available for use at the time of page
replacement. This signifcantly minimizes the number of I/O operations and thereby reduces the
total amount of time required for disk handling. The powerful VAX VMS (DEC system) is a repre-
sentative operating system that uses this approach, and a few other operating systems, including the
Mach operating system (Rash 88) use this approach with a few variations.
Brief details on this topic are given on the Support Material at www.routledge.com/9781032467238.
5.9.5.4.9 Comparison
The results of an experiment as reported in Baer (1980) compares the four algorithms, FIFO,
Clock, LRU, and OPT, and are depicted in Figure 5.17. While conducting the experiments, it was
assumed that the number of pages assigned to a process is fxed. The experiments were carried
out by running a FORTRAN program considering 0.25 × 106 references, using a page size of 256
words. The experiment was run with different frame allocations, 6, 8, 10, 12, and 14 frames, and
the page faults caused by these four algorithms were counted separately for each of these alloca-
tions of frame one at a time. Interesting situations were observed. With small allocations, the
differences among the four policies were impressive, with FIFO having page faults almost double
OPT. The curves show that, to realize effcient run, we would prefer to be on the right of the knee
of the curve (which shows a small page fault rate) but at the same time keep a small frame alloca-
tion (as far as left of the knee of the curve). These two conficting constraints, while combined as
a compromise, suggest that the most desirable, realistic mode of operation would be around the
knee of the curve.
Finkel also conducted a similar experiment in order to fnd the performance of various replace-
ment algorithms using 100 pages with 10,000 synthesized page references. He employed an expo-
nential distribution for the probability of referencing a specifc page so as to approximate the
principle of locality. This experiment also revealed identical results as those reported in Finkel
(1988). The outcome of this experiment confrmed that the maximum spread of page faults here was
about a factor of 2. Apart from that, many other interesting and important conclusions were also
derived from his experiment.
40 - FIFO
Page faults per 1000 references
35 - CLOCK
30 -
25 - LRU
20 - OPT
15 -
10 -
5-
0
6 8 10 12 14
Number of frames allocated
FIGURE 5.17 Graphical representation of comparison of four popular and commonly used replacement
algorithms with fxed number of local page allocation.
288 Operating Systems
during its execution inspired the working set theory to evolve, the basis of which is mainly the
following:
• An executing process prefers only a subset of its pages at any interval in time.
• The typical patterns of the memory reference of an executing process exhibit a strong rela-
tionship between the recent-past and immediate-future memory references.
• On average, the frequency with which a particular page is referenced mostly changes
slowly with time.
The notion of a working set, however, assists the virtual memory handler in deciding how many
real page frames to be allocated to a given process and which pages of the process should be resi-
dent in memory at any given time to realize satisfactory performance by the process. In fact, when
a program begins its execution and starts referencing more and more new pages, it actually builds
up a working set gradually. Eventually, the process, due to the principle of locality, should fnally
stabilize on a certain set of pages. Subsequent transient periods indicate a shift of the program from
its existing locality to a new locality during its execution. It is interesting to note that during this
transient phase, when the new locality has already started, some of the pages of the old locality still
remain within the new evolving working set, causing a sudden increase in the size of the working
set as new pages are continually referenced. Thereafter, within a short duration of time when the
pages of the old locality start to be replaced by the referenced pages of the new locality, the working
set size again begins to decline and fnally stabilizes when it contains only those pages from the
new locality. Since the future references of the pages are not known, the working set of a program
can be defned as the set of pages already referenced by the executing program during a predefned
recent-past interval of time. This set is then probably able to predict the domain of future references.
The working-set principle is basically a guideline for two important aspects of paging systems:
allocation and replacement. It states that:
The working set of each process is thus always monitored to ensure that it should be in memory
so that the process can run without causing many faults until it moves to another execution phase
(another locality) or completes. If the available memory is not large enough to accommodate the
entire working set, the process will then naturally cause many page faults to occur and conse-
quently is said to be thrashing, ultimately requiring the system to spend most of its time shuttling
pages between main and secondary memory. However, the two requirements, as mentioned, cannot
always be religiously met due to various reasons, particularly when the degrees of multiprogram-
ming are relatively high for the sake of performance improvement. Still, many paging systems try
to keep track of each process’s working set and attempt to hold it in memory as far as possible dur-
ing the execution of the process. This approach, called the working set model (Denning, 1970), is
mainly designed to greatly reduce the page-fault rate.
Strict implementation of the working set model is a costly affair, and that is why a close approxi-
mation of the realizable defnition of the working set is attempted instead. For example, when a pro-
gram is under execution, after a predefned interval of time, the referenced bits of its resident pages
can be recorded and then cleared. Status (settings) of those bits prior to cleaning can be saved in
counters or bit arrays, which may be provided with individual page entries in the page table. Using
this counter or bit arrays, the working-set approximations can be made as a list of pages that have
been referenced during recent past.
Using this working-set approach, the clock algorithm can be effectively modifed to improve
its performance. In clock algorithm, normally, when the hand points to a page whose R bit is zero,
Memory Management 289
the page is evicted. The improvement is in effect to further check to see if that page is part of the
working set of the currently active process. If it is, the page is spared and kept out of eviction. This
algorithm is called Wsclock.
• how many real page frames to be allocated to a given process, that is, what will be the
size of the working set,
• which pages of the process should be resident in memory; in other words, what will be
the scope of replacement.
• Allocation Policy: Working Set Size: With the limited size of memory in the system,
the operating system must decide how much memory is to be allocated, that is, how
many pages of a particular process are to be brought into memory for smooth execu-
tion. Several interrelated factors are to be considered that infuence the performance of
the system:
• The smaller the number of page frames allocated to a process, the greater the degree of
multiprogramming that will eventually improve the performance of the system.
• If a relatively small number of page frames is allocated to a process, then despite the
principle of locality, it may result in a high rate of page faults, resulting in severe degra-
dation in the performance of the system.
• The more page frames allocated to a process, intuitively, the fewer page faults a program
will experience, which may result in a better performance. But observations reveal that
beyond a certain number of page frames allocated to a process, any additional alloca-
tion of frames will have no noticeable effect on the page fault rate for that process,
mainly because of the principle of locality.
Considering all these factors and others and making a compromise among several conficting
requirements, two different types of allocation policies have evolved that control the working
set size. A fxed-allocation policy offers a fxed number of frames to a process during its entire
tenure of execution. This number must be decided ahead of time; is often decided at initial load
time; and may be determined based on the size, types, and other attributes of the process or
even on advice issued from the user end. With this policy, whenever a page fault occurs and
replacement of page is required in the execution of a process, one of the pages of that particular
faulting process must be selected for replacement to make room for the new page to hold. This
policy behaves quite well with those processes with high degree of principle of locality, show-
ing an exceptionally low page fault rate and thereby requiring only a small number of page
frames.
The other allocation policy is known as the variable-allocation policy, in which the number of
page frames allocated to a process frequently varies during the entire tenure of its execution. This
policy is suitable for those processes that exhibit a weak form of principle of locality during execu-
tion and constantly tend to result in high levels of page faults, thereby continually expanding their
working set size. The variable-allocation policy appears to be the more powerful one, but, on the
other hand, it often develops thrashing due to snatching page frames from other active processes
that subsequently may create unnecessary extra page faults at their end through no fault of their
own. For this reason, the intervention of the operating system is required every time to assess the
behavior of the active processes that, in turn, demand adequate software intelligence to be embed-
ded in the operating system as extra overhead with the support of additional hardware mechanisms
required.
290 Operating Systems
However, both allocation policies as described are closely related to replacement scope.
• Replacement Scope: When a process gets a page fault, and there is no free page frame in
memory, then if the policy chooses the victim only from the resident pages of the process
that generates the page fault, it is called a local replacement policy. In this case, the algo-
rithm religiously obeys the fxed-allocation policy. All the replacement algorithms already
discussed in the preceding subsection use this policy. On the other hand, if the policy con-
siders all unlocked pages in main memory as candidates for replacement irrespective of
which process owns the particular page to be evicted, it is known as a global replacement
policy. Tis algorithm implies a variable-allocation policy. Te clock algorithm is com-
monly implemented as a global replacement policy, since it considers all residential pages
in a single list while it selects. Te problem with this global replacement policy (variable-
allocation) is actually the selection of a frame from all the unlocked frames in memory
using any of the replacement policies, as already discussed. Te page to be selected for
replacement can belong to any of the resident process; there is as such no defnite guideline
to determine which process should lose a page from its working set. Consequently, the pro-
cess that loses its page may sufer from reduction in working set size that may afect it badly.
One way to combat this potential problem is to use page bufering (as already discussed).
In this way, the choice of which page to replace becomes less important, because the page
can normally be reclaimed at any instant if it is referenced before the next time a block of
pages is overwritten.
A local replacement with variable allocation policy can alleviate the problems faced by the
global replacement strategy. The improvement or enhancement that can be made over the existing
local replacement with fxed allocation policy is to reevaluate the number of page frames already
allocated to the process at a specifc interval of time and then change it accordingly to improve the
overall performance. However, such planned attempt to be taken periodically for making decisions
whether to increase or decrease the working set size of active processes actually requires a thorough
assessment of their likely future demands of pages. Such an evaluation activity, in fact, is not only
a time-consuming proposition, but it makes this approach more complex than a relatively simple
global replacement policy. Still, it may probably yield overall better performance. However, the key
to the success of this strategy is to determine the working set size at any instant that dynamically
changes and also the timing of reevaluation to effect such changes. One particular strategy that
seems to have received much attention in this area is the working set strategy. A true working set
strategy would be diffcult to improvise and equally hard to implement for all practical purposes for
each process, but it can serve as a yardstick for comparison.
Local replacement policies tend to localize effects of the allocation policy to each particular
process. They are easier to analyze and simple to implement with minimal overhead. But their
major drawbacks are: when the working set grows, thrashing is inevitable, even if there are plenty of
free page frames. If the working set shrinks, local algorithms simply waste memory. If the amount
of allocations (number of page frames) tends to be too small, there will be a high probability of
increased page fault rate. If the amount of allocations (number of page frames) tends to be unneces-
sarily large, the degree of multiprogramming will then be considerably reduced, thereby adversely
affecting the performance of the entire system.
Global replacement policies, on the other hand, increase the correlation and the degree of cou-
pling between replacement and allocation policies. In particular, pages allocated to one process by
the allocation algorithm may be snatched away by a global replacement algorithm. This algorithm
is more concerned with the overall state of the system and much less interested in the behavior of
each individual process. By offering more page frames and thereby varying the number of frames
in use by a given process, global replacement may badly affect the regions, based on which the logic
and the attributes of the replacement algorithms are derived. Moreover, research results reveal that
Memory Management 291
the relationship between the frequency of page faults and the amount of real memory (number of
page frames) allocated to a program is not linear. Despite all these facts, global replacement, not
unnaturally, is still considered close to optimal.
From this discussion, the important conclusion is that each program has a certain threshold
regarding the proportion of real (page frame) to virtual pages. An amount of allocation below this
threshold causes page faults to increase very quickly. At the high end, there seems to be a certain
limit on the number of real pages above which any additional allocation of real pages results in very
little or almost no noticeable performance improvement. It is thus suggested to allocate memory in
such a way that each active program will get an amount that lies between these two extremes. In
fact, these upper and lower bounds should probably not be fxed, they are mostly program-specifc
and thus dynamically derived on the basis of the faulting behavior of the program at the time of
its execution. Therefore, it is more judicious to keep track of the behavior of the active program
rather than uselessly chase increasing the degree of multiprogramming for the sake of performance
improvement. Therefore, improvising a good design of an allocation algorithm that will be stable
and at the same time will not be inclined toward thrashing can be achieved by monitoring the page
fault rate rather than keeping track of the working set size directly.
Thus, the proposed algorithm will be based on the principle that a program that frequently experiences
a large number of page faults, i.e. having a high page-fault rate which is above some maximum threshold,
should be allocated more frames, if possible, without degrading the system or otherwise suspended it to
allow other active processes to run smoothly. Similarly, for a program that exhibits a low page-fault rate
below some minimum threshold, a few pages may be taken away from that process without causing any
appreciable effect. Moreover, the number of frames to be allocated may be determined by the amount of
available free memory, its priority, and other similar infuencing factors. However, when designing an
algorithm to implement this approach, one practical diffculty is that it requires prior knowledge about the
size of the working set of the process, in particular which specifc pages the process would really need at
any instant during the course of its execution. However, this problem has been addressed and has a solution
using a strategy known as the page-fault frequency algorithm that implements this approach.
Brief details on this section with a fgure are given on the Support Material at www.routledge.
com/9781032467238.
F = 1/T
where T is the critical interpage fault time, that is, the time between two consecutive page faults of a
particular process and is usually measured in number of page faults per millisecond. The operating
system defnes a system-wide (or sometimes per-process) critical page fault frequency F.
To implement this algorithm, it requires a reference bit (R) to be associated with each page in
memory. When a page is accessed, its reference bit is set to 1. When a page fault occurs, the operat-
ing system notes the virtual (process) time since the last page fault for that process. This could be
accomplished by maintaining a counter of page references. The operating system stores the time of
the most-recent page fault in the related process control block. Now, if the amount of time since the
last page fault occurred is less than T (= 1/F) ms, the process is operating above the PFF threshold
F, and a new page is added to the resident set of the process from the pool of free pages to hold the
needed page. This triggers a growth in the resident set size. Otherwise, the process is operating
below the PFF threshold F, and a page frame occupied by a page whose referenced bit and modi-
fed bit are not set is freed to make room for the new page. At the same time, the operating system
292 Operating Systems
sweeps and resets referenced bits of all resident pages. The pages that are not–referenced, unmodi-
fed, or not–shared since the last sweep are then released, and these freed page frames are returned
to the pool of free pages for future use.
PFF may be implemented as a global or local policy, although in its original proposal, PFF
was described as a global policy. This strategy can be further extended for the sake of complete-
ness; some other policies are additionally needed to maintain the size of the pool of free frames
within the specifed limits. Modifed pages should be written back to disk in an appropriate manner.
Moreover, if the PFF algorithm is supplemented with page buffering, the resulting performance is
quite appreciable.
Implementation of PFF in this manner faces several problems. First, resetting of reference bits
at the end of an interval would encroach on the page replacement decisions. If a page fault occurs
in a process soon after the required actions being taken and the working set was determined, most
pages of the process in memory would have their reference bits off, so memory management cannot
differentiate between these pages for the purpose of page replacement. The consequences are even
severe if processes either remain blocked or do not get an opportunity to execute for an entire inter-
val; their allocation would then shrink unnecessarily. As a result, it will be diffcult to decide on the
needed size of the working set window. To alleviate this problem, an alternative is to use a working
set window for each process individually. But this would rather invite additional complications in
the operation of memory management and further increases its overhead. Moreover, it would not
address the issues of interference with page replacement decisions. Apart from these, there are also
other problems, such as the PFF approach possibly not performing properly during the transient
periods (as already stated) when there is a shift from an existing locality to a new locality. This
problem, however, has been resolved by modifying the existing PPF approach accordingly.
More about this section is given on the Support Material at www.routledge.com/9781032467238.
drawback. Precleaning allows the writing of pages in batches, and it makes little sense to write out
hundreds or thousands of pages when the majority of them have been modifed again before they
are actually replaced and fnally evicted. The precleaning action in this situation is thus redundant
that simply waste the limited transfer capacity of the busy secondary memory with unnecessary
cleaning operations.
With demand cleaning, on the other hand, the modifed page is frst written back to the secondary
storage before the reading in of a new page begins. Although this approach may reduce redundant
page writes (as in precleaning), it goes through two page transfers when a page fault occurs that
causes the faulty process to remain blocked for a considerable duration. This may, however, result
in a noticeable decrease in processor utilization.
The use of the page buffering technique can offer a better approach in page cleaning. The
strategy to be followed is: Clean (remove) only those pages that need to be replaced. The
cleaning and replacement operations are to be separated out, and they are treated individu-
ally. As usual with page buffering, the replaced pages can be placed on two lists: modified
and unmodified. The pages on the modified list can be periodically be written out in batches
and moved to the unmodified list. A page on the unmodified list is either reclaimed if it is
referenced or thrown out from the list when its frame is assigned to another page. In this way,
a good number of redundant page writes can be avoided, which is a compromise between two
extremes.
processes to be in the system at any point in time, in many situations, it essentially reduces proces-
sor utilization by limiting process scheduling when most of the resident processes may be blocked,
and much time will then be spent as overhead to resolve the situation (possibly by swapping) for
effective processor utilization. On the other hand, if the policy accepts too many processes in mem-
ory at any instant, on average, the size of the resident set of each process may not be supported by
the available size of memory, which eventually may cause frequent page faults to occur, leading to
severe degradation in system performance.
It is intuitive that, as the degree of multiprogramming increases from a small value, processor
utilization is likely to increase and actually rise sharply because there are many active processes
present in the system to use the processor. But the danger of this gradual increase in the degree of
multiprogramming is that after a certain point, any attempt to further increase the degree of mul-
tiprogramming (the number of resident processes) may cause an adverse effect, mainly due to non-
availability of adequate memory space to hold the average resident set of all the active processes.
From this point onward, the number of page faults rises abruptly, and consequently processor utili-
zation then begins to fall drastically.
However, there are a number of approaches that resolve the problem of load control by determin-
ing how many active programs are to be resident in the existing system at any instant to yield the
best performance. In fact, the working set theory or the page-fault frequency algorithm (especially
the principle on which the algorithm is based) implicitly determine load control. It simply states that
those processes are allowed to continue their execution if their resident set size is relatively large,
and the needed memory should be provided to the resident set of each process to remain active. So
this policy automatically controls and dynamically determines the number of active processes in
the system at any point in time.
Denning and his colleagues proposed another approach known as the L = S criterion, which
adjusts the degree of multiprogramming so that the mean time between page faults (L) equals
the mean time required to process a page fault (S). Numerous experiments based on this crite-
rion to study performances reveal that this is the point at which processor utilization reaches a
maximum.
Another approach devised by Carr (1984) describes a technique adapting the clock
page replacement algorithm (Figure 5.73 on the Support Material at www.routledge.com/
9781032467238) using a global scope. This approach is based on monitoring the rate at which
the hand traverses the circular buffer of frames. If the rate of traversal is below a given lower
threshold, the degree of multiprogramming can be safely increased. On the other hand, if the
rate of traversal exceeds a given upper threshold, it indicates either a high page-fault rate or
limited number of pages available for replacement, which implies that the existing degree of
multiprogramming is too high and should not be increased any more.
To effect load control, the degree of multiprogramming is sometimes required to be increased
or decreased. It can be increased by simply bringing in few more processes in the existing
system and activating them. On the other hand, it is decreased using a common approach of
merely suspending one or more currently resident processes and then swapping them out of
main memory. The main categories of processes that can be suspended are: largest process,
frequent faulting process, lowest-priority process, last process activated, and processes with
largest remaining time.
Like many other areas of operating system design that are mostly policy-dependent, the design
issue related to the management of virtual memory is quite complex, because most of the issues
in this area are closely interrelated to one another, and in many situations, they are conficting.
The policy to be adopted here to achieve one’s goal is to be carefully selected, because it requires
adequate support of both hardware and software and has a bearing on many other design factors
related to other areas of the OS. A policy employed in this area may sometimes be lucrative, but it
may have a negative impact in the operation of other areas of the OS. In addition, the design is often
infuenced by the characteristics of the program to be run on the system.
Memory Management 295
• Hierarchical (multi-level) mapping tables: UNIX uses tables that facilitate implemen-
tation of the vital strategy of using allocations of noncontiguous regions for different
296 Operating Systems
segments of a particular process in virtual address space and the important copy-on-write
technique more effciently. For example, a parent and a child process while sharing a region
need only maintain private frst-level tables that contain pointers to a single shared copy of
the second-level page table in which the actual page table entries are stored.
• Page Frame Data Table: Similar to a page table, this table describes each frame of real
memory and is indexed by frame number (similar to page numbers in page tables) and
ultimately facilitates the effective operation of the page replacement algorithm. Each entry
of this page frame data table contains information such as page state, logical device, block
number, reference count, and pointer.
• Page Replacement: Different versions of UNIX use variations of the NRU page-replace-
ment algorithm (as discussed before) based on a global replacement policy that works bet-
ter with maintenance of longer page usage histories. UNIX maintains a list of free pages to
reduce the page replacement overhead and offers the frst page in the free list when it needs
to load a new page. To add new pages to this free list, it exploits the clock algorithm with
one or two hands (used in UNIX SVR4) to identify the inactive eligible (not locked) pages
to be taken out from memory using their reference (R) bit. The mechanisms used to imple-
ment one-handed clock and two-handed clock algorithms in this regard are separately
described on the Support Material at www.routledge.com/9781032467238.
• Paging Daemon: This is a vital module attached to the cleaning policy adopted by the
virtual memory handler of the UNIX system. It is a background process that sleeps most
of the time but starts working when it fnds that the number of frames in the free list has
fallen below the low threshold and goes to sleep when it detects that this number exceeds
the high threshold. The monitoring activity carried out by UNIX on daemon operation,
and the way the paging daemon itself operates to support the page replacement activities
are simply enormous. The details of the working procedures that the daemon follows are
given on the Support Material at www.routledge.com/9781032467238.
• Swapping: In UNIX, swapping is delegated to a separate kernel process, a scheduler known
as a swapper, which is always process 0 in the system. However, swapping from memory
to disk is generally initiated when the kernel runs out of free memory on account of many
events or meeting specifc conditions. When it is found that swap-out is necessary, the swap-
ping activity is carried out by the page-out daemon that activates the swapper and, accord-
ing to its own defned policy, swaps out selected processes to produce a suffcient number
of free page frames. The swapper also periodically checks to see whether suffcient free
memory is available to swap-in processes selected per the defned policy. Free storage in
memory and on the swap device is always tracked by using an appropriate data structure.
• Swap-Use Table: For each swap device, there is one swap-use table. Each entry in this
table is for each page on the device that contains information such as page/storage unit
number, and reference count.
• Disk Block Descriptor: This table describes the disk copy of the virtual page. Each entry
in this table is for each page associated with a process that contains certain information,
such as swap device number, device block number, and type of storage.
• Kernel Memory Allocator: This portion was already described earlier in detail.
More details about this section with fgures are given on the Support Material at www.routledge.
com/9781032467238.
memory management with respect to two major areas, process virtual memory and kernel memory
allocation. Here, the ultimate objective is to provide a clear view of the practical issues in virtual
memory implementation rather than to study in detail the virtual memory handler of a specifc
Linux version.
Virtual Memory
• Virtual Address Mapping: The design of Linux was actually intended to drive the 64-bit
Alpha processor, which provided the needed hardware support for three levels of paging. It
uses a hierarchical three-level page table structure that is platform- independent. This page
table structure consists of the following types of tables. Each individual table has a size of
one page. The three levels are:
• Page global directory: Each active process has a single page global directory, and this
directory must be resident in one page in main memory for an active process. Each
entry in this directory points to one page of the page middle directory.
• Page middle directory: Each entry in the page middle directory points to one page in
the page table. This directory may span multiple pages.
• Page table: As usual, each entry in the page table points to one virtual page of the pro-
cess. This page table may also span multiple pages.
The virtual address of Linux is thus viewed as consisting of four felds to address this three-level
page table structure; three of these are for the three levels, and the fourth one is the offset (byte num-
ber) within a page. The leftmost feld is used as an index into the global page directory. The next left-
most feld is used as an index into the page directory. The third feld serves as an index into the page
table, and the fourth feld offers the offset within the selected page frame in main memory. Sixty-
four-bit Linux also accommodates two-level hardware (32-bit Intel X-86/Pentium processor) support
by defning the size of the page middle directory as one. Note that all references to the extra level of
indirection in addressing are handled at compilation time rather than at runtime. This not only makes
it a platform-independent, but while running on a platform with two-level paging hardware, using of
three-level generic Linux would not incur any additional cost as performance overhead.
Similar to UNIX, the virtual (logical) address space of a process can consist of several regions,
but each such region can have different characteristics of its own and is handled separately using
separate policies for loading and replacement of pages. A page in a zero-flled memory region is
flled with zeroes at its frst use. A fle backed region assists in mapping fles in memory with ease.
The page table entries of its pages point at the disk buffers used by the fle system. In this way, any
update in a page of such a region can be immediately refected in the fle that can allow concurrent
users to use it with no delay. A private memory region is handled in a different fashion. When a
fork system call creates a new child process, this new process is given a copy of the parent’s page
table. At this time, the copy-on-write policy is enforced on the pages of a private memory region.
When a process modifes such a page, only then is a private copy of the page made for it.
• Page Allocation: Linux uses a page size of 4 Kbytes. It uses a buddy system allocator
for speedy allocation/deallocation of contiguous page frames (for mapping of contiguous
blocks of pages) with a group of fxed-size consisting of 1, 2, 4, 8, 16, or 32 page frames.
The use of the buddy system allocator is also advantageous for traditional I/O operations
involving DMA that requires contiguous allocation of main memory.
• Page Replacement: Linux essentially uses the clock algorithm described earlier (see Figure 5.73
on the Support Material at www.routledge.com/9781032467238), with a slight change that
the reference bit associated with each page frame in memory is replaced by an 8-bit age vari-
able. Each time a page is accessed, its age variable is incremented. At the same time, in the
background, Linux periodically sweeps through the global page pool and decrements the age
variable for each page while traversing through all the pages in memory. By this act, lower
298 Operating Systems
the value of age variable of a page, the higher its probability of being removed at the time of
replacement. On the other hand, a larger value of the age variable of a page implies that it is
less eligible for removal when replacement is required. Thus, the Linux system implements a
form of the least frequently used policy (LFU), already described earlier.
A Linux system always tries to maintain a suffcient number of free page frames at all times so that
page faults can be quickly serviced using one of these free page frames. For this purpose, it uses two
lists called the active list and inactive list and takes certain approved measures to maintain the size
of the active list at two-thirds of the size of the inactive list. When the number of free page frames
falls below a lower threshold, it executes a series of actions until a few page frames are freed. As
usual, a page frame is moved from the inactive list to the active list if it is referenced.
64-Kbyte 64-Kbyte
for NULL- for bad-
pointer pointer
assignments assignments
(inaccessible) (inaccessible)
FIGURE 5.18 A block diagram of Windows default virtual address space used in virtual memory manage-
ment of Windows (32-bit addressing).
FIGURE 5.19 A representative diagram, showing two-level page table organization of Windows (32-bit
addressing) used in the management of its virtual memory.
Paging:
Windows allows a process to occupy the entire user space of 2 Gbytes (minus 128 Kbytes)
when it is created. This space is divided into fixed-size pages. But the sizes of pages may be
different, from 4 to 64 Kbytes depending on the processor architecture. For example, 4 Kbytes
is used on Intel, PowerPC, and MIPS platforms, while in DEC Alpha systems, pages are 8
Kbytes in size.
Address Translation: Windows provides various types of page table organization and
uses different page table formats for different system architectures. It uses two-level,
three-level, and even four-level page tables, and consequently the virtual addresses used
for addressing are also of different formats for using these differently organized page
tables.
300 Operating Systems
FIGURE 5.20 A representative virtual (logical) address format of Windows (32-bit addressing) used in the
management of its virtual memory.
However, on an Intel X-86 architecture, windows uses a two-level page table organization, as
shown in Figure 5.19. The higher level page table is called a page directory (PD), with a size of 1
page (4 Kbytes) that contains 1024 entries of 4 bytes each. This requires 10 bits (210 = 1024) in the
virtual address to identify a particular entry in page directory. Each such entry in the PD points to
a page table (PT). The size of each page table is also 1 page that contains 1024 page table entries of
4 bytes each. This also requires 10 bits (210 = 1024) in virtual address to identify a particular entry
in a page table. Each such entry in the PT points to a page frame in main memory. The size of each
page frame is 4 Kbytes (212 = 4 K); hence 12 bits is required in the virtual address to identify infor-
mation within a page frame. Each 32-bit virtual (logical) address is thus split into three components,
as shown in Figure 5.20.
At the time of translating such a 32-bit logical address, the PD index feld is used to locate
the corresponding page table. The PT index feld is then used to select the corresponding page
within the page table. This page points to a particular page frame in main memory. The byte
index is then concatenated with the address of the page frame to obtain the desired physical
address.
Each page table entry in question is 32 bits (4 bytes). Out of these 32 bits, only 20 bits are used
to identify the page frame containing a page. The remaining 12 bits are used for the following pur-
poses: 5 bits contain the protection feld, 4 bits indicate the paging fle (e.g. the disk fle to which
pages are to be written when removing them from memory) that contains a copy of the page, and 3
bits specify the state of the page.
If the page is not in memory, the 20 bits of its page table entry specify the offset into the paging
fle to identify the page. This address can be used to load the page in memory. If this page is a text
(code) page, a copy of it already exists in a code fle. Hence, this type of page need not be included
in a paging fle before it is loaded for the frst time. In this case, 1 bit indicates the page protection,
and 28 bits point to a system data structure that indicates the position of the page in a fle containing
the code.
Use of the page state in the page table simplifes the accounting in page handling. A page can be
in one of three states:
Standby: The page has been removed from the working set of the process to which it was
allocated, but it can be brought back (reassigned) to the process if it is referenced again.
Modifed: The page is dirty and yet to be written out.
Bad: The page cannot be accessed due to hardware failure.
• Page Sharing: At the time of handling the sharing of pages, the pages to be shared are
represented as section objects held in a section of memory. Processes that share the section
object have their own individual view of this object. A view controls the part of the object
that the process wants to view. A process maps a view of a section into its own address
space by issuing a system (kernel) call with parameters indicating the part of the section
object that is to be mapped (in fact, an offset), the number of bytes to be mapped, and
the logical address in the address space of the process where the object is to be mapped.
When a view is accessed for the frst time, the kernel allocates memory to that view unless
memory is already allocated to it. If the memory section to be shared has the attribute
based, the shared memory has the same virtual address in the logical address spaces of all
sharing processes.
Sharing of pages is supported by the copy-on-write feature. As usual, a single copy of the shared
page is used by all sharing processes until any sharing process attempts to modify it. If a process
wants to modify the shared page, then a private copy of the page is created for it. Copy-on-write
is implemented by setting the protection feld of the page in the page table entry to read-only. A
protection exception is raised when a process attempts to modify it, which is resolved by the virtual
memory manager by making a private copy of the page for use by the process.
Sharing of pages also uses a distinct feature which relates to a “level of indirection” to sim-
plify its implementation. Usually, all sharing processes individually include the shared pages in
their own page tables, and all these entries of shared pages in all these page tables would have
to be modifed when a page is loaded or removed from memory. To avoid this complication and
to reduce the time-consuming overhead, a level of indirection is included while accessing page
table entries for shared pages. A provision of an indirection bit is kept, which is set in the page
table entries of the shared pages in each sharing process’s page table. The page table entry of the
shared page points to a prototype page entry, which points to the actual page frame containing
the page. When a shared page is loaded or removed from memory, only the prototype page entry
needs to be modifed.
• Replacement Scope: Windows uses the variable allocation, local scope scheme (see
replacement scope, described earlier) to manage its resident set. As usual, when a process
is frst activated, it is allocated a certain number of page frames as its working set. When
a process references a page not in main memory, the virtual memory manager resolves
this page-fault situation by adjusting the working set of the process using the following
standard procedures:
• When a suffcient amount of main memory is available, the virtual memory manager sim-
ply offers an additional page frame to bring in the new page as referenced without swap-
ping out any existing page of the faulting process. This eventually results in an increase in
the size of the resident set of the process.
• When there is a dearth of available memory space, the virtual memory manager swaps
less recently used pages out of the working set of the process to make room for the new
page to be brought into memory. This ultimately reduces the size of the resident set of
the process.
302 Operating Systems
activities do not have any impact on the working of the underlying operating system, and hence
are not included in the regular functioning of the operating system. Therefore, cache memory, as a
whole, is kept set aside from the purview of any types of operating systems, in general.
Brief details on this section, with all related issues supported by fgures, are given on the Support
Material at www.routledge.com/9781032467238.
SUMMARY
This chapter presents the principles of managing the main memory, then investigates different
forms of memory management schemes, ranging from very simple contiguous allocation of mem-
ory, including static (fxed) partition, dynamic (variable) partition, and overlay, to highly sophisti-
cated noncontiguous allocation of memory, including paging, segmentation, and segmentation with
paging. In allocating space for processes using each of these schemes, various issues relating to
management of memory were discussed. Static partition of memory, its allocation/deallocation,
and space management are relatively simple and straightforward, although they suffer from critical
internal fragmentation. Dynamic creation of partitions according to specifc needs and also to elim-
inate the problem of internal fragmentation requires more complex algorithms; particularly deal-
location of partitions and coalescing of free memory are needed to combat external fragmentation.
The need for occasional compaction of memory is also a major issue in the increased time and space
complexity of dynamic partitioning. However, both static and dynamic partitioning require almost
identical hardware support to fulfll their aims. Sharing, in fact, is quite restrictive in both systems.
Since physical memory in paging as well as in segmented systems retains its linear-array orga-
nization, an address translation mechanism is needed to convert a two-dimensional virtual-page
address or virtual-segment address into its unidimensional physical equivalent. Both paging and
segmentation in dynamically partitioned memory reduce the impact of external fragmentation, and
the other advantages they mainly offer include dynamic relocation with needed dynamic linking and
binding, fne-grained protection both within and between address spaces, and ease of sharing.
The introduction of virtual memory alleviated to a large extent the main memory space scar-
city problem and also removed the restrictions on the size of address spaces of individual pro-
cesses due to limited capacity of the available physical memory. With virtual memory, paging
simplifes allocation and deallocation of physical memory. Translation of virtual to physical
addresses during runtime, usually assisted by the hardware, is used to bridge the gap between
contiguous virtual addresses consisting of pages and discontinuous physical addresses compris-
ing page frames. Virtual memory is commonly implemented using segmentation with paging
in which demand paging is used, which allows a process to run only with pages on demand.
Since the capacity of main memory is far less than that of virtual memory, it is often necessary
to replace pages from memory to make frames free for new pages. Numerous page-replace-
ment algorithms are used, out of which LRU replacement is an approximation of optimal page
replacement but is diffcult to implement. However, the second-chance algorithm along with a
few others are close approximations of LRU replacement. Besides, when a policy relating to
fxing of number of frame-allocation to a process is formed, it exploits either local page replace-
ment, or dynamic using global page replacement. In addition, the working-set model approach
is introduced as a guideline that shows the minimum number of page frames required for an
executing process to continue with fewer page faults.
Kernel processes typically require memory to be allocated using pages that are physically con-
tiguous. The McKusick–Karels allocator, lazy buddy allocator, and slab allocator are only a few
methods described here that are used for this purpose in which memory wastage due to fragmenta-
tion is negligible and memory requests can also be satisfed speedily.
Last, the salient features of memory management implemented in reality by popular, commer-
cially successful operating systems such as UNIX, Linux, Windows, and Solaris are illustrated
here as case studies to explain the realization of different aspects of memory management that were
theoretically discussed.
304 Operating Systems
EXERCISES
1. What are the functions performed by the memory manager? What is meant by memory
allocation? What are the physical address and logical address?
2. What is meant by address translation? What are the different situations when address
translation takes place? Explain how address translation takes place at: a. program genera-
tion time, b. loading time, c. execution time
3. Name the different memory management schemes that are commonly mentioned. What
are the parameters used while making a comparison between them?
4. State the policy-decisions that are made to guide swapping actions. “Swapping, in turn,
invites relocation”: give your comments.
5. Explain with a diagram the multiple-user fxed partition scheme of memory management
with a partition description table. Give its merits and drawbacks.
6. What are the different approaches taken by the operating system in dynamic partitioning
memory management while allocating memory to processes? Compare and contrast the
merits and drawbacks of these approaches in light of keeping track of allocated partitions
as well as free areas.
7. What is meant by fragmentation of memory? State and explain the different types of frag-
mentation observed in contiguous allocation of memory.
8. What is meant by compaction? When does it take place? What is the effect of compaction?
What is the overhead associated with compaction? Selective compaction of memory in
certain situations performs better than brute-force straightforward compaction: Give your
comments.
9. What is a bit map? How it can be used for keeping track of allocated and free space in main
memory in a dynamic partition approach? Discuss its merits and drawbacks.
10. A dynamic memory partitioning scheme is being used, and the following is the memory
confguration at any given instant:
The Ps are the processes already allocated with blocks; the Hs are the holes and are free
blocks. The next three memory requests are for 35K, 20K, and 10K. Show the starting
address for each of the three blocks using the following placement algorithms:
a. First-ft
b. Next–ft (assume the most recently added block is at the beginning of memory)
c. Best-ft
d. Worst–ft
11. What is meant by overlay? Explain with a diagram the implementation of an overlay mech-
anism. What are the responsibilities performed by an overlay supervisor? “The concept of
overlay opens a new horizon in the emergence of modern approaches in memory manage-
ment”. What are those innovative approaches? How they have been conceptually derived
from the implementation of an overlay mechanism?
12. Discuss memory fragmentation and its impact on the buddy system. Explain why free lists
in a buddy system are made to be doubly linked.
13. What are the distinct advantages that can be obtained in a power-of-two allocation system
over its counterpart buddy system?
14. Discuss the basic principle involved in noncontiguous memory allocation strategy. State
the distinct advantages that can be accrued from this strategy.
Memory Management 305
15. State and explain the salient features of a simple paging system. How is address translation
carried out in a simple paging system?
16. Suppose you have a processor that supports 64-bit address space and 4 KB frame size.
How many levels of paging do you need in that system if each page table/directory is
restricted to a single frame? You may assume each page table entry (page descriptor) is 8
bytes.
17. State and explain segmentation in light of its principles of operation. What hardware sup-
port is required to implement segmentation? What is a segment descriptor table? When is
it created? Where is it stored?
18. Why is it said that the segment numbers are visible to processes (in a segmentation scheme),
but page numbers are invisible/transparent to processes (in a paging scheme)?
19. Consider a simple segmentation system that has the following segment table:
0 220 500
1 2300 50
2 750 100
3 1229 570
4 1870 110
For each of these logical addresses: a. 0, 390; b. 2, 90; c. 1, 19; d. 0, 590; e. 4, 22; f. 3, 590; g. 3,
253, determine the corresponding physical address. Also check whether there is an invalid
address specifcation for a segment fault to occur.
20. What is demand segmentation? Why is it diffcult to implement compared to demand
paging?
21. What is the difference between simple paging and virtual memory paging? What elements
are typically found in a page table entry? Briefy defne each element. Defne briefy the
alternative page fetch policies.
22. What is the purpose of a translation lookaside buffer (TLB)? Page tables are stored in
physical memory, which has an access time of 100 ns. The TLB can hold eight page table
entries and has an access time of 10 nanosec. During execution of a process, it is found that
85 percent of memory references are found in the TLB and only 2 percent of the references
lead to page faults. The average time for page replacement is 2 ms. Compute the average
memory access time.
23. How is a TLB different from hardware cache memory used to hold instructions/data?
24. A machine has 48-bit virtual addresses and 32-bit physical addresses. Page sizes are
8K. How many entries are needed for a conventional page table? For an inverted page
table?
25. Assume that a 32-bit address uses a two-level page table. Virtual addresses are split into a
high-ordered 9-bit top-level page table feld, ant 11-bit second-level page table feld, and an
offset. How large are the pages and how many are there in the virtual address space?
26. If an instruction takes 1 microsec and a page fault takes an additional n microsec, develop
a formula for the effective instruction time if page faults occur every k instructions.
27. What different page replacement policies are commonly used? Write two page replacement
policies for virtual memory.
28. What is the relationship between FIFO and the clock page replacement algorithm? Explain
why the two-handed clock algorithm for page replacement is superior to the one-handed
clock algorithm.
29. Consider the following page reference string. 1, 2, 3, 4, 2, 1, 5, 6, 2, 1, 2, 3, 7, 6, 3, 2 Using
four frames that are all initially empty, how many page faults will occur under a. FIFO, b.
Second-chance, c. LRU, d. NRU, e. Optimal?
306 Operating Systems
30. To implement the LRU approach, what are the modifcations required in the page table.
Show that the LRU page replacement policy possesses the stack property. Discuss an alter-
native approach that can implement the LRU approach.
31. What is meant by thrashing? How can it affect the performance of a system? Describe a
strategy by which the thrashing can be minimized.
32. What is the difference between a working set and a resident set? Explain with reasons and
preferably with the help of examples why the working set size of a process may increase or
decrease during the course of its execution.
33. What is the advantage of a page fault frequency algorithm over the estimation of the work-
ing set using the window size? What is its disadvantage?
34. State the different types of allocation policy and different types of replacement scope. How
is the former related to the latter?
35. Is it possible to page out page tables from the kernel space to the swap device? Justify your
answer.
36. “In a virtual memory system using a segmentation-with-paging scheme, the involvement
of segmentation is limited to sharing. It does not really have any role in memory manage-
ment”. Justify and comment on this.
More questions and problems are given at the end of the chapter on the Support Material at www.
routledge.com/9781032467238.
Learning Objectives
6.1 INTRODUCTION
All types of computers include a set of I/O modules as one of their fundamental resources fol-
lowing the Von Neumann design concept. Computer devices of different classes often come
with numerous brands and models from different vendors using diverse technologies. All such
details are, however, kept hidden from the internal working of the system, and that is why it
is said that I/O is transparent with respect to brand, model, or physical device type. Various
types of such I/O (peripheral) devices in the form of I/O modules are interfaced to the main
system for migration of data from the outside world into the computer or from the computer to
the outside world. Each such I/O module is made up of its own logic that on the one hand works
in tune with the main system (CPU and main memory) and on other hand establishes a physical
connection with the outside electro-mechanical peripheral devices to execute the physical I/O
operation.
With the continuous introduction of more advanced technology, processor speed has enormously
increased, and memory access time has decreased, mostly by the intelligent use of one, two, or
even more levels of internal cache to manage growing processor speed. But the speed of the I/O
modules has not improved to the extent (due to being mostly electro-mechanical) as to cope with
this faster processor-memory bandwidth. As a result, it causes a severe bottleneck when interacting
with the rest of the system that creates a strong challenge in the overall performance of the machine,
particularly in the case of most important I/O module, disk storage. Even today, the speeds of I/O
devices themselves, and I/O speed in general, are comparatively far lower than the overall speed of
the processor–memory cluster. That is why modern computer systems provide adequate hardware
assistance in the I/O area with proper support of event management mechanisms (such as interrupts)
so that I/O operations can be overlapped with those of the CPU by allowing the CPU to continue
with its own work while I/O operations are in progress in parallel in order to reduce the impact of
slower I/O on the overall performance of the system. Along these lines, constant development in the
design of more intelligent I/O modules and their allied interfaces continue, ultimately culminating
in the introduction of a separate I/O processor to handle I/O modules on its own, thereby totally
decoupling I/O activities from the clutch of the main system. But if there is an I/O processor, many
standard issues in relation to the central processing unit, such as scheduling and interprocess syn-
chronization, are equally applicable to the I/O processor for its proper function. A large computer
system, nowadays, is thus equipped with such powerful I/O modules that an I/O module itself can
be treated as if it is a completely standalone, full-fedged computing system.
In fact, the operation and working of all these numerous types of relatively slow I/O modules, in
particular generic devices such as disks, tapes, printers, and terminals, are monitored and controlled
by the operating system with respect to managing the allocation, isolation, sharing, and dealloca-
tion of these devices according to the prescribed policies as designed. The portion of the operating
system performing all these responsibilities is known as the I/O system, device management, or
sometimes I/O management. An amazingly large proportion of the instructions in the operating
system, often 50 percent, are devoted only to handling devices. Although device management has
gradually gained considerable importance due to the continuous increase in user density (the user
interacts with the system only through I/O devices), it is still a relatively simple part of the overall
OS design, because it is essentially defned mostly by the hardware design.
Brief details on this section are given on the Support Material at www.routledge.com/
9781032467238.
FIGURE 6.1 An illustration of a representative model for connecting device I/O controllers and I/O devices with
the main system.
310 Operating Systems
controller to the bus so that a device can be attached to a computer and then interoperate with other
facilities in the main system. Nearly all small computers use the single bus model, as shown in
Figure 6.1, for communication between the main system and controllers. Large systems often use
multiple buses and specialized I/O processors (I/O channels) to relieve the CPU to a great extent
from the burden of its required involvement in I/O activities.
The operation of the device controller (at one end with the device and at other end with the main
system) is manipulated by the software. The high-level software interface to a device controller is a
middle layer, the generic device controller that defnes the interaction between the software and the
controller. It states how software manipulates the hardware controller to cause the device to perform
I/O operations. This software interface generally provides a uniform abstraction of I/O devices to
software engineers. In particular, it makes every device appear to be a set of dedicated registers.
These registers are accessible either directly as part of a physical store or indirectly via I/O instruc-
tions provided by the hardware. A set of such dedicated registers is usually called an I/O port.
Controllers often incorporate a small amount of memory (hardware) called a buffer to temporarily
hold data after it is read from the device or sent from the main system for subsequent needed actions.
With its various forms, it plays a vital role (to be discussed later) in I/O operation and is mainly used
to overlap the operation of the device and that of the main system to smooth out peaks in I/O demand.
More about this with a fgure is given on the Support Material at www.routledge.com/
9781032467238.
FIGURE 6.2 A generic representative model of an I/O system including I/O interface and external device.
Device Management 311
as input, and the corresponding control signal is obtained as output to monitor and control the
process.
Brief details on this section with a fgure are given on the Support Material at www.routledge.com/
9781032467238.
More details about this section are given on the Support Material at www.routledge.com/
9781032467238.
More detail about this section are given on the Support Material at www.routledge.com/
9781032467238.
devices with numerous patterns of characteristics present in the system to make it convenient while
performing device-level I/O. This requires generality while handling different types of I/O devices
and is achieved by defning certain methods that treat all these devices, such as disks, tapes, printers,
and terminals, in the same uniform manner for the sake of simplicity and also freedom of handling
various errors arising from all these devices. In fact, the overall generality can be mostly realized if
physical I/O function can be structured logically in a modular fashion. This concept is considered one
of the keys in the design of device management software, and is explained in detail later (Section 6.10).
Brief details on this topic are given on the Support Material at www.routledge.com/9781032467238.
1. When initiating an I/O operation, the status of the device is to be checked. The status of
all devices is being kept track with a special mechanisms using a database known as a
unit control block (UCB) associated with each device. The module that keeps track of the
status of devices is called the I/O traffc controller.
2. In multiprogramming/multi-user environment, it decides the policy to determine which
process from the waiting queue pending on a physical device gets a device, when, and for
how long. A wide range of techniques are available for implementing these policies that is
based on the objective of improved system performance. In general, three basic techniques
are followed. Those are:
a. Dedicated: a technique whereby a device is assigned to a single process.
b. Shared: a technique whereby a device is shared by many processes.
c. Virtual: a technique whereby one physical device is simulated on another physical
device.
3. Allocating: physically assigning a device to a process.
4. Actions with regard to completion of I/O (interrupt servicing), error recovery, if there
is any, and subsequently deallocating a device. Deallocation (policy-wise and technique-
wise) may be done on either a process or a job level. On the job level, a device is assigned
for as long as the job exists in the system. On a process level, a device may be assigned for
as long as the process needs it.
In order to realize these basic functions, device management uses certain data structures that
constitute the physical input–output control system (IOCS). Those are:
The physical device table (PDT) is a system-wide data structure that contains information on all
physical devices present in the system. Each row of this table contains the information on one physi-
cal device that consists of several felds; some notable ones are device address, device type, and IOQ
pointer. The logical device table (LDT) is a per-process data structure that describes assignments to
the logical devices used by the process. One copy of the LDT exists for every process in the system,
and this copy is accessible from the PCB of the process. Both the PDT and LDT are fxed tables and
are accessed by device numbers as keys. The unit (device) control block (UCB) is a data structure
that represents a unit (device) in the operating system. This data structure contains all information that
Device Management 313
describes the characteristics of the device pertaining to the generic set of I/O operations that are sup-
ported by the I/O system. When a device is confgured to the system, a UCB is created and inserted
into the device list. An UCB is allocated only on demand when a process initiates an I/O operation and
is destroyed when the I/O operation completes or terminates. The I/O queue (IOQ) is a waiting list of
all processes pending for an I/O operations on a physical device when this device is busy with some
other process. This wait queue contains a pointer to the corresponding UCB that, in turn, points to the
corresponding queue of process control blocks of processes waiting for device access.
The PDT, LDT, and IOQ data structures reside within the kernel. The process creates a UCB
on demand in its own address space, initializes its felds, and uses some of its felds (device switch
pointer, etc.) as parameters in the physical call to device-specifc I/O routines (device drivers) lying
in the kernel space specifed by the I/O request. The presence of the UCB in the address space of a
user’s process avoids many complexities, such as checking the status of an I/O operation, like “Get
status info”, without having to invoke the implicit system call required to enter the kernel space.
More details about all these tables with respective fgures are given on the Support Material at www.
routledge.com/9781032467238.
to computers is one of the grand challenges of OS designers and developers. Each brand of disk,
channel, tape, or communication device has its own control protocol, and new devices with diverse
characteristics are frequently introduced in the market. It is thus essential to reduce this variety to
some sort of order; otherwise, whenever a new device is to be attached to the computer, the operat-
ing system (device management) will have to be completely reworked to accommodate it. In order
to get rid of this problem, it is required to organize the physical I/O function into a regular structure,
at least at some level inside the kernel, that can work well across a wide range of different class of
devices. The device-specifc parts of device control can then be sequestered into well-defned mod-
ules. This idea eventually gave rise to the concept of organizing the physical I/O function into an
appropriate logical structure.
The concept of hierarchical-level structuring as implemented in the design of an operating system
(Chapter 3) can also be similarly mapped in the design of device management (I/O software) to realize
the desired I/O functions. The ultimate objective of implementing this approach in the design of device
management is to organize the software, decomposing it into a series of layers in which each layer in the
hierarchy is entrusted with performing small manageable subfunctions of the whole. Each such layer
has its own form of abstractions and relies on the next lower layer, which performs more primitive func-
tions, but it hides the details of these functions and the peculiarities of the hardware associated with the
next lower layer. At the same time, it offers services to the next higher layer, and this upper one is more
concerned with presenting a relatively clean, regular interface to the users. Conclusively, at the outset,
this layered concept in the design and subsequent realization of device management still nicely matches
with the environment, fulflls its primary requirement to interact at one end directly with the computer
hardware, and communicates with the user processes through its other end. However, the number of
layers involved in the organization used by most device management systems (not all) may vary from
one device to another depending on the class of device and the application. Three such representative
classes of devices present in the system are local peripheral devices (console, printer, etc.), communi-
cation devices (involved in network architecture using the ISO-OSI model or TCP/IP), and fle-based
secondary devices (storage devices supporting the fle system). The layering in the design of the device
management of each of these classes is obviously different.
Brief details on different types of layering in the organization of these three different classes
of devices with respective fgures are given separately on the Support Material at www.routledge.
com/9781032467238.
FIGURE 6.3 Pictorial representation of layered structure of the generic Device Management system (I/O sys-
tem) and the main function of each layer.
software and even runs outside the kernel. For example, when a user program attempts to read a
block from a fle, device management (operating system) is invoked to carry out the operation.
The device-independent software frst searches the cache, for example, to fnd whether the block is
available. If the desired block is not present in the cache, it calls the device driver, which, in turn,
issues the appropriate request to the hardware. The process is then blocked (sleep) until the related
disk operation is completed.
When the related disk operation is over, the device hardware generates an interrupt. The interrupt
handler takes control, runs the respective interrupt service routine to resolve the interrupt that extracts
the status from the device, checks whether the related disk operation is successfully completed, and
if so fnally wakes up the sleeping process to fnish the I/O request and let the user process continue
again. We now discuss each individual layer separately, as shown in Figure 6.3, one after another
starting from the bottom, and show how the various layers of device management ft together.
• to determine the I/O command or create the channel program in order to perform the
desired I/O operation,
• to initiate I/O to the particular device through the respective I/O modules,
• to handle the interrupts arrived from the device, and
• to optimize its performance.
In summary, from the user’s end, it appears that the device driver performs the entire physical I/O.
The details of I/O operations executed by the device driver with a fgure are given on the Support
Material at www.routledge.com/9781032467238.
The initialization code is run when the system is bootstrapped. It scans and tests for the physical pres-
ence of the devices and then initializes them. The upper part of the device driver (API) implements
functions for a subset of entry points (as shown in Table 6.1). This part of the code also provides infor-
mation to the kernel (the lower part shown in Figure 6.10 on the Support Material at www.routledge.
com/9781032467238) as to which functions are implemented. The device interrupt handler is called by
the system interrupt handler that corresponds to the physical device causing the interrupt.
The devices and the respective drivers are usually installed by the system people following cer-
tain steps. The information required at the time of installing a driver can be incorporated into a con-
fguration fle and then processed by the confguration builder tool, /etc/confg, to build a makefle
capable of building the kernel.
Table 6.1 containing BSD UNIX driver entry points is given on the Support Material at www.
routledge.com/9781032467238.
Device Management 317
Uniform naming of devices helps to identify the particular physical device and locate the respective
driver. Symbolic device names may be used as parameters when invoking the respective drivers as
well as the specifc operation (read or write) it will carry out on the device.
Brief details on the different functions of this software are given on the Support Material at www.
routledge.com/9781032467238.
318 Operating Systems
• Input buffering: Pre-fetching an input record by having the input device read information
into an I/O buffer before the process requests it, or
• Output buffering: Post-writing an output record from an I/O buffer to the I/O device while
the process continues its own execution.
Input buffering can be started to read the next record on an I/O buffer sometime before the record
is needed by the process. This may be carried out while the CPU (IOP) is processing the previous
Device Management 319
record. By overlapping a part of I/O operation time with CPU (or IOP) processing time, the actual
time needed for completion of an I/O operation would be comparatively less and lead to less wast-
age. In the case of output buffering, the record to be written is simply copied into an I/O buffer when
the process issues a write operation. Actual output can be conveniently performed from this I/O
buffer to the destined location sometime later. It can also be overlapped with processing or a part of
processing for the next record.
The use of two system buffers, called double buffering, in place of a single buffer in the I/O
operation can make the buffering mechanism even more effcient and effective, although there exists
a distinct difference in the operation of a double buffer from that of a single buffer. But, sometimes
it is found that even double buffering is not enough to reasonably reduce the problem of I/O waiting,
especially in situations when the process enters rapid I/O bursts. That situation requires more than
two buffers to smooth out I/O operation, and that is why multiple buffers are employed to handle
this environment.
While the buffering technique intuitively always attempts to smooth out peaks in I/O demand,
the effect of buffering on performance depends mostly on the process’s characteristics. In the case
of an I/O-bound process, use of buffers often yields substantial performance improvement, whereas
for a CPU (compute)-bound process, the situation is exactly the reverse. However, in either case, no
amount of buffering is adequate to allow an I/O device to keep pace with a process indefnitely, par-
ticularly in situation when the average demand of the process becomes greater than the I/O device
can support. Nevertheless, generally in a multiprogramming environment, it has been observed
that, within the system, there is a nearly homogeneous mixture of various I/O-bound and CPU-
bound processes to service. Buffering technique thus emerges as a powerful tool that can ultimately
increase the performance of each individual process, and thereby enhance the effciency of the
operating system as a whole.
Different forms of I/O buffering and their operations with fgures are given on the Support
Material at www.routledge.com/9781032467238.
6.13 CLOCK
In computer systems, the timing order in which events are to happen is critical, and the hardware in
computers uses clocks that are different from usual clocks to realize synchronization. Clocks, also
called timers, mainly prevent one process from monopolizing the CPU, among other things. The
clock software generally takes the form of a device driver, even though a clock is neither a block
device, like a disk, nor a character device, like a terminal or printer. A clock in this context is a cir-
cuit that emits series of pulses with a precise pulse width and specifed interval between consecutive
pulses. The interval between corresponding edges of two consecutive pulses is called the clock cycle
time. Pulse frequencies are commonly between 1 and 100 MHz or correspond to clock cycles of 1000
nsec to 10 nsec. To achieve high accuracy, the clock frequency is usually controlled by a crystal oscil-
lator. The existing clock cycle can also be divided into subcycles. A common way of providing fner
resolution than the basic clock is to tap the primary clock line and insert a circuit with known delay in
it, thereby generating secondary clock signal that is phase-shifted from the primary (basic clock) one.
The advantage of a programmable clock is that its interrupt frequency (the time interval between
two consecutive interrupts) can be controlled by software. If a 1-MHz crystal is used, then the coun-
ter will be pulsed every microsecond. With a 16-bit register, interrupts can be programmed to occur
at rates from 1 to 65535 microsec. Programmable clock chips, however, usually contain two or three
independently programmable clocks and have many other options as well.
To implement a time-of-day clock, the software asks the user for the current time, which is then
translated into the number of clock ticks. At every clock tick, the real time is incremented by one
count. To prevent the current time from being lost when the computer’s power is turned off, some
computers store the real time in a special register powered by a battery (battery backup).
The hardware design of a clock with a fgure is given on the Support Material at www.routledge.
com/9781032467238.
Discussions on each mentioned point with fgures are given on the Support Material at www.
routledge.com/9781032467238.
Brief details on this topic are given on the Support Material at www.routledge.com/9781032467238.
FIGURE 6.4 A representative scheme showing data layout arrangement in magnetic disk (hard disk).
322 Operating Systems
Higher disk capacities are obtained by mounting many platters vertically, spaced at a defnite
distance from each other on the same spindle to form a disk pack. The disk pack is provided with
a single movable arm with many fxed branches. The number of branches present in such an arm
depends on the number of platters available in the disk pack. Each branch of the arm holds two
read/write heads, one for the lower circular surface of a platter and the other for the upper circular
surface of the next platter of the disk pack. Because of engineering economics for devices, the read/
write heads attached to different branches of a single movable arm are positioned over a particular
track on both sides of different platters (except possibly the topmost and the bottommost one) while
the media rotates under/over the heads. All these identically positioned tracks (both top and bottom
surface of each platter) on different platters’ surfaces together constitute what is called a cylinder
of the disk. This notion of a cylinder is an important property in the I/O operation of a disk pack. If
the records can be organized to be placed in the same cylinder (identical tracks on different platters),
the I/O operations on such records can then immediately be carried out without further movement
of the heads, thereby saving a substantial amount of time needed for the mechanical movement of
the heads for the correct placement. The disk pack can now be looked upon as consisting of a set of
concentric cylinders between the innermost cylinder, which consists of the innermost tracks of all
surfaces, and the outermost cylinder, which consists of the outermost tracks of all surfaces.
A physical disk record is normally stored in the disk pack on one track in one sector on one
surface of the disk or may be extended on other sectors that may be on the same surface or on the
other side of the platter or even on different platters depending on the nature of formatting, which
does sector numbering (or interleaving). A record’s address is usually specifed as (cylinder num-
ber, surface number, record number), and the commands supported by a disk device are read/write
[record–address, addr (I/O-area)] and seek (cylinder number, surface number).
More about this topic with related fgures is given on the Support Material at www.routledge.
com/9781032467238.
Each step will consume a certain amount of time (wait time), except for the frst one, if the disk is
readily available. The total duration of time consumed by the steps 2 to 5 is considered device busy
time. However, the actual time consumed to bring the disk to a position to begin the data transfer
operation in any disk I/O is the sum of the times consumed by steps 3 and 4, and this is mandatory for
any disk I/O operation to begin. However, the total time to be consumed for any disk I/O operation is:
The wait time in step 3 is the time required by the movable-head disk system to position the head at
the right track to perform the needed data transfer. This is known as seek time. When step 3 is over,
that is, the track is selected, and the head is positioned on the right track, step 4 starts. The wait time
in step 4 is the time consumed by the disk controller until the target sector (which may be on the
wrong part of the moving track) arrives just under the head so that data transfer can start. The aver-
age time needed for this movement to occur is called the rotational delay or rotational latency.
After consuming these two mandatory times (steps 3 and 4), known as access time, the current
position of the disk and the head can start physical data transfer. Now, the data transfer starts (step
5) as the sector moves serially under the head. The time required for the transfer of all the needed
information is the transfer time. This transfer time varies depending on the amount of information
to be transferred (say, 40 bytes or 15 words) as well as on the data transfer rate that the specifc
disk and its controller can provide. However, the data transfer rate can never be reduced by adopting
any kind of policy/strategy unless the electronic technology used in the disk is modifed. That is
why, apart from the size of the disk to take into account, the access time as well as its transfer time
are considered vital parameters at the time of selection.
Apart from the delays caused by the access time as well as the transfer time, there may be several
other delays caused by queuing normally associated with a disk I/O operation in situations when a
process issues an I/O request, but most of the time, it has to wait frst in a queue for the device to be
available (step 1). Another form of delay may occur, particularly with some mainframe systems that
use a technique known as rotational positional sensing (RPS).
Seek Time (tS): This time is required by the disk to place its movable arm on the required track
and mostly depends on two key factors; the initial startup time of the arm that is required for its
movement and the time taken by the arm to pass over the tracks radialy to reach the target track once
the access arm has taken up speed. The average seek time is tS to place the head on the right track.
Rotational Delay: This delay mostly depends on the speed of rotation of the disk. Let the disk is
rotating i.e. each track is rotating at the speed of r revolutions per second. The start of the block to
be transferred may be just under the head or at any position on the track. Hence, on average, half a
revolution is required to get to the beginning of the desired sector under the head. This time is the
latency time, which is equal to (2r) –1 seconds.
Transfer Time: The transfer time to and from the disk mostly depends on the data transfer rate
in which the rotation speed of the disk is a dominating factor. To make a rough estimate of this
transfer time, let us assume that each track has a capacity of N words that (i.e. the disk) rotates at
the speed of r revolutions per second. The data transfer rate of the memory is r.N words per second.
The size of the block to be transferred is n words. The time required to transfer n words is n ×(r.N) –1
seconds. Hence, once the read–write head is positioned at the right track at the start of the desired
block, the physical time to be taken only to transfer the desired block is n ×(r.N) –1 seconds.
Hence, the total time required to transfer an information of a block of n words in a disk I/O
operation can be expressed as
where the symbols here have their usual signifcances, as already mentioned.
More about this topic is given on the Support Material at www.routledge.com/9781032467238.
more sectors may pass under the head by the time the transfer is completed. Hence, an attempt
to read just the next consecutive sector cannot succeed in the same revolution of the disk. In the
case of a write operation, the effect is the same: data transfer from the disk buffer takes place
before writing of the data on the disk is initiated, and consequently data cannot be written into
just the next consecutive sector in the same revolution. In both cases, the throughput of the disk
device is severely affected.
To avoid this problem caused by data transfer time, it may then be necessary to read one block
and then skip one, two, or even more blocks. Skipping blocks to give the disk drive enough time to
transfer data to and from memory is called interleaving. This technique separates consecutively
numbered sectors on a track by putting other sectors between them. The expectation is that the
data transfer involved in reading/writing a sector can be completed by the time before the next
consecutively numbered sector passes under the head. Interleaving is realized when the disk is for-
matted, taking into account the interleave factor (inf), which is the number of sectors that separate
consecutively numbered sectors on the same track of the disk. Numbering of sectors in this way
by choosing a suitable interleaving factor would at least enable the disk drive to read consecutively
numbered sectors with ease and still achieve the maximum speed, of course within the limits of the
underlying hardware capability.
Another problem is observed when the frst sector of any platter is to be read immediately
after reading the last sector of the previous platter in the same cylinder. The read operation in this
case is required to be switched between heads positioned on different platters. As a result, a delay
is caused, known as head switching time. This is sometimes predominant, particularly in the
processing of sequential fle when the last sector on a platter has been read before the next read
operation starts. Such head skewing staggers the start of tracks on different platters in the same
cylinder. One possible solution is that the times when the last sector of a track and the frst sector
of the next track in the same cylinder pass under their respective heads must be separated by the
head switch time. Still, one more problem is found in situations in which the head is required to be
moved to the next cylinder, and this naturally consumes the needed seek time. Cylinder skewing
is analogous to head skewing and similarly staggers the start of a track on consecutive cylinders
to allow for the seek time.
However, interleaving is not as important in modern disks, which are equipped with sophisticated
controllers that can transfer data to and from memory at faster speeds and high rates. Modern disks are
more concerned with head and cylinder skewing. Still, sector interleaving is important, since it offers an
insight that provides a means to optimize disk throughput as far as possible by way of data staggering.
More about this topic with a fgure is given on the Support Material at www.routledge.com/
9781032467238.
The seek time for a specifc disk block to service an I/O request depends on the position of
the block relative to the current position of the disk heads. In a multitasking environment, there
will normally be a number of I/O requests arriving from various processes to a single disk that
are maintained in a queue and are serviced successively. This often requires access to random
locations (tracks) on the disk, even if the items are selected (scheduled) from the queue at ran-
dom. Consequently, in a multitasking environment, the total seek time involved in servicing all
I/O requests for a particular device abruptly increases. This indicates that the order in which
these items are to be selected to perform their individual I/O operations eventually determines
the total seek time needed to service all the I/O requests after visiting all the required tracks.
Moreover, when the same disk pack is used both for secondary store (fles) and backing store
(swap space), the swap space should be kept on the tracks halfway between the center and the
edge, because the items in this location can then be accessed with the lowest average seek
time. However, the main objective ultimately aims to always bring down the total seek time
as much as possible so as to increase the throughput (which is the number of I/O operations
performed per second) of a disk that, in turn, depends on the order in which I/O requests are
to be serviced.
Therefore, it is necessary to introduce suitable ordering in disk I/O operations that will ulti-
mately give rise to some specifc disk (track) scheduling policy to be obeyed by the physical IOCS
and the device drivers. Since scheduling policies always attempt to minimize the wasted time in
executing lengthy seeks, they will inevitably improve the throughput as well as the overall system
performance, but individual requests sometimes may be affected adversely. In general, disk-head
scheduling is defned based on the following assumptions:
• There is only one disk drive. If there are several, each is to be scheduled independently.
• All requests are for single, equal-sized blocks.
• Requested blocks are randomly distributed on the disk pack.
• The disk drive has only one movable arm, and all read/write heads are attached to that one
arm.
• Seek latency is linear in the number of tracks crossed (this assumption fails if the disk
controller has mapped tracks at the end of the disk to replace ones that have bad sectors).
• The disk controller does not introduce any appreciable delays.
• Read and write requests take identical time to service.
Disk scheduling requires a careful examination of pending requests to determine the most eff-
cient way to service all the requests. A disk scheduler examines the positional relationship among
waiting requests. The queue of requesting processes is then reordered so that the requests can be
serviced with minimum head movement. We now examine some of the common disk schedul-
ing policies that aim to optimize the total seek time for a number of I/O requests available in the
queue at any instant. We assume a disk of 100 tracks or cylinders and that the disk request queue
has random requests in it. Assume that a queue of requested tracks in the order as received by
the disk scheduler is 20, 15, 30, 10, 45, 55, 35, 5, 95, 85. The various scheduling algorithms will
now be explained based on this request queue, and each algorithm will use this queue to show its
respective performance.
6.17.2 FIRST-IN-FIRST-OUT/FIRST-COME-FIRST-SERVE
The simplest form of disk scheduling is FIFO or FCFS scheduling, which processes items from the
queue starting from the current item and proceeding in sequential order. As the requests arrive, they
are placed at the end of the queue one after another according to their time of arrival. This policy
is easy to implement and reasonably fair, because it ensures that every request will eventually be
honored, and all the requests are serviced in the order they are received. However, every request
here is likely to suffer from a seek operation.
Figure 6.5 shows the working of FIFO considering the queue as already mentioned. Initially, the
read/write head is on cylinder 20. In FIFO scheduling, the head will now move from cylinder (track)
20 to cylinder 15, then from 15 to 30, then to 10, and so on, as shown in Figure 6.5. If t is the time
taken to move the head from one track to the adjacent track, the total seek time in processing all the
requests in the queue is:
If there are only a few I/O requests in the queue, and if many of the requests are clustered closely in
a disk region, then FIFO is expected to exhibit a moderately good performance. On the other hand,
if there are many requests in the queue that are scattered, then the performance of FIFO will often
become an approximation of random scheduling, and this performance may not always be acceptable.
Thus, it is desirable to look at a more sophisticated scheduling policy that could yield a reasonable
result. The principle and working of a number of such algorithms will now be considered and discussed.
Although FCFS is not an attractive approach on its own, but its refnements once again illustrate the
law of diminishing returns. However, there are many alternatives to FCFS, and they are all much better.
For example, the Pickup method keeps the requests in FCFS order to prevent starvation, but on its way
to the track where the next request lies, it “picks up” any requests that can be serviced along the way.
6.17.3 PRIORITY
The system that is designed on the priority principle will not consider the scheduling mechanism
policy to lie under the domain of device management software. The aim of this approach is mostly
FIGURE 6.5 A representative sample example showing the implementation of disk–arm scheduling using
FCFS (First Come First Serve) scheduling algorithm in magnetic disk.
Device Management 327
to satisfy other objectives rather than to optimize disk throughput (performance). As in process
scheduling, here also short batch jobs and interactive jobs are often given higher priority than longer
jobs that require a longer time to complete their computations. In this way, this policy allows a lot of
short jobs to leave the system quickly after fnishing their tasks and thus may exhibit a good inter-
active response time. Consequently, this approach requires longer jobs to wait long times and even
excessively long times if there is a continuous fow of short batch jobs and interactive jobs, which
may eventually lead to starvation. In addition, this policy could sometimes lead to countermea-
sures on the part of users, who may split their original jobs into smaller pieces to beat the system.
Furthermore, when this policy is employed for database systems, it tends to offer poor performance.
6.17.4 LAST-IN-FIRST-OUT
This policy always immediately schedules the most-recent request irrespective of all the requests
already lying in the queue. Surprisingly, it is observed to have some merit, particularly in transac-
tion processing systems, in which offering the device to the most recent user could result in little or
almost no arm movement while moving through a sequential fle. Exploiting the advantage of this
locality, it improves disk throughput and reduces queue lengths as long as a job actively uses the fle
system. On the other hand, the distinct disadvantage of this approach is that if the disk remains busy
because of a large workload, there is a high possibility of starvation. Once a job has entered an I/O
request and been placed in the queue, it will fall back from the head of the line due to scheduling of
the most-recent request, and the job can never regain the head of the line unless the queue in front
of it empties.
FIFO, priority, and last-in-frst-out (LIFO) scheduling policies are all guided by the desire of the
requestor. In fact, the scheduler has no liberty to make any decision to schedule an item that may
be appropriate to the existing position of the arm to optimize the throughput. However, if the track
position is known to the scheduler, and the scheduler is free to select an item based on the current
position of the arm, the following strategies can then be employed to carry out scheduling based on
the items present in the queue, as requested.
Note that when the scheduling algorithm changes from FCFS to SSTF, there is a considerable
reduction in seek time, from 235t to 105t. In fact, this algorithm, on average, cuts the total arm
motion almost in half compared to FCFS.
328 Operating Systems
FIGURE 6.6 A representative sample example showing the implementation of disk–arm scheduling using
Shortest Seek Time First (SSTF) scheduling algorithm in magnetic disk.
The SSTF policy is analogous to the shortest job frst (SJF) policy of process scheduling. Hence,
it achieves good disk throughput, but similar to the SJF algorithm, SSTF can cause starvation for
some I/O requests. If a request is consistently bypassed, the disk must be incapable of keeping up
with the disk requests in any case.
In a real situation, more requests usually keep coming in while the requests shown in Figure 6.6
are being processed. Consider the situation if, after moving to track 30 and while processing this
request, a new request for track 25 arrives. So, after completing the request for track 30, it will take
up the newly arrived request for track 25 (due to less arm movement), keeping the scheduled existing
request for track 10 in the queue waiting. If another new request for track 15 then arrives while the
request for track 25 is under process, then the arm will next go to track 15 instead of track 10. With a
heavily loaded disk, it is observed that the arm will tend to stay in the middle of the disk most of the
time, so requests at either extreme will have to wait indefnitely until a statistical fuctuation in the
load causes there to be no requests near the middle. Thus, all requests that are far from the middle
may suffer from starvation and receive poor service (not being attended). So the goals of minimal
response time and fairness are in confict, though overall throughput of the disk increases. This kind
of starvation is not as bad as the underlying problem, in which it is very likely that the main memory
allocation policy itself will cause thrashing.
However, SSTF always has the advantage of minimizing the total seek latency and saturating
at the highest load of any policy. On the negative side, it tends to have a larger wait-time variance.
Since this policy allows the arm to move in two directions at the time of selecting an item to service,
a random tie-breaking algorithm may then be used to resolve cases of equal distances. In addition,
SSTF and various other scan policies can be implemented more effectively if the request queues are
maintained in sorted order by disk track number.
6.17.6 SCAN
In SSTF scheduling, the least movement of the arm in both directions for the purpose of selecting
the right item is one of the reasons for starvation problem and can result in increasing seek time.
Scan scheduling has been introduced to alleviate these problems of SSTF. In this method, the disk
heads are required to move in one direction only, starting from one end of the disk (platter) and
moving towards the other end, servicing I/O operations en route on each track or cylinder before
moving on to the next one. This is called a scan. When the disk heads reach the last track in that
direction (other end of the platter), the service direction of the head movement is reversed and the
scan proceeds in the opposite direction (reverse scan) to process all existing requests in order,
Device Management 329
FIGURE 6.7 A representative sample example showing the implementation of disk–arm scheduling using
SCAN scheduling algorithm in magnetic disk.
including new ones that have arrived. In this way, the head continuously scans back and forth across
the full width of the disk.
Let us now explain the SCAN scheduling technique with our example queue of the previous two
sections. Before applying SCAN, we should know the direction of the head movement as well as the
last position of the read/write head. Let us assume that the head will be moving from left to right and
the initial position of the head is on track 20. After servicing a request for track 20, it will service a
request for track 30, then track 35, and so on, until it services the request for track 95. After that, it
continues to move forward in the same direction until it reaches the last track in that direction (track
number 99), even though there is no request to service. After reaching 99, the head movement will
then be reversed, and on its way, it will service requests for track 15, then track 10, and then track 5.
Figure 6.7 shows the sequence of operation using the SCAN algorithm. The total seek time here is:
Time 4t indicates the time taken to move the head from track 95 (last request) to the last track (track
number 99). As the head moves from left to right, servicing the requests on its path, new requests may
arrive for tracks both ahead and behind the head. The newly arrived requests that are ahead of the
current position of the head will be processed as usual as the head reaches the track of new requests
during its forward journey. The newly arrived requests that are behind the head will be processed
during the reverse journey of the head along with the pending older requests en route in order lying
in the queue. These newly arrived requests may be processed before existing older ones, depending
on the track position that the new items have requested. The older requests in this situation may have
to wait longer than the newly arrived ones, leading to an increase in their response times.
It is to be noted that the SCAN policy is biased against the area most recently traversed. Thus, it
does not exploit locality as an SSTF or even LIFO attribute. It is also interesting to observe that the
SCAN policy favors those jobs whose requests are for tracks nearest both innermost and outermost
tracks and equally inclined to the latest-arriving jobs. The frst issue can be avoided by a circular
SCAN, or C-SCAN, policy, while the second issue can be addressed by the N-step-SCAN policy.
FIGURE 6.8 A representative sample example showing the implementation of disk–arm scheduling using
elevator scheduling algorithm in magnetic disk.
track of the disk in that direction, as with SCAN) and then starts traversing in the reverse direc-
tion until the last request in this direction is met. This algorithm is known both in the disk world
and the elevator world as the elevator algorithm. Often this algorithm is also called the LOOK
algorithm, since it looks for a request before continuing to move in a given direction. The working
of this algorithm requires the associated software to maintain 1 bit: the current direction bit, UP
or DOWN. When a request completes, the disk or elevator driver checks the bit. If it is UP, the arm
or cabin is moved to the next higher pending request, if any. If no request in the same direction is
pending at higher position, the direction bit is reversed, and it is set DOWN; the movement of the
head or cabin starts to move in the reverse direction to the next lower requested position, if any.
Figure 6.8 shows the sequence of operation of the elevator algorithm with the same example
queue of the previous two sections The total seek time here is:
C-SCAN reduces the maximum delay experienced by new requests. With SCAN, if the expected
time for a scan from the inner track to the outer track is T, then the expected service interval for
tracks at the periphery is 2T. With C-SCAN, this interval is on the order of T + S max, where S max is
the maximum seek time.
6.17.9 C-LOOK
Similar to LOOK, which is a variant of SCAN algorithm, the C-LOOK algorithm is a variant of C-SCAN
in the same fashion. Here, the disk head also keeps moving in the same direction until there are no
more outstanding requests in the current direction. Once it reaches the last request in one direction, it
Device Management 331
FIGURE 6.9 A representative sample example showing the implementation of disk–arm scheduling using
Circular Scan or C-SCAN scheduling algorithm in magnetic disk.
immediately switches its direction and returns to the frst-positioned request at the opposite end without
servicing any request on the return trip, and the scan once again begins moving towards the other end.
In practical situations, it is observed that one or a few processes sometimes have high access rates
to one specifc track, thereby almost monopolizing the entire device by repeated requests to that track.
High-density multi-platter disks, in particular, are likely to be very inclined to this type of functioning
than their counterpart lower-density disks and/or disks with only one or two surfaces. To alleviate the
problem of head movement remaining almost fxed on a specifc track while responding to successive
requests, the disk request queue can be segmented such that only one such segment is permitted to be
processed entirely at a time. The N-step-SCAN and FSCAN policies use this approach.
6.17.10 N-STEP-SCAN
The N-step-SCAN policy usually segments the disk request-queue into sub-queues, each of length
N. Sub-queues are processed one at a time using the SCAN algorithm. While a queue is being
processed, new requests may arrive, and those must be added to some other queue. If fewer than
N requests are available at the end of a scan, then all of them are processed with the next scan. For
large values of N, the performance of N-step-SCAN tends to approach that of SCAN.
6.17.11 FSCAN
This policy is similar to N-step-SCAN but uses only two sub-queues. All of the requests already
arrived are in one of the queues. The other queue remains empty. The processing of the flled-queue
begins using the SCAN algorithm, and during the processing of this flled-queue, all new requests
that arrive are put into the other queue. This queue will be serviced only after the completion of the
queue under process. In this way, service of the new requests is deferred until all of the old requests
are processed, thereby attempting to produce uniform wait times for all the requests on average.
effciently implemented in advanced version of Linux (Linux 2.6). Its details can be seen in “Device
Management in Linux”.
6.18 RAID
Over a couple of decades, the speed of disk storage devices by far exhibited the smallest improvement
when compared to that of processors and main memory due to continuous advances in electronic
technology. Yet there has been a spectacular improvement in the storage capacity of these devices.
However, magnetic disks inherently have several drawbacks. The most important of them are:
Computer users thus constantly clamor for disks that can provide larger capacity, faster access to
data, high data-transfer rates, and of course, higher reliability. But those are generally expensive.
Alternatively, it is possible to arrange a number of relatively low-cost devices in an array in which
each one can act independently and also work in parallel with others to realize a considerable high
performance at a reasonable cost. With such an arrangement of multiple disks, separate I/O requests
can also be handled in parallel, as long as the data are located on separate disks. On the other hand,
a single I/O request can be executed in parallel if the block of data to be accessed is distributed
across multiple disks. Based on this idea, a storage system was proposed using small relatively inex-
pensive multiple disks (alternative to a single large one), called a redundant array of inexpensive
disks (RAID), which offered a signifcant improvement in disk performance and also in reliability.
Today, RAID technology is more appropriately called a redundant array of independent disks,
adopting the term independent in place of inexpensive as accepted in the industry.
The presence of multiple disks using a RAID arrangement in the confguration opens a wide vari-
ety of ways to organize data, and redundancy can also be included to improve reliability. However,
it raises diffculties in developing schemes for multiple-disk database design that can be equally
usable on a number of different hardware platforms and operating systems. Fortunately, the indus-
try has agreed on a standardized scheme to overcome this diffculty using RAID, which comprises
seven universally accepted levels from zero through six, besides a few additional levels that have
been proposed by some researchers. These levels do not indicate any hierarchical relationship; they
rather designate different architectures in design that exhibit the following common characteristics:
• RAID consists of a set of physical disk drives, but to the operating system, it is recognized
as a single logical disk drive.
• Data are distributed in different fashions across multiple physical disk drives present in
the array.
• Redundant disk capacity is used to store parity information, which enables recovery of
data at the time of any transient or catastrophic failure of a disk.
Different RAID levels differ essentially in the details of the second and third characteristics as men-
tioned above. But, only the third characteristic, however, is not supported by RAID 0 and RAID 1.
The RAID arrangement distributes the data involved in an I/O operation across several disks and
performs the needed I/O operations on these disks in parallel. This feature can consequently provide
fast access or a higher data transfer rate, but it depends on the arrangement of disks employed. The
performance of any of the RAID levels critically depends on the request patterns of the host system
and the layout of the data. High reliability is achieved by recording redundant information; how-
ever, the redundancy employed in a RAID arrangement is different by nature from that employed
in conventional disks. A traditional disk uses a cyclic redundancy checksum (CRC) written at the
Device Management 333
end of each record for the sake of providing reliability, whereas redundancy techniques in a RAID
employ extra disks to store redundant information so the original data can be recovered even if some
disks fail. Recording of and access to redundant information does not consume any extra I/O time
because both data and redundant information are recorded/accessed in parallel.
Disk striping: RAID technology uses a special technique known as disk striping that provides
a way to achieve high data transfer rates during an I/O operation. A disk strip is a unit of data on
a disk, which can be a disk block, a sector, or even an entire disk track. Identically positioned disk
strips on different disks form a disk stripe. A fle may be allocated an integral number of disk
stripes. The data located in the strips of the same stripe can be read/written simultaneously because
they reside on different disks. If the array contains n disks, the data transfer rate is expected, at least
theoretically, to be n times faster than that of a single disk. However, the real data transfer rates as
obtained by this arrangement may not be exactly equal to n times due to the presence of several
factors that prevent parallelism of I/O operations from occurring smoothly. In fact, the implemen-
tations of disk striping arrangements and the redundancy techniques employed differ signifcantly
from one level to another in the proposed various RAID organizations. Two main important metrics
that determine the performance differences among these levels are:
There are all together seven different RAID schemes, RAID 0 through RAID 6, each with its
own disk arrangements and specifc data organizations along with allied redundancy techniques. .
However, RAID level 0 +1, and RAID level 1 + 0 are actually hybrid organizations based on RAID
0 and RAID 1 levels, and RAID level 5 is the most popular RAID organization.
A comparison of the seven different levels of RAID organization on vital metrics in tabular form
and also more details of this topic with a fgure are given on the Support Material at www.routledge.
com/9781032467238.
When an I/O request has a hit in the disk cache, the block of data in the disk cache (within main
memory) is either transferred to the target memory location assigned by the user process or using
a shared memory approach in which a pointer is passed to the appropriate slot in the disk cache,
thereby avoiding a memory-to-memory transfer and also allowing other processes to enjoy shared
accesses over the entire disk cache.
The next design issue is the replacement strategy, similar to the memory cache, which is
required when a new block is brought into the disk cache to satisfy an I/O request but there is
no room to accommodate it; one of the existing blocks must be replaced. Out of many popular
algorithms, the most commonly used algorithm is least recently used (LRU). In fact, the cache
can be logically thought of as consisting of a stack of blocks, with the most recently referenced
block on the top of the stack. When a block in the cache is referenced, it is taken from its existing
position in the stack and put on the top of the stack. When a new block is fetched from secondary
memory, the block at the bottom of the stack is removed to make room for the new one, and the
incoming new block is pushed on the top of the stack. In fact, it is not necessary to actually move
these blocks around in main memory; only a stack of pointers could be associated with the disk
cache to perform these operations. The replacement strategy, however, can employ another useful
algorithm known as least frequently used (LFU), which selects to replace the lowest–referenced
block in the set.
Disk caching, in principle, is ultimately involved in both reads and writes. The actual physical
replacement of an entry from the disk cache can take place either on demand or preplanned. In the
former case, a block is replaced and written to disk only when the slot (room) is needed, but if the
selected block is only read, it is not modifed; hence, it is simply replaced, and it is not necessary to
write it back to the disk. For preplanned, the disk block may be continually modifed by the running
processes and remain in the disk cache (if slot is not required), and instant writing to disk for imme-
diate upgrade is temporarily postponed to perform later at a convenient time. This mode is related
to write-back blocks, similar to memory cache, in which a number of slots are released at a time.
This mode of writing is known as delayed write in UNIX. The main drawback of this approach is
due to volatile main memory that may create potential corruption of the fle system, since unwritten
buffers might be in memory that may be lost due to sudden power failure or system crashes for any
other reason.
Disk caches, unlike memory caches, are often implemented entirely in software, possibly with
indirect assistance from memory management hardware when disk blocks are intelligently mapped
in memory. This concept and basic principle of operation of disk cache, more specifcally a unifed
disk cache (as explained later) to handle I/O operations, is implemented in UNIX as a buffer cache,
which is discussed in detail later (“Device Management in UNIX”).
More details on this topic are given on the Support Material at www.routledge.com/9781032467238.
be judiciously decided to avoid memory wastage. To alleviate all these problems, the concept of a
unifed data cache has been introduced.
The disk caching technique exhibits a lot of advantages as well as serious drawbacks. Some of
the potential advantages are:
The main disadvantage of disk caching is due to its use of a delayed write policy, which may result
in potential corruption of the fle system if the power fails or system crashes occur due to other
reasons. Moreover, it is sometimes critical to decide on the amount of memory to be committed to
disk caches to realize satisfactory performance and set other parameters to implement it favorably.
Failing this, it may result in poor performance in systems using disk caches in comparison to that of
non-cached systems in certain situations.
Considering all these aspects, the use of the disk caching technique is still relatively advanta-
geous, and it is also used by Berkley implementers with variants of UNIX. It has been claimed
that elimination of 85% of disk accesses can be achieved when disk caching technique is used, in
comparison to that of in same systems with no buffer cache.
FIGURE 6.10 A formal design concept in logical structuring of I/O facility in UNIX.
supports I/O to take place directly between the I/O module and the memory area assigned to the
I/O process.
Buffered I/O
The buffering technique is used at the operating-system level in I/O operations on both block-
and character-oriented devices. Although different forms of buffering schemes exist,
UNIX uses the I/O buffering technique in its own way using two types of buffers: system
buffer caches and character queues, in order to improve the performance of each indi-
vidual process and thereby enhance the effciency of the operating system as a whole.
• Buffer Cache: UNIX implements a disk cache approach; essentially a unifed disk cache
to handle disk I/O operations, called here a buffer cache managed by the kernel. It is a
dedicated part of main memory blocks used mainly to store recently used disk blocks. The
size of the entire buffer cache is typically determined during system confguration, but the
size of each buffer in buffer cache is the same as that of a disk block (sector). A buffer is
allocated to a disk block that can be shared by all other processes, thereby enabling differ-
ent processes to avoid repeated disk accesses for the same data. Similarly, many accesses
made by a single process for some data more than once need not access the disk regularly.
Since both the buffer cache and the user process I/O area are in main memory, the transfer
of data between the buffer cache and the user process is simply a memory-to-memory copy,
and moreover, it is always carried out using DMA, which uses only bus cycles, leaving the
processor free to do other useful work. Each buffer in the buffer cache is either locked for
Device Management 337
use by other process(es) or free and available for allocation. Buffer cache management
maintains three lists, Free list, Driver I/O queue, and Device list, to manage these buffers.
When a large volume of data is processed, to minimize the disk-access delay, pre-fetching of
data block on a per-process basis is carried out by initiating an I/O for the next disk block of the
fle. Since records are usually ordered in blocks, an entire disk block I/O operation (read/write) is
relatively faster even when only a specifc byte in it is required.
• Character Queue: Block-oriented devices, such as tape and disk, are conducive to the
buffer cache approach. Character-oriented devices, such as printers and terminals, can also
be made suitable by a different form of buffering. A character queue is either written by the
process and read by the device or written by the I/O device and read by the process. In both
cases, the producer–consumer model explained in Chapter 4 is used. Since the character
queues may only be read once, hence, as each character in this queue while is read and
after its use, it is effectively destroyed. This is in contrast to the buffer cache, which can
be read multiple times.
Unbuffered I/O
Unbuffered I/O is simply DMA that operates between devices (I/O module) and the reserved
memory area assigned to a process. This is probably the fastest mode of I/O operation.
However, the process executing unbuffered I/O is locked in memory and cannot be swapped
out. This commitment may sometimes adversely affect the overall performance of the sys-
tem, since a part of the memory is always tied up that limits the advantages of swapping.
Conclusion
Disk drives are considered the most important devices for all contemporary modern operat-
ing systems, including UNIX, due to their immense potential to provide reasonably high
throughput. I/O involving these devices may be unbuffered or buffered via buffer cache.
Tape drives are operationally different but functionally similar to disks; hence, they use
similar I/O schemes. In large UNIX installations, more than 100 disk blocks may be
buffered.
Terminal when involved in I/O is, by virtue, relatively quite slow in operational speed at the
time of exchange of characters; character–queue approach is thus found to be most suit-
able for this device. Communication lines similarly require serial processing of bytes of
data for input or output and hence can best be managed by the use of character queues.
Likewise, I/O involving printers will generally depend on the speed of the printer. Slow
printers generally use character queue, while a fast printer might employ unbuffered I/O;
since the particular bytes of data proceeding to a printer are never reused, using a buffer
cache for this device would mean useless overhead.
More details on this topic, including the general organization of the buffer cache and its opera-
tion, with a fgure, are given on the Support Material at www.routledge.com/9781032467238.
mention some of them here before looking at several I/O features that contemporary Linux usually
provides.
• Linux kernel modules are essentially dynamically loadable. So a device driver has to be
registered with the kernel when installed and de-registered at the time of its removal.
• For devices, the vnode (similar to inode in UNIX) of the virtual fle system (VFS) contains
pointers to device-specifc functions for the fle operation, such as; open, read, write, and
close.
• Similar to UNIX, each buffer in the disk cache also has a buffer header, but it is allocated
in a slab offered by the slab allocator (already discussed).
• Modifed buffers in the disk cache are written out when the cache is full, when a buffer is
in the cache for a long time, or when a fle directs the fle system to write out its buffers in
the interest of synchronization to yield better system performance.
Disk Scheduling
In Linux 2.4, the default disk scheduler, known as the Linux Elevator, is basically a variant of the
LOOK algorithm discussed in Section 6.17.7. Linux 2.6, however, used innovative approaches in
this area, replacing the one in its older version, 2.4, to further enhance I/O scheduling performance.
Thus, the Elevator algorithm used in Linux 2.4 was augmented by two additional advanced algo-
rithms, deadline I/O scheduling and the anticipatory I/O scheduler.
1. The Elevator Scheduler (Linux 2.4): The scheduler maintains a single queue sorted on
the block number of all pending I/O requests (both read and write). As the disk requests
are serviced, the drive as usual keeps moving in a single direction, servicing each request
encountered on its way until there are no more outstanding requests in the current direction.
To improve disk throughput, this general strategy is refned: it combines the new requests
that arrive with the existing queue of pending requests whenever possible and uses four
types of specifc actions (actions to be taken in order, as explained on the Support Material
at www.routledge.com/9781032467238) while attempting to carry out this merging.
In fact, Linux 2.6 provides four I/O schedulers. The system administrator can select the one that
best matches the nature of the workload in a specifc installation. However, the No-Op scheduler is
simply an FCFS scheduler. The other three schedulers are the following.
2. Deadline Scheduler (Linux 2.6): The two critical problems as faced by the elevator sched-
uler are: requests from distant blocks may experience continuous delay, thereby resulting
in starvation due to the arrival of a continuous stream of new requests in close vicinity
to the currently servicing request. The second problem has the potential to be even more
acute and is related to the operational differences between read and write requests. While
a read operation takes less time, if the data do not exist in the buffer cache, a disk access
is required, and consequently the process can get blocked until the read operation is com-
pleted. But the process issuing a write request does not get blocked, since a write request
actually copies data into a buffer (memory-to-memory copy), and the actual write opera-
tion takes place sometime later as time permits. If such a stream of write requests (e.g. to
place a large fle on the disk) arrives, it then blocks a read request for a considerable time
and thereby blocks a process. Therefore, to provide better response times to processes, the
IOCS adopts a design strategy allowing read operations to perform at a higher priority than
write operations.
The deadline scheduler also uses an elevator (LOOK) scheduling approach as its basis and incor-
porates a feature to avoid large delays (frst problem) in order to alleviate both these problems.
Device Management 339
It employs three queues. The frst one is, as before, the elevator queue containing the incoming
requests sorted by track number, and the scheduler normally selects a request based on the current
position of the disk heads. In addition, the same request is placed either at the tail of a read FIFO
queue (second queue) for a read request or a write FIFO queue (third queue) for a write request.
Thus, the read and write queues separately maintain lists of requests in the sequence in which these
requests arrived. Since elevator scheduling faces an inherent problem when a process performs a
large number of write operations in one part of the disk, I/O operations in other parts of the disk
would be constantly getting delayed. Moreover, if a delayed operation is a read, it would cause fur-
ther substantial delays in the requesting process. To prevent such exorbitant delays, the scheduler
assigns a deadline (expiration time) with a default value of 0.5 seconds to a read request (higher
priority) and a deadline of 5 seconds to a write request. Each deadline for a read and write request
is attached to each request in the respective FIFO queue.
Normally, the scheduler dispatches requests from the sorted queue. When the task in relation to
a request is completed, it is removed from the head of the sorted queue and also from the respec-
tive FIFO queue. However, when the item at the head of one of the FIFO queues becomes older and
its deadline expires, then the scheduler next dispatches this request from that FIFO queue, plus a
couple of the next few requests from this queue, out of sequence before resuming normal schedul-
ing. As each request is dispatched, it is also removed from the sorted queue. In this way, the deadline
scheduler scheme resolves both the starvation problem and the read versus write problem.
3. Completely Fair Queuing Scheduler (Linux 2.6): This scheduler maintains a separate
queue of I/O requests for each process, and performs round robin between these queues.
The ultimate objective of this approach is to offer a fair share that consequently avoids
large delays for processes (see “UNIX Fair Share Scheduling”, Chapter 4).
4. Anticipatory Scheduler (Linux 2.6): Both the original elevator scheduler and the deadline
scheduler, like the other disk scheduling algorithms already discussed, keep dispatching a
new request that appears close to the currently executing request as soon as the currently
executing request is completed so as to obtain a better disk performance. Typically, a pro-
cess that performs synchronous read requests gets blocked (sleep) until the read operation
is completed and the data are available. Then it wakes up. This process usually issues the
next I/O operation immediately after waking up. When elevator or deadline scheduling is
used, the small delay that happens between receiving the data for the last read and issuing
the next read would cause the disk heads to probably pass over the track that contains the
data involved in the next read operation. This may cause the scheduler to turn elsewhere
for a pending request and dispatch that request. By virtue of the principle of locality, it is
probable that successive reads from the same process will be to disk blocks that are close
one another. As a result, the next read operation of the process could be serviced only in
the next scan of the disk, causing more delays in the process and more movement of the
disk heads. This problem can, however, be avoided if the scheduler were to delay a short
period of time after satisfying a read request to see if a new nearby read request is made,
so that this next read request could be immediately serviced, and in this way, the overall
performance of the system could be enhanced. This strategy, as proposed, is exactly fol-
lowed in anticipatory schedulers and implemented in Linux 2.6.
The anticipatory scheduler uses deadline scheduling as its backbone but also adds a feature to
handle synchronous I/O. When a read request is dispatched, the anticipatory scheduler causes the
scheduling system to delay for up to 6 milliseconds depending on the confguration. During this
small delay, there is a high chance that the application that issued the last read request will issue
another read request to almost the same region of the disk. This next read operation can then be
serviced immediately in the same scan of the disk. If no such read request occurs, the scheduler
resumes its normal operation using the prescribed deadline scheduling algorithm.
340 Operating Systems
Experimental observations on reading numerous types of large fles while carrying out a long
streaming write in the background and vice-versa reveal the fact that the anticipatory scheduler
exhibits dramatic performance improvement over the others.
• Cache manager: Windows uses a centralized cache manager to provide a caching service
in main memory to all fle systems of the I/O manager, the virtual memory manager, and
network components. The part of the fle held in a cache block (256 Kbytes) is called a
view, and a fle is considered a sequence of such views. Each view is described by a vir-
tual address control block (VACB), and an array of such VACBs is set up by the cache
manager when the fle is opened. This VACB actually helps quickly determine whether
a needed part of a fle is in the cache at any instant, and also readily tells whether a view
is currently in use.
When an I/O request is issued, the I/O manager passes it to the cache manager, which consults
the VACB array for the fle to ascertain whether the request is already a part of some view in
the cache. If so, it readily serves the request and takes appropriate actions on respective VACB.
Otherwise, it allocates a cache block and maps the view in the cache block. If the request is a
read one, it copies the required data in the caller’s address space, maybe with the help of the VM
manager. In addition, if a page fault occurs during the copy operation, the VM manager invokes
the disk driver through NTFS to read the required page into the memory, and that is performed in
a non-cached manner.
The buffering of a fle is carried out by the cache manager, which analyzes the requests (read/
write), and if it observes that the previous few requests (read) indicate sequential accesses to a fle,
it then starts prefetching subsequent data blocks. In the case of fle updates (writing), the data to be
written into a fle are refected in the view of the fle held in the cache manager, which exploits two
typical services to improve overall performance:
• Lazy write: The system performs updates in the cache only, not on the disk. Later, when
demand on the processor is low, the cache manager periodically nudges the VM manager
to write out the changed data to the disk fle. If a particular cache block is once again ref-
erenced and updated in the meantime, there is a substantial savings, since disk access is
not required.
• Lazy commit: This is for transaction processing and is very similar to the lazy write.
Instead of immediately marking a transaction as successfully completed, the system
Device Management 341
caches the committed information and later writes it at its convenience to the fle system
log by a background process.
• File system drivers: The I/O manager treats a fle system driver as just another device
driver and routes certain volumes of messages to the appropriate software driver (such as
intermediate and device drivers) for that particular device adapter (controller) to imple-
ment the target fle system.
• Network drivers or flter drivers: Windows uses network drivers to support its integrated
networking capabilities and enable distributed applications to run. This driver can be inserted
between a device driver and an intermediate driver, between an intermediate driver and a fle
system driver, or between the fle system driver and the I/O manager API to perform any kind
of function that might be desired. For example, a network redirector flter driver can intercept
fle commands intended for remote fles and redirect them to remote fle servers.
• Hardware device drivers: These drivers access the hardware registers (controller’s reg-
isters) of the peripheral devices through entry points provided in Windows Executive
dynamic link libraries (.dll). A set of these library routines exist for every platform that
Windows supports; because the routine names are the same for all platforms, the source
code of Windows device drivers is portable over different types of processor.
Windows supports both synchronous and asynchronous I/O operation. With synchronous I/O,
the application is blocked (sleep) until the I/O operation completes. In the case of asynchronous
I/O, an application initiates an I/O operation and then can continue with other processing while the
I/O operation is in progress in parallel.
Windows supports two types of RAID confgurations:
• Hardware RAID, in which separate physical disks are combined into one or more logical
disks by the disk controller or disk storage cabinet hardware. In hardware RAID, redun-
dant information is created and regenerated entirely by the controller interface.
• Software RAID, in which noncontiguous disk space is combined into one or more logical
partitions by the fault-tolerant software disk driver, FTDISK. The software RAID facility
is available on Windows servers, which implement RAID functionality as part of the oper-
ating system, and can be used with any set of multiple disks. It implements RAID 1 and
RAID 5. In the case of RAID 1 (disk mirroring), the two disks containing the original and
mirror partitions may be on the same disk controller or on different disk controllers. When
they are on different disk controllers, it is often referred to as disk duplexing.
SUMMARY
Device management comprising I/O subsystems controls all I/O devices that implement a generic
device interface for the rest of the operating system to access the I/O devices with relative ease. It
allocates devices to processes, schedules I/O requests on the devices, and deallocates the devices
when they are not needed. It also maintains data caches and page caches to hold data from block
as well as character devices. It, along with numerous device drivers, occupies a major part of the
operating system. I/O devices are connected to the host system through I/O controllers, each of
which is interfaced with the operating system by a respective device driver. This driver converts
the complex hardware interface to a relatively simple software interface for the I/O subsystem to
easily use. A representative UNIX device driver is presented as a case study. Different types of I/O
operation, programmed I/O, interrupt-driven I/O, DMA, I/O channels, and fnally I/O processors
are described. Clocks, also called timers, are discussed with clock software that generally takes the
form of a device driver, even though a clock is neither a block device, like a disk, nor a character
device, like a terminal or printer. This chapter also presents in brief the magnetic disk with its
physical characteristics and different components’ organization, along with the major parameters
342 Operating Systems
that are involved in disk operation. Disk management in relation to formatting and subsequent data
organization is described. I/O requests are handled by a disk scheduler, and different types of disk-
arm scheduling policies with illustrative examples, along with their merits and shortcomings, are
described. RAID technology is discussed that uses many identical disks in parallel as an array and
is viewed as single “logical disk” controlled by a single “logical controller”. RAID enhances disk
performance and reliability to a large extent. Disk cache, along with its design considerations, as
well as page cache, and fnally the combination of the two to form a unifed cache was described,
with its merits and drawbacks. Finally, device management as implemented in UNIX, Linux, and
Windows is described fgures in brief to illustrate how this management is actually realized in prac-
tice in representative commercial systems.
EXERCISES
1. The transfer rate between a CPU and its attached memory is higher by an order of magni-
tude than mechanical I/O transfer. How can this imbalance cause ineffciencies?
2. Explain the architectures of device controllers with the help of a block diagram. What are
the specifc functions that a device controller performs?
3. What is meant by device-level I/O? Discuss the steps it follows while carrying out an I/O
operation physically.
4. Why is device management becoming more important as a constituent in the design of a
contemporary operating system? State and explain the main functions that device manage-
ment must perform, mentioning its common targets.
5. With an approximate structure of a device (unit) control block, explain how it is used by
the operating system while managing devices. Discuss the functions that are performed by
its main constituents.
6. What are interrupts? When a device interrupts, is there always a context switch?
7. What are device drivers? Explain with a diagram the general functions usually carried out
by a device driver. What is meant by reconfgurable device drivers? Explain why modern
operating systems use such device drivers.
8. What is meant by “device independence”? What is the role played by device-independent
software? What is its usefulness? What are the main functions that are typically common
in most device-independent software?
9. What is meant by buffering technique as used by an operating system? Should magnetic
disk controllers include hardware buffers? Explain your answer.
10. Why would you expect a performance gain using a double buffer instead of a single buffer
for I/O?
11. Using double buffering exploited by the operating system, explain the impact of buffering
on the runtime of the process if the process is I/O-bound and requests characters at a much
higher rate than the device can provide. What is the effect if the process is compute-bound
and rarely requests characters from the device?
12. In a certain computer system, the clock interrupt handler requires 2 msec. (including pro-
cess switching overhead) per clock tick. The clock runs at 60 Hz. What fraction of CPU is
devoted to the clock?
13. State the main physical characteristics of a magnetic disk. How are data normally orga-
nized in such a disk?
14. Write down the procedures and steps followed in an I/O operation involving a magnetic
disk when transferring information to and from the disk after the initiation of an I/O
operation.
15. What are the parameters that infuence the physical disk I/O operation? How are they
involved in such an operation? Compute the total time required to transfer the information
Device Management 343
of a block of n words in a disk I/O operation when the disk has a capacity of N words per
track and each track is rotating at the speed of r revolutions per second. Write down the
assumptions made, if any.
16. What is meant by interleaving? When and how is the interleaving on a disk carried out?
What is the role of interleaving in the organization of data on a disk? How does interleav-
ing facilitate data access on a disk?
17. If a disk controller writes the bytes it receives from the disk to memory as fast as it
receives them, with no internal buffering, is interleaving conceivably useful? Discuss with
reasons.
18. Explain the notion of interleaving: how it works and how it at affects the logical-to-physical
translation of disk addresses. Should the knowledge of interleaving be confned only to the
related disk driver? Explain with reasons.
19. A disk is double interleaved. It has eight sectors of 512 bytes per track and a rotation rate
of 300 rpm. How long does it take to read all the sectors of a track in order, assuming that
the arm is currently positioned correctly and ½ rotation is needed to get sector 0 under the
head? Compute the data rate. Now carry out the problem with a non-interleaved disk with
the same characteristics. Compare and contrast interleaving and non-interleaving in this
regard.
20. Calculate how much disk space (in sectors, tracks, and surfaces) will be required to store
logical records each of 120 bytes in length blocked 10/physical record if the disk is fxed
sector with 512 bytes/sector, 96 sectors per track, 110 tracks per surface, and 8 usable sur-
faces. The fle contains 4440 records in total. Do not consider any fle header record(s) or
track indexes, and assume that records cannot span two sectors.
21. Discuss the delay elements that are involved in a disk read or write operation. Out of those,
which element would you consider the most dominant and why?
22. What does disk scheduling mean? Why is it considered an important approach that requires
the operating system to include it?
23. What scheduling algorithm is used by contemporary Linux operating system? Discuss the
strategy and principles of its operation.
24. The head of a moving-head disk system with 200 tracks numbered 0 to 199 is currently
serving a request at track 15 and has just fnished a request at track 10. The queue of pend-
ing requests is kept in the order 8, 24, 16, 75, 195, 37, 55, 75, 153, 3. What is the total head
movement required to satisfy these requests for the following disk scheduling algorithms?
a. FCFS, b. SSTF, c. SCAN, d. LOOK, e. C-SCAN, f. C-LOOK.
25. Disk requests arrive to the disk driver for cylinders 10, 22, 20, 2, 45, 5, and 35, in that order.
A seek takes 6 msec. per cylinder movement. How much seek time is needed for: a. FCFS,
b. SSTF, c. SCAN, d. LOOK, e. C-SCAN, f. C-LOOK?
26. Discuss the impact of disk scheduling algorithms on the effectiveness of I/O buffering.
27. An input fle is processed using multiple buffers. Comment on the validity of the following
statements:
a. “Of all the disk scheduling algorithms, the FCFS strategy is likely to provide the best
elapsed time performance for the program”.
b. “Sector interleaving is useful only while reading the frst few records in the fle; it is not
so useful while reading other records”.
28. Defne RAID. How does RAID handle several inherent drawbacks of a disk accessed for
data transfer? What are the common characteristics that are observed in all RAID levels?
Compare and contrast the performance of the seven RAID levels with respect to their data
transfer capacity.
29. What are the implications of using a buffer cache? Describe implications of the UNIX buf-
fer cache for fle system reliability. UNIX supports a system call fush to require the kernel
344 Operating Systems
to write buffered output onto the disk. Do you suggest using fush to improve the reliability
of your fles?
30. State and explain the different types of RAID confguration used in Windows.
Learning Objectives
7.1 INTRODUCTION
All computer systems ultimately process information, and storing this information in main memory
within a specifc process’s address space is only a temporary measure at best for the duration the
process remains active; the information is automatically lost when the process terminates or the
computer is turned off or crashes. Also, adequate memory space is not available for this purpose. In
addition, information confned within a specifc process’s address space cannot be accessed by other
processes intending to simultaneously share it (or a part of it). All these facts, along with many other
issues, dictate that information to be stored must be made independent, sharable, and also on a long-
term basis. In other words, the essential requirement is that the storage space must be large enough
to permanently store the bulk of information, must be independent of any specifc process, and must
survive for a long time except in the event of catastrophe. All these requirements can be satisfed
only if the information is stored on secondary devices, such as disk, tape, or any other external
media using a specifc form of units called fles. Processes can then operate independently on these
fles at will. Files produced in this way must be persistent and have a life outside of any individual
application that uses them for input/output as well as irrespective of the state of the process that uses
it. A fle will then exist permanently unless and until it is explicitly removed by its owner.
The concept of a fle is a central theme in the vast majority of computer applications, except for
real-time applications and some other specialized applications which seldom use fles. In general,
applications use one or more fles to input information, process data, and fnally produce output virtu-
ally in the form of fles that are permanently saved for further future use. The ultimate objective of
having fles in this manner is to enable the owner of the fle to access it, modify it, and save it and also
to authorize them to protect the fle’s integrity. To implement the underlying objectives, a part of the
operating system is solely entrusted with managing and controlling fles and their contents located
on secondary storage. This part known, as the fle management system, is primarily concerned with
structuring, naming, accessing, using, and protecting fles, along with other major topics related to fles.
Files are usually stored on any type of physical devices, such as disk drives, magnetic tapes,
and similar other peripheral devices or semiconductor memory. These storage devices vastly differ
in the nature of their physical organization as well as operations. To relieve the users from all the
peculiarities of the underlying storage devices, the operating system hides everything and provides
the user a uniform logical view of the stored information for the sake of convenience. Irrespective
of the specifc storage volume that contains a given fle, the fle is designated as online or offine.
When the combined size of all fles in use in the system exceeds the online capacity of the avail-
able storage devices, then volumes (disks or tapes media) may be dismounted, and new volumes are
added to allow ongoing fle operations to continue. This chapter, however, emphasizes mostly the
management of online fles, or abbreviated as fles.
The fle management system is supposed to hide all device-specifc aspects of where the fle resides
and offer a clean abstraction of only a simple, uniform space of named fles. Some systems extend this
view with a further abstraction of even an input/output system in which all I/O devices appear to the
user as a set of fles. However, the primary concern is after all the various services that the fle manage-
ment system offers for both usual fles and fles related to I/O device management. There are lots of
other issues related to the fle management system that are differently implemented by different oper-
ating systems. This chapter will only discuss all the topics related to fle and fle management systems
(FMS) common to any generic operating system. At the end, the FMS of contemporary representative
popular operating systems, such as UNIX, Linux, and Windows, are separately discussed in brief to
give an overview of different types of actual FMS that are implemented in practice.
More details on this topic are given on the Support Material at www.routledge.com/9781032467238.
7.2 FILES
A fle is essentially a container for a collection of logically related pieces of information that are
physically stored normally on any type of peripheral device for long-term existence but presented
by most operating systems as device-independent. A fle often appears to users as a linear array of
characters or record structures with a collection of similar records stored somewhere in the system.
To users and applications, a fle is treated as a single entity and may be referenced by name. Files
can be created, read, written, and even removed using various system calls. File thus created can
also include access permissions (at fle level) that allow controlled sharing by many concurrent
processes. In some sophisticated systems, such access controls can be enforced at the record level
or even lower at the feld level. A fle can be used for read/write only after it is open, and similarly,
after its use, it should be closed using system calls.
information. However, a fle system maintains a set of attributes associated with the fle that include
owner, creation time, time last modifed, access privileges, and similar other information.
The fle system viewed from the user’s end mainly consists of:
The fle system viewed from designer’s end consists mainly of:
exact rules for fle naming vary somewhat from system to system, but almost all operating systems
allow strings of letters or a combination of letters and digits or even special characters as valid fle
names. Some fle system distinguish between uppercase letters and lowercase letters, whereas oth-
ers do not. UNIX falls into the frst category.
Many operating systems support two-part fle names with the two parts separated by a period, like
ABC.TXT. The part following the period is called the fle extension and usually indicates something
about the fle. Here, .TXT simply reminds the owner that it is a text fle rather than conveying any
specifc information to the computer. On the other hand, a COBOL compiler will require that a fle
submitted for compilation must have the extension .COB; otherwise it will not be accepted for com-
pilation. Different operating systems follow different conventions in fle naming both with respect
to length of fle name as well as the length and number of extensions to be used in a fle name. For
example, in MS-DOS, fle names are 1 to 8 characters plus an optional extension of 1 to 3 characters.
In UNIX, the size of the extension, if any, is up to the user, and a fle may even have two or more
extensions, as in prog.c.z, where c indicates that it is a C program fle and z is commonly used to
indicate that the fle (prog.c) has been compressed using the Ziv–Lemple compression algorithm.
Brief details on this topic are given on the Support Material at www.routledge.com/9781032467238.
fles and directories. UNIX also uses character-special and block-special fles. Windows also has
metadata fles. We will defne metadata in the section “File Attributes”.
• Regular fles are generally ASCII fles (text) or binary fles containing user information.
The distinct advantage of ASCII fles is that they can be displayed and printed as it is, and
they can be edited with an ordinary text editor. Furthermore, if a large number of programs
are ASCII fles for input and output, it is easy to connect the output of one program to the
input of another (as found in UNIX pipes).
• Binary fles just means that they are not ASCII fles. Listing them on a printer or display-
ing them on a terminal gives an incomprehensible listing, apparently full of random junk.
Usually they have some internal structure. Although the fle is nothing but a sequence of
bytes, the binary fle, if it is an executable fle, must have a defnite format that is differently
defned in different operating systems. In general, it has fve sections: the frst section is a
header which consists of a magic number identifying the fle as an executable fle as well
as other necessary related information, and the other sections are text, data, relocation bits,
and a symbol table, as shown in Figure 7.1a.
• Another example of a binary fle is an archive found in many contemporary modern oper-
ating systems. A representative format of such a fle is depicted in Figure 7.1b. It consists
of a collection of library procedures (modules) already compiled but not linked. Each one
is prefaced by a header telling its name, creation date, owner, protection code, and size.
Similar to an executable fle, the module headers are full of binary numbers.
FIGURE 7.1 A pictorial representation of the formats of an (a) executable fle and (b) archive fle created by
a generic modern operating system.
350 Operating Systems
All operating system must recognize one fle type: their own executable fle, but some recognize
more. They sometimes even go to the extent of inspecting the creation time of an executable fle.
Then it locates its corresponding source fle and sees if the source has been modifed since the
binary fle was made. If it has, it automatically recompiles the source. In UNIX, the term “make
program” has been built into the shell. Here, fle extensions are mandatory so as to enable the oper-
ating system to decide which binary program was derived from which source.
• A directory is a system fle associated with the fle management system for maintain-
ing the structure of the fle system and consists of a collection of fles and directories.
Directories and their various types of structures will be discussed later.
• Character-serial fles are related to input/output and are used to model serial I/O devices,
such as terminals, printers, and networks. Block-special fles are used to model disk-like
media, tapes, and so on.
External name, owner, user, current size, maximum size, record length, key position, key length,
time of creation, time of last change, time of last access, read-only fag, hidden fag, ASCII/bi-
nary fag, system fag, archive fag, temporary fag, random access fag, lock fags, current state,
sharable, protection settings, password, reference count, storage device details, and so on.
Most minicomputer and microcomputer (desktop computer) by this time get enough matured,
and thus do not require all these items in their fle systems to handle the fles.
Brief details on fle descriptors with these attributes are given on the Support Material at www.
routledge.com/9781032467238.
Create/Creat: A new fle is defned and created with no data, positioned within the structure
of fles. The purpose of the call is to announce that the fle is being created and to set some
of the attributes.
File Management 351
Delete: A fle is removed from the fle structure and destroyed when it is not required. In addi-
tion, some operating systems automatically delete any fle that has not been used in n days.
Delete_One: This deletes an existing record from the currently positioned location of the fle,
sometimes for the purpose of preserving the existing fle structure.
Open: This function opens the fle before being used and allows the system to fetch the
attributes and to refect in the fle descriptor that the fle is in use. It locates the address of
the referenced fle by searching through the directories, lists disk addresses of the fle in
main memory, allocates buffers in the main memory, and establishes a runtime connection
between the calling program and called fle to perform subsequent operations on the fle.
A possible format of the OPEN system call is:
While invoking the OPEN service, the user specifes the fle name and the access mode. The fle
system verifes the user’s authority to access the fle in the intended mode before opening the fle.
Close: The fle should be closed with respect to a process to free up internal table space used
to keep attributes and disk addresses, if permissible. The process may no longer perform
functions on the fle until the process opens the fle once again. Many systems encourage
this by imposing a maximum number of open fles on processes.
Read: This system call (operation) copies all or a portion of the data (a byte or a block of
bytes) beginning at the current fle position from a fle into the respective buffer of the cor-
responding fle. The caller must specify the amount of data needed. If the fle position is
L bytes from the end when read is invoked, and the number of bytes to be read is greater
than L, an end-of-fle condition is returned. The possible format of a READ system call is:
Write: Data are written to the fle at the current position. If the current position is the end of
the fle, the fle’s size increases. If the current position is anywhere within the fle, existing
data are overwritten and lost forever. One possible format of a WRITE system call is:
Many other useful operations are provided by almost all fle systems. However, the nature of operations
that are to be performed on a fle entirely depend on, and have an impact on the organization of the fle.
Brief details on many other common useful operations for fles are given on the Support
Material at www.routledge.com/9781032467238.
structured in the fle. The “record access pattern” thus depends on the fle organization that
describes the order in which records in a fle are accessed by a process. The application pro-
grammer defnes the format for logical records and also the access routines for reading and
writing records. The fle system invokes the programmer-supplied access routines when read-
ing from and writing to the fle. Different types of organization of fles and their related access
mechanism exist, and each one is equally popular, but the selection of a particular one mostly
depends on the environment and the type of the application that uses the fle. Since fle organiza-
tion and fle access are directly linked to one another, they will be discussed separately in detail
in the following subsection.
• File-related services: create fles, open fles, read fles, write into fles, copy fles, rename
fles, link fles, delete fles, seek records, and so on.
• Directory-related services: make directory, list directory, change directory, rename
directory, and so on.
• Volume-related services: initialize the volume, mount the volume, dismount the volume,
verify the volume, backup the volume, compact the volume, and so on.
A fle server is essentially just a user process, or sometimes a kernel process, running on a machine
that implements the fle services. A system may have one fle server or several servers, each one
offering a different fle service, but the user should not be aware of how many fle servers there are
or what the location or function of each one is. All they know is that when they call the procedures
specifed in the fle service, the required work is performed somehow. A system, for example, may
have two services that offer UNIX fle service and Windows fle service, respectively, with each
user process using the one appropriate for it. It is hence possible to have a terminal with multiple
windows, with UNIX programs running in one window and Windows programs running in another
window, with no conficts. The type and number of fle services available may even change as the
system gradually improves.
File Management 353
• To provide a mechanism in storing data in such a way so that the users can perform differ-
ent types of operations (as mentioned before) on their fles.
• To ensure the validity of data stored in the fles.
• To offer the users the needed I/O support while using various types of different storage
devices.
• To provide a standardized set of I/O interface routines while actual physical devices are
to be used.
• To provide needed I/O support to multiple users for the purpose of sharing and distribution
of fles in the case of multi-user systems.
354 Operating Systems
• To safeguard the data as far as possible from any sort of accidental loss or damage.
• To attain optimum I/O performance both from the standpoint of the system with respect
to overall throughput, as well as from the user’s point of view in terms of response time.
FIGURE 7.2 A block diagram of a representative scheme of fle management system with its required
elements used in a generic modern operating system.
File Management 355
output and unblocked after input. To support block-I/O of fles, several functions are needed, and
those are carried out at the physical I/O level (physical IOCS). The physical IOCS layer, which is
a part of the kernel in most operating systems, implements device-level I/O (discussed in Chapter 6).
We assume that the physical IOCS is invoked through system calls, and it also invokes other
functionalities of the kernel through system calls. The part of physical IOCS which belongs to the
operating system implements different policies and strategies by using several modules, such as disk
scheduling and fle allocation, to achieve many different targets, including high device-throughput.
These modules, in turn, invoke physical IOCS mechanisms to perform actual I/O operation. In fact,
process-level mechanisms that implement fle-level I/O use physical IOCS policy-modules to realize
effcient device-level I/O. Finally, blocking and I/O buffering although are shown as separate layers
in the interest of clarity, but sometimes they exist in the access-method layer and are available only
to access-method policy-modules and not allowed to access directly from the fle system layer.
This hierarchical level–wise division at least suggests what would be the concern of the fle
management system, and which would go in the domain of the operating system kernel. In fact, this
division proposes that the fle management system should be developed as a separate system utility
that would use the kernel support to realize the needed fle-I/O operations.
FIGURE 7.3 A schematic block diagram of a representative fle system software architecture of a generic
modern operating system.
356 Operating Systems
At the lowest level is the device driver, which is a physical IOCS that communicates directly
with hardware peripheral devices or the associated controllers or channels to implement certain
mechanisms at the device level, such as I/O initiation on a device, providing I/O operation status,
processing the completion of an I/O request, and recovery of error, if any. The policy employed by
the device driver is to optimize the performance of the related I/O device. Device drivers are com-
monly considered part of the operating system.
The next higher level is referred to as basic fle system, which is also a part of physical IOCS.
This level is the frst interface that deals with blocks of data defned outside the computer system
environment that are to be exchanged with secondary devices, mostly disk or tape systems. It is thus
involved only in the placement of those blocks on secondary storage devices and the buffering of
those blocks in main memory. It is not at all concerned about the content or meaning of the data, nor
the structure of the fles involved. The basic fle system (a part of physical IOCS) is also considered
part of the operating system.
The basic I/O supervisor is responsible for implementing certain mechanisms at the fle level,
such as fle-I/O initiation and termination. It selects the device on which the prescribed fle resides
to perform fle-I/O and is also concerned with scheduling device access (such as disk or tapes) to
optimize fle access performance. To achieve all this, specifc structures are maintained that control
fle-I/O by way of dealing with device-I/O and other related aspects, and fle status. Assignment
of I/O buffers and allocation of secondary memory space are also carried out at this level, but the
actual placement is done at the next level. The basic I/O supervisor is also considered part of the
operating system.
Logical I/O is entirely related to users and applications. This module deals with everything with
the records of a fle with respect to their organization and access. In contrast to the basic fle system
that deals with blocks of data, logical I/O is concerned with the fle records that are defned and
described by users and applications. The structuring of a directory to enable a user to organize the
data into logical groups of fles as well as the organization of each individual fle to maintain basic
data about fles are provided at this level. In fact, logical I/O is engaged to the extent of providing
general-purpose record I/O capability.
Access method is perhaps the frst level of the fle management system that directly interacts
with the user. This level provides the primary standardized interface between the application and
the fle system and consequently with the devices that physically hold the data. Different types of
access methods exist that describe different ways of accessing and processing the data residing in
a particular fle. Each type of access method, in turn, tells something about the corresponding fle
structure that can be the best ft for effective storing of data. Some of the most commonly used
access methods are shown in Figure 7.3, and those are explained separately in a later subsection.
of only fve fundamental organization schemes. Most real-life systems use structures that fall into
one of these categories or some other structures that may be realized with a combination of these
organization types. The fve fundamental organization methods for fles are:
• The pile
• The sequential fle
• The indexed-sequential fle
• The indexed fle
• The direct or hashed fle
FIGURE 7.4 Representative block diagrams of different types of commonly used fle organization employed
in generic modern operating systems.
File Management 359
normally skipped around or processed out of order. Sequential fles can be built on most I/O
devices; hence these fles are not critically dependent on device characteristics. Consequently, a
sequential fle can be migrated easily to a different kind of device. Since sequential fles can be
rewound, they can be read as often as needed. Therefore, sequential fles with large volume are
mostly convenient when the storage medium is magnetic tape, but they can be stored easily on
any type of disk media.
Sequential fles exhibit poor performance for interactive applications in which queries and
updates of individual records when carried out require a sequential search of the fle, record by
record, for a key match to arrive at the desired record, if present. Consequently, it causes a substan-
tial delay in accessing the target record if the fle is a relatively large one. Additions/deletions to the
fle also pose similar problems. However, the procedure normally followed in this case is to create
a separate transaction fle with the new records to be added/deleted, sorted in the same order on the
key feld as the existing master fle. Periodically, a batch update program is executed that merges
the new transaction fle with the existing master fle based on the keys to produce a new up-to-date
master fle, with the existing key sequence remaining unchanged.
However, addition of records to this fle is carried out in a different way. Each record in the main
fle contains a link feld offered by the fle system, not visible to the application, used as a pointer
to the overfow fle. When a new record is to be inserted into the fle, it is added to the overfow
fle. The record in the main fle that immediately precedes the new record in logical sequence is
accordingly updated to contain a pointer that will indicate the new record in the overfow fle. If
the immediately preceding record is itself in the overfow fle, then the pointer associated with that
record will be updated accordingly. Similar to the sequential fle, the indexed sequential fle also is
occasionally merged with the overfow fle in batch mode.
While preserving all its merits over sequential organization, indexed-sequential organiza-
tion also has a sequential nature without sacrificing anything and permits sequential process-
ing, particularly when the application requires processing almost every record in the file that
is indexed sequentially. To process the entire file sequentially, the records of the main file are
processed in sequence as usual until a pointer to the overflow file is found. In that situation,
accessing and subsequent processing then continue in the overflow file until a null pointer is
encountered, at which time accessing of the main file is resumed once again from the point
where it left off.
When the main fle is huge, the index fle becomes larger, and searching for a key over such a
larger index fle may be expensive. In order to realize greater effciency in access for such a large
main fle, multiple levels of indexing can be employed. Here, the lowest level of the index fle is
operated as a sequential fle, and a higher-level index fle is created for that fle. Searching for a
record in the main fle starts from the lowest level and continues to move to the upward levels until
the target record in the main fle is obtained, if present. In this way, a substantial reduction in aver-
age search length can be attained. But searching many index fles one after another at different
levels may also be time-consuming and add proportionately to the total access time. This aspect
should be taken into consideration while designing a multi-level indexed sequential fle that deter-
mines how many different levels of index fle will be created, and this may be one of the trade-offs
in the design considerations.
felds, thereby saving storage space. However, the table storage requirements are increased. The
overhead to manage the indexes also increases, as the addition of a new record to the existing main
fle requires that all of the respective index fles must be updated. Similarly, record deletions can be
expensive, because of removal of dangling pointers from the respective indexes. Above all, access
times can be appreciably reduced by using this fle organization.
Index fles are important mostly with applications which are query-based, interactive in nature, and
time-critical. Here, exhaustive processing of records rarely happens. Representative examples of such
application areas are railway reservation systems, banking operations, and inventory control systems.
Another drawback observed in the use of direct fles is excessive device dependence. Characteristics
of an I/O device are explicitly assumed and used by the fle system while address calculation is car-
ried out. Rewriting the fle on another device with different characteristics, such as different track
capacity, implies modifcation of the address calculation method and its related formulas. In addi-
tion, it is also observed that sequential processing of records in a direct fle is a detrimental one
when compared with similar processing of records carried out on a sequential fle or even on an
indexed-sequential fle.
7.10.1 STRUCTURE
A fle system normally contains several directories. The format of the directory and the types of
information to be kept in each directory entry (an example is shown in Table 7.2 on the Support
Material at www.routledge.com/9781032467238) differ widely among various operating systems.
Some of this information may often be included in a header record associated with the fle, thereby
reducing the amount of storage required for the directory that makes it convenient to keep it in main
memory either partly or entirely for the sake of performance improvement. However, the directory
structure of the fle system broadly falls into two categories. The simplest form of the structure for
a directory is a single-level directory or fat directory. The other one, based on a more powerful
and fexible approach, is almost universally adopted, the hierarchical or tree-structured directory.
Preliminary Info Location Info. Access Control Info. Flags Usage Info.
. . . . . .. . . . . . . . . . .. . . . . . . . . . . .. . . . . . ...... .......
FIGURE 7.5 A pictorial representation of a typical directory entry used in generic modern operating system.
File Management 363
FIGURE 7.6 Schematic block diagrams of representative structuring of one-level (fat) directories used in generic
modern operating systems; (a) Attributes in the directory entry, and (b) Attributes lying elsewhere outside directory.
containing fle names and other attributes; one entry per fle is stored in a master directory. Two
possible structures may exist. One approach as shown in Figure 7.6a, in which each entry contains
the fle name and all other attributes, including the disk addresses where the data are stored. The
other possibility is shown in Figure 7.6b. Here, the directory entry holds the fle name and a pointer
to another data structure where the attributes and the disk addresses are found. Both these systems
are equally and commonly used.
When a fle is opened, the fle management searches its directory until it fnds the name of the
fle to be opened. It then extracts the attributes and disk addresses, either directly from the directory
entry or from the data structure pointed to, and puts them in a table in main memory. All subsequent
references to the fle then use this table to get the required information of the fle.
While fat directories are very conducive to single-user systems, they have some major drawbacks
even in those systems, particularly when the total number of fles is huge enough that it poses trouble
in unique naming of the fles, and searching a directory for a particular fle requires signifcant time.
Moreover, there is no provision for organizing these fles in a suitable way, such as type-wise, applica-
tion-wise, or user-wise, for multiple users or in a shared system. Last but not least, since a fat directory
has no inherent structure, it is diffcult to conceal portions of the overall directory from users, even if
it is sometimes critically required. Thus, the fat directory as a whole is most inconvenient and inad-
equate when multiple users share a system or even for single users with many fles of different types.
• Two-Level Directories
The problems being faced by fat directories have been alleviated with the use of a two-level
directory structure in which there is a master fle directory (MFD), and every account is given a
private directory, known as a user fle directory (UFD). The master directory has an entry for each
user directory, providing the address and access control information. Each user directory is a simple
list of the fles of that user in which each fle is described by one entry. A user who wants to separate
various types of fles can use several accounts with different account names. Although this structure
certainly offers some distinct advantages compared to the straightforward fat directory structure,
it still provides users with no help in structuring a collection of a large number of fles, in general.
Brief details on this topic with a fgure are given on the Support Material at www.routledge.com/
9781032467238.
approach, the hierarchical or tree structure, which is basically an uprooted tree. Here, the root of the
tree is a master directory that may have two types of nodes: directories and ordinary fles. A direc-
tory can have children (sub-nodes); a non-directory cannot. Directories may be empty, and are usually
user directories. Each of these user directories, in turn, may have subdirectories and fles. Files are not
restricted to a particular level in the hierarchy; that is, they can exist at any level. However, the top few
levels of the directory structure usually tend to have a lot of directories, but ordinary fles can reside
there, too. Now, the obvious question thus arises as to how this hierarchical structure will be organized.
• The frst aspect concerns the number of levels when the fle system may provide a fxed
multi-level directory structure in which each user has a user directory (UD) containing
a few directories, and some of these directories may contain other directories, too. A user
can then group the fles based on some functionally related meaningful criterion, such as
the name of activities to which they pertain. However, this approach is not adequate and
relatively lacks fexibility, since it provides a fxed number of levels in the hierarchy and a
fxed number of directories at each such level.
• The second approach provides more generalization with better fexibility in which a direc-
tory is treated as a fle, and it can be created in a similar way to how a fle is created. Here,
the directory may appear as an entry in another directory with its fag feld (Figure 7.5)
with a value D to indicate that it is a directory fle. The root directory as usual contains
information about the user directories of all users. A user creates directory fles and ordi-
nary fles to structurally organize their information as needed. Figure 7.7 shows the direc-
tory tree for a user (say, Y), and in this way, the directory trees of all such users together
constitute the directory tree of the fle system.
• The next aspect is naming, by which fles can be unambiguously accessed following a path
from the root (master) directory down various branches until the desired fle is reached. The
path that starts from the root directory and then traverses in sequence through the directory
hierarchy of the fle system up to the targeted fle will constitute a pathname consisting of the
directory names thus traversed in sequence for the fle, known as the absolute pathname.
FIGURE 7.7 A representative sample example showing the implementation of a tree–structured directory
for fles used in generic modern operating system.
File Management 365
Several fles with the same fle name belonging to different directories are permitted, since
they have unique pathnames or differ in their absolute pathnames. While absolute pathname
can uniquely locate the desired fle, it often appears inconvenient to spell out the entire path-
name to locate a fle every time it is referenced. Alternatively, the pathname for a fle can
be started from the user’s current directory and then traverse down various branches until
the desired fle is reached, called a relative pathname, which is often short and convenient
to use. However, relative pathname can sometimes be confusing because a particular fle
may have a different relative path name from different directories. Moreover, a user during
execution may change the current directory to some other working directory by navigating
up or down in the tree using a change–directory command. To facilitate this, each directory
stores information about its parent directory in the directory structure.
• How the shared fles can be accessed more conveniently by users in the existing tree-type
directory structure.
• How the access rights to the fle to be shared are to be offered to the other users.
• The management of simultaneous access over shared fles by different users.
Use of the tree structure leads to a fundamental asymmetry when different users access a shared
fle. The fle would always exist in some directory belonging to one of the users who can access
it with a shorter pathname than other users. A user wishing to access a shared fle of another user
can do so by visiting a longer path through two or more directories. This problem can be resolved
by organizing the directories in an acyclic graph structure in which a fle can have many parent
directories; hence, a shared fle can be pointed to by directories of all those users who have rights
to access it. To implement this, links can be used, which gives rise to the construction of an acyclic
graph structure.
• Links: A link (sometimes called a hard link) is a directed connection between two exist-
ing fles (a directory is also a fle) in the directory structure. The link is created using the
appropriate command in which a link name is given that links the target shared fle (or
directory) to be accessed. Once the link is established, the target fle can be accessed as if
it were a fle with a name (link name) in the current directory. The link name entry in the
current directory is identifed by the value L in its fag feld.
When the link mechanism is used, the fle system normally maintains a count indicating the
number of links pointing to a fle. The UNIX fle system uses inodes that contain a feld to maintain
a reference count with each fle to indicate the number of links pointing to it. The content of this
count feld (reference count) is incremented or decremented when a new link is added or deleted,
and the concerned directories and shared fle are also required to be accordingly manipulated. A fle
can be deleted only if its reference count is 0. Thus, if a fle has many links to it, it cannot be deleted
even if its owner intends to do so by executing a delete operation.
366 Operating Systems
A major limitation in using hard links for fle sharing is that directories and inodes are data structures
of a single fle system (partition), and hence cannot point to an inode on another fle system. Moreover,
a fle to be shared can have only one owner and one set of permissions, and thus all the responsibilities
relating to the fle are entrusted only to its owner, even including the disk space held by it.
An alternative way to share fles is to create a symbolic link, which is itself a fle and contains
a pathname of another fle. Thus, if a fle X is created as a symbolic link to a shared fle Y, then
fle X will contain the pathname of Y provided in the link command. The directory entry of fle
X is marked as a symbolic link, and this way the fle system knows how to interpret its contents.
Symbolic links can create dangling references when fles are deleted. One interesting feature of
this kind of link is that it can work across mounted fle systems. In fact, if a means is provided for
pathnames to include network addresses, such a link can then refer to a fle that resides on a dif-
ferent computer. UNIX and UNIX-like systems call it a symbolic link, whereas in Windows, it is
known as shortcut, and in Apple’s Mac OS, it is called alias. However, one of the disadvantages of
symbolic links is that when a fle is deleted, or even just renamed, the link then becomes an orphan.
• Access Rights: The fle system should provide an adequate and a suffciently fexible
mechanism for extensive sharing of fles among many different users, of course, providing
suitable protection mechanisms to amply secure and control the usage of these shared fles.
Typically, the fle system provides a wide range of access rights to the users of shared fles.
Different types of access rights are provided by different operating systems (fle systems);
however, the rights being offered normally constitute a hierarchy in the access control list
(to be discussed later), meaning that each such right implies those that precede it. Thus,
for example, if a particular user is offered the right to append a specifc fle, it is implied
that the same user can automatically enjoy the rights that precede it, such as acquaintance,
execution, and reading.
• Concurrent Access: When different users with the same access rights on a particular fle
attempt to execute one of the permitted operations on the fle simultaneously, the fle manage-
ment system must impose certain rules in order to keep them mutually exclusive for the sake of
needed interprocess synchronization (already discussed in Chapter 4). Several useful methods
are available to negotiate such situations, and one such simple approach may be to lock the
entire fle while it is under use, thereby preventing other users from accessing it simultaneously.
Another approach may be relatively fne-grained control in which the respective record(s) under
use is only locked and not the entire fle. Manipulation of fles with the same command being
issued simultaneously by different users may also lead to a situation of deadlock. But those situ-
ations can be easily handled by the fle management system using additional simple methods
that can ensure prevention of deadlocks (already discussed in Chapter 4).
Brief details on links with a fgure, and also different access rights, are given on the Support
Material at www.routledge.com/9781032467238.
processing a fle is substantially reduced, thereby improving not only the fle processing effciency
but also utilization of space on secondary storage and thereby throughput of a device. On the other
hand, actions for extracting a particular logical record from a block thus accessed for processing are
collectively called deblocking actions. However, many issues now need to be addressed, apart from
deciding what size of block is suitable for overall use.
First, the block may be considered fxed or variable length. Fixed-length blocks offer several
advantages, such as; making data transfer easy to or from a secondary device, straightforward buf-
fer allocation in main memory and memory commitment for fle buffers, and simple organization of
blocks in secondary storage that requires no additional overhead.
The second consideration is about the size of the block to be chosen, which, in turn, is related to
the blocking factor when compared to the average size of a record. Intuitively, the larger the block,
the more sequential records can be handled in one I/O operation, thereby resulting in a reasonable
reduction in the total number of I/O operations needed to process a fle. However, if the records
are being accessed randomly with no particular locality of reference, then larger blocks result in
the unnecessary transfer of useless records. Moreover, larger blocks require larger I/O buffers, and
management of larger buffers itself creates other diffculties. Another serious drawback of using
a larger block irrespective of any type of access is that if such a block fails in an I/O operation,
it requires all the records that are now quite large in number contained in the block to be created
afresh, thereby requiring more additional overhead in order to make the fle workable again. In
general, for a given size of a block, there are three methods of blocking.
(tio)b = ta + m. tx . . . (a)
(tio)r = (ta / m ) + tx . . . (b)
Where ta and tx are the access time per block and transfer time per record, respectively. If ta =
8 msec., data transfer rate = 1000 Kbytes/sec, record length sr = 200 bytes. So the transfer time per
record (logical record), that is, tx= 200/1000 msec. = 0.2 msec. The values of (tio)b and (tio)r can be
computed using Equations (a) and (b) for a given value of m.
If the CPU time spent in processing a record tp is 2.5 msec. and m = 4, then (tio)r < tp. This shows
that the next record is available to the CPU before the processing of the current record is completed,
thereby totally eliminating CPU idle time waiting for the next record to arrive. In fact, proper block-
ing of records and related buffering, when combined, ensure that the process does not suffer any I/O
waits after the initial start-up phase.
Brief details on this topic, with Table 7.3, are given on the Support Material at www.routledge.
com/9781032467238.
• What is the amount of space to be allocated to a fle when a new fle is frst created?
Whether the maximum space is to be provided at a lot that is requested at the time of its
creation?
File Management 369
• The other approach suggests allocating space to store an entire fle as a collection of pieces,
each one with contiguous areas; these pieces themselves may not normally be contiguous.
Now, the obvious question is, what will be the size of these pieces used as a unit for fle
allocation? The size of a piece can range from a single disk block to even the entire fle.
Whatever method of allocation is considered, the key issue in storing fles is how to keep track
of the space allocated to a fle and what sort of data structure or table is to be used for this
purpose. Various methods are, however, used in different systems. An example of such a data
structure is the inode used in UNIX or fle allocation table (FAT) used in Windows and some
other systems.
• Having a large piece provides contiguity of space that eventually increases performance
while writing a fle in a single blow as well as for read/modify operations in transaction-
oriented processing systems.
• Using a large number of small pieces provides allocation fexibility and better space utili-
zation but at the same time increases the size of the tables employed to manage the alloca-
tion information.
370 Operating Systems
Now, the consideration is whether fxed-size or variable-sized pieces of the disk area are to be
used in each of these two approaches; such as, small number of moderately large pieces or large
number of small pieces. However, considerations of the size of the pieces as well as the number
of pieces to be used altogether on permutations give rise to several options, each of which is
equally conducive to both pre-allocation and dynamic allocation, but each one, in turn, has also
its own merits and similarly drawbacks. While moderately large and variable-sized contiguous
pieces of disk area normally provide better performance with less wastage of space (fragmenta-
tion), managing these pieces with the existing free spaces and minimizing of fragmentation raises
some issues. Those can be resolved by adapting the noncontiguous memory allocation model (see
Chapter 5) to disk space allocation. The most viable alternative strategies in this regard are, frst
ft, best ft, and nearest ft. Each of these strategies has its own strengths and drawbacks; hence,
it is very diffcult to decide which one is superior, because numerous factors interact with one
another in a critical fashion. The factors that infuence choosing a particular strategy are types
of fles, structure of fles, access mechanism to be employed, disk buffering, disk caching, disk
scheduling, and many other performance metrics in the system that are closely associated with
I/O operation.
Brief details on this topic are given on the Support Material at www.routledge.com/9781032467238.
The contiguous allocation scheme has its own merits and drawbacks, already discussed in detail
in a previous section. External fragmentation is an obvious consequence of the contiguous alloca-
tion scheme that makes it diffcult for the fle system to fnd a suffcient number of contiguous blocks
to satisfy the new requests. It will then be necessary to perform a compaction (defragmentation)
algorithm to coalesce the scattered free space on the disk to a contiguous space for further use.
Briefs detail on this topic with figures are given on the Support Material at www.routledge.
com/9781032467238.
FIGURE 7.8 A schematic block diagram illustrating the mechanism of indexed allocation method used in
fle management of a generic modern operating system.
File Management 373
indirection in which one disk access per level of indexing is required to reach the target data block
and thus face a marginal degradation in their access performance. However, successive accesses to
fle blocks within the addressing range of the current index blocks need not face this overhead if
recently used index blocks are kept in memory. Naturally, sequential access is more likely to beneft
from buffering of index blocks. However, the nature of noncontiguous allocation is such that it is
unlikely to fnd adjacent logical blocks in consecutive physical disk blocks (or sectors). In order to
reduce latency in disk access, systems using noncontiguous allocation occasionally make their fle
contiguous by means of performing defragmentation or consolidation or by another similar useful
approach.
Conclusion: File organization using indexed allocation is less effcient for sequential fle process-
ing when compared to the linked allocation scheme, since the FMT of a fle has to be accessed, and
then a series of accesses to different parts of the disk are required that eventually increases the disk
I/O. Random access, however, is comparatively more effcient, since the access to a particular record
can be carried out directly by way of obtaining the specifc address of the target block from the
FMT. Reliability in indexed allocation is comparatively less damaging than linked allocation. This
is because any corruption in an entry of an FMT or DSM may lead to a situation of limited damage.
Indexed allocation with variable-length blocks and multiple-level index allocation with fgures
are given on the Support Material at www.routledge.com/9781032467238.
____________________________________
. . .. . . . 011001011 . . . . . .. . . . . . . .
_ _ _ _ _ _ _ _ _ _ _ _ _ _ ↓ _ _ _ _↓ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
↓ ↓
Disk block Disk block
is free is allocated
FIGURE 7.9 An example relating to management of disk free space using disk status map (DSM) or bit
tables employed by a generic modern operating system.
data structures that summarize the contents of the bit tables by logically dividing them into almost
equal-sized subranges, and a summary can be made for each such subrange that mainly includes
the number of free blocks and the maximum-sized contiguous number of free blocks. When the
fle management needs a particular number of contiguous blocks, it can go only through the sum-
mary table to fnd an appropriate subrange and then search that particular subrange to fnd the
desired one.
7.16.3 INDEXING
Indexing of free space may be cumbersome. This approach treats the free space as a fle and simi-
larly uses an index table as already described in the fle allocation method. In a freshly initialized
File Management 375
volume with a very large number of free blocks, this approach requires multiple levels of indexing
when it intends to access many or all of them, thereby consuming an unacceptable amount of pro-
cessing time. As new fles are gradually created, the number of free blocks may automatically be
decreased, but allocations and deallocations of blocks may require high overhead due to multiple
levels of indexing. That is why some designs propose keeping at least one index block of the free list
in memory that can speed-up the allocation of up to n free blocks, where n is the number of indices
in an index block. However, the outcome obtained is still not acceptable. For the sake of effciency,
the index should be maintained on the basis of variable-sized pieces rather than on a block-basis.
This requires only one entry in the index table for every free piece on the disk. This approach
appears to mostly offer adequate support for all types of fle allocation methods.
In an individual block-basis approach, each free block is identifed by assigning it a serial number
(or by its address), and the list of numbers (or addresses) of all free blocks can then be recorded in
a separate list. With a disk of common size today, the size of the number (or address) to represent
(identify) each free block in the free list will require in the range of 32 bits (232 = 4 GB). Such a huge
free list cannot be maintained in main memory and hence must be stored on disk. Now, the problem
is that every time a free block is needed, a corresponding slower disk access will then be required,
and this will eventually affect the overall system performance adversely. To avoid this problem,
some designers suggest two effective techniques that store a small part of the list in main memory
so that the block request can be quickly responded to.
One such technique is to fetch a part of the list into main memory and treated it as a push-down
stack. Whenever a new block is required to be allocated, it is popped off the top of the stack, and
similarly when a block is deallocated, it is pushed on to the top of the stack. In this way, all the
requests can be responded to with no delay except in the situation when the stack (part of the list)
in memory is either full or exhausted. At that time, only one appropriate transfer of a part of the list
between disk and memory is required to resume stack operations. The other technique is similarly
to fetch a part of the list into main memory, and it would now be treated as a FIFO queue and be
operated in a similar way to a push-down stack but obeying the traditional queue operations. To
make each of these approaches more attractive and effective, a background process (or thread) can
be slowly run whenever possible to sort the list in memory by serial number (or by address) to enable
each of these approaches to be applicable for contiguous allocation of blocks as far as possible.
The second approach is allocation of free space from a collection of different clusters of con-
tiguous free blocks, and it requires keeping track of such clusters and also implementation of a
policy for allocation and deallocation of blocks. Addresses and sizes of free disk areas can be
maintained as a separate list or can be recorded in unused directory entries. For example, when
a fle is deleted and subsequently deallocated, its entry in the basic fle directory can be marked
as unused, but its address and size in terms of blocks can be left intact. At the time of creation of
a new fle, the operating system can inspect unused directory entries to locate a free area of suit-
able size to match the current request. The frst-ft and best-ft algorithms could be used for this
purpose. Depending on the portion of the directory that is kept in main memory, the trade-off
between frst-ft and best-ft may go either way. While frst-ft may sometimes give rise to a sub-
stantial amount of internal fragmentation, it requires fewer entries of the directory to be looked
up and may be preferable when very few directory entries are available in main memory and most
of the directory entries are on the disk. On the other hand, if most of the directory entries are
376 Operating Systems
available in main memory, the best-ft algorithm provides a better performance since it tends to
reduce internal fragmentation by carrying out a closer match of the requested size to the size of
the allocated disk area.
Brief details on this topic are given on the Support Material at www.routledge.com/9781032467238.
• Disk blocks have numbers, and complex structures can be placed on the disk by having
data in one block refer to another block by number.
• Each fle is described by a fle descriptor, which tells how the fle is physically arranged
on the disk.
• Each physical disk is described by a disk descriptor, which tells how the disk is arranged
into areas and which parts are currently unused. The disk descriptor is stored at a well-
known location on the disk.
• Information may be stored redundantly on the disk to allow programs to try to restruc-
ture the disk if it gets confused. Confusion is the typical result of unscheduled operating-
system failures, because the structure may be undergoing modifcation at the time of the
failure. Even worse, the disk may be in the middle of writing a block when failure occurs.
Restructuring a garbage disk is called salvaging.
• The basic unit of allocation is the single disk block, although entire tracks or cylinders may
be allocated at a time to keep large regions in a fle contiguous on the disk. This attempt to
keep fles in local regions on the disk is called clustering. It is based on the cache principle,
since it is faster to read or write on the disk at cylinders close to the current position, and
the most likely request to come after one fle request is another request on the same fle.
Clustering may also be attempted to keep fles that are in the same directory positioned
close together on the disk. Another form of clustering, called skewing, that spaces con-
secutive blocks of a fle by a few sectors apart. As a result, a typical process reading the
entire fle will fnd the next fle block under the disk read/write head at the time it needs
it. Some disk controllers interleave sectors to place consecutively numbered ones some
distance apart from each other on the same track. In this case, the fle manager most likely
should not attempt skewing.
• Searching fle structures and allocating free blocks would be too time consuming if the
information were stored only on the disk. In accordance with the cache principle, some
structure and allocation information may be duplicated in the main store. But, as is typi-
cally the case with caches, the cached (main-store) data and the actual (disk) data will
possibly be out of step. Operating system failures (crashes) then become even more serious
than they seem to be, because they may lose recent changes. To mitigate the danger, all
main-store caches are occasionally (perhaps every minute) archived to the disk. Perhaps
the worst time for a catastrophic failure is during archiving.
The facilities provided by the fle service mostly determine the structures that must be used. For
example, direct access of arbitrary positions in a fle requires different structures than sequential
access. Hierarchical directories and fat directories require different structures. Different methods
of access control also need different structures.
result in irrevocable lost for some reasons, then restoring all the information will be not only time-
consuming and painstaking, but equally diffcult, and in many situations, seems to be practically
impossible, and may even be catastrophic. However, there are some commonly used methods, such
as bad block management, and backups, that help to offer safeguards to the fle system. That is why
the reliability of the fle system is urgently sought. The fle system is said to be reliable if it can guar-
antee that the functions of the fle system will work correctly despite the different types of faults that
may occur in the system. The reliability of fle system concerns two main aspects:
Reliability is closely associated with the terms fault and failure. A fault is commonly a defect in
some part of the system causing an error in the system state. When an error causes unexpected behav-
ior or an unusual situation in the system, it is termed a failure. In other words, a fault is the cause of a
failure. For example, crashing of I/O devices due to a power outage or corruption of a disk block is a
fault, whereas inability of the fle system to read such a block is a failure. Faults are of various types,
and any one of them may affect the entire computer system or hardware components such as proces-
sor, memory, I/O devices, and communication links. But we will concentrate here only on some of
those issues of fault and the respective approaches to prevent them that are related to FMS.
Reliability problems in fle systems that are most common are mainly system crashes due to
power interruptions and data corruption by viruses leading to loss of data in fles or loss of fle
system control data (various data structures needed by the fle system, to be described in the follow-
ing discussion) stored on disk. If the control data are either lost or become inconsistent, the injury
is fatal, and the fle system may not be able to work at all. On the other hand, damage caused by
loss of only data in a fle due to data corruption is relatively less serious, since it is limited to only
a single fle.
Brief details on this topic are given on the Support Material at www.routledge.com/9781032467238.
Recovery: This is a classic approach implemented using different techniques when a failure
occurs. It attempts to restore the damaged data in fles and inconsistent control data of
the fle system to some near-past consistent state so that the fle system can once again
378 Operating Systems
resume its normal operation from the new state. While this action rescues fles from a
damaged state, deviations from the expected behavior may also be observed. Appropriate
manipulation measures in system operation can then accordingly be taken by noticing the
deviations that happened.
Fault tolerance: This is another effective approach that uses some commonly used tech-
niques in order to guard the fle system against any loss of integrity. Its objective is to keep
the operation of fle system consistent, uninterrupted, and perfectly correct, even in the
event of failure, at all times.
• safeguard loss of data due to malfunctioning of devices caused by physical damage to the
devices or media or by other types of faults caused by numerous reasons.
• ensure fle system consistency by preventing inconsistency of control data resulting from
system crashes.
These techniques are commonly known as fault-tolerance techniques that primarily attempt
to implement some precautionary measures which can safeguard the fle system from any unfore-
seen damage. Two such techniques, stable storage and atomic actions, can be mentioned in this
regard.
File Management 379
• Stable storage: This simple technique, named disk mirroring by Lampson, uses redun-
dancy by creating two copies of a record, called its primary and secondary copy, that are
maintained on disk to negotiate only a single failure occurrence. This disk mirroring is quite
different from the disk mirroring technique used in RAID, as described in Chapter 6. For a
fle write operation, it updates both copies: the primary copy frst, followed by the secondary
copy. For a fle read operation, the primary copy of the disk block(s) is frst accessed. If it is
all right, there is no problem, but if it is not readable, the secondary copy is accessed. While
this technique guarantees total survival even after a single failure and is thus applied to all
the fles for general use in the fle system, it is very expensive; still, it can be made proftable
when processes selectively use it to protect their data. However, one of the serious drawbacks
of this approach is that it fails to indicate whether a value is the old or new version (both the
copy contains old version when the failure occurs before the primary copy is updated, both
the copy contains new version when the failure occurs after both copies have been updated).
So the user is confused and cannot ascertain defnitely whether to re–execute the operation
which was ongoing at the time of failure while restoring the system to normalcy. The atomic
action described next, however, addresses this issue and accordingly overcomes this problem.
• Atomic action: An action may consist of several sub–actions, each of which may involve
in some operation that intends to manipulate either control data (data structures) of fle
system or updating data of fles. Any failure during the course of such an operation inter-
rupts its execution and eventually may make the fle system inconsistent and cause errors.
For example, consider an inventory control system that involves transferring spares from
one account to another. If a failure occurs during the transfer operation, it interrupts its
execution; spares may have been debited from one account but not credited to the other
account, or vice-versa. The inventory control fle system would then be in an inconsistent
and erroneous state. The ultimate objective of atomic action (very similar in nature to the
atomic action used in interprocess synchronization, as described in Chapter 4) aims to
avoid all such ill effects of system failure. In fact, any action Xi consisting of a set of sub-
actions {xik} is said to be an atomic action if, for every execution of Xi, either
1. Executions of all sub-actions in {xik } are completed, or
2. Executions of none of the sub-actions in {xik } are completed
An atomic action succeeds when it executes all its sub-actions without any interruption or interfer-
ence. In that situation, it is said to commit. An atomic action fails if a failure of any type occurs or
an abort command is executed before all its sub-actions are completed. If it fails, the state of fle
system, the state of each fle, and each variable involved in the atomic action should remain as it was
prior to the beginning of the atomic action. If an atomic action commits, it is guaranteed that all the
actions already taken by it will survive even if a failure occurs.
Brief details on the implementation of atomic actions with an algorithm are given on the Support
Material at www.routledge.com/9781032467238.
FIGURE 7.10 A generalized block diagram illustrating a schematic representation of virtual fle system used
in a generic modern operating system.
(similar to the virtual machine layer that resides between the operating systems and bare hardware
to support many different operating systems running simultaneously on the same piece of hard-
ware), as shown in Figure 7.10. The VFS layer also has two interfaces: the upper interface that inter-
acts with the processes above and the lower interface for the target fle systems lying below. Any
fle system that conforms to the specifcation of VFS fle system interface can be installed to run
under VFS. This feature helps to add a new fle system easily to the existing environment. The VFS
process interface (upper interface) provides functionalities to perform generic open, close, read,
write, and other common operations on fles, and mount and unmount operations on fle systems.
These functionalities are invoked through respective system calls. The VFS fle system interface
(lower interface) determines under which fle system a particular fle actually belongs and invokes
the respective functionalities of the corresponding fle system as needed. This interface also invokes
functions of the specifc fle system to implement mount and unmount operations.
As shown in Figure 7.10, many different fle systems initiated by different processes can be
run simultaneously using the VFS interface. In addition, the VFS can also be used to compose a
heterogeneous fle system. For example, a user can mount a fle system of type A in a directory of
a fle system of type B. This feature is particularly useful with removable media like CDs; it per-
mits a user to mount the fle system that resides in a CD in his current directory and access its fles
without any concern for the fact that the fle data are recorded in a different format. This feature
is also important when used in a distributed environment for mounting a remote fle system into a
fle system of a different type. For example, the Sun Network File System (NFS) uses a VFS layer
to permit mounting of different fle systems and provides sharing of these different fle systems in
nodes operating under the Sun OS operating system, which is a version of UNIX.
The VFS, in essence, does not contain any fle data; rather it contains merely data structures that
constitute VFS metadata. Each fle system that runs under it contains its own metadata and fle data.
The VFS layer implements a complete system-wide unique designator for each fle by creating a key
data structure of the fle used by VFS. This data structure is known as virtual node, popularly called
vnode. The vnode is essentially a representation of a fle in the kernel. It can be looked upon as a fle
object with the following three parts:
Eliminating the VFS layer from this approach will indicate that all the processes will be running
under one fle system, and in that case, all the nodes will be treated as physical nodes. Many con-
temporary modern operating systems have provided VFSs since the 1990s. Some of the notable
ones among them are UNIX SVR4, UNIX 4.2 BSD, Linux, and Sun OS.
7.20 PIPES
A pipe is a sort of pseudo-fle that can be used to connect two processes together. When process
P1 wants to send data to process P2, it writes at one end of the pipe as though it were an output
fle. Process P2 can read the data by reading from the other end of the pipe as though it were an
input fle. Thus, communications between processes looks very much like ordinary fle reads and
writes. The pipe acts here as a virtual communication channel to connect two processes wishing to
exchange a stream of data. That is why some systems often implement interprocess communication
mechanism via pipe, which is similar to messages but can be programmed using the standard set
of fle and I/O services. It can also be used to link external devices or fles to processes. The two
processes communicating via a pipe can reside on a single machine or on different machines in a
network environment.
The operating system usually provides automatic buffering for the data within a pipe and implicit
synchronization between the processes communicating via a pipe. A user, for example, intending to
write on a pipe may be delayed while the pipe is full, and similarly a process wishing to read from
an empty fle may be suspended until some data arrive. Pipes can be handled at the system-call level
in exactly the same way as fles and device-independent I/O, and, in particular, can use the same
basic set of system calls that may also be used for handling devices and fles. This generality in
approach enables pipes to use even at the command-language level that can establish an additional
form of inter-program communication. That is why pipes are often found in use to output from one
program or device to input to another program or directly to a device without any reprogramming or
the use of temporary fles. By allowing this form of redirection, applications that are not specifcally
developed to work together, such as a spellchecker and a text formatter, can be combined to perform
a new complex function without any reprogramming. In this way, several independent utilities can
be cascaded, which can then enable users to construct powerful functions out of simple basic utili-
ties, functions that can even go beyond the limit of their designers’ vision. The pipe is, therefore, a
powerful tool that is exploited in many operating systems, including UNIX. More about pipes can
be found in Chapter 2 and in Chapter 4, where it is explained in detail.
TABLE 7.1
Factors Infuencing File System Performance
Factor Techniques Used to Address the Factor
• Accessing fle map table or FAT • File map table cache in main memory
• Directory access • Hash tables, B+ trees
• Accessing a disk block • Disk block cache in main memory
• Accessing data Disk scheduling
• Writing data Cylinder groups and extents
Disk block cache in device
• Blocking and buffering data
• Different approaches to when the computed data is to be written on disk
frst time. When a directory is used to resolve a pathname, it is retained in a memory cache known
as the directory names cache to speed up future references to fles located in it. Disk-arm move-
ment is an important issue when a disk block is accessed and, in turn, becomes a dominant factor in
the performance of fle system. To minimize this movement, several useful techniques are used with
respect to accessing the free blocks and subsequent allocation of them. Sometimes a free list is also
arranged in the form of block-clustering by grouping consecutive blocks to considerably improve
disk I/O performance. Some modern systems, especially Linux, use this approach. When allocating
blocks, the system attempts to place consecutive blocks of a fle in the same cylinder.
Another performance bottleneck in systems that use inodes or anything equivalent to inodes is
that reading even a short fle requires two disk accesses: one for the inode and one for the data block.
Usually inodes are placed near the beginning of the disk, so the average distance between an inode
and its blocks will be about half the number of cylinders, thereby requiring appreciably long seeks.
To improve the performance, inodes are placed in the middle of the disk instead of at the beginning,
thereby reducing the average seek between the inode and its pointed frst block by a factor of two.
Another approach may be to divide the disk into cylinder groups, each with its own inodes, blocks,
and free list (McKusick, et al., 1984). When creating a new fle, any inode can be chosen, but an
attempt is made to locate a block in the same cylinder group as the inode. If no such block is avail-
able, then a block close to the cylinder group is used.
Still, an important issue remains; at what time the data is to be written back from the cache (or
memory) to the disk. The notion of delayed writes (write-back) tends to improve the effective speed
of writing and response time, eliminate redundant disk writes, and also reduce the network and
server load in distributed systems, but, of course, sometimes at the cost of gross data loss in the
event of system failure. In this regard, UNIX uses a system call, sync, that forces all the modifed
blocks out onto the disk immediately. In fact, when the UNIX system is started up, a program, usu-
ally called update, is started up in the background to sit in an endless loop issuing sync calls, sleep-
ing for 30 seconds between two successive calls. As a result, no more than 30 seconds of work is lost
if a system crash occurs. This sounds quite comfortable for many users. The Windows approach is
normally to use write-through caches that write back all modifed blocks to the disk immediately.
When a large fle is handled, the disk I/O time taken by Windows is thus appreciable.
As technology advances, operating systems are also gradually modernized, and these advanced
systems provide device-independent I/O, where fles and logical I/O are treated by a single, unifed
set of system services at both the command-language and system-call levels. In such systems, user
processes can be interchangeably connected to pipes, fles, or I/O devices. This facility is often
coupled with runtime binding of processes to devices that summarily makes compiled programs
insensitive to confguration changes and provides considerable fexibility in managing resources and
thereby speeding up computer systems. Moreover, the recent trend to further enhance fle system
File Management 383
performance is to implement all the speed-up techniques in hardware that were previously realized
by software. In addition, modern I/O device technology also incorporates some of the techniques
mentioned in Table 7.1. Thus, SCSI disks provide disk scheduling in the device itself. RAID units,
as already discussed in previous section, today contain a disk block buffer which can be used to both
buffer and cache disk blocks.
More details on this topic are given on the Support Material at www.routledge.com/9781032467238.
FIGURE 7.11 A representative scheme of fle–update mechanism used in a log–structured fle system
employed by a generic modern operating system.
fle are in the log fle, and the data blocks in the log fle are numbered as shown in Figure 7.11(a).
In actual implementation, the index block contains the pointer that points to the respective fle data
blocks. The directory entry of a fle points to the respective index block in the log fle; it is assumed
here that the index block contains the FMT (FAT) of the fle. When the fle data contained in block 1
is modifed (updated), the new values are written into a new disk block, say, block 4. This is depicted
in Figure 7.11(b). Similarly, when the data in block 3 are updated, some of its fle data are written into
disk block 5. The fle system now writes a new index block that contains the updated FMT of the fle
and sets the FMT pointer in the directory of the fle to point to the new index block. The new FMT
now contains pointers to the two new data blocks and to data block 2, which has not been changed, as
shown in Figure 7.11(b). The old index block and disk blocks 1 and 3 are now released.
Since disk fles are written as a sequential-access fle, all the fnite disks will be quickly occu-
pied by the log fle, leaving almost no space for new segments to write, although there may still be
many blocks that are no longer needed. As shown in Figure 7.11b, if a fle is updated, a new index
block is written, but the old one, though currently of no use, will still occupy space in previously
written areas.
To deal with this problem, LFS exploits the service of a cleaner thread that scans the log circularly
to compact it, similar to memory compaction (see Chapter 5). It starts out by reading the summary of
the frst segment in the log to see which index blocks are still current and fle blocks are still in use.
If they are not, that information is discarded, and the spaces occupied by them will be released. The
index blocks and fle blocks that are still in use go to memory to be written out in just the next avail-
able segment. The original segment is then marked as free, and the log can use it to store new data.
In this way, the cleaner moves along the log, removing old segments from the back and putting any
live data into memory for rewriting in the next segment. Consequently, a large free area on the disk
is available for the log fle. The entire disk can now be viewed as a big circular buffer, with the writer
thread adding new segments to the front and the cleaner thread removing old ones from the back.
This operation involves considerable disk head movement that determines the disk usage; however,
the cleaner and writer threads perform their operations (i.e. compaction) as a background activity
File Management 385
without affecting the actual fle processing activities. Performance results reveal that all this com-
plexity is worthwhile. Measurements offered in the paper mentioned in the beginning of this section
show that LFS clearly outperforms UNIX by an order of magnitude on small writes while exhibiting
performance as good as or even better than UNIX for reads and large writes.
FIGURE 7.12 A schematic block diagram consisting of relevant data structures used in UNIX fle manage-
ment system.
386 Operating Systems
the fle descriptors of the parent process are copied into it. Thus, many fle descriptors may share
the same fle structure. Processes owning the descriptors share the fle offset.
• Disk Space Allocation: Each fle has a FAT analogous to FMT, and this information is
obtained from the contents of the inode. File allocation is carried out dynamically on a
block basis; hence, the allocated blocks of a fle on disk are not necessarily contiguous. An
indexed allocation method is used to keep track of each fle, with part of the index stored in
the inode of the fle. The inode includes 39 bytes of allocation address information which
is organized as thirteen 3-byte addresses or pointers. The frst 10 addresses point to the
frst 10 data blocks of the fle. If the fle is still longer than 10 blocks long, then one or more
levels of indirection is used.
The total number of data blocks in a fle depends on the size of the fxed-size blocks in the
system. In UNIX System V, the length of a disk block is 1 Kbyte (210), and thus each such block
can hold a total of 256 (28) block addresses; each block address is of 4 bytes (= 22). Hence, the
maximum number of disk blocks that can be addressed using triple levels of indirection are 256
× 256 × 256 = 224 disk blocks. Each disk block is 1 Kbytes = 210 bytes. Hence, the maximum size
of a fle with this scheme is 224 × 210 bytes = 234 bytes = 16 Gbytes. Similarly, two levels of indi-
rection needs 256 × 256 = 216 disk blocks, that is, 216 × 210 bytes = 64 Mbytes, and with a single
level of indirection, the maximum size of the fle would be 256 = 28 disk blocks, that is, 28 × 210
bytes = 256 Kbytes, and the direct (i.e. zero level of or no indirection) would require simply 10
× 1 Kbytes = 10 Kbytes.
For fle sizes smaller than 10 Kbytes, this arrangement is as effcient as the fat allocation dis-
cussed in a previous section. Such fles also have a small allocation table that can ft into the inode
itself with no indirection. Not much bigger fles, while using one level of indirection, may be
accessed with little extra overhead, but as a whole, reduce processing and disk access time. Two or
more levels of indirection permit fles to grow to very large sizes, virtually satisfying all applica-
tions, although their access involves extra time consumption while traversing through the different
levels of indirection in the FAT.
• Free Space Management: In its simplest form, the UNIX fle system maintains a list of
free disk blocks in a way similar to linked allocation in which each block points to the next
block in the list. To avoid the high overhead inherent in this approach, UNIX employs an
indexed allocation scheme but implemented differently. Here, free space is managed by
means of a chained list of indices to unused blocks. In particular, approximately 50 point-
ers to free blocks are collected in one index block. The index blocks in the free list are
chained together so that each points to the id of the next one in line in the free list. The
frst index block is normally kept in main memory. As a result, the system has immediate
access to addresses of up to 50 free blocks and to a pointer to an index block on the disk
that contains 50 more pointers to free blocks. With this arrangement, the overhead of add-
ing disk blocks to the free list when a fle is deleted is greatly minimized. Only marginal
processing is needed for fles smaller than 10 Kbytes (or multiples of 10 Kbytes depending
on the size of data blocks) in size. However, while disk blocks are added and deleted from
the free list, race conditions may occur, and that is why a lock variable is used with the free
list to avoid such situations.
• Sharing of Files: UNIX provides fle sharing with the use of a single fle image. As illus-
trated in Figure 7.12, every process that opens a fle points to the copy of its inode using
its fle descriptor and fle structure. Thus, all processes that share a fle use the same copy
of the fle; changes made by one process are at once visible to other processes sharing the
fle. As usual, race conditions may exist while accessing an inode; hence, to ensure mutual
exclusion, a lock variable called an advisory lock is provided in the memory copy of an
File Management 387
inode that is supposed to be heeded by processes; however, the fle system does not enforce
their use. A process attempting to access an inode must go to sleep if the lock is set by
another process. Processes that concurrently use a fle must do their own planning to avoid
race conditions on the data contained in the fle.
• Directories: Directories are tree-typed and hierarchically organized. A directory is also
simply a fle that contains a list of fle names and/or other directories plus pointers (inode
number) to associated inodes. When the fle or directory is accessed, the fle system must
take its inode number from the related directory and use it as an index to the respective
inode table to locate its disk blocks.
• Volume Structure: A UNIX fle system resides on a single logical disk or disk partition,
and all such disks that contain UNIX fle systems have the layout depicted in Figure 7.13,
with the following elements:
• Boot Block: Block 0 is not used by UNIX and often contains code to boot the
computer.
• Superblock: Block 1 is the superblock that contains critical information about the lay-
out of the fle system, the number of inodes, the number of disk blocks, and the start of
the list of free disk blocks (typically a few hundred entries). Damage to or destruction
of the superblock will render the fle system unreadable.
• Inode Tables: The collection of inodes for each fle. They are numbered from 1 to some
maximum.
• Data Blocks: All data fles and directories/subdirectories are stored here.
• Multiple File Systems: Many fle systems can exist in a UNIX system. When a physi-
cal disk is partitioned into many logical disks, a fle system can be constructed on each
of them and can exist only on a single logical disk device. In other words, a logical disk
contains exactly only one fle system. Hence, fles also cannot span different logical
disks. While a disk is partitioned, it provides some protection and also prevents a fle
system from occupying too much disk space. Each fle system consists of a superblock,
an inode list, and data blocks. The superblock itself contains the size of the fle system,
the free list, and the size of the inode list. The superblock, which is the root of every fle
system, is maintained by UNIX in main memory for the sake of effciency. The super-
block is copied onto the disk periodically. Some part of the fle system may be lost if the
system crashes after the superblock is modifed but before it is copied to the disk. Some
of the lost state information can be reconstructed by the fle system, such as the free list,
by simply analyzing the disk status. This is, of course, carried out as part of the system
booting procedure.
A fle system can be mounted in any directory in a logical disk device by using a fle system
program mount with the parameters of a device-special fle name (for the fle system) and the path-
name of the directory in which it is to be mounted. Once the fle system is mounted, the root of the
fle system has the name given by the pathname, and the superblock of the mounted fle system is
then loaded in main memory. Disk block allocation for a fle in the mounted fle system must now be
FIGURE 7.13 A schematic block diagram showing a representative layout of volume structure of a disk used
in traditional UNIX system.
388 Operating Systems
performed within the logical disk device on which the fle system exists. All the fles in a mounted
fle system are then accessed in the usual way.
• Other Features: UNIX provides extensive buffering of disk blocks in order to reduce
physical I/O and effective disk access time. However, buffers are not allocated at the level
of a fle or a process. This arrangement facilitates implementation of concurrent fle shar-
ing with the use of only a single fle image and also reduces the disk access overhead when
a fle is processed simultaneously by two or more processes. UNIX supports two kinds
of links: a hard link, already described in Section 7.11, and a symbolic link, which was
described in Section 7.11.
More details on this topic are given on the Support Material at www.routledge.com/9781032467238.
User Process
System call
Virtual file
Linux
system (VFS)
Kernel
Page cache
Device Drivers
I/O request
Disk Tape
Controller Controller
Hardware
Devices Devices
FIGURE 7.14 A schematic representation of Linux virtual fle system used in today’s Linux operating
systems.
Since the Linux fle management scheme is mostly derived from the concepts used in UNIX
fle systems, the Linux fle system, similarly to UNIX, also employs a tree-typed hierarchical
organization of directories, in which each directory may contain fles and/or other directories. A
path from the root through the tree consists of a sequence of directory entries, ending in either a
directory entry (dentry) or a fle name. Thus, fle operations can be equally performed on either
fles or directories.
• The Superblock Object: This object stores information describing a specifc fle system.
Typically, the superblock corresponds to the fle-system superblock or fle-system control
block, which is stored in a specifc sector on a disk. The superblock object consists of a
number of key data items, including a list of superblock-operations that refer to an opera-
tion-object which defnes the object methods (functions) that the kernel can invoke against
the superblock-object. Some of the notable methods defned for the superblock-object are
read_inode, write_inode, remount_fs, write_super, and clear_inode.
• The Inode Object: As in UNIX, an inode is associated with each fle. The inode-object
holds all the information about a named fle except its name and the actual data contents
of the fle. Items contained in an inode-object include owner, group, permissions, access
times for a fle, size of data, and number of links. However, the inode-object also includes
an inode-operations-object that describes the fle system’s implemented functions that the
390 Operating Systems
VFS can invoke on an inode. A few of the methods (functions) defned for the inode-object
include the following:
• * create: Create a new inode for a regular fle associated with a dentry object in some
directory.
• * lookup: Search a directory for an inode corresponding to a flename.
• * mkdir: Create a new inode for a directory associated with a dentry object in some
directory.
• The Dentry Object: A dentry (directory entry) is simply a specifc component in a path.
The component may be either a directory entry or a fle name. When a fle is opened, the
VFS transforms its directory entry into a dentry object. Dentry objects facilitate access
to fles and directories and are cached (in a dentry cache) so that the overheads of build-
ing them from the directory entry can be avoided if the fle is opened repeatedly during a
computing session.
• The File Object: The fle object is used to represent a fle opened by a process. The object
is created in response to the open() system call and destroyed in response to the close()
system call. The fle object consists of a number of items, including the following:
• dentry object associated with the fle
• fle system containing the fle
• fle object usage counter
• user’s user-ID
• user’s group-ID
• fle pointer, which is the current position in the fle from which the next operation on
the fle will take place
The fle object also includes an inode-operations-object that describes the fle system’s implemented
functions that the VFS can invoke on a fle object. The methods (functions) defned for fle object
include read, write, open, release, and lock.
• Locks: The standard fle system of Linux is ext2, which was infuenced by the design
of UNIX BSD. This ext2 provides a variety of fle-locks for process synchronization.
Advisory locks are those that are supposed to be heeded by processes to ensure mutual
exclusion; however, the fle system does not enforce their use. UNIX fle-locks belong to
this category of locks. Mandatory locks are those that are checked by the fle system;
if a process attempts to access data that is protected by a mandatory lock, the process is
blocked until the lock is reset by its holder. A lease is a special kind of fle-lock which is
valid for a specifc amount of time after which another process that tries to access the data
protected by it can get it. It is implemented in the following way: if a process attempts to
access data protected by a lease, the holder of the lease is alerted by the fle system. It now
has a stipulated interval of time to fnish accessing the fle and then frees the lease. If it
fails to do so, its lease is broken and access to the data protected by the lease is awarded to
the process that was attempting to access it.
• Disk Space Allocation: ext2, similar to UNIX BSD’s fast fle system, employs the notion of
a block group consisting of a set of consecutive disk blocks to reduce the movement of disk
heads when fle data are accessed. It uses a bitmap to keep track of free disk blocks in a block
group. When a fle is created, it tries to allocate disk space for the inode of the fle within the
same block group that contains its parent directory and also includes the fle data within the
same block group. Every time a fle is extended due to the addition of new data, it searches
the bitmap of the block group to fnd a free disk block that is close to a target disk block. If
such a disk block is found, it checks whether a few adjoining disk blocks are also free and
preallocates a few of these to the fle with the assumption of its forthcoming requirements.
File Management 391
If such a free disk block is not found, it preallocates a few contiguous disk blocks located
elsewhere in the block group. In this way, it is comfortably possible to read large sections of
data without having much movement of the disk head. When the fle is closed, preallocated
but unused disk blocks are also released. This strategy of disk space allocation that makes
use of (almost) contiguous disk blocks for contiguous sections of fle data provides notably
increased performance in fle access, even when fles are created and deleted at a high rate.
More details on this topic are given on the Support Material at www.routledge.com/9781032467238.
• Salient Features of NTFS: The design objective of NTFS is to offer adequate fexibility;
although it is built on a simple fle system model but is itself a powerful fle system. Some
of the notable features of NTFS include the following:
• Large disk device and large fles: It supports a very large disk device and very large
fles.
• Security: It uses security descriptors containing many security attributes as well as
providing features such as a sophisticated protection system, encryption, and data
compression.
• Recoverability: NTFS can restore the damaged data in fles and also metadata by
reconstructing disk volumes in the event of any type of system crashes. To provide
full recoverability, including user data, it requires to incorporate much more elabo-
rate and resource-consuming recovery facilities. Moreover, it employs redundant
storage (backup) for critical fle-system data that describes the structure and status
of the fle system. In addition, it may use a RAID architecture to avoid any loss of
user fles.
• Multiple data streams: In NTFS, a fle is a collection of attributes, and each attribute
is considered an independent byte stream. The data in a fle are also treated as a stream
of bytes while it is even considered to be an attribute. In NTFS, it is possible to defne
multiple data streams for a single fle. Use of such multiple data streams provides enor-
mous fexibilities; for instance, a large graphic image may have a smaller thumbnail
associated with it. Such a stream at one end can contain a maximum of 248 bytes, and
at the other end, at least, only a few hundred bytes.
• Indexing of attributes: The descriptions of the attributes associated with each fle are
organized by NTFS as a relational database so that they can be indexed by any attribute.
• NTFS Volume: NTFS considers disk storage in the following way:
• Sector: It is the smallest possible storage unit on the disk. Its data size in bytes is a
power of 2 and is almost always 512 bytes.
• Cluster: A collection of one or more contiguous (one after another on the same track)
sectors, and its size in terms of sectors is always a power of 2.
392 Operating Systems
Partition boot sector Master fle table System and user fles File area
FIGURE 7.15 A representative block diagram of volume layout used in Windows NT fle system.
Hence, while recovering from a failure, it checks whether a transaction was in progress at
the time of failure. If so, it completes the transaction as mentioned before resuming opera-
tion. However, to totally avoid loss of fle system data due to a crash and to cover the risk of
such loss of data, the log is not discarded immediately after completing an update. Instead,
NTFS usually takes a checkpoint every 5 seconds and then discards the log fle. In the case
of a crash, the fle system can then be restored by copying the checkpoint and processing
records, if any, in the log fle.
More details about this topic with a fgure and tables are given on the Support Material at www.
routledge.com/9781032467238.
SUMMARY
The FMS described in this chapter primarily carryout the responsibilities to create, manipu-
late, and maintain data in the form of a fle put in a persistent storage medium for practical
use. A fle is merely an organized collection of related records that are encapsulated in the fle
system, which is the most visible interface for users and can be conceived as a tree of two basic
abstractions: directories and fles (leaves), where each directory may contain subdirectories
and fles. In essence, a fle system is a logical view of data stored on a device such as disk, CD,
fash drive, or tape.
The FMS organizes fles, directories, and control information (metadata); provides convenient and
secure access to fles and directories; and implements many operations, like create, delete, open, read,
write, and close, that users can apply on fles and directories to manage their contents. This chapter
also explains how the space on block devices (disks) is organized to hold fle systems and describes
some space management techniques. The space allocated to a fle might be contiguous or noncontigu-
ous arranged in the form of a linked list, index-sequential, and so on. Different free space manage-
ment techniques, including bitmap schemes, are also discussed. Many runtime data structures that
reside in the kernel space and how they are interlinked to each other are also described. In addition,
this chapter also shows the importance of reliability, protection, concurrency control, and journaling,
and discusses different fault-tolerant techniques and recoverability. A presentation of a modern log-
structured fle system is included. This chapter also briefy discusses VFSs that can allow many real
fle systems to coexist under the umbrella of a single VFS. Finally, a brief overview of the actual and
practical implementation of UNIX, Linux, and Windows fle systems is presented as a case study.
EXERCISES
1. Describe the user’s view as well as the designer’s view while developing a fle system.
2. Defne fle system. Describe the user’s view of the fle system. Describe the view that the
designer uses while developing a fle system
3. What is meant by fle types? What are the various methods employed to specify the fle
type? What are the advantages obtained by specifying the type of a fle?
4. What is meant by fle attribute? State its signifcance. State and explain those fle attributes
that are mainly used by most operating systems.
5. “Different fle operations are realized by using respective system calls”. Give your views,
with suitable examples. Explain with an example the operation of at least one such system
call to realize the respective fle operation.
394 Operating Systems
6. “An operating system is categorized by the different fle services it offers”. Justify this,
giving the different classifcation of services that a fle system usually offers to its system
call users. What relationship exists between the fle services and fle server?
7. Explain the purpose and the benefts derived from a fle control block with an approximate
functional specifcation of its structure. What is the signifcance of the current fle pointer
feld in the fle control block?
8. When a fle is opened concurrently by several processes, should each process construct a
separate fle control block of its own to connect to the shared fle, or should the involved
processes share a single FCB? Discuss the relative merits of each approach and propose a
strategy for managing the sharing of fles.
9. State the minimum requirements that fle management systems must meet to perform their
responsibilities.
10. Describe with a diagram the design principles involved in a generic fle system.
11. Why is the average search time to fnd a record in a fle less for an indexed sequential fle
than for a sequential fle?
12. An index sequential fle contains 5000 records. Its index contains 100 entries. Each index
entry describes an area of the fle 50 records. If all records in the fle have the same prob-
ability of being accessed, calculate the average number of operations involved in accessing
a record. Compare this number with the number of disk operations required if the same
records were stored in a sequential fle.
13. “An indexed (inverted) fle exhibits certain advantages and beat out indexed-sequential
fles in some situations of fle processing”. Give your comments.
14. What does fle structuring mean? Some operating systems design a fle system as tree-
structured but limit the depth of the tree to a small number of levels. What effect does this
limit have on users? What are the advantages of this type of structuring? How does this
simplify fle system design, if it does at all?
15. What are the typical operations performed on a directory? Discuss the relative merits and
demerits of a system that provides a two-level directory in comparison to single-level (fat)
directory system
16. The use of hard links (or simply links) poses some inconveniences at the time of sharing
fles. What are the major limitations that are faced in this arrangement? Symbolic links are
supposed to alleviate these limitations: discuss.
17. What is a graph directory structure? How does this structure overcome the problems faced
by a tree-structured directory when a fle is shared? State the specifc access rights pro-
vided under this structuring by most OSs at the time of fle sharing.
18. State the different methods of blocking that are generally used. Given; B = block size, R =
record size, P = size of block pointer, and F = blocking factor, that is, the expected number
of records within a block, derive a formula for F for all the methods of blocking you have
described.
19. State and explain the different disk space allocation techniques that are used in noncon-
tiguous memory allocation model. Which technique do you fnd suitable and in what
situation?
20. The logical disk block is of size 512 bytes. Under contiguous allocation, a fle is stored
starting at logical disk block 40. To access the byte at fle address 2000, which is the logical
disk block to be read into the memory? To read 100 bytes from the fle at fle address 2000,
how many disk accesses are required?
21. State and explain the relative merits and drawbacks of the linked (or chained) allocation
scheme on secondary storage space.
22. For what type of fle processing is indexed allocation found suitable? Explain. What are the
merits and drawbacks to an indexed allocation scheme?
File Management 395
23. Calculate the number of disk accesses needed to read 20 consecutive logical blocks of a
fle in a system with: a. contiguous allocation, b. chained allocation, and c. indexed alloca-
tion of space. Discuss your fndings, using an appropriate fgure for illustrative purposes,
if necessary. Explain the timing difference in this regard between logical block accessing
and physical block accessing.
24. Free disk space can be tracked by using a free list or a bitmap. Disk addresses require D
bits. In a disk with B blocks, F of which are free, state the condition under which the free
list uses less space than the bitmap. For D with a value of 32 bits, express your answer as a
percentage of the disk that must be free.
25. Consider a hierarchical fle system in which free disk space is maintained in a free list.
a. Suppose the pointer to free space is lost. Can the system reconstruct the free space list?
b. Suggest a scheme to ensure that the pointer is never lost as a result of a single memory
failure.
26. How many device operations are required to add a released node to a free list when the disk
(block) status map approach is used to implement the free list?
27. A fle system implements multi-level indexed disk space allocation. The size of each disk
block is 4 Kbytes, and each disk block address is 4 bytes in length. The size of the FMT
is one disk block. It contains 12 pointers to data blocks. All other pointers point to index
blocks. What is the maximum fle size supported by this system?
28. A sequential fle ABC contains 5000 records, each of size 4 Kbytes. The fle accessing
parameters are:
Average time to read a disk block = 2 msec.
Average time to process a record = 4 msec.
Calculate the time required by a process that reads and processes all records in the fle
under the following conditions:
a. The fle system keeps the FMT in memory but does not keep any index blocks in mem-
ory while processing the fle.
b. The fle system keeps the FMT and one index block of the fle in memory.
29. State the major aspects that are to be taken into account when the reliability of the fle
system is considered of prime importance.
30. “File system integrity is an important issue to both users and computer systems”. Give your
views.
31. What are the different methods of back-ups used to recover a fle system? What are the various
overheads associated with them? Which back-up method seems to be preferable and why?
32. Discuss how the stable storage technique can be used to prevent loss of fle system integ-
rity. What are the drawbacks of the stable storage technique?
33. Discuss how the atomic-action mechanism can be used to prevent loss of fle system integ-
rity. In what way does it remove the drawbacks of the stable storage technique?
34. Defne virtual fle system. Explain with a diagram how it abstracts the generic fle model.
35. State and describe the log-structured fle system. Discuss its salient features and the areas
in which it has offered great benefts.
McKusick, M. K., Joy, W. N., et al. “A Fast File System for UNIX”, ACM Transactions on Computer Systems,
vol. 2, no. 3, pp. 181–197, 1984.
Rosenblum, M., Ousterhout, J. K. “The Design and Implementation of a Log-Structured File System”,
Proceedings of the 24th ACM Symposium on Operating Systems Principles, New York, ACM, pp. 1–15,
1991.
Rubini, A. “The Virtual File System in Linux”, Linux Journal, vol. 1997, no. 37es, pp. 21–es.
Seltzer, M. I., Smith, K. A., et al. “File System Logging Versus Clustering: A Performance Comparison”,
USENIX Winter, pp. 249–264, 1995.
Wiederhold, G. File Organization for Database Design. McGraw-Hill, New York, 1987.
Yeong, W., Howes, T., et al. “Lightweight Directory Access Protocol”, Network Working Group, Request for
Comments, 1995.
8 Security and Protection
Learning Objectives
8.1 INTRODUCTION
To safeguard the valuable, sensitive, and confdential information of the user/system as well as the
precious assets in the computing environment from unauthorized access, revelation, or destruc-
tion of data/programs; adequate protection mechanisms are inevitably required. Thus, protection
is concerned with threats that are internal, whereas security, in general, deals with threats to
information that are external to a computer system. As the user density is continuously increased,
sharing of programs and data, remote access, and connectivity become unavoidable and obvious,
thereby making the system entirely exposed to the outside world that ultimately causes major
security weaknesses and likely points of penetration in improperly designed systems. In fact, the
area of computer protection and security is a broad one and encompasses physical and administra-
tive controls as well as controls that are automated with the use of tools that implement automated
protection and security.
While designers of computer systems and related software developers enforce extensive secu-
rity measures to safeguard the computing environment as much as possible, those, in turn, can
increase the cost and complicate the computing environment to such an extent that it may eventually
restrict the usefulness and user-friendliness and above all badly affect the overall performance of
the entire computer system. Thus, a good balance in this regard is required while making the com-
puting environment suffciently effcient and effective, but again with no compromise on security
aspects. Therefore, the computer and especially the operating system must be suffciently equipped
to provide an adequately fexible and functionally complete set of protection mechanisms so that the
ultimate objectives of enforcing security policies can be effectively attained. This chapter is devoted
to discussing a variety of issues and approaches related to security and protection of standalone sys-
tems that are also equally applicable to large mainframe systems as well as to timesharing systems
comprising a set of small computers connected to shared servers using communication networks.
More about this topic is given on the Support Material at www.routledge.com/9781032467238.
Security system is actually built using the protection mechanisms that the operating system (fle
management) provides. The two key methods used by the operating system to implement protection
and thereby enable users to secure their resources are authentication and authorization.
Authentication is commonly the method of verifying the identity of a person intending to use
the system. Since physical verifcation of identity is not practicable in contemporary operating envi-
ronments, computer-based authentication is provided using a set of specifc assumptions. One such
assumption is that a person is the user they claim to be if they know a thing or things that only the per-
missible user is supposed to know. This is called authentication by knowledge (password-based). The
other method is to assume a person is the claimed user if they have something that only the allowed
user is expected to possess. An example of this approach is biometric authentication, like authentica-
tion based on a unique and inalterable biological feature, such as the fngerprints, retina, or iris.
Authorization, on the other hand, is the act of verifying a user’s right to access a resource in a
specifc manner. This means that it is the act of determining the privileges of a user. These privi-
leges are ultimately used to implement protection.
While the security setup consists of the authentication service and the authentication database,
the protection setup consists of the authorization service, the authorization database, and the service
and resource manager. The authentication service generates an authentication token after it has veri-
fed the identity of a user. It passes this token containing a pair of the form (authentication token,
privileges) to the authorization service that uses the authorization database of every registered user
of the system. It consults the database to fnd the privileges already granted to the user and passes
this information to the service and resource manager. Whenever the user or a user process issues
a request for a specifc service or resource, the kernel attaches the user’s authentication token to it.
The service and resource manager then checks whether the user has been authorized to use the ser-
vice or resource. It grants the request if it is consistent with the user’s privileges. Figure 8.1 shows
an explanation of this approach.
Security and Protection 399
FIGURE 8.1 A schematic block-wise illustration of a representative generic security and protection set-up
used in a generic operating system.
The distinctive difference between protection and security provides a neat separation that con-
cerns the operating system. In a conventional operating system, the security aspect is limited to
ensuring that only registered users can use an OS. When a user logs in, a security check is per-
formed to determine whether the user is a registered user of the OS, and if so, obtain their user-id.
Following this check, all threats to resources in the system are of protection concern; the OS uses
the user-id of a person to decide whether they can access a specifc resource under the OS. In a
distributed system, security aspects are relatively complex due to the presence of a set of computers
connected with networking components. Therefore, we confne our discussion about this aspect in
this section to only conventional uniprocessor operating systems.
• Policies and mechanisms: Security policies state whether a person should be allowed to use
a system. Protection policies, on the other hand, specify whether a user should be allowed to
access a specifc resource (fle). Both these policies are enforced outside the domain of the
OS. A system administrator decides whether a person should be allowed to become a user of
a system. Likewise, while creating a new fle, the fle-creator specifes the set of users who
are permitted to access it. These policies are implemented by using certain security and pro-
tection mechanisms which perform a set of specifc checks during the operation of the sys-
tem. The security policy can be determined by defning it in the user space, although many
operating systems do layer the functionality. A very small part of the OS implements the
mechanisms, while the other parts of the OS, system software and utilities, or user software
determine the policy. As a result, protection and security frst depend on the OS protection
mechanisms and then on the security policies chosen by the designers and administrators.
The separation of policy and mechanism has been discussed by Sandhu (1993). Our objec-
tive, however, will be to emphasize more on mechanisms and less on policies.
Security mechanisms have provisions to add new users or verify whether a person is an autho-
rized user of the system. The latter mechanism is called authentication, and it is invoked whenever
400 Operating Systems
a person attempts to log in to an OS. Protection mechanisms set protection information for a fle or
check whether a user can be allowed to access a fle. This mechanism is called authorization; it is
invoked whenever a person attempts to access a fle or when the owner of a fle wishes to alter the
list of users who are allowed to access it.
• Confdentiality or Secrecy: This requires that the information in a computer system only
accessible only by authorized parties. Disclosure of information to unauthorized parties
may lead to catastrophic losses depending on the nature of the information in question.
Secrecy is defnitely a security concern, because it is threatened by entities or parties out-
side an operating system. An OS negotiates it using an authentication service.
• Privacy: This means that the information should be used only for the purposes for which
it is intended and shared. Privacy is a protection concern that guards individuals from mis-
use of information. An OS negotiates privacy through the authorization service that deter-
mines privileges of a user, and the service and resource manager disallows all requests that
belong outside a user’s privileges. It is up to the users to ensure privacy of their information
using this setup, and they can then allow other users to share the information by accord-
ingly setting the authorization for the information. It can also be called controlled sharing
of information. It is based on the need-to-know principle.
• Authenticity: This requires that the computer system be able to verify the identity of the
source or sender of information and also be able to verify that the information is preserved
in the form in which it was created or sent.
• Integrity: This requires and ensures that the computer system assets can be modifed only
by authorized parties. Modifcation usually includes writing content, changing content,
changing status, creating and deleting, and so on. This way, unauthorized modifcation
by means of unlawful penetration to destroy or corrupt the information can be prevented.
• Availability: This requires that the computer system assets always be available to autho-
rized parties, and nobody can be able to disturb the system to make it unusable. Such
denial of service (DoS) attacks are becoming increasingly common. For example, if a
computer is used as an internet server, sending a food of requests to it may cripple it by
eating up all of its CPU time simply for examining and discarding incoming requests. If
it takes, say, 100 msec. to process an incoming request, then anyone who manages to send
10,000 requests/sec can straightaway wipe it out.
Reasonable models and technology for dealing with attacks on confdentiality, authenticity, and
integrity are available, but handling denial-of-service attacks is much harder. In fact, confdentiality
(secrecy), authenticity, and integrity are both protection and security concerns. Elaborate arrange-
ments are thus needed to handle these concerns. However, the security aspect is actually more of
an issue in distributed OSs, and of course, comparatively less in uniprocessor-based traditional
operating systems. Moreover, all these concerns are relatively easy to suit as protection concerns
on any type of operating system, because the identity of the user has already been verifed before
the authorization and validation of a request are being carried out which are considered a part of
the protection set–up as already shown in Figure 8.1. Last but not least, security threats, as a whole,
are more severe and can appear more easily in a distributed OS, since it is mostly exposed to the
outside world. For example, when an interprocess message travels over open communication links,
Security and Protection 401
including public links, it is quite possible for external entities to enter this domain to tamper with
messages.
More about this topic is given on the Support Material at www.routledge.com/9781032467238.
More about this topic with a fgure is given on the Support Material at www.routledge.com/
9781032467238.
• Hardware: The main threat to the system hardware is mostly in the area of availability
and includes deliberate damage and accidental mishap, as well as theft. A logged-on ter-
minal left unattended, when accessed by an intruder, eventually results in their gaining
full access to all the system resources available to the legitimate user whose identity is
assumed. Introduction of networking facilities together with a rapid increase in user den-
sity add further fuel that ultimately increases the potential of such damage in this area. As
this domain is least conducive to automated controls, appropriate physical and administra-
tive measures are required to negotiate all these attacks.
• Software: A key threat to both system and application software is an attack on its
integrity/authenticity by way of making unauthorized modifcations that cause a working
program either to result in execution failure or still function but start to behave erratically.
Computer viruses, worms, Trojan horses, and other related attacks belong to this category
and are discussed later (“Malicious Programs”) in this chapter. Another type of attack
launched is by using trap doors, which are the secret points of entry in the software to
gain access to it without going through the usual security access procedures. Trap doors
are deliberately left by software designers themselves for many reasons, presumably to
allow them to access and possibly modify their programs after installation for produc-
tion use. Trap doors can be abused by anyone who is already aware of them or acquires
402 Operating Systems
knowledge of their existence and the related entry procedure. Another diffcult problem
commonly faced is software availability. Software is often deliberately deleted or is altered
or damaged to make it useless. To counter this problem, careful software management
often includes a common approach that always keeps backups of the software’s most recent
version.
• Data: Data integrity/authenticity is a major concern to all users, since data in fles or any
other form is a soft target for any type of security attack. Malicious attempts to modify
data fles can have consequences that may range from inconvenience to catastrophic losses.
Attacks on availability of data are concerned with the destruction of data fles and can
either happen unintentionally, accidentally, or maliciously.
• Communication Links and Networks: Since a communication link is a vulnerable com-
ponent of a computing environment, many attacks are launched in this area, although they
are mainly found in distributed operating systems and can result in severe consequences.
They have been also observed to be equally damaging to non-distributed conventional
operating systems. Network security attacks can generally be classifed more effectively
in terms of passive attacks and active attacks. A passive attack only attempts to obtain or
make use of information from the system and does not alter or affect system resources. An
active attack, on the other hand, attempts to directly damage or alter the system resources
or can even affect their normal operations.
Passive attacks are hard to detect because they do not cause any alteration of data, and their pres-
ence cannot even be guessed beforehand. They give no indication that a third party is prying to
capture messages or at least to obtain an observed traffc pattern. However, it is not impossible to
prevent the success of such attempts, generally by means of using encryption. Thus, the major aim
in dealing with passive attacks is to ensure prevention rather than detection.
Brief details on this topic with a fgure are given on the Support Material at www.routledge.
com/9781032467238.
A DoS attack may be launched with a specifc target in view. For example, it can corrupt a particu-
lar program that offers a specifc service. It can damage or destroy some confguration information
that resides within a kernel; for example, access to an I/O device can be denied by simply chang-
ing its entry in the physical device table of the kernel. Another class of DoS attacks are launched
by overloading a resource using phantoms to such an extent that genuine users of the resource are
denied its use. A network DoS attack may be launched by disrupting an entire network, either by
disabling the network or by overloading (fooding) it with messages so that network bandwidth is
denied to genuine messages, leading to an inability to respond to any messages that eventually
causes severe degradation in system performance. A distributed DoS attack is one that is launched
FIGURE 8.2 A generalized representation of different types of active attacks launched on security system
of a generic operating system.
404 Operating Systems
by a few intruders located in different hosts in the network; it is perhaps the most diffcult to detect
and equally hard to prevent.
Active attacks exhibit exactly the opposite characteristics of passive attacks. While passive
attacks are diffcult to detect, there are certain means and measures that are able to prevent their
success. Active attacks, on the other hand, cannot be absolutely prevented because it would require
Security and Protection 405
strict physical protection mechanisms and vigilance on all communication facilities and paths at
all times. Therefore, the goal while negotiating such active attacks should be simply to go for their
detection and subsequently to recover by any means from any disruption or delays caused by them.
Although the detection procedure may create a hindrance in normal operation and may have certain
deterrent effects, still then it may also contribute to prevention.
• Policies
Security policies must address both external and internal threats. Since most threats are orga-
nized by insiders, the policies primarily encompass procedures and processes that specify:
Additional aspects can be included to expand the security domain to limit other possibilities of
danger. However, security policies are often based on some well-accepted time-proven basic prin-
ciples. Those are:
Least privilege: Every object is to be allowed to access just the information required to com-
plete the tasks that the subject is authorized to. For example, the accountants in a factory
need not have access to the production data, and similarly, factory supervisors should not
be allowed to access the accounting data
Rotation in responsibilities: Sensitive and confdential operations should not be permanently
entrusted to the same personnel or the same group of personnel. Some rotation can often
be used to prevent foul play or wrongdoing.
Isolation in duties: In the case of critical operations that can put an organization at risk, then
two or more people with conficting interests should be involved in carrying it out. In
other words, two people with different involvement should be given charge of two differ-
ent keys to open the vault.
The selection of an adequate security policy for a given installation and for specifc data therein
is commonly a trade-off between the perceived risk of exposure, the potential loss due to the dam-
age or leakage of information, and the cost to be incurred to implement a required level of security.
406 Operating Systems
The selection process will analyze the risk assessment and the related assessment of cost, which
includes cost of equipment, personnel, and performance degradation due to the implementation of
security measures. Once the analysis process is over, the suitable and appropriate security policies can
then be chalked out. Most computer-related security policies belong to one of two basic categories:
• Discretionary Access Control (DAC): Under this category, policies are usually defned
by the creator or the owner of the data, who may permit and specify access rights to
other users. This form of access control is quite common in fle systems. It is, how-
ever, vulnerable to the Trojan horse (to be discussed in later subsections) attack, where
intruders pose themselves as authorized and legitimate users.
• Mandatory Access Control (MAC): In this scheme, users are classifed according to
level of authority or permissions to be awarded. Data are also categorized into security
classes according to level of sensitivity or confdentiality, and stringent rules are then
specifed that should be strictly followed regarding which level of user clearance is
required for accessing the data of a specifc security class. Mandatory access restric-
tions are thus not subject to user discretion and hence limit the damage that the Trojan
horse can cause. For example, military documents are usually categorized as secret, top
secret, confdential, and unclassifed. The user must have clearance equal to or above
that of the document in order to gain access to it. MAC also appears in other systems,
perhaps in less obvious forms. For example, a university authority cannot pass the right
to modify grade records to students.
Security measures, in general, must address both external and internal security. External or
physical security includes the standard age-old techniques of fencing, surveillance, authentication,
attendance control, and monitoring. Physical security also demands replication of critical data for
recovery from disasters (such as accidental system crashes, fre, or food), access restrictions to
computer systems, and also safe storage areas used for maintaining backups.
Major efforts are, however, exerted to realize internal security mechanisms, which encompass
issues primarily related to the design of the OS that will actually lay the basic foundation of the
mechanisms to implement security policies. Saltzer and Schroeder (1975) have identifed several
general principles that can be used as a guideline to designing secure systems. A brief summary of
their ideas (based on experience with MULTICS) is given here:
Least privilege: Give each process the least privilege to enable it to complete its task. For
example, if an editor has only the clearance to access the fle to be edited (specifed when
the editor is invoked), then editors already infected with Trojan horses will not create much
damage. This principle effectively advocates for support of small protection domains that
imply a fne-grained protection scheme. It also provides switching of domains when the
access needs to be changed.
Separation of privilege: Whenever possible, privilege with respect to access to objects should
be granted in such a way so that it has to satisfy more than one condition (in other words,
two keys to open the vault).
Least common mechanism: Minimize the amount of mechanism that is common to and
depends upon multiple users. The designed mechanism will incorporate techniques for
separating users, such as logical separation via virtual machines and physical separation of
different machines present in distributed systems.
Complete mediation: Every access right should be checked for authorization. The checking
mechanism should be effective and effcient, since it has an immense impact on the per-
formance of the system.
Security and Protection 407
Squirrel checking: Every access should be checked for current authority. The system should
check for permission, determine that access is permitted, and then squirrel away this infor-
mation for subsequent use. Many systems check for permission when a fle is opened and
not afterward. This means that a user who opens a fle and keeps it open for weeks will
continue to have access, even if the owner has long since changed the fle protection.
Fail-safe default: Access rights should be acquired by explicit permission only, and the
default should be to have no access. Errors in which legitimate access is refused will be
reported much faster than the errors that may result from an unauthorized access.
Open design: The design of a security mechanism should not be secret; rather it should be
public. It should not depend on the ignorance of attackers. Assuming that the intruder will
not know how the system works serves only to delude the designers.
Economy of mechanisms: The design should be kept as simple and uniform as possible to
facilitate verifcations and correct implementations. It should be built into the lowest layers
of the system. Trying to retro-ft security to an existing insecure system is nearly impos-
sible. Security, like correctness, is not an add-on feature.
User acceptability: The scheme to be chosen must be psychologically acceptable. The mech-
anism should provide ease of use so that it is applied correctly and not circumvented by
users. If users feel that protecting their fles involves too much work, they just will not do
it. Moreover, they may complain if something goes wrong. Replies of the form “It is your
own fault” will generally not work or be well received.
8.7 PROTECTION
The original motivation for protection mechanisms started to evolve with the introduction of mul-
titasking systems in which resources, such as memory, I/O devices, programs, and data, are shared
among users. The operating system was designed to prevent others from trespassing in the users’
domain and thereby absolutely protected the users’ interests as a whole. In some systems, protection
is enforced by a program called reference monitor that checks the legality of accessing a potential
resource by consulting its own policy tables and then makes a decision that enables the system
to correctly proceed. We will discuss later in this section some of the environments in which a
reference monitor is expected to be involved. Pfeeger (1997) identifes the following spectrum of
approaches used by a user along which an operating system may provide appropriate protection.
• No protection: This approach is workable if sensitive procedures can be run at separate times.
• Isolation: This implies that each process is a standalone one that operates separately from
other processes with no dependency, with no sharing and communication. Each process
has its own address space, fles, and other objects to complete its task.
• Share all or share nothing: This states that the owner or creator of an object (program,
data, or memory area) can declare it to be public or private. In the former case, any other
process may access the object, whereas in the latter, only the owner’s processes may access
the object.
• Share via access limitation: This option tells the OS to check the legality of each access
when it is made by a specifc user to a particular object. This ensures that only authorized
access is permissible.
• Share via dynamic capabilities: This allows dynamic creation of sharing rights for
objects.
• Limited use of an object: This approach provides a form of protection that limits not just
access to an object but the way in which the object may be used. For example, a user may
be allowed to display a confdential document to view but not permitted to print it. Another
example is that a user may have permission to access a database to get information but no
rights to modify it.
408 Operating Systems
The preceding items are arranged roughly in increasing order of diffculty when implemented, but
at the same time, this shows an increasing order of the fneness of protection that they provide. One
of the design objectives of an operating system is thus to create a balance while allowing sharing of
resources by many users to enhance resource usage and at the same time to ensure the protection of
users’ vital resources from any unauthorized access. An operating system when designed and devel-
oped may incorporate different degrees of protection for different users, objects, or applications.
We will discuss here some of the most commonly used protection mechanisms that many operating
systems employ to realize protection for their objects.
Brief details on this topic are given on the Support Material at www.routledge.com/9781032467238.
A user may be permitted to access a set of objects, and an object may be accessible by a set of users.
A protection structure contains information that indicates which users can access which objects and
in what manner.
8.9.1 USER-ORIENTED
The access control imposed on users is, unfortunately, sometimes referred to as authentication,
since this term is widely used nowadays in the sense of message authentication. We will, however,
strictly refrain from applying it here.
The most common technique widely used for user access control on shared systems are user
log-on details containing a user identifer (ID) and a password. But the ID/password system is seem-
ingly unreliable, because enough expertise has emerged that can guess the IDs of users as a whole.
Security and Protection 409
Moreover, in some systems, the ID/password fle is accessible by skilled hackers, and as a result,
this fle becomes a soft target of penetration attempts. Modern protection systems may even resort
to methods such as fngerprint or eye scan identifcation. Besides, certain other means and measures
are now available to counter these attempts, and those will be discussed later in this section.
In a distributed environment, user access control is either centralized or decentralized. In
a centralized approach, the network system provides a log-on service that determines who is
allowed to access the network and to whom the user is allowed to connect. In a decentralized
approach, the network is treated by user-access-control as a transparent communication link,
and the usual log-on procedure is carried out by the destination host. Of course, transmitting
passwords over the network concerning security must still be further addressed. In addition,
many networks may have the provision of two–levels of access control. In that system, indi-
vidual hosts may be provided with a log-on facility to guard host-specifc resources and applica-
tions. Besides, the network itself as a whole may also provide some protection to limit network
access to authorized users. This two-level facility is desirable for the common case in which
the network connects disparate hosts and simply provides a convenient means of terminal–host
access. In a more uniform network of hosts, an additional centralized access policy could also
be enforced in a network-control center.
More about this topic is given on the Support Material at www.routledge.com/9781032467238.
8.9.2 DATA-ORIENTED
Following successful log-on, the user has been granted access to objects, such as to one or a set of
hosts (hardware) and applications. Every object has a name by which it is referenced and an access
privilege, which is a right to carry out a specifc fnite set of operations (e.g. read and write opera-
tions to a fle and up and down on a semaphore).
It is now obvious that a way is needed to prohibit processes (users) from accessing objects that
they are not authorized to. Associated with each user, there can be a profle provided by the OS that
specifes permissible operations or a subset of the legal operations and fle accesses when needed.
For example, process A may be entitled to read fle F but not write to it. An access descriptor
describes such access privileges for a fle. Common notations are used, like r, w, and x, to represent
access privileges to read, write, and execute the data or program, respectively in a fle. An access
descriptor can also be represented as a set; for example, the descriptor {r, w} indicates privileges to
only read and write a fle. Access control information for a fle is a collection of access descriptors
for access privileges held by various users.
Considerations of data-oriented access control in network parallel those for user-oriented access
control. This means that if only certain users are permitted to access certain items of data, then
encryption may be useful and needed to protect those items during transmission to authorized target
users. Typically, data access control is decentralized. It is usually controlled by host-based database
management systems. If a database server exists on a network, then data access control is monitored
and becomes a function of the network.
More about this topic is given on the Support Material at www.routledge.com/9781032467238.
• Subjects: A row in ACM represents a process (or user) with access privileges of accessing
different objects. In fact, any user or application gains access to an object by means of a
process which represents that user or application.
410 Operating Systems
FIGURE 8.3 A schematic illustration of a representative format of access control matrix (ACM) used in the
protection mechanism built in a generic modern operating system.
• Objects: A column describes access control information for anything (fle, database, etc.)
to which access is controlled. Examples include fles, portions of fles, programs, segments
of memory, and software objects (e.g. Java objects).
• Access rights: The way in which an object is accessed by a subject. Examples include
read, write, execute, copy, and functions in software objects.
The subjects of ACM typically consist of individual users, or user groups, although they can
include terminals, hosts, or applications instead of or in addition to users for access control. Similarly,
the objects, at the fnest level of detail, may be individual data felds. But more aggregate group-
ings, such as records, fles, or even the entire database, may also be objects in the matrix. Moreover,
access descriptors can be made bit-oriented instead of a character of one access right (bit = 1, i.e.
present, and bit = 0, i.e. absent) for reduced memory usage as well as for access effciency.
One way to alleviate this problem is to reduce the size of access control information, and that
reduction can be realized in two ways: by reducing the number of rows in the ACM or simply by
eliminating the null information. Attempting to reduce the number of rows means to assign access
Security and Protection 411
privileges to groups of users rather than to individual users, but forming such groups may not always
be feasible in practice. In fact, it will compromise the granularity of protection that will eventu-
ally mar the actual objectives of protection provision. The other alternative is by eliminating null
information in the ACM. Here, the information stored in the ACM is in the form of lists instead of
a matrix. This approach while does not affect the granularity of protection but reduces the size of
protection information, since only non-null entries of an ACM need to be present in a list. Two such
list structures are commonly used in practice:
An access control list stores the non-null information from a column of the ACM. Thus, it essen-
tially consists of a (ordered) list of access control information for one object (fle or account) cover-
ing all the users present in the system. A capability list (C-list) stores the non-null information of
a row of the ACM. It thus describes all access privileges held by a user. While ACL can provide
coarse- or medium-grained protection, the C-list provides only medium-grained protection. These
two approaches will be discussed in the next section. Fine-grained protection can be obtained by
using a protection domain, which will be discussed later.
Brief details on this topic are given on the Support Material at www.routledge.com/9781032467238.
corresponding C-list of that user is searched to make a decision. C-lists are usually small in size,
which limits the space and time overhead in using them to control fle accesses. In fact, C-lists
are themselves objects and may be pointed to or from other C-lists, thereby facilitating sharing
of sub-domains.
Brief details on this topic with a figure are given on the Support Material at www.
routledge.com/9781032467238.
• Protection of capabilities
Like any other objects, since capability lists is shared, they must also be kept protected from any
form of tampering attempted by the user. Four methods of protecting them are commonly known.
The frst method requires a tagged architecture in the hardware design in which each memory word
has an extra (or tag) bit that specifes whether the word contains a capability. The tag bit is set by
OS, can be accessed only in kernel mode, and cannot be modifed by any user instructions. When
an operation OPj is to be performed on an object OBk, the CPU checks the compatibility of OPj
with OBk’s tag and performs the operation only if the two are compatible’ otherwise the attempt at
executing OPj fails. For example, a fxed-point operation will fail if applied to a foat value. Tagged
architecture machines have been built and found to work satisfactorily. IBM AS/400 (now called
p-series) systems are a popular example of this kind. The second method is to maintain the C-list
inside the operating system. Capabilities are then referred to by their position in the capability list.
A process might then say: “Read 2 KB from the fle pointed to by capability 5”. File descriptors
in UNIX are found to use similar form of addressing. Hydra also worked this way as described by
Wulf. The third method is to keep the C-list in the user space, but the capabilities are managed
cryptographically so that users cannot tamper with them. This approach is particularly suited to
distributed systems and has been found to work well. The fourth approach is extended using the
popular segment-based memory management scheme by introducing a third kind of segment, a
capability segment in which capabilities are inserted only by the kernel using a privileged instruc-
tion. To access the desired object, the operand feld of an instruction contains two felds: the id of a
capability segment and an offset (to reach the desired object in C-list) into this segment. The address
of the object is now obtained using an object table in which each row has two felds: one feld con-
tains the object-id, and the other feld contains the object in the computer’s primary or secondary
memory. Protection of capabilities is implicit in the fact that a store operation cannot be performed
in a capability segment. This feature prevents tampering with and forgery of capabilities.
ACLs and capabilities are observed to have somewhat complementary properties. Capabilities
are very effcient, and no checking is needed because they can be referred to by their positions in the
capability list. With ACLs, a search of long list is required to ascertain the access–privilege of a cer-
tain object if groups are not supported. Capabilities also allow a process to be easily encapsulated,
whereas ACLs do not support this. On the other hand, ACLs allow selective revocation of rights if
needed, which capabilities do not. Last but not least, if an object is removed and the capabilities are
not, or the capabilities are removed and an object is not, problems arise in both situations. ACLs,
however, do not face such a problem.
More about this topic is given on the Support Material at www.routledge.com/9781032467238.
• Software Capabilities
An operating system that runs on computer systems with non-capability architecture can imple-
ment capabilities in the software by a component of the kernel called an object manager (OM).
When a program intends to manipulate its objects, it indicates its requirements to the OM by mak-
ing a system call:
Before beginning the execution of <opk>, the OM verifes that Cap(objk) contains the necessary access
privileges in the C-list. But software implementation of capabilities gives rise to two major issues:
• A process may be able to bypass the capability-based protection arrangement while access-
ing objects.
• A process may be able to fabricate or tamper with capabilities.
To counter the frst issue, one way may be to hide objects from the view of user processes by
encrypting the system-wide object table, in which each row of the table contains the object-id and
object address feld that indicates the location address of the corresponding object in the computer’s
primary and secondary memory. Now, processes intending to access the object will not be able to
locate the objects because the locations of the objects are not known to them, so they have to depend
on the OM (object table) to perform object manipulation. The second issue in relation to preventing
fabrication or tampering capabilities can also be addressed and negotiated using encryption. This
approach has been successfully implemented in the capability protection scheme used in the distrib-
uted operating system, Amoeba.
The distinct advantage of software capabilities, that is, its independence from the underlying
hardware is also appeared as its major drawback. Here, every operation opk on an object requires
a costly, time-consuming system call to invoke the OM to verify its access privilege in the C-list.
Moreover, prevention of tampering requires validation of a capability every time before use, thereby
also causing substantial time overhead. All these requirements summarily lead to appreciable over-
head, resulting in signifcant degradation in the overall system performance.
More on this topic is given on the Support Material at www.routledge.com/9781032467238.
Although capabilities are very effcient, implementation of them in hardware or software incurs
a high cost. Apart from that, use of these capabilities faces some other practical diffculties when
implemented. Three such diffculties out of many are:
Details on each of the bullet points are described on the Support Material at www.routledge.
com/9781032467238.
FIGURE 8.4.a A schematic layout of ring–type protection domains used in building up the protection
mechanism of a generic modern operating system.
The notion of a protection domain addresses the privacy aspect. Figure 8.4a(i) is a pictorial visu-
alization of two protection domains in the form of rings. The inner domain represents programs that
are executed in the supervisor mode in the context of protection; the process is said to execute in
the supervisor domain. The outer ring is the user domain in which the programs are executed in the
user mode. Obviously, programs operating in the supervisor domain enjoy additional access rights
compared to programs operating in the user domain.
The generalization of this two-level domain is a set of N concentric rings called a domain, ring
architecture for protection as shown in Figure 8.4a(ii). This ring architecture was frst introduced
in MULTICS (the predecessor of UNIX) which provides 64 such protection domains that were
organized as concentric rings in a similar way as that shown in Figure 8.4a(ii). Under this scheme,
out of N rings of protection, rings R0 through RS support the operating system domain, and rings
RS+1 through R N–1 are used by applications. Thus, i < j means that Ri has more rights than Rj. In other
words, the farther in the ring, the more privileges. The most critical part of the kernel in terms of
protection executes in R0. The next most secure level of the OS executes in R1, and so on. The most
secure level of user programs executes in ring RS+1, with successively less secure software executing
in outer rings. The hardware supervisor mode in this model would ordinarily be used when software
executes in the lowest-numbered rings, perhaps only in R0 (as was the case in MULTICS). This part
of the OS is to be designed and implemented most carefully and is presumably proved to be correct.
A protection domain conceptually is an “execution environment”. Software located in a fle
that executes in a ring is assigned to that ring. A process operates in a protection domain. Access
privileges are granted to a protection domain rather than to a user. By default, the initial execution
domain of a process does not possess any access privileges. This way, a process cannot access any
resources while it executes in this domain, even if the resources are owned by the user who initiated
this process. In order to access any specifc resource, the process must “enter” a protection domain
that possesses access privileges for that resource. This says that a process may switch between dif-
ferent protection domains in the course of its execution, and the protection mechanism provides a
means by which a process can safely change domains, that is, can cross rings. If a process executes
a fle in Ri, then the same process can call any procedure in R k, (k ≥ i) without special permission,
since that call represents a call to a lower protection domain.
The operating system (kernel) provides system calls through which a process may issue request
for entry into an inner protection domain. Each attempted crossing of an inner ring causes an internal
Security and Protection 415
FIGURE 8.4.b An example with a representative format of protection domains created for the different
activities of a user made by a generic modern operating system.
authorization mechanism to validate the respective procedure call. A set of conditions would be
defned for the legality of such an entry request. The kernel would apply the conditions and either
honor the request for change of protection domain or abort the process for making an illegal request.
Domains themselves, in general, need not be static; their elements can change as objects are deleted
or created, and the access rights are modifed. Domains may even overlap; a single object can partici-
pate in multiple domains, possibly with different access rights defned therein.
Figure 8.4(b) shows three protection domains, D1, D2, and D3, for different objects (fles) as
mentioned. Domains D1 and D2 overlap over the object “accounts”, while domain D3 is disjoint
with both of them. Assume that a user U1 executes three computations, leave, salary, and job,
related in domains D1, D2, and D3, respectively. Thus, salary can access only fle accounts and can
only read it. Now consider an OS that does not use protection domains. User U1 would need read and
write access rights to fles personnel, accounts, inventory, and mails and read access rights to the
fle project. When user U1 executes the program salary that is owned by user U2, the program salary
will be able to modify many fles accessible to user U1! This is not fair and is not desirable at all.
This example demonstrates a protection arrangement involving the use of protection domains
that facilitate implementation of the need-to-know principle with a fne-grained granularity of pro-
tection. Only processes that need to access a resource are granted access to it. It also illustrates how
this approach provides privacy of information and thereby improves data integrity and reliability.
The generalized ring structure does not need to support inner ring data accesses; rather it requires
only procedure calls. Data kept in inner rings can then be accessed using a corresponding inner ring
access procedure, similar to the way an abstract data type allows references to its felds only through
public interfaces.
Ring structures are equally applied in hardware in contemporary computer architecture. In the
Intel 80386 microprocessor, for example, a four-level structure is incorporated that exhibits some simi-
larities to the one described here. In the Intel case, there were three levels of instruction sets. Level 2
and 3 instructions were the normal application program instruction sets, although non-critical portions
of the OS code were also assumed to execute at level 2. Level 1 instructions included I/O instructions.
Level 0 instructions manipulated segmented memory using a system global descriptor table and per-
formed context switching. This architecture and its successors, such as 80486 and Pentium micropro-
cessors, are intended to support memory segment management at level 0, while I/O operations execute
at a relatively higher security level: a higher ring number. The main body of the OS, however, operates
at level 2, where the segments are appropriately protected by the ring structure.
access the object, provided that the key matches the related lock. The owner of the object can also
revoke the access rights of all processes that share the key Ki by simply deleting lock entry Li. This
method has a close resemblance to the storage keys introduced in the IBM 360 systems.
8.10 INTRUDERS
One of the two most commonly observed threats to security is the intruder, and the other is, of
course, the virus. In security literature, people who are nosing around places where they have no
business are called intruders or sometimes adversaries, or they are referred to as hackers or crack-
ers. Intruders act in two different ways. Passive intruders just want to read fles they are not autho-
rized to read. Active intruders are more dangerous; they want to make unauthorized modifcations
that may lead to fatal consequences, disrupting the entire system. The main objective of the intruder,
in general, is to somehow gain access to a system or to increase the range of privileges accessible on
a system. Intruders of different classes with different natures and characteristics have been found in
practice. Some common categories are:
• Nontechnical users: Some people have bad habits of reading other fles and documents
simply by curiosity without having any defnite reasons if no barriers are found in the way.
Some operating systems, in particular most UNIX systems, have the default that all newly
created fles are publicly readable, which indirectly encourages exercising this practice.
They are mostly insiders.
• Snooping by insiders: A genuine relatively skilled user, often take it a personal chal-
lenge with no specifc intention, exercise their expertise to just only break the security of
a local system in order to access data, programs, or resources when such accesses to these
resources are not authorized.
• Misusers: A legitimate user often accesses data, programs, or resources for which access
is not authorized, or they are even authorized for such accesses but misuse their rights and
privileges. Such a user is generally an insider.
• Masqueraders: An individual or sometimes a group who is not authorized to use the com-
puter system but still attempts to penetrate a system’s access control in order to intention-
ally acquire a legitimate user’s account. A masquerader is likely to be an outsider
• Clandestine user: An individual who steals or by some means seizes supervisory control
of the system and uses this control to evade or to suppress authentication or authorization.
This category of user can be either an insider or an outsider.
8.11.1 PASSWORDS
The most common popular form of authentication widely in use based on sharing of a secret is the
user password, possibly initially assigned by the system or an administrator. Typically, a system
Security and Protection 417
must maintain a fle or a password table as the authentication database in which each entry is in the
form (login id, <password-info>) for each legitimate user to identify at the time of login. Many sys-
tems also allow users to subsequently change their passwords at will. Since, password authentica-
tion offers limited protection and is easy to defeat, the password table (fle) should be kept protected,
and that can be done in one of two ways:
• Access Control: Access to the password fle is limited to one or a very few accounts.
• One–Way Encryption: The system stores only an encrypted form of the users’ passwords
in the fle. When a user presents a password at the time of login, the system immediately
performs encryption of the password by using it as an input for the encryption function in
a one-way transformation (not reversible) to generate a fxed-length output, which is then
compared with the stored value in the password fle.
If one or both of these countermeasures are applied, then it is obvious that password cracking will
not be so easy and that some extra effort is then still needed even for a potentially skilled intruder
to obtain passwords. However, the intruder may still launch a variety of ingenious attacks, mainly
based on an unlimited license to guess using numerous techniques to breach/crack security or learn
passwords. Sometimes, use of a Trojan horse (to be described later) to bypass restrictions on access,
and wire-tapping the line between a remote user and the host system can be fruitful to reach the
desired target.
These and other commonly used purely password-related schemes possess certain merits, but the
password itself is always the face of all types of threats as a soft target that attempts to destroy the
security system. It will then be wise to concentrate more on how to protect passwords so that attacks
launched on passwords or any form of intrusion can be prevented. Still, all possible measures to
restrict intrusion can fail, so the system must have a second line of defense, which is intrusion detec-
tion, so that the appropriate measures can be taken. Detection is always concerned with obtaining
the nature and type of the attack, either before or after its success.
Prevention, as a whole, in computer folklore is a challenging aspect and an uphill battle at all
times and, in fact, becomes a challenging security goal when considered in the context of protec-
tion mechanisms offered to users. The problem is that defenders are always at the receiving end and
hence must always attempt only to foil all possible attacks organized from the end of the offenders.
But, the attacker is free to play, always trying to fnd the weakest link in the chain of defense and
improvise different strategies to mobilize attacks.
More about this topic is given on the Support Material at www.routledge.com/9781032467238.
• Password aging: This requires or encourage users to regularly change their passwords to
create an obstacle to intruder attack and make passwords relatively secure. Another alter-
native may be to use “good” passwords and advise users to keep them unchanged. Then,
418 Operating Systems
the amount of time and effort to be exerted by the intruders to break them will frustrate
them and make it mostly infeasible for them.
• Encryption of passwords: An encrypted form of a password known as ciphertext is a
standard technique to protect the password fle (authentication database) stored in a system
fle and is usually visible to all users in the system. An intruder in the guise of a regis-
tered user (insider) would then try a chosen plaintext attack by changing their own pass-
word repeatedly, perhaps creating thousands of possible passwords, and then analyze the
encrypted forms with little resource consumption. On the other hand, an outsider would
have to use an exhaustive attack by trying to login with various types of different pass-
words. But if the encrypted password fle is invisible to anybody other than the root, any
intruder within or outside the system would have to launch an exhaustive attack through
repeated login attempts. This strategy, although seemingly attractive and reasonable, suf-
fers from several faws, as pointed out by many researchers. A few of them are: Any type
of an accident of protection might expose the password fle and render it readable, thereby
compromising all the user accounts. Second, it has been observed that some users in one
machine have accounts on other machines in other protection domains, and they often use
the same password. Thus, if the password is read by anyone on one machine, a machine in
another location in another protection domain could be easily accessed.
Both UNIX and Linux perform encryption on passwords. UNIX uses DES encryption (to be
discussed next), whereas Linux uses a message digest (MD) technique which is simply a one-way
hash function that generates a 128- or 160-bit hash value from a password. This technique also
has many variants, called MD2, MD4, and MD5. Linux uses MD5. In addition, both UNIX and
Linux provide a shadow password fle option. When this option is considered, the ciphertext form
of passwords is stored in a shadow fle that is accessible only to the root. This arrangement creates a
situation that requires an intruder to go through a process of an exhaustive attack, which is not only
expensive but also time-consuming.
Thus, it can be concluded that a more effective strategy would be to always require users to select
good passwords that would normally be diffcult to guess. Several password selection strategies
exist, and the ultimate objective of all of them is to eliminate guessable passwords while allowing
the user at the same time to select a password that is memorable.
• Other Methods: One approach is to change the passwords regularly by using a one-time
password (OTP) which is provided by the system to the user through a book that contains
a list of passwords. Each login uses the next password in the list. If an intruder, by any
chance, ever discovers a password, it will not serve their purpose, since next time a differ-
ent password must be used. But the user here must be very careful about the password book
and maintain the book with utmost secrecy.
One other method exists which is based on a variation on the password ideas in which each new
user is provided a long list of questions, and their answers are stored in the computer in encrypted
form. The questions should be chosen so that the user does not need to write the answers down. In
other words, they should be things no one forgets. Typical questions look like: 1. Who is Kamal’s
brother? 2. On what street was your maternal uncle’s house? 3. What did Mr. Paul teach in your
secondary school? At the time of login, the computer asks one of those at random and checks that
with the existing answer.
Still another variation of the password idea is known as challenge–response. This technique is
to have the system issue a dynamic challenge to the user after login. The user picks an algorithm
when signing up as a user that is to be applied as a secret transformation, such as x × 2 or x + 3.
When the user logs in, the computer types a random number as an argument, say, 5, in which case
the user types 10 or 8, or whatever the answer is. Failure to do so may be used to detect unauthorized
Security and Protection 419
users. The algorithm can be different in the morning and afternoon, on different days of the week,
even from different terminals, and so on.
8.11.3 BIOMETRICS
There exists another major group of authentication mechanisms which are based on the unique
characteristics of each user that are hard to forge. Some user characteristics are so naturally unique
and completely users’ own that they can be exploited to realize a protection mechanism in the form
of biometric techniques. These user characteristics fall into two basic categories:
Many other methods can be cited that can provide a foolproof identifcation, for example, urinalysis,
often used by dogs, cats, and other animals to mark their territory by urinating around its perimeter.
In our case, each terminal could be equipped with a specifc device along with a sign: “For login,
please deposit your sample here”. This might be an absolutely unbreakable system, but it would
probably give rise to fairly serious objections from the user end. After all, whatever authentication
scheme is employed, it must be psychologically acceptable to the user community.
Behavioral characteristics, in general, can vary with a user’s physical and mental state and thus
may be susceptible to higher false acceptance and rejection rates. For example, signature pattern
and keystroke rate mostly depend on and may vary with user stress level and fatigue.
However, detection devices to be used as an attachment to the computer system should usually be
self-contained, easily pluggable with the existing system, and independent of the computer system,
which defnitely improves the potential for tamper-proofng. The distinct advantages of biometric
authentication lie in its increased accuracy in the process of authentication and similarly reduction
420 Operating Systems
FIGURE 8.5 A schematic block–wise representation of the taxonomy of generic malicious programs used to
launch threats on modern operating systems.
Security and Protection 421
door is basically code that recognizes a special sequence of input or is triggered by being run from
a certain user-id or by an unlikely sequence of events in order to activate the program or different
parts of the program.
A trap door can be abused by anyone who is already aware of it or acquires knowledge of its
existence and the related entry procedure. Trap door attack is so vulnerable that it defeated the
most strongly secured system of those days, MULTICS equipped with 64 hierarchically orga-
nized protection domains, numbered from the innermost to the outermost; each one had a set of
specifc access privileges, including access privileges to all higher numbered domains. An Air
Force “tiger team” (simulating intruders) launched a threat to MULTICS through a trapdoor so
accurately that the MULTICS developers could not detect it, although they were later informed
of its presence.
Trap door attacks are dangerous, and it is extremely hard to prevent them even by implementing
an adequate security system using operating-system controls to counter them. It is thus suggested
that software developers put more emphasis on implementing appropriate security measures at the
time of system design and development and/or update activities.
More about this topic is given on the Support Material at www.routledge.com/9781032467238.
same intent. Many other examples can be cited to illustrate various types of damaging functions that
are performed by different types of Trojan horse programs.
More advanced versions of sophisticated Trojan horse programs can make themselves even
harder to detect by fully emulating the utility that they are meant to impersonate with the additional
provision of creating many types of damage, such as forwarding data of interest to a perpetrator
or quietly erasing the hard disk, deleting the user’s fle. This is a severe violation of the integrity
requirement or forces a system to crash or slow down that amounts to denial–of–service. Another
typical example of a Trojan horse activity is a spoof login program that provides a fake login prompt
to fool a program into revealing password information.
One of the relatively diffcult-to-detect Trojan horse programs is a compiler that modifes certain
programs by injecting additional code into them when they are compiled, such as a system-login
program. This code creates a trap door in the login program that permits the Trojan horse creator to
log on to the system using a special password. This Trojan horse can never be discovered by reading
the source code of the login program. The Trojan horse was even implanted in a graphics routine
offered on an electronic bulletin board system. Finally, it is worthwhile to mention that, since a
Trojan horse is loaded explicitly by a user, its authorship or origin cannot be completely concealed;
hence it is not diffcult to track.
8.12.4 VIRUSES
The most well-known kind of malware is the virus, which, in recent years, has become a signif-
cant part of the software industry, particularly because of the evolution of two aspects of comput-
ing. First, the compact disk (CD) and pen drive were widely circulated among personal computer
users and users of distributed computing environments, including client–server systems. These
devices are an ideal carrier for a virus, since the recipient mounts the device and then runs its
programs. Second, the emergence of the internet and its wide use make it a prolifc breeding
ground for viruses, particularly because it offers a broad variety of mail, Web pages, newsgroups,
and free software.
Basically, a virus is a piece of code (it behaves like a parasite) that can attach itself to other
programs in the system and also spread to other systems to “infect” them when virally affected
programs are copied or transferred. Analogous to its biological counterpart, a computer virus also
carries in its instructional code the recipe for making many perfect copies of it. In a network envi-
ronment, when a host computer is logged in, the typical virus takes temporary control of the com-
puter’s operating system. After that, whenever an uninfected piece of software comes into contact
with the virally affected computer, a fresh copy of the virus passes into the new incoming program.
In this way, infections are spread from program to program, from one computer to another, without
giving any indication to the users who are dealing with virally affected systems or programs over
network. The ability to share data, applications, and system services on other computers as provided
in a cluster of interconnected computers establishes a perfect culture for the spread of a virus.
A virus can do anything that other programs do. The only difference is that it attaches itself to
another program and executes secretly when the host program is run. While attaching itself to a
program, a virus puts its own address as the execution start address of the program. This way it gets
control when the program is activated and infects other programs on a disk in the system by attach-
ing itself to them. After that, it transfers control to the actual program for execution. The infection
step usually does not consume much CPU time; hence, a user receives no indication and practically
has no way of knowing beforehand that the program being executed carries a virus. Its presence
is felt only after its success. In fact, the way a virus attaches itself to another program makes it far
more diffcult to track than a Trojan horse.
Most viruses perform their tasks by exploiting the support and services offered by the underlying
operating system and often are mostly specifc to a particular operating system. In some cases, they
carry out their work in a manner that is also specifc to a particular hardware platform. Thus, they
Security and Protection 423
are designed mostly by keeping in view the specifc operating environment as a whole so that the
details and weaknesses of particular systems can be willfully exploited.
Once a virus begins execution, it can perform any function that is permitted by the privileges of
the current user. During its lifetime, a typical virus usually goes through the following four stages.
• Dormant state: The virus is idle and will be activated only when a certain event occurs.
Not all viruses have this state.
• Propagation state: Each infected program contains a clone of the virus and places an
identical copy of itself into other uninfected programs, thereby it itself enters a propaga-
tion state.
• Triggering state: The triggering state is attained when certain event occurs or can be
caused by the presence of a variety of system events, including a count of the number of
times that this copy of the virus has already made copies of itself.
• Execution state: The virus now performs its intended function, which is usually harmful,
damaging, and destructive.
• Types of Viruses
Different types of viruses are emerging very often with inherent advantages and are always at the
offending end, whereas antivirus software is always at the defending end and attempts only to foil
all possible attacks organized from the end of the offenders. Hence, there is a war between virus and
antivirus, and there has been an arms race continuously going on between virus creators and anti-
virus developers. Antivirus software developed using numerous techniques is now quite matured
to counter almost all types of existing viruses, and that is why more and more new types of viruses
with numerous characteristics are continuously being developed and introduced to outmaneuver
existing antivirus software. Different types of viruses presently exist and have been classifed, and
the following are the most signifcant types of viruses:
• Parasitic virus: This is perhaps the most common form and traditional type of virus,
which attaches itself to executable fles and replicates itself, attempting to infect other
executable uninfected fles, when the infected host program is executed.
• Boot sector virus: This virus implants itself in the boot sector of a disk device and
infects a master boot record. It gets an opportunity to execute when the system is booted
and then spreads from the disk containing the virus. Similarly, it gets an opportunity to
replicate when a new disk is made.
• Memory-resident virus: This virus lodges in main memory as part of a resident sys-
tem program and starts infecting other programs whenever a program is brought into
memory for execution.
• Stealth virus: This is a form of virus explicitly designed to hide itself from detection
carried out by antivirus software. A common example of a stealth virus is one that uses
compression techniques so that its presence in the infected program cannot be detected,
since the length of both the infected program and its counterpart uninfected version are
same. Far more sophisticated techniques can be used. For example, a virus can place
intercept logic in disk I/O routines so that when the antivirus attempts to read a sus-
pected portion of the disk, the virus will present the original uninfected program to foil
the attempt. That is why stealth, more specifcally, is not a term that applies to a type of
virus as such; rather it can be considered a technique used by a virus at the time of its
creation to evade detection.
424 Operating Systems
• Polymorphic virus: A polymorphic virus is one that mutates with every infection. This
makes identifcation of the virus by the “signature” during detection almost impossible.
When replicating, this virus creates copies that are functionally equivalent but contain
distinctly different bit patterns. The ultimate target is to evade detection or defeat the
actions taken by antivirus programs. In this case, the “signature” of the virus will vary
with each copy. To realize this variation, the virus may randomly insert fake instruc-
tions or change the order of independent instructions in its own program. A far more
effective approach may be to use encryption. A portion of the virus, generally called a
mutation engine, creates a random encryption key to encrypt the remaining portion of
the virus. The key is stored with the virus, and the mutation engine itself is altered dur-
ing replication. When an infected program is invoked, the virus uses the stored random
key to decrypt the virus. When the virus replicates, a different random encryption key
is selected. Like the stealth virus, this term cannot be applied to a type of virus as such;
rather it can be considered a technique used by a virus at the time of its formation to
evade detection.
Virus-writers often use available virus-creation toolkits which enable beginners to
quickly create a number of different viruses, although the products are not as sophisti-
cated as others that are developed from scratch using innovative schemes, but success
encourages them to stay with this practice, and they soon become expert. Virus–writers
sometimes use another tool: a virus exchange bulletin board that offers copies of viruses as
well as valuable tips for the creation of more intelligent viruses, and these can be directly
downloaded. A number of such boards exist in The United States and other countries.
• E–Mail Viruses: One of the latest developments in the world of malicious software is
perhaps the e-mail virus. The frst rapidly spreading viruses, such as Melissa, made use
of a Microsoft Word macro embedded in a mail attachment. When the e-mail attachment
is opened by a recipient, the Word macro is activated, and the following actions are then
started.
• The e-mail virus present in the attachment sends itself to everyone on the mailing list
in the user’s e-mail package.
• The virus starts local damage.
A more powerful version of an e-mail virus emerged in late 1999 using the Visual Basic scripting
language supported by the e-mail package. This newer version provides activation when the e-mail
containing the virus is merely opened rather than waiting for the opening of an attachment.
The emergence of e-mail viruses opened a new generation of malware that exploits existing
email software features and replicates to spread across the internet. The virus begins propagating
itself as soon as it is activated (whether by way of opening an e-mail attachment or simply by open-
ing the e-mail itself) to all of the e-mail addresses already known to the infected host. This virus
accomplishes its task within hours, whereas other viruses used to take months or years to propagate.
Consequently, it is becoming very diffcult for antivirus software to respond before much damage is
done. A greater degree of security measures is hence urgently needed and must be embedded into
internet utility and application software to counter these types of constantly growing intelligent
threats.
• Antivirus
Propagation of viruses and their subsequent attacks on the system cannot be stopped. Even pre-
venting them from getting into the system is impossible to achieve. Hence, to protect the system
from the threats of viruses, countermeasures are required. One possible solution may be to prevent
them from entering the system by exerting the best effort, and even if the system is ever found
Security and Protection 425
virally infected despite this effort, destroy the virus immediately before any major damage is done.
To accomplish this task, some useful actions, as given in the following, are to be taken in an orderly
way to eventually bring the system back to normalcy.
Detection: If the system is infected, be sure that the infection has really occurred and locate
the virus.
Identifcation: Once detection is successfully done, identify the specifc virus that caused
the infection.
Removal: Once the specifc virus has been identifed, take all possible measures to remove
all traces of the virus from infected programs to bring them back to their original state.
Finally, remove the virus drastically from all infected systems to prevent its further propa-
gation. The removal process destroys the virus so that it cannot spread once again.
In situations when a virus is detected, but its identifcation or removal is not possible at the
moment, one relatively safe alternative may be to abandon the infected program and use a clean
backup version of it. But this approach never gives any guarantee that the system or the other pro-
grams in the system will remain safe and unaffected henceforth.
Declared war between virus and antivirus continues, and becomes more virulent as the technol-
ogy, and tricks and techniques used in both the areas are gradually getting matured. Advancement
and numerous innovations are observed to happen at a high pace, and there is no indication of any
ceasefre; rather one always attempts to dominate the other. Early viruses or their early versions
were relatively simple code fragments, were comparatively easy to identify, and were purged with
relatively simple antivirus software. As the virus arms race began, both camps organized them-
selves and started to develop more advanced, complex, and sophisticated products. Antivirus prod-
ucts with increasingly sophisticated approaches continue to appear. Two of the most important are
generic decryption, and digital immune systems, developed by IBM. Detailed description of them is
beyond the scope of this book. Interested readers are advised to go through the respective write-ups
to acquire a clear understanding of these two approaches.
8.12.5 WORMS
Worms are closely related to viruses, but they are distinguished from viruses because a worm is a
free-standing, active penetrating entity. A worm is a program which may enter the machine as a fle,
but it will begin its execution on its own and replicate itself, spreading to other computer systems by
exploiting the holes in their security setup. Once a fle containing a worm has been placed in the fle
system, the worm fnds a loophole in the process manager in order to execute itself. For example,
one well-known worm program, “Morris’s worm”, was developed to penetrate UNIX systems by
taking advantage of the fnger command.
Network worms normally spread by using network connections to transmit their copies
to other computers. Once active within a system, the effects of a worm can be the same as
those of a virus. It could implant Trojan horse programs or perform any type of unwanted and
destructive actions. Worms are known to replicate at unimaginably high rates, thereby creat-
ing congestion in the network and consuming appreciable CPU time during replication. While
replicating itself, a network worm makes use of the facilities and support that network provides.
Some examples are:
• E-mail facility: A worm often mails a copy of itself to another known system.
• Remote login capability: A worm logs onto a remote system as a legitimate user, then
uses commands to copy itself from one system to another.
• Remote execution capability: A worm on its own can execute a copy of itself on another
system.
426 Operating Systems
The characteristics of a network worm are similar to those of a computer virus. It also goes
through four stages during its lifetime: a dormant stage, a propagation stage, a triggering
stage, and an execution stage. During its journey through different stages, it performs stage-
specific functions. For example, the propagation phase performs the following functions in
general:
• Search for other systems to infect by inspecting host tables or similar other repositories of
remote system addresses.
• Establish a connection with a remote system obtained from such a search.
• Determine whether the connected system has already been infected before copying itself
to the system. If so, the copy operation is abandoned. If not, copy itself to the remote sys-
tem and perform related actions to run the copy.
The behavior of a worm can be the same as that of a virus. Indeed, the distinction between a worm
and a virus in activity cannot always be clearly demarcated; some malware thus often uses both
methods to spread. Due to its self-replicating nature, a worm is even more diffcult to track than a
virus. However, properly designed security measures implanted in both network systems and single-
computer systems can reasonably minimize the threats caused by worms.
8.12.6 ZOMBIES
A zombie is a program that secretly takes over another computer connected to the internet and
then uses that computer to launch various types of attacks that are diffcult to trace. Zombies are
purposely used mostly in denial-of-service attacks, typically against targeted Support Materials at
www.routledge.com/9781032467238. The zombie is normally planted on hundreds of computers
belonging to innocent (unsuspecting) third parties and then is used to overwhelm the target Support
Material at www.routledge.com/9781032467238 by launching a devastating attack on internet traffc.
There are still other forms of malware, such as spyware. These are mostly related to Support Materials
at www.routledge.com/9781032467238 and mainly operate between web browsers and web servers.
Apart from using malware to disrupt or destroy the environment, it is often created to make a
proft. Malware in a for-proft scheme installs a key logger on an infected computer. A key logger
records everything typed at the keyboard. It is then not too diffcult to flter these data and extract
needed information, such as; username–password combinations, credit card numbers and expira-
tion dates, and similar other things. This information can then be supplied to a master, where it can
be used or sold for launching criminal activities.
8.13 ENCRYPTION
Encryption is essentially a technique for protecting all automated network and computer
applications and related data. File management systems often use it to guard information
related to users and their resources. The branch of science that deals with encryption is called
cryptography.
Encryption is the application of an algorithmic transformation E k to data, where E is the
encryption algorithm with a secret key k as input, called an encryption key. The original form
of data on which encryption is carried out is called plaintext, and the transformed data are
called encrypted or ciphertext. The encrypted data are once again recovered by applying a
transformation D k′, where k′ is a decryption key. D k′ is essentially an encryption algorithm
with the secret key k′ run in reverse. It takes the ciphertext and the secret key k′ as input and
produces the original plaintext. A scheme that uses k = k′ is called symmetric encryption or
conventional encryption, and one using k ≠ k′ is called asymmetric encryption or public-key
encryption.
Security and Protection 427
• Ciphertext-only attack: While attempting to guess Dk, an intruder relies on the nature
of the algorithm Ek already known to them and perhaps some knowledge of the general
428 Operating Systems
characteristics of the plaintext, such as the frequency with which each letter of the alpha-
bet appears in English text. If it is known that Ek replaces each letter in a plaintext with
another letter of the alphabet (a method called a substitution cipher), an intruder may use
this information to guess Dk. A ciphertext attack, in essence, is perhaps more effcient than
its counterpart, an exhaustive attack, if a characteristic feature of plaintext can be success-
fully identifed.
• Known plaintext attack: An intruder here knows the plaintext corresponding to a cipher-
text. This attack is possible to launch if an intruder can gain a position within the OS,
which is not very diffcult to occupy, from where both a plaintext and its corresponding
ciphertext can be observed. In this way, an intruder can collect a suffcient number of
(plaintext, ciphertext) pairs that can guide them, which may make determining Dk easier.
• Chosen plaintext attack: An intruder here is able to supply the plaintext and examine its
encrypted form. This possibility helps the intruder to systematically build a collection of
(plaintext, ciphertext) pairs that eventually helps them arrive at an approximate guess and
thereby make repeated refnements to guesses to determine an exact Dk.
In summary, all these types of attacks actually exploit the characteristics of the algorithm Ek in
attempting to deduce a specifc plaintext or the key being used. If the attack succeeds in deducing
the key, the result is catastrophic. All future and past messages encrypted with that key can then be
easily obtained.
Quality of encryption to thwart attacks is believed to be improved with the use of an increased
number of bits in key k. For example, use of a 56-bit key in an encryption scheme requires 255 trials
guessing Dk to break it. This large number of trials was believed to make such a scheme computa-
tionally secure from exhaustive attacks. However, exploiting powerful mathematical techniques like
differential analysis can make this approach to guessing Dk much easier than an exhaustive attack.
In addition, making use of today’s computers with massively parallel organizations of microproces-
sors, it may now be possible to achieve processing rates many orders of magnitude greater. Coupling
all such tools and support, the performance of such attempts reaches a level that a one-way function
Ek with a 56-bit key can no longer be considered computationally secure.
More details on this topic are given on the Support Material at www.routledge.com/9781032467238.
• Block cipher: This strategy is essentially an extension of the classical substitution cipher.
A block cipher processes the plaintext input of fxed-size blocks with a key k and produces
Security and Protection 429
a block of ciphertext of equal size for each plaintext block. These produced blocks are
then assembled to obtain the ciphertext. The block cipher strategy is easy to handle and
simple to implement. While it introduces some confusion, it does not introduce suffcient
diffusion, so identical blocks in a plaintext produce identical blocks in the ciphertext. This
feature is a weak point of this approach that makes it vulnerable to an attack based on
frequency-analysis and known or chosen plaintext attacks. The chances of such attacks to
succeed can be reduced by using a relatively large number of blocks.
• Stream cipher: A stream cipher treats a plaintext as well as the encryption key as streams
of bits. Encryption is performed using a transformation that involves a few bits of the plain-
text and an equal number of bits of the encryption key. Transformation operations may be
of various types, but a popular choice may be a bit-by-bit transformation of a plaintext,
typically by performing an operation like Exclusive-OR on a bit of the plaintext and a bit
of the encryption key.
A stream cipher is operationally faster than a block cipher. When a bit-by-bit transfor-
mation is used, it does not provide confusion or diffusion. However, out of many variants
of stream ciphers, one is a ciphertext autokey cipher, an asynchronous stream cipher or a
self-synchronizing cipher that introduces diffusion. It employs a key-stream generator that
uses a function of the key stream and the last few bits of the ciphertext stream generated
so far. In practice, the key stream is used to encrypt the frst few bits of the plaintext. The
ciphertext thus generated corresponding to these bits of the plaintext is then used as a key
stream to encrypt the next few bits of the plaintext, and this process goes on until the entire
plaintext is encrypted. In this way, diffusion is obtained, since a substring in the plaintext
always infuences encryption of the rest of the plaintext.
• The Data Encryption Standard: DES, essentially a block cipher developed by IBM in
1977, was a dominant encryption algorithm. It uses a 56-bit key to encrypt 64-bit data
blocks, and for being a block cipher, it possesses poor diffusion. To overcome this short-
coming, DES incorporates cipher block chaining mode, which uses the frst block of
plaintext to be combined with an initial vector by an Exclusive-OR operation and then
enciphered. The resulting ciphertext is then combined with the second block of the plain-
text using an Exclusive-OR operation and then enciphered. This process goes on until the
entire plaintext is encrypted. In this algorithm, there are three steps that explicitly incor-
porate diffusion and confusion. Diffusion is introduced using permutation of the plaintext.
Confusion is realized through substitution of an m bit number by an n bit number by selec-
tively omitting some bits and then using this n bit number in the encryption process. These
steps eventually obscure the features of a plaintext and the encryption process to such an
extent that it forces the intruder to resort to an extensive variant of the exhaustive attack to
break the cipher.
DES eventually fzzled out, primarily because it used only a small key length of 56 bits,
since more and more versatile computers with increased speed and lower cost have been
started to introduce that became successful to break DES-based encryption. The life of
DES was extended by the use of a triple DES (3DES) algorithm that employed a key of size
112 bits and could effectively use keys up to 168 bits in length, which was considered suf-
fciently secure against attacks for only a few years and hence was endorsed as an interim
standard until a new standard was adopted.
The principal drawback of 3DES is that the algorithm is relatively sluggish in software.
Moreover, both DES and 3DES use a 64-bit block size, which was not ftting from the
perspective of both effciency and security; a larger block size was thus desirable. Work
continued in the quest for a new standard, and ultimately the AES was introduced, adopted
in 2001.
• Advanced Encryption Standard: Reviewing all these drawbacks, even 3DES, was not
considered a dependable candidate for long-term use. Consequently, the National Institute
430 Operating Systems
of Standards and Technology (NIST) in 1997, along with others, proposed the new AES,
which was stronger than 3DES and essentially a symmetric block cipher of blocklength
128 bits and support for key lengths of 128, 192, and 256 bits. Finally, NIST released AES
in 2001 as a Federal Information Processing Standard (FIPS) that mostly fulflled all the
criteria, including security, computational effciency, memory requirements, hardware and
software suitability, and fexibility.
AES uses a block size of 128 bits, and keys of 128, 192, or 256 bits. It is essentially a
variant of Rijndael, which is a compact and fast encryption algorithm employing only
substitutions and permutations with use of the key and block sizes in the range of 128–256
bits that are multiples of 32 bits. AES uses an array of 4 × 4 bytes, called a state, which is
a block of plaintext on which several rounds of operations are carried out. The number of
rounds to be performed depends on the key length; 10 rounds are performed for keys of
128 bits, 12 rounds for 192-bit keys, and 14 rounds for 256-bit keys. Each round consists
of a set of specifed operations: byte substitution, shifting of rows, mixing of columns, and
key addition.
To enable both encryption and decryption to be performed using the same sequence of
steps, a key addition is performed before starting the frst round, and the step involving
mixing of columns is skipped in the last round.
More details on each of the schemes with a fgure are given on the Support Material at www.
routledge.com/9781032467238.
key rather than a secret key to avoid confusion with symmetric encryption, in which the key is
always referred to as a secret key.
Finally, it is worthwhile to mention that it should not be thought that public-key encryption is
more secure from cryptanalysis than symmetric encryption. In fact, the strength of any encryption
scheme to withstand attacks depends mostly on the length of the key and the computational work
involved to break the cipher. Truly speaking, there is as such no basis, at least, in principle about
either symmetric or public-key encryption that can decide one superior than other from the end of
resisting cryptanalysis. Moreover, there is no reason to believe that public-key encryption for being
a general-purpose and an asymmetric technique has elbowed out symmetric encryption and made it
obsolete. On the contrary, due to computational overhead involved in today’s public-key encryption
schemes, there seems no foreseeable likelihood that symmetric encryption will be considered to be
discarded. In fact, symmetric encryption can outperform public-key encryption in many aspects,
particularly in the area of key distribution.
• RSA algorithm: Public-key encryption schemes have been implemented in many ways.
One of the frst public-key encryption schemes was developed by Ron Rivest, Adi Shamir,
and Len Adleman at MIT in 1977. The RSA scheme then started to dominate, and since
that time has reigned supreme as the only widely accepted and implemented approach to
public-key encryption. In fact, RSA is a cipher in which both the plaintext and the cipher-
text are integers between 0 and n – 1 for some n. Encryption uses modular arithmetic.
However, the secret of its success is the strength of the algorithm, which is mostly due to
the diffculty of factoring numbers into their prime factors, based on which the algorithm
was constructed.
• Access Control: UNIX defnes three user classes: fle owner, user group, and other users.
The ACL needs to record only the presence of three access rights: r, w, and x, which rep-
resent read, write, and execute, respectively, in bit-encoded form for each of the three user
432 Operating Systems
classes. When any bit of 3 is 1, the respective access is permitted; otherwise it is denied.
Three sets of such 3-bit groups (3 × 3 = 9 bits), like (rwx) (rwx) (rwx), one set for fle owner,
one set for user group, and one set for other users, are used in the access control list of any
specifc fle. While the 9-bit UNIX scheme is clearly less general than a full-blown ACL
system, in practice, it is adequate, and its implementation is much simpler and cheaper.
The directory entry of a fle, however, contains the identity of the fle owner in one feld,
and bit-encoded access descriptors (r, w, x) for each user class are stored in another feld.
The access privileges of a UNIX process are determined by its uid. When a process is created
by the kernel, the kernel sets the uid of the process to the id of the user who created it. However,
temporarily changing the uid of a process is possible, and that is accomplished with the use of the
system call setuid<id> to change its uid to <id> and another setuid system call with its own id to
revert to its original uid. The setgid feature analogously provides a method of temporarily changing
the group-id of a process.
More details on this topic with a fgure are given on the Support Material at www.routledge.
com/9781032467238.
• Access Control: Linux protects fle access through the user-id and group-id of a process;
even when NFS (essentially a server) accesses a fle on behalf of a user, the server’s own
fle protection mechanism would not be used. To enable the server to temporarily gain
access rights of users, Linux provides the system calls fsuid and fsgid through which a
server can temporarily assume the identity of its client.
The Linux kernel provides loadable kernel modules through which the improved access controls can
be realized; one is called Linux Security Modules (LSMs), which supports many different security
models. In fact, the Security Enhanced Linux (SELinux) of the US National Security Agency has
built additional access control mechanisms through LSM that provide mandatory access control.
More details on this topic are given on the Support Material at www.routledge.com/9781032467238.
with each process, which is analogous to capabilities, and a security descriptor associated with
each object that also enables interprocess accesses. An important aspect of Windows security is
the concept of impersonation, which simplifes the client–server interaction when it is over a RPC
connection. Here, the server can temporarily assume the identity of the client so that it can evaluate
a client’s request for access relative to that client’s rights. After the access, the server automatically
reverts to its own identity.
• Access Token: When the logon in a Windows system is successful using the name/
password scheme, a process object is created with an access token that determines
which access privileges the process may have and bears all necessary security informa-
tion as well as speeding up access validation. A process can create more access tokens
through the LogonUser function. Generally, the token is initialized with each of these
privileges in a disabled state. Subsequently, if one of the user’s processes needs to
perform a privileged operation, the process may then enable the appropriate privilege
and attempt access. In fact, the general structures of access tokens include the main
parameters (felds), such as Security ID (SID), Group SIDs, Privileges, Default Owner,
and Default ACL.
• Security Descriptors: For each object, such as a fle when is created, its access token
is assigned by the creating process with its own SID or any group SID that makes inter-
process access possible. Each object is associated with a security descriptor in which the
chief component is an access control list that mainly includes parameters (felds) such as
Flags, Owner, Discretionary Access Control List (DACL), and System Access Control List
(SACL) in order to specify access rights for various users and user groups for this object.
When a process attempts to access this object, the SID of the process is compared against
the ACL of the object to determine if access is permitted.
• Access Control List: The access control lists at the heart of Windows provide access con-
trol facilities. Each such list consists of an overall header and a variable number of access
control entries (ACEs). Each entry specifes an individual or group SID and an access
mask that defnes the rights to be granted to the SID. When a process attempts to access an
object, the OM in the Windows executive reads the SID and group SIDs from the access
token and then scans down the object’s DACL. If a match is found, that is, if an ACE is
found with a SID that matches one of the SIDs from the access token, then the process
has the access rights specifed by the access mask in that ACE. The access mask has 32
bits, in which each bit or a group of consecutive bits contains vital information in relation
to access rights applicable to all types of objects (generic access types) as well as specifc
access types appropriate to a particular type of object.
The high-order half (16 bits) of the mask contains bits relating to four generic access types that
apply to all types of objects. These bits also provide a convenient way to set specifc access types to
a number of different types of object. The lower 5 bits of this high-order half refer to fve standard
access types: Synchronize, Write owner, Write DAC, Read control, and Delete. The least signifcant
16 bits of the 32-bits specify access rights that apply to a particular type of object. For example, bit
0 for a fle object is File_Read_Data access, whereas bit 0 for an event object is Event_Query_Status
access.
Another important feature in the Windows security model is that applications and utilities
can exploit the Windows security framework for user-defned objects. A database server, for
example, might create its own security descriptors and attach them to specifc portions of a
database. In addition to normal read/write access constraints, the server could secure database-
specifc operations, such as deleting an object or performing a join. It would then be the server’s
responsibility to defne the meaning of special rights and carry out all types of access checks,
and such checks would occur in a standard way, considering system-wide user/group accounts
and audit logs.
434 Operating Systems
SUMMARY
The operating system offers protection mechanisms based on which users implement security
policies to ultimately protect the computer system. System security is scrutinized from different
angles to reveal the numerous types of security attacks launched and to estimate the role and
effect of malicious programs as major security threats, and subsequently approaches are imple-
mented to counter all such threats to protect the related resources. Two popular access-control-
based security systems used are access control lists and capability lists. Both are implemented
by most popular systems in which ACLs maintain static information, and capabilities are created
by the system at runtime. Different types of authentication strategies, including password-based
schemes widely employed to keep illegitimate users away from systems, are discussed. Various
types of encryption-decryption-based approaches, including symmetric encryption (private-key
encryption) and asymmetric encryption (public-key encryption) schemes, are demonstrated.
Viruses, the fagship security attackers causing numerous threats, are discussed in detail. Last,
the different types of protection mechanisms offered by the most popular operating systems,
UNIX, Linux, and Windows, to meet each one’s individual objectives and subsequently how
security systems are built by users separately on each of these platforms are narrated as case
studies in brief.
EXERCISES
1. Describe the distinctive differences between security and protection
2. Protection is implemented by the operating system using two key methods, authentication
and authorization; discuss with your own comments.
3. What is the difference between policies and mechanisms with respect to both security and
protection?
4. Discuss the common requirements or goals that an operating system must meet to prevent
generic security threats.
5. State and characterize the different types of attacks that may be launched on security of a
system.
6. “It is not the data alone but the entire computer system that is always exposed to threats in
security attacks”. Justify the statement.
7. What is the difference between passive and active security attacks? Discuss briefy the dif-
ferent categories that are commonly observed in active security attacks.
8. What is the spectrum of approaches used by a user in which an operating system may pro-
vide appropriate protection?
9. Dynamic relocation hardware is usually considered a basic memory protection mecha-
nism. What is the protection state in relocation hardware? How does the operating system
ensure that the protection state is not changed indiscriminately?
10. Give your argument for conditions under which the access control list method is superior
to the capability list approach for implementing the access matrix.
11. An OS performs validation of software capabilities as follows: When a new capability is
created, the object manager stores a copy of the capability for its own use. When a process
wishes to perform an operation, the capability presented by it is compared with stored
capabilities. The operation is permitted only if a matching capability exists with the object
manager. Do you think that this scheme is foolproof? Using this scheme, is it possible to
perform selective revocation of access privileges?
12. Use of capabilities is very effcient to realize protection mechanisms. What are the practi-
cal diffculties at the time of its implementation?
13. Passwords are a common form to implement user authentication. Briefy, state the ways in
which the password fle can be protected.
Security and Protection 435
14. Write down the numerous techniques by which an intruder may launch a variety of inge-
nious attacks to breach/crack security or to learn passwords.
15. What are the popular and effective techniques that have been proposed to defeat attacks on
passwords?
16. Password-based authentication often suffers from inherent limitations. What are other
ways by which user authentication can be checked?
17. What is meant by malicious software? How are the threats launched by them differenti-
ated? Write down the different types of malicious software that are commonly found in
computer environments.
18. Can the Trojan horse attack work in a system protected by capabilities?
19. What is meant by virus? What is the common nature of viruses? Write down the type of
viruses you are aware of.
20. What is an e-mail virus? How do they work? What are their salient features that differenti-
ate them from other members?
21. What is the difference between a virus and a worm? How do they each reproduce?
22. What is antivirus? How do they work? Write down the steps they take to restore the system
to normalcy.
23. List the security attacks that cannot be prevented by encryption.
24. What are the two general approaches to attacking a conventional encryption system?
25. Assume that passwords are limited to use of the 95 printable ASCII characters and all
passwords are allowed to use 10 characters in length. Suppose a password cracker is
equipped with an encryption rate of 6.5 million encryptions per second. How long will it
take, launching an exhaustive attack, to test all possible passwords on a UNIX system?
26. What are DES and Triple DES? Discuss their relative merits and drawbacks.
27. The encryption scheme used for UNIX passwords is one way; it is not possible to reverse
it. Therefore, would it be more accurate to say that this is, in fact, a hash code rather than
an encryption of the password?
28. It was stated that the inclusion of the salt in the UNIX password scheme increases the dif-
fculty of guessing by a factor of 4096 (212 = 4096). But the salt is stored in plaintext in the
same entry as the corresponding ciphertext password. Therefore, those twelve characters
are known to the attackers and need not be guessed. Therefore, why is it asserted that the
salt increases security?
29. How is the AES expected to be an improvement over triple DES?
30. What evaluation criteria will be used in assessing AES candidates?
31. Describe the differences among the terms public key, private key, and secret key.
32. Explain the difference between conventional encryption and public-key encryption.
Saltzer, J. H., Schroeder, M. D., et al. “The Protection of Information in Computer Systems”, Proceedings of
the IEEE, New York, IEEE, pp. 1278–1308, 1975.
Sandhu, R. S. “Lattice-Based Access Controls Models”, Computer, vol. 26, pp. 9–19, 1993.
Shamon, C. E. “Communication Theory of Secrecy Systems”, Bell Systems Journal, 1949.
Schneier, B. Applied Cryptography, New York, Wiley, 1996.
Treese, W. “The State of Security on the Internet”, NetWorker, vol. 8, pp. 13–15, 2004.
Weiss, A. “Spyware Be Gone”, NetWorker, vol. 9, pp. 18–25, 2005.
Wright, C., Cowan, C., et al. “Linux Security Modules: General Security Support for the Linux Kernel”,
Eleventh USENIX Security Symposium.
RECOMMENDED WEBSITES
AntiVirus On-line: IBM’s site about virus information.
Computer Security Resource Center: Maintained by National Institute on Standards and Technology
(NIST). It contains a wide range of information on security, threats, technology and standards.
Intrusion Detection Working Group: Containing all of the documents including the documents on Protection
and Security generated by this group.
CERT Coordination Center: This organization evolves from the computer emergency response team formed
by the Defense Advanced Research Projects Agency Site offers good information on Internet security,
vulnerabilities, threats, and attack statistics.
9 An Introduction
Distributed Systems
Learning Objectives
• To describe the evolution of distributed computing systems and their advantages and dis-
advantages, including different forms of their hardware design.
• To describe the forms of software that drive the distributed computing systems.
• To demystify the generic distributed operating system and its design issues.
• To discuss generic multiprocessor operating systems with numerous considerations used in
different forms of multiprocessor architecture.
• To elucidate different management systems of OSs with emphasis on processor manage-
ment, including the different methods used in processor scheduling, process scheduling,
and thread scheduling in multiprocessor environments.
• To present separately in brief the Linux OS and Windows OS in multiprocessor environ-
ments as case study.
• To present the multicomputer system architecture and its different models.
• To discuss the design issues of generic multicomputer operating systems.
• To introduce the concept of middleware and its different models in the design of true dis-
tributed systems, including its services to different application systems.
• To present a rough comparison between various types of operating systems running on
multiple-CPU systems.
• To introduce the concept of distributed systems built in the premises of networks of com-
puters with their related networking issues.
• To present as a case study a brief overview of AMOEBA, a traditional distributed operat-
ing system.
• To discuss in brief internetworking with all its related issues.
• To discuss in brief the design issues of distributed operating systems built in the premises
of workstation–server model.
• To discuss the remote procedure call and the implementation of generic RPCs as well as
the implementation of SUN RPC, presented here as case study.
• To present a brief overview of distributed shared memory as well as its design issues and
implementation aspects.
• To discuss the different aspects of distributed fle systems (DFSs) and their various design
issues, along with a brief description of their operation.
• To briefy describe the implementation of the Windows DFS, SUN NFS, and Linux GPFS
as case studies.
• To present a modern approach to distributed computer system design, the cluster, along
with its advantages, classifcations, and different methods of clustering.
• To briefy describe the general architecture of clusters and their operating system
aspects.
• To briefy describe the different aspects of implementation of Windows and SUN clusters
as case studies.
remote. Often, a processor together with its allied resources is referred to as a node, site, or even
machine of the distributed computing system.
Advancements in many areas of networking technology continued, and as a result, another major
innovation in networking technology took place in the early 1990s with the introduction of asyn-
chronous transfer mode (ATM) technology, which offered very high-speed data transmission on
the order of 1.2 gigabits in both LAN and WAN environments. Consequently, it made it possible to
support a different new class of distributed computing, called multimedia applications, that handles
a mixture of information, including, voice, video, and ordinary text data. These applications were
simply beyond imagination with the existing traditional LANs and WANs.
Distributed systems although appeared in the late 1970s and were well defned from the stand-
point of the hardware, but the appropriate software (mainly the operating system) that could extract
its total power to the fullest extent was not available. They need radically different software than
that used for centralized systems. Although this feld is not yet mature, extensive research already
carried out and still in progress has provided enough basic ideas to evolve a formal design of these
operating systems. Commercial distributed operating systems (DOSs) of different forms have
emerged that can support many popular distributed applications. Distributed computing systems
that use DOSs are referred by the term true distributed systems or simply distributed systems.
The term “distributed system” means the presence of a DOS on any model of a distributed comput-
ing system.
Brief details on this topic are given on the Support Material at www.routledge.com/9781032467238.
It is to be noted that all these advantages are actually extracted by the appropriate operating system
to be carefully designed that would drive the well-organized machines in distributed computing sys-
tems and manage all the processes in a way to make them properly ft in a distributed environment.
Since the entire distributed environment is exposed and available to many users, security in such
systems is certainly a critically central issue. Moreover, the continuous increase in user density and
explosive growth in the emergence of different models of distributed systems and applications have
made this part more vital. Distributed systems, however, amplifed the dependence of both orga-
nizations and users on the information stored and communications using the interconnections via
networks. This, in turn, means a need to protect data and messages with respect to their authenticity
and authority as well as to protect the entire system from network-based attacks, be it viruses, hack-
ers, or fraud. Computer security, fortunately, by this time has become more mature; many suitable
means and measures, including cryptography, are now available to readily enforce security.
Brief details on this topic are given on the Support Material at www.routledge.com/9781032467238.
dependence on a communication network, which may cause data or messages to be lost during
transmission across the network, requiring the intervention of additional special software to handle
the situation that, in turn, results in appreciable degradation of the overall system performance and
responsiveness to users. Moreover, when traffc on the network continues to grow, it exhausts the
network capacity; the network saturates and becomes overloaded; users on the network may then
come to a stall. Either special software is needed to negotiate this problem, or the communication
network system must be upgraded to higher bandwidth (maybe using fber optics), incurring a huge
cost. All these along with other perennial problems ultimately could negate most of the advantages
the distributed computing system was built to provide.
The other problem comes from one of the system’s advantages, which is the easy sharing of data
that exposes these data to all users, and consequently, this gives rise to a severe security problem.
Special security measures are additionally needed to protect widely distributed shared resources
and services against intentional or accidental violation of access control and privacy constraints.
Additional mechanisms may also be needed to keep important data dedicated, isolated, and secret
at all costs. Fortunately, several commonly used techniques are available today to serve the purpose
of designing more secure distributed computing systems.
Last but not least is the lack of availability of suitable system software, which is inherently
much more complex and diffcult to build than its counterpart, traditional centralized systems. This
increased complexity is mainly due to the fact that apart from performing its usual responsibilities
by effectively using and effciently managing a large number of distributed resources, it should
also be capable of handling the communication and security problems that are very different from
those of centralized systems. In fact, the performance and reliability of a distributed computing
system mostly depends to a great extent on the performance and reliability of the large number of
distributed resources attached to it and also on the underlying communication network, apart from
the performance of the additional software, as already mentioned, to safeguard the system from any
possible attack to keep it in normal operation.
Despite all these potential problems, as well as the increased complexity and diffculties in build-
ing distributed computing systems, it is observed that their advantages totally outweigh their dis-
advantages, and that is why the use of distributed computing systems is rapidly increasing. In fact,
the major advantages, economic pressures, and increased importance that have led to the growing
popularity of distributed computing systems will eventually bring about a further move one step
forward to connect most of the computers to form large distributed systems to provide even better,
cheaper, and more convenient service to most users.
Brief details on this topic are given on the Support Material at www.routledge.com/9781032467238.
FIGURE 9.1 A schematic block–wise representation relating to the taxonomy of the parallel and distributed
computing systems.
bandwidth, since most of the accesses made by each processor are to its local memory, thereby
reducing the latency that eventually results in an increase in processor performance. The nodes
in the machine are equipped with communication interfaces so that they can be connected to one
another through an interconnection network.
Due to the advent of more powerful VLSI technology in the mid-1980s, it became feasible to
develop powerful one-chip microprocessors and larger-capacity RAM at reasonable cost. Large-
scale multiprocessor architectures with radical changes then started to emerge with multiple memo-
ries that are now distributed with the processors. As a result, each CPU can now access its own
local memory quickly, but accessing the other memories connected with other CPUs and also the
separate common shared memory is also possible but is relatively slow. That is, these physically
separated memories can now be addressed as one logically shared address space, meaning that any
memory location can be addressed by any processor, assuming it has the correct access rights. This,
however, does not discard the fundamental shared memory concept of multiprocessors but supports
it in a broader sense. These machines are historically called distributed shared memory systems
or the scalable shared memory architecture model using non-uniform memory access (NUMA).
The distributed shared memory (DSM) architecture, however, can be considered a loosely coupled
multiprocessor, sometimes referred to as a distributed computing system in contrast to its coun-
terpart, a shared memory multiprocessor (uniform memory access (UMA)), considered a tightly
coupled multiprocessor, often called a parallel processing system.
Tightly coupled systems tend to be used to work on a single program (or problem) to achieve
maximum speed, and the number of processors that can be effectively and effciently employed is
usually small and constrained by the bandwidth of the shared memory, resulting in limited scalabil-
ity. Loosely coupled systems (multicomputers), on the other hand, often referred to as distributed
computing systems, are designed primarily to allow many users to work together on many unre-
lated problems but occasionally in a cooperative manner that mainly involves sharing of resources.
These systems for having loosely coupled architecture are more freely expandable and theoretically
Distributed Systems: An Introduction 443
can contain any number of interconnected processors with no limits, and these processors can
even be located far from each other in order to cover a wider geographical area. Tightly coupled
multiprocessors can exchange data nearly at memory speeds, but some fber-optic-based multicom-
puters have been found also to work very close to memory speeds. Therefore, the terms “tightly
coupled” and “loosely coupled” although indicate some useful concepts, but any distinct demarca-
tion between them is diffcult to maintain because the design spectrum is really a continuum.
Both multiprocessors and multicomputers individually can be again divided into two categories
based on the architecture of the interconnection network; bus and switched, as shown in Figure 9.1.
By bus, it is meant that there is a single network, backplane, bus, cable, or other medium that con-
nects all the machines. Switched systems consist of individual wires from machine to machine, with
many different wiring patterns in use, giving rise to a specifc topology. Usually, messages move
along the wires, with an explicit switching decision made at each step to route the message along
one of the outgoing wires.
Distributed computing systems with multicomputers can be further classifed into two different
categories: homogeneous and heterogeneous. In a homogeneous multicomputer, all processors are
the same and generally have access to the same amount of private memory and also only a single
interconnection network that uses the same technology everywhere. These multicomputers are used
more as parallel systems (working on a single problem), just like multiprocessors. A heterogeneous
multicomputer, in contrast, may contain a variety of different independent computers, which, in
turn, are connected through different networks. For example, a distributed computing system may
be built from a collection of different local-area computer networks, which are then interconnected
through a different communication network, such as a fber-distributed data interface (FDDI) or
ATM-switched backbone.
Multiprocessors that usually give rise to parallel computing systems lie outside the scope of this
chapter. Multicomputers, whether bus-based or switch-based, are distributed computing systems,
although not directly related to our main objective, DOSs, but they still deserve some discussion
because they will shed some light on our present subject, as we will observe that different forms of
such machines use different kinds of operating systems.
Brief details on this topic are given on the Support Material at www.routledge.com/9781032467238.
the application database; and the storage, processing, and communication workloads for access to
objects are distributed across many computers with communication links. Each object is replicated
in several computers to further distribute the load and to provide resilience in the event of individual
machine faults or communication link failure (as is inevitable in the large, heterogeneous networks
in which peer-to-peer systems exist). The need to place individual objects and retrieve them and to
maintain replicas among many computers render this architecture relatively more complex than its
counterparts in other popular forms of architecture. This form of distributed computing system was
used in the early ARPANET and is found to be appropriate for the situations in which resource shar-
ing, such as sharing of fles of different types, with each type of fle located on a different machine,
is needed by remote users.
Brief details on this topic with a fgure are given on the Support Material at www.routledge.
com/9781032467238.
this model does not require any migration of the user’s processes to the target server machine for
getting the work executed by those machines.
With passing of days, this model has become increasingly popular, mainly for providing an
effective general-purpose means of sharing information and resources in distributed computing
systems. However, it can be implemented in a variety of hardware and software environments, and
possesses a number of characteristics that make it distinct from other types of distributed computing
systems. Some of the variations on this model involve consideration of the following factors:
The term mobile code is used to refer to code that can be sent from one computer to another and run
at the destination. Java applets are a well-known and widely used example of mobile code. Code ft
to run on one computer may not be necessarily suitable to run on another, because executable pro-
grams are normally specifc both to the instruction set (hardware) and the host operating system. The
use of the software virtual machine approach (such as Java virtual machine, JVM), however, pro-
vides a way to make such code executable on any environment (hardware and OS). A mobile agent
is a running program (consisting of both code and data) that travels from one computer to another in
a network carrying out a task (such as collecting information) on someone’s behalf, eventually return-
ing with the results. A mobile agent may issue many local resources at each site it visits.
Moreover, it has been also observed that both client and server processes can sometimes even
run on the same computer, and it is perhaps diffcult at times to strictly distinguish between a server
process and a client process. In addition, some processes are also found as both client and server
processes; a server process may sometimes use the services of another server (as in the case of
three-tier architecture), thereby appearing as a client to the latter.
FIGURE 9.2 A representative scheme of a three–tier architecture used in the client–server model formed
with computer systems in the premises of computer networks.
446 Operating Systems
is the relatively slow-speed interconnection network that communicates between the processors
where the jobs are to be executed and the terminals via which the users talk with the system. That
is why this model is usually considered not suitable for the general environment in which typically
high-performance interactive applications run. However, distributed computing system based on the
processor-pool model has been implemented in a reputed distributed system, Amoeba (Mullender
et al., 1990).
Brief details on this topic with a fgure are given on the Support Material at www.routledge.
com/9781032467238.
FIGURE 9.3 A schematic representation of the taxonomy of different combinations of MIMD category
hardware and related software used in the domain of different types of interconnected multiple–CPU com-
puter systems.
In tightly coupled hardware systems, the OS software system essentially tries to maintain a
single, global view of the resources it manages. In loosely coupled hardware systems, the individual
machines are clearly distinguishable and fundamentally independent of one another, each running
its own operating system to execute its own job. However, these operating systems often interact to
a limited degree whenever necessary and work together to offer their own resources and services
to others.
A tightly coupled operating system, when used for managing tightly coupled multiprocessors,
is generally referred to as multiprocessor operating system, which, when combined with the
underlying architecture of the computing system, gives rise to the concept of a parallel system.
Numerous issues in regard to this system need to be addressed, but a detailed discussion of such an
operating system is outside the scope of this book. However, some important issues in regard to this
system are discussed in brief later in this chapter.
A tightly coupled operating system, when used for managing a loosely coupled multiprocessor
(DSM) and homogeneous multicomputers, is generally referred to as a distributed operating sys-
tem, which, when combined with the underlying architecture of the computing system, gives rise
to the concept of a distributed system. Distributed operating systems are homogeneous, implying
that each node runs the same operating system (kernel). Although the forms and issues relating
to implementation of DOSs that drive the loosely coupled multiprocessor and those that manage
homogeneous multicomputers are quite different, but their main objectives and design issues hap-
pen to be the same. Similarly to conventional uniprocessor operating systems, the main objectives
of a DOS are, however, to hide the complexities of managing the underlying distributed hardware
such that it can be enjoyed and shared by multiple processes.
A loosely coupled operating system, when used for managing loosely coupled hardware, such
as a heterogeneous multicomputer system (LAN-and WAN-based), is generally referred to as a
Distributed Systems: An Introduction 449
network operating system (NOS). Although a NOS does not manage the underlying hardware in
the way that a conventional uniprocessor operating system usually does, but it provides additional
supports in that local services are made available to remote clients. In the following sections, we
will frst describe in brief the loosely coupled operating system and then focus on tightly coupled
(distributed) operating systems.
Out of many different forms of loosely coupled hardware, the most widely used one is the cli-
ent–server model using many heterogeneous computer systems, the benefts of which are mainly
tied up with its design approach, such as its modularity and the ability to mix and match different
platforms with applications to offer a business solution. However, there is a lack of standards that
stands to prevent the model managed by a NOS from being a true distributed system. To alleviate
these limitations, and to achieve the real benefts of the client–server approach to get the favor of
an actual distributed system (general-purpose services) that would be able to implement an inte-
grated, multi-vendor, and enterprise-wide client–server confguration, there must be a set of tools
which would provide a uniform means and style of access to system resources across all platforms.
Enhancements along these lines to the services of NOSs have been carried out that provide distribu-
tion transparency. These enhancements eventually led to introduce what is known as middleware
that lies at the heart of modern distributed systems. Middleware and its main issues will be dis-
cussed later in this chapter.
FIGURE 9.4 A representative block diagram of general structure of a network operating system used in the
premises of computer networks formed with multiple computers (each one may be of single CPU or multiple CPU).
FIGURE 9.5 A block–structured illustration of a representative scheme consisting of autonomous clients and
a server operated under a network operating system used in computer networks formed with multiple computers.
for example, a shared, global fle system that can be accessible from all client’s machine. The fle
system may be supported by one or more machines called fle servers that accept all requests from
user programs running on other machines (non-servers). Each such incoming request is then exam-
ined and executed at the server’s end, and the reply is sent back accordingly. This is illustrated in
Figure 9.5.
File servers usually maintain hierarchical fle systems, each with a root directory containing sub-
directories and fles. Each machine (or client) can mount these fle systems, augmenting their local
fle systems with those located on the servers. It really does not matter where a server is mounted by
a client in its directory hierarchy, since all machines operate relatively independently of others. That
is why different clients can have a different view of the fle system. The name of the fle actually
depends on where it is being accessed from and how that machine has built its fle system.
Many NOSs have been developed by different manufacturers on top of UNIX and other operating
systems, including Linux and Windows. One of the best-known and most widely used commercial
Distributed Systems: An Introduction 451
networking systems is Sun Microsystem’s Network File System, universally known as NFS, that
was primarily used on its UNIX-based workstations, although it supports heterogeneous systems;
for example, Windows running on Intel Pentium gets service from UNIX fle servers running on
Sun SPARC processors.
A network OS is easier to implement but is clearly more primitive than a full-fedged distrib-
uted OS. Still, some of the advantages of NOS as observed compared to DOSs are that it allows
machines (nodes) to maintain total autonomy and to remain highly independent of each other; it also
facilitates safely adding or removing a machine to and from a common network in the arrangement
without affecting the others except only to inform the other machines in the network about the exis-
tence of the new one. The addition of a new server to the internet, for example, is done in this precise
way. To introduce a newly added machine on the internet, all that is needed is to provide its network
address or, even better, the new machine can be given a symbolic name that can be subsequently
placed in the Domain Name Service (DNS) along with its network address.
A few of the major shortcomings of a network OS are that it allows the local operating systems
to be independent stand-alone, retaining their identities and managing their own resources, so they
are visible to users, but their functioning cannot be integrated. Consequently, this lack of transpar-
ency in a network OS fails to provide a single, system-wide coherent system view, and it cannot
balance or optimize utilization of resources, because the resources are not under its direct control.
In addition, all access permissions, in general, have to be maintained per machine, and there is
no simple way of changing permissions unless they are the same everywhere. This decentralized
approach to security sometimes makes it vulnerable and thereby makes it equally hard to protect the
NOS against malicious attacks. There remain other issues that stand as drawbacks of an NOS, and
all these together draw a clean demarcation between an NOS and its counterpart, a DOS. However,
there are still other factors that make DOS distinctly different from NOS.
Brief details on this topic are given on the Support Material at www.routledge.com/9781032467238.
A distributed system is one that runs on a collection of interconnected independent computers which
do not have shared memories yet appears to its users as a single coherent system.
This characteristic is also sometimes referred to as the single system image (SSI). A slightly differ-
ent notion of a distributed system is that it is one that runs on a collection of a networked machine
but acts like a virtual uniprocessor. No matter how it is expressed, the leading edge in quest of
distributed systems and its development is mainly focused on the area of DOSs. Although some
commercial systems have already been introduced, fully functional DOSs with most of their neces-
sary attributes are still at the experimental stage in the laboratory. However, the defnition of a DOS
can be presented in the following way:
A distributed operating system is one that casts a view to its users like an ordinary centralized oper-
ating system but controls operations of multiple independent central processing units (nodes) in a
well-integrated manner. The key concept here is transparency. In other words, the use of multiple pro-
cessors should be invisible (transparent) to the user, who views the system as a “virtual uniprocessor”,
not as collection of distinct machines (processors).
However, the functionality of DOSs is essentially identical to that of traditional operating systems
for uniprocessor machines with the exception that they manage multiple CPUs. The advantages of
distributed computing systems, as mentioned in Section 9.2, are exploited by using a DOS that takes
452 Operating Systems
advantages of the available multiple resources and thereby disperses the processes of an applica-
tion across various machines (or CPUs) in the system to achieve computation speed-up, effective
utilization by sharing and effciency of underlying resources whenever possible, communication
and cooperation between users via the existing network, and above all to provide reliability when-
ever necessary. However, there remains a possibility of communication network failures or break-
down of individual computer systems that sometimes complicates the functioning of the underlying
operating system and necessitates use of special techniques in its design to negotiate these situa-
tions. Users of these operating systems also often require special techniques that facilitate access to
resources over the existing network.
Users of a DOS have user ids and passwords that are valid throughout the system. This
feature makes communication conveniently possible between users in two ways. First, commu-
nication using user-ids automatically invokes the security mechanisms of the OS to intervene
and thereby ensures the authenticity of communication. Second, users can be mobile within the
domain of the distributed system and still be able to communicate with other users of the system
with ease.
Distributed operating systems can be classifed into two broad categories: An operating
system in multiprocessors (DSM) manages the resources of a multiprocessor. The other
one is an operating system in multicomputers that is developed to handle homogeneous
multicomputers.
all the transparency aspects together that summarily explain the different transparency facets and
their respective implications.
• Location and access transparency : Resources and services are usually made transpar-
ent to the users by identifying them simply by only their names and do not depend on
their locations in the system. This aspect also facilitates migration transparency, which
ensures that the movement of the data is handled automatically by the system in a user-
transparent manner. Distributed fle systems (DFS) also exploit this transparency aspect
favorably when they store system and user fles in different location (nodes) to mostly opti-
mize disk space usage and network traversal time, and that is also done with no indication
to the user.
• Replication transparency: This is related to the creation of replicas of fles and resources
to yield better performance and reliability by keeping them transparent to the user. Two
important issues related to replication transparency are naming of replicas and replication
control, which are automatically handled by the replication management module of the
distributed system in a user-transparent manner.
• Failure transparency: In the face of a partial system failure, such as a machine (node or
processor) failure, a communication link failure, a storage device crash, or other types of
similar failures, the failure transparency attribute of a DOS will keep all these failures
transparent from the user and still enables the system to continue to function, perhaps only
with degraded performance. The OS realizes this by typically implementing resources as a
group cooperating with each other to perform their respective functions so that in the event
of the failure of one or more same resource, the user is still remained unaffected and will
not notice the failure. The user can still get going on with the service of the resource even
in the situation when only one of the resources in the group is up and working. Complete
failure transparency is, however, not possible to achieve with the current state of the art
in DOSs, because all types of failures cannot be handled in a user-transparent manner.
Communication link failure, for example, in a system cannot be kept beyond the notice of
the user, since it directly hampers the work of the user. Hence, the design of such a distrib-
uted system is theoretically possible but is not practically feasible.
• Performance transparency: This aims to improve the performance of the system by
enabling the system to be automatically confgured as loads vary dynamically. This is
often carried out by rescheduling and uniformly distributing the currently obtainable pro-
cessing capacity of the system among the jobs present within the system using the support,
such as resource allocation ability or data migration capability, etc. of the system.
• Scaling transparency: This is related to the scalability of the system that allows expan-
sion of the system in scale without affecting the ongoing activities of users. This requires
the system to have an open-system architecture and make use of appropriate scalable algo-
rithms in the design of the components of the DOS.
9.9.2 RELIABILITY
Reliability of a system is closely associated with the availability of its resources, which is ensured by
protecting them against likely faults. Though the presence of multiple instances of the resources in
distributed system are generally assumed to make the system more reliable, but the reality is entirely
different; it rather tells that the distributed OS must be designed in such a way so that the full advan-
tage of a distributed system can be realized with equal increase in the reliability of the system.
However, system failure results when a fault occurs in a system, and this failure can be categorized
into two types depending on the behavior of the failed system. The frst one is fail–stop failure,
which causes the system to stop functioning after changing to a state in which its cause of failure
can be detected. The second one is popularly known as Byzantine failure that causes the system to
454 Operating Systems
continue its function but generates erroneous results. Software bugs that remain undetected often
cause Byzantine failure, which is more diffcult to handle than fail-stop failure.
Distributed operating systems thus must be designed properly to realize higher reliability, such
as to avoid faults, to tolerate faults, to detect faults, and to subsequently recover from faults. Various
popular methods in this regard are available to deal each of these issues separately.
• Fault avoidance: This is mostly accomplished by designing the components of the system
in such a way that the occurrence of the faults is minimized. Designers of the distributed
OS must test the various software components thoroughly before use to make them highly
reliable.
• Fault tolerance: This is the capability of a system to continue its proper functioning even
in the event of partial system failure, albeit with a little degradation in system performance.
A distributed OS can be equipped with improved fault-tolerance ability by using concepts
such as redundancy techniques and distributed control.
• Redundancy techniques: These essentially exploit the basic principle of replication
of critical hardware and software components in order to handle a single-point of
failure so that if one component fails, the other can be used to continue. Many differ-
ent methods to implement the redundancy technique are used for dealing with differ-
ent types of hardware and software resources. Link and node faults are tolerated by
providing redundancy of resources and communication links so that the others can
be used if a fault occurs in these areas. Similarly, a critical process can be executed
simultaneously on two nodes so that if one of the two nodes fails, the execution of the
process can continue to completion at the other node. Likewise, a fle is replicated on
two or more nodes of a distributed system. Additional disk space is then required, and
for correct functioning, it is often necessary that all copies of the fles be mutually
consistent. Note that what is common to all these approaches is that additional over-
head is required in each case to ensure reliability. Therefore, a distributed OS must be
designed in such a way as to maintain a proper balance between the required degree
of reliability and the amount of overhead incurred. A replication approach in some
situations needs appropriate concurrency control mechanisms. Concurrency of data
becomes a critical issue when data are distributed or replicated. When several parts
of a distributed data are to be modifed, a fault should not put the system in a state in
which some parts of the data have been updated but others have not due to a hardware
fault or a software error. A centralized conventional uniprocessor OS generally uses
the technique of atomic action to satisfy this requirement. A distributed OS handling
distributed data employs a technique called two-phase commit (2PC) protocol for
this purpose.
• Distribution of control functions: Control functions in a distributed system such
as resource allocation, scheduling, and synchronization and communication of pro-
cessors and processes, if implemented centrally, may face several problems. Two of
them are quite obvious. The frst one is due to communication latency that frequently
prevents the system from obtaining the latest information with respect to the current
state of processes and resources in all machines (nodes) of the system. The second
one is that a centralized control function often becomes the cause of a potential per-
formance bottleneck and the root of a threat to system reliability for being a single-
point of control, the failure of which may be sometimes fatal to the system. Due to
these factors and for many other reasons, a distributed OS must employ a distrib-
uted control mechanism to avoid a single-point of failure. A highly available DFS,
for example, should have multiple and independent fle servers controlling multiple
and independent storage devices. In addition, a distributed OS implements its control
functions using a distributed control algorithm, the notion of which is to perform
Distributed Systems: An Introduction 455
• Fault detection and recovery: This approach discovers the occurrence of a failure and
then rectify the system to a state so that it can once again continue its normal operation. If
a fault (hardware or software failure) occurs during the ongoing execution of a computa-
tion in different nodes of the distributed system, the system must be then able to assess
the damage caused by the fault and judiciously restore the system to normalcy to continue
the operation. Several methods are available to realize this. Some of the commonly used
techniques implemented in DOSs in this regard are:
• Atomic transactions: These are computations consisting of a set of operations that are
to be executed indivisibly (atomic action) in concurrent computations, even in the face
of failures. This implies that either all of the operations are to be completed success-
fully, or none of their effects prevail if failure occurs during the execution, and other
processes executing concurrently cannot enter the domain of this computation while it
is in progress. In short, it can be called an all-or-nothing property of transactions. This
way, the consistency of shared data objects is made preserved even when failure occurs
during execution, which eventually makes recovery from crashes much easier.
• A system equipped with such a transaction facility, when it halts unexpectedly due to
the occurrence of a fault or failure before a transaction is completed, the system sub-
sequently restores any data objects (that were undergoing modifcation at the time of
failure) to their original states by restoring rationally some of the subcomputations to
the previous states recorded already in the back-ups (archives). This action is commonly
known as roll back, and the approach is called recovery. If a system does not support this
transaction mechanism, sudden failure of a process during the execution of an operation
may leave the system and the data objects that were undergoing modifcation in such an
inconsistent state that, in some cases, it may be diffcult or even impossible to restore
(roll back) them to their original states. Atomic transactions are, therefore, considered a
powerful tool that enables the system to come out of such a critical situation.
• Acknowledgements and timeout-based retransmission of messages: Interprocess commu-
nication mechanisms between two processes often use a message-passing approach in which
messages may be lost due to an unexpected event of system fault or failure. To guard against
loss of messages in order to ensure reliability, and to detect the lost messages so that they can
be retransmitted, the sender and receiver both agree that as soon as a message is received, the
receiver will send back a special message to the sender in the form of an acknowledgement. If
the sender has not received the acknowledgement within a specifed timeout period, it assumes
that the message was lost, and it may then retransmit the message (duplicate message), which
is usually handled involving a mechanism which automatically generates and assigns appro-
priate sequence numbers to messages. A detailed discussion of the mechanism that handles
acknowledgement messages, timeout-based retransmission of messages, and duplicate request
messages for the sake of reliable communication is provided in Chapter 4.
• Stateless servers: In distributed computing systems in the form of a client–server
model, the server can be implemented by one of two service paradigms: stateful or
stateless. These two paradigms are distinguished by one salient aspect: whether the
history of the serviced requests between a client and a server affects the execution of
the next service request. The stateful approach depends on the history of the serviced
requests, whereas the stateless approach does not. A stateless server is said to be one
that does not maintain state information about open fles. Stateless servers possess a
distinct advantage over stateful servers in the event of a system failure, since the state-
less service paradigm makes crash recovery quite easy because no client information
456 Operating Systems
is maintained by the server. On the contrary, the stateful service paradigm requires
complex crash recovery procedures since both the client and server need to reliably
detect crashes. Here, the server needs to detect client crashes so that it can abandon any
state it is holding for the client, and the client must detect server crashes so that it can
initiate necessary error-handling activities. Although stateful service is inevitable in
some cases, the stateless service paradigm must be used whenever possible in order to
simplify failure detection and recovery actions.
However, the major drawback in realizing increased reliability in a distributed system lies in the
costly extra overhead involved in implementing the mechanism, whatever it is. It consumes a good
amount of execution time that may eventually lead to a potential degradation in the performance
of the system as a whole. Obviously, it becomes a hard task for designers to decide to what extent
the system can be made reliable so that a good balance of cost versus mechanism can be effectively
implemented.
9.9.3 FLEXIBILITY
For numerous reasons, one of the major requirements in the design of a distributed OS is its fex-
ibility. Some of the important reasons are:
The fexibility of a distributed OS is critically infuenced mostly by the design model of the kernel
because it is the central part of the system that controls and provides basic system facilities which, in
turn, offer user-accessible features. Different kernel models are available, and each one has its own
merits and drawbacks based on which different distributed OS has been built. The ultimate objec-
tive is then to formulate the design of the OS in such a way that easy enhancement in and around
the existing kernel will be possible with minimum effort and less hindrance irrespective of the type
of model chosen.
9.9.4 SCALABILITY
Scalability is one of the most open features for open distributed systems and refers to the ability
of a given system to expand by adding new machines or even an entire sub-network to the existing
Distributed Systems: An Introduction 457
system so that increased workload can be handled without causing any serious disruption of services
or notable degradation in system performance. Obviously, there exist some accepted principles that
are to be followed as guidelines when designing scalable distributed systems. Some of them are:
• Try to avoid using centralized entities in the design of a distributed system because the
presence of such entities often creates hindrance in making the system scalable. Also, as
the existence of such an entity is a single point, it often makes the system suffer from a
bottleneck, which is inherent in such a design as the number of users increase. In addition,
in the event of failure of this entity, the system may go beyond the fault-tolerance limit and
ultimately break the entire system totally down.
• Try to avoid using centralized algorithms in the design of a distributed system. A central-
ized algorithm can be described as one that operates on a single node by collecting infor-
mation from all other nodes and fnally distributes the result to them. Similar to the reasons
that disfavor the use of any central items in the design of a distributed system, here also the
presence of such algorithms in the design may be disastrous, particularly in the event of
failure of the central node that controls the execution of all the algorithms.
• It is always desirable to encourage client-centric execution in a distributed environment,
since this act can relieve the server, the costly common resource, as much as possible from
its increasing accumulated load of continuously providing services to several clients within
its limited time span. Although client-centric execution inherently possesses certain draw-
backs, which may give rise to several other critical issues that need to be resolved, still it
enhances the scalability of the system, since it reduces the contention for shared resources
at the server’s end while the system gradually grows in size.
9.9.5 PERFORMANCE
Realization of good performance from a distributed system under prevailing load conditions is
always an important aspect in design issues, and it can be achieved by properly designing and
organizing the various components of the distributed OS, with the main focus mostly on extracting
the highest potential of the underlying resources. Some of the useful design principles considered
effective for improved system performance are:
• Data migration: Data migration often provides good system performance. It is employed
mostly to reduce network latencies and improve response times of processes.
• Computation migration: This involves moving a computation to a site mostly because the
data needed by the computation is located there. Besides, this approach is often employed
to implement load balancing among the machines (CPUs) present in the system.
• Process migration: A process is sometimes migrated to put it closer to the resources it is
using most heavily in order to mainly reduce network traffc, which, in turn, avoids many
hazards and thereby improves system performance appreciably. A process migration facil-
ity also provides a way to cluster two or more processes that frequently communicate with
one another on the same node of the system.
• Use of caching: Caching of data is a popular and widely used approach for yielding overall
improved system performance, because it makes data readily available from a relatively
speedy cache whenever it is needed, thereby saving a large amount of computing time
spent for repeated visits to slower memory and thereby preserving network bandwidth.
Use of the caching technique, in general, including fle caching used in DFSs, also reduces
contention for centralized shared resources.
• Minimize data copying: Frequent copying of data often leads to sizeable overhead in
many operations. Data copying overhead is inherently quite large for read/write operations
on block I/O devices, but this overhead can be minimized to a large extent by means of
458 Operating Systems
using a disk cache. However, by also using memory management optimally, it is often pos-
sible to signifcantly reduce data movement as a whole between the participating entities,
such as; the kernel, block I/O devices, clients, and servers.
• Minimize network traffc: Reduced traffc load on the network may also help to improve
system performance. One way of many to minimize network traffc is to use the process
migration facility, by which two or more processes that frequently communicate with each
other can be clustered on the same node of the system. This will consequently reduce
the redundant to and fro journey of the processes over the network. In addition, this will
reduce the other effects caused by network latencies. Process migration activity as a whole
can also resolve other critical issues, as already discussed, that eventually reduce overall
network traffc. In addition, avoiding collection of global state information, whenever pos-
sible, for making decisions using the communication network may also help in reducing
network traffc.
• Batch form: Transferring data across the network in the form of batching it as a large
chunk rather than a single individual page is sometimes more effective and often greatly
improves system performance. Likewise, piggybacking acknowledgement of previous
messages with the next message during transmission of a series of messages between com-
municating entities also exhibits improved system performance.
9.9.6 SECURITY
Security aspects gain a new dimension in a distributed system, and is truly more diffcult to enforce
than in a centralized system, mainly due to the lack of a single point of control and the presence
of insecure networks attached to widely spread systems for needed communication. Moreover, in
a client–server model, a frequently used server must have some way to know the client at the time
of offering services, but the client identifcation feld in the message cannot be entirely trusted
due to the likely presence of an intruder and their impersonation activities. Moreover, interprocess
messages may sometimes pass through a communication processor that operates under a different
OS. An intruder can at any point gain control of such a computer system during transmission and
either tamper with the messages passing through it or willfully use them to perform impersonation.
Therefore, a distributed system as compared to a centralized system should enforce several addi-
tional measures with respect to security.
Similar to the aspects and related mechanisms discussed in Chapter 8 (“Security and Protection”),
designers of a distributed system should equally address all the protection issues and incorporate
different established techniques, including the well-known practical method of cryptography, to
enforce security mechanisms as much as possible to safeguard the entire computing environment.
In addition, special techniques for message security and authentication should be incorporated to
prevent different types of vulnerable attacks that might take place at the time of message passing.
9.9.7 HETEROGENEITY
A heterogeneous distributed system is perhaps the most general one, and it consists of interconnected
sets of dissimilar hardware and software providing the fexibility of employing different computer
platforms for a diverse spectrum of applications used by different types of users. Incompatibilities in
heterogeneous systems also include the presence of a wide range of different types of networks being
interconnected via gateways with their own individual topologies and related communication protocols.
Effective design of such systems is, therefore, critically diffcult from that of its counterpart, homoge-
neous systems, in which closely related hardware is operated by similar or compatible software.
Heterogeneous distributed systems often make use of information with different internal for-
mats, and as such need some form of data conversion between two incompatible systems (nodes)
at the time of their interactions. The data conversion job, however, is a critical one that may be
Distributed Systems: An Introduction 459
performed using a specifc add-on software converter either at the receiver’s node that will be able
to convert each format in the system to the format used on the receiving node or may be carried out
at the sender’s node with a similar approach. The software complexity of this conversion process
can be reduced by choosing an appropriate intermediate standard format, supposed to be the most
common format of the system that can minimize the number of conversions needed at the time of
interactions among various types of different systems (nodes).
Another heterogeneity issue in a distributed system is related to the fle system that enables the
distributed system to accommodate several different storage media. The fle system of such a dis-
tributed system should be designed in such a way that it can allow the integration of a new type of
workstation or storage media in a relatively simple manner.
Shared-bus systems are relatively simple and popular, but their scalability is limited by bus and
memory contention. Crossbar systems while allow fully parallel connections between processors
460 Operating Systems
and different memory modules, but their cost and complexity grow quadratically with the increase
in number of nodes. Hypercubes and multilevel switches are scalable, and their complexities grow
only logarithmically with the increase in number of nodes. However, the type of interconnection
network to be used and the related nature of this interconnection path has a signifcant infuence on
the bandwidth and saturation of system communications apart from the other associated important
issues, such as cost, complexity, interprocessor communications, and above all the scalability of
the presented architectures that determines to what extent the system can be expanded in order to
accommodate a larger number of processors.
In multiprocessors, multiple processors communicate with one another and with the non-local
(local to some other processor) memory as well as with commonly shared remote memory in the
form of multiple physical banks using communication networks. Peripherals can also be attached
using some other form of sharing. Many variations of this basic scheme are also possible. These
organization models, however, may give rise to two primary points of contention, the shared mem-
ory and the shared communication network itself. Cache memory is often employed to reduce con-
tention. In addition, each processor may have an additional private cache (local memory) to further
speed up the operation. Here, shared memory does not mean that there is only a single centralized
memory which is to be shared. Multiprocessors using shared memory give rise to three different
models that differ in how the memory and peripheral resources are connected; shared, or distributed.
Three such common models are found: UMA, NUMA, and no remote memory access (NORMA).
Symmetric Multiprocessors (SMPs); UMA Model: An SMP is a centralized shared memory
machine in which each of n processors can uniformly access any of m memory modules at any
point in time. The UMA model of multiprocessors can be divided into two categories: symmetric
and asymmetric. When all the processors in the bus-based system share equal access to all IO
devices through the same channels or different channels that provide paths to the same devices, the
multiprocessor is called a SMP. When the SMP uses a crossbar switch as an interconnection net-
work, replacing the common bus, then all the processors in the system are allowed to run IO-related
interrupt service routines and other supervisor-related (kernel) programs. However, other forms
of interconnection network in place of a crossbar switch can also be used. In the asymmetric cat-
egory, not all but only one or a selective number of processors in the multiprocessor system are
permitted to additionally handle all IO and supervisor-related (kernel) activities. Those are treated
as master processor(s) that supervise the execution activities of the remaining processors, known
as attached processors. However, all the processors, as usual, share uniform access to any of m
memory modules.
The UMA model is easy to implement and suitable in general-purpose multi-user applications
under time-sharing environments. However, there are several drawbacks of this model. The dispar-
ity in speed between the processors and the interconnection network consequently results in appre-
ciable degradation in system performance. Interconnection networks with speeds comparable with
the speed of the processor are possible but are costly to afford and equally complex to implement.
Inclusion of caches at different levels (such as L1, L2, and L3) with more than one CPU improves
the performance but may lead to data inconsistencies in different caches due to race conditions. The
architecture should then add the needed cache coherence protocol to ensure consistencies in data
that, in turn, may increase the cost of the architecture and also equally decrease the overall system
performance. In addition, bus-based interconnection networks in the UMA model are not at all
conducive to scalability, and the bus would also become an area of bottleneck when the number of
CPUs is increased. However, use of a crossbar switch while would make it scalable, but only moder-
ately, and the addition of CPUs requires proportionate expansion of the crossbar switch, whose cost
may not vary linearly with the number of CPUs. Moreover, the delays caused by the interconnection
network also gradually increase, which clearly indicates that the SMP is not suitable to reasonably
scale beyond a small number of CPUs.
In the UMA model, parallel processes must communicate by software using some form of mes-
sage passing by putting messages into a buffer in the shared memory or by using lock variables
Distributed Systems: An Introduction 461
in the shared memory. The simplicity of this approach may, however, lead to a potential resource
confict, which can be resolved by injecting appropriate delays in the execution stages but, of course,
at the price of slight degradation in system performance. Normally, interprocessor communication
and synchronization are carried out using shared variables in the common memory.
Distributed Shared Memory Multiprocessors: NUMA Model: A comparatively attractive alter-
native form of a shared-memory multiprocessor system is a NUMA multiprocessor, where the shared
memory is physically distributed (attached) directly as local memory to all processors so that each proces-
sor can sustain a high computation rate due to faster access to its local memory. A memory unit local to a
processor can be globally accessed by other processors with an access time that varies by the location of
the memory word. In this way, the collection of all local memory forms a global address space shared by
all processors. NUMA machines are thus called distributed shared-memory (DSM) or scalable shared-
memory architectures. The BBN TC-2000 is such a NUMA machine using a total of 512 Motorola 88100
RISC processors, with each local memory connected to its processor by a butterfy switch (Chakraborty,
2020). A slightly different implementation of a NUMA multiprocessor is with a physical remote (global)
shared memory in addition to the existing usual distributed memory that is local to a processor but global
to other processors. As a result, this scheme forms a memory hierarchy where each processor has the
fastest access to its local memory. The next is its access to global memory, which are individually local to
other processors. The slowest is access to remote large shared memory (Chakraborty, 2020).
Similar to SMP, the NUMA model architecture must also ensure coherence between caches
attached to CPUs of a node as well as between existing non-local caches. Consequently, this require-
ment, as usual, may cause memory accesses to be slowed down and consume part of the bandwidth
of interconnection networks, apart from increasing the cost of the architecture and also equally
decreasing overall system performance.
Usually, the nodes in a NUMA (DSM) architecture are high-performance SMPs, each contain-
ing around four or eight CPUs to form a cluster. Due to the presence of a non-local communication
network to connect these clusters, performance of such NUMA architecture is scalable when more
nodes are added. The actual performance of a NUMA system, however, mostly depends on the non-
local memory accesses made by the processes following the memory hierarchy during their execu-
tion. This issue falls within the domain of OS and will be addressed in the next section.
Multiprocessor systems are best suited for general-purpose multi-user applications where major
thrust is on programmability. Shared-memory multiprocessors can form a very cost-effective
approach, but latency tolerance while accessing remote memory is considered a major shortcoming.
Lack of scalability is also a major limitation of such a system.
Brief details on this topic with figures are given on the Support Material at www.
routledge.com/9781032467238.
• UMA kernel on SMP: The fundamental requirement of the operating system driving an
SMP architecture suggests that any CPU present in the system is permitted to execute OS
kernel code at any instant, and different CPUs can execute OS code almost in parallel or at
different times. Temporarily, the processor that executes the OS code has a special role and
acts as a master in the sense that it schedules the work of others. The OS is, however, not
bound to any specifc processor; it foats from one processor to another. Hence, symmetric
organization is sometimes called foating master. The operating system here is more or
less a single, large critical section and is mostly monolithic; very little of its code, if any, is
executed in parallel. This, in turn, requires that there be suffcient provision for any CPU
to equally communicate with the other CPUs in the system, and any CPU should be able
to initiate an I/O operation of its own on any device in the system at any point in time.
Otherwise, if only one or just a few CPUs have access to I/O devices, the system becomes
asymmetric. To satisfy the condition that each CPU be able to carry out its own I/O opera-
tion, the interconnection network that connects the CPUs in the system must provide some
arrangements to connect the I/O also so that the I/O interrupts are directed to the respec-
tive CPU that initiated the I/O operation or to some other processor in the system that is
kept dedicated to this purpose.
To fulfll the requirement for communication between the CPUs, the kernel reserves an area
in its memory known as communication area (similar to uniprocessor architecture when the CPU
communicates with a separate I/O processor [Chakraborty, 2020]). Whenever a CPU C1 intends to
Distributed Systems: An Introduction 463
communicate with another CPU C2, it places needed information in C2’s communication area and
issues an IPI in C2. The processor C2 then picks up this information from its own communication
area and acts on it accordingly.
As the SMP kernel can be accessed and shared by many CPUs in the system at any point in
time, the OS code should be reentrant (see Section 5.8.1.2.4). Some parallelism may be introduced
at the OS level by identifying routines that can access shared data structures concurrently and by
protecting them with the appropriate interlocks. Since all communication is to be done by manipu-
lating data at shared memory locations (communication area), it is thus essential to ensure mutual
exclusion over these kernel data structures so as to protect those data from simultaneous access to
synchronize processes. This can be accomplished with the use of semaphores, possibly with count-
ing semaphores, but especially with binary semaphores (see Section 4.2.1.4.9), sometimes referred
to as mutex locks (variables), or this can be achieved with the use of monitors to carry out lock
and unlock operations. The mutex lock can only take on the values 0 and 1. Locking a mutex will
succeed only if the mutex is 1; otherwise the calling process will be blocked. Similarly, unlocking a
mutex means setting its value to 1 unless some waiting process could be unblocked. The semaphore
operation itself also must be atomic, meaning that once a semaphore operation has started, no
other process can access the semaphore until the ongoing operation is completed (or until a process
blocks).
The number of locks to be used in the system to enforce needed mutual exclusion is a vital design
issue, since it directly affects the performance of the system. If a single lock is used to control access
of all kernel data structures, then at any instant, only one processor can be allowed to use the data
structures. If separate locks are provided to control individual data structures, then many processors
could access different data structures in parallel, thereby obviously increasing system performance.
However, the use of many locks may invite a situation of deadlock when a processor attempts to
access more than one data structure. Necessary arrangements should thus be made to ensure that
such deadlocks do not arise.
An SMP kernel is a natural frst step in OS implementation and relatively easy to realize. It is
equally easy to port an existing uniprocessor operating system, such as UNIX, to a shared-memory
UMA multiprocessor. The shared memory contains all of the resident OS code and data structures.
The largely monolithic UNIX kernel may then be executed by different processors at different
times, and process migration is almost trivial if the state is saved in shared memory. Simultaneous
(parallel) executions of different applications is quite easy, and can be achieved by maintaining a
queue of ready processes in shared memory. Processor allocation then consists only of assigning the
frst ready process to the frst available processor unless either all processors are busy or the ready
queue of the processes is emptied. In this way, each processor, whenever available, fetches the next
work item from the queue. Management of such shared queues in multiprocessors is, however, a
different area, and precisely a subject matter of processor synchronization which will be discussed
next.
Further improvement in the performance of the operating system can be realized if the operating
system is designed and developed by organizing it as a set of cooperating threads, and subsequent
scheduling of such threads and synchronization of them using the proper mechanisms, such as
semaphores or messages, as already discussed, can be carried out, In this environment, threads can
be used to exploit true parallelism in an application. If the various threads of an application can be
made to run simultaneously on separate processors, potential parallelism in the OS can be attained.
Consequently, this will not only yield dramatic gains in performance but at the same time enable the
OS to be ported to different equivalent hardware platforms, including tightly coupled and loosely
coupled.
One of the distinct advantages of SMP is that it can continue its normal operation, even in the
event of certain failures of some CPUs, but, of course, affecting only with a graceful degradation
in the performance of the system. Failure of a processor in most situations is not so severe to the
operation of other processors present in the system if it is not executing the kernel code at the time
464 Operating Systems
of failure. At best, only the processes using the service of the failed processor would be affected,
and the other processes henceforth would be barred from getting the service of the failed processor,
which may affect the total performance of the entire system only to some extent.
• NUMA kernel with DSM: The NUMA scheme forms a memory hierarchy where each
CPU has the fastest access to its local memory. The next is access to global memories
which are individually local to other CPUs. The slowest is access to remote shared memory.
The actual performance of a NUMA system thus mainly depends on non-local memory
accesses made by the processes following the memory hierarchy during their execution.
That is why every node in the system must be given its own kernel that can control the pro-
cesses in local memory of the CPUs within the node. This ensures that processes consume
relatively less time in memory accesses, thereby yielding better performance, since most
of their accesses are only to local memory.
Providing a separate kernel to each node in the system exhibits several advantages. The entire
system is then divided into several domains, and there is a separate dedicated kernel that admin-
isters each such domain. The kernel in an individual node should always schedule a process on its
own CPU. This approach is expected to yield better system performance, since it ensures a high hit
ratio in the individual CPU’s own (L1) cache. Similarly, a high hit ratio in the L3 cache (the cache
within the cluster of a group of CPUs forming a node) could also be obtained if the memory is allo-
cated to a process within a single local memory unit.
The kernel of a node always attempts to allocate memory to all processes of a specifc application
in the same memory unit and assigns those to the same set of a few CPUs for their execution. This
idea forms the notion of an application region that usually consists of a resource partition and the
executing kernel code. The resource partition contains one or more CPUs, some local memory units
and a few available I/O devices. The kernel of the application region executes processes of only one
application. In this way, the kernel can optimize the performance of application execution through
willful scheduling and high cache-hit ratios with no interference from the processes of other appli-
cations. Most of the operating systems developed for the NUMA model exploit this approach or an
equivalent one.
The introduction of a separate kernel concept for a node in NUMA architecture or the inclu-
sion of the application region model can equally cause some disadvantages. The separate kernel
approach suffers from several inherent problems associated with such types of partitioning that
cause underutilization of resources, because resources remaining idle belonging to one partition
cannot be used by processes of other partitions. Similarly, the application region concept affects
reliability because failure of resource(s) in one partition may cause delays in processing or may even
require abnormal termination or require the support of resources belonging to other partitions that
are not possible to provide immediately to compensate for the loss due to such failure. In addition,
non-local memory may become a region of bottleneck and access to them become more complex,
since they are used by the domains of more than one kernel.
CPUs in a multiprocessor system can reduce the synchronization delay that usually happens with
traditional uniprocessor systems in the form of busy waiting (to let a process loop until the synchro-
nization condition is met) and blocking of a process (wait and signal).
In a multiprocessor system, processes can run in parallel on different CPUs. At the time of syn-
chronization, it is sometimes preferable to let a process loop rather than blocking it if the CPU over-
head for blocking the process and scheduling another process, followed by activating the blocked
process and rescheduling it again, exceeds the amount of time for which the process would loop.
In short, only when there is a reasonable expectation under certain conditions that the busy–wait
will be of relatively shorter duration, and is thus preferred; since the shared resources for which the
looping (busy–waiting) begins may be quickly released by processes executing on other CPUs or
the time needed by the other CPU to execute its critical section is comparatively quite small. This
situation arises if a process looping for entry to a critical section and the process holding the critical
section are scheduled almost in parallel.
Additional details on process synchronization are given on the Support Material at www.
routledge.com/9781032467238.
• Queued lock: The traditional lock used in uniprocessor systems for process synchronization is
known as a queued lock. When a process Pi executing on CPU Ck attempts to enter a critical sec-
tion, the operating system performs certain actions on the corresponding lock L. The Lock L is
tested. If it is available (not set), the kernel sets the lock and allows process Pi to enter the critical
section to perform its execution. If the lock is not available (already set by other process), process
Pi is preempted, and its request for the lock is recorded in a queue (wait and signal mechanism).
Some other process is then scheduled by OS on CPU Ck for execution. Since the action of this
lock is to put the processes in a queue that is kept waiting, the lock is called a queued lock.
Figure 9.6(b) shows that process Pi is blocked for non-availability of the lock, its id is recorded
in the queue of lock L, and some other process Px is then scheduled to run on Ck. When the process
that is using the lock completes its execution in the critical section, the lock is released and some
other process lying in L’s queue will be awarded with the lock and be activated. The entire activity
is supervised and carried out by the kernel. A semaphore can be used to implement a queued lock in
a multiprocessor system. The semaphore is declared as a shared variable and is updated as required
by the semaphore defnition with the aid of instructions to implement wait and signal mechanism.
466 Operating Systems
FIGURE 9.6 A schematic graphical representation of queued, spin, and sleep locks used in multiprocessor sys-
tem to realize process synchronization.
The average length of the queue of blocked processes for a lock determines whether the solution is
scalable. If the processes do not require locks very often, the length of the queue is relatively small
and is usually limited by a constant value, say, m, so increasing the number of CPUs or the processes
in the system in this situation will not affect (increase) the average delay in gaining access to the
lock. This solution is said to be scalable. But when processes require locks very often, the length of
the queue may be relatively large and will become proportional to the number of processes present
in the system. The solution in this situation is said to be not scalable.
• Spin lock: While a queued lock provokes a wait and signal mechanism, the spin lock is
quite different and supports busy waiting, as already observed in traditional uniprocessor
operating systems. When a process Pi attempts to acquire a lock and is unsuccessful to
set it, because it is already set by another process, the process Pi will not be preempted to
relinquish the control of the CPU on which it is executing. Instead, it keeps continuously
checking the lock with repeated attempts to see whether it is free until it succeeds. In this
way, the CPU is not doing any productive work but remains busy with testing the lock con-
tinually, moving around a loop spinning over the lock. That is why such a lock is called a
spin lock. This is depicted in Figure 9.6(c) in which Pi is not relinquishing control of CPU
Ck, which now spins on lock L, as shown by an arrow. MontaVista Linux Professional
Edition, a derivative of the Linux 2.4 kernel with a full preemptive scheduler, has a fne-
grained locking mechanism inside the SMP kernel for improved scalability. The design
of this kernel exploits these services that allow user tasks to run concurrently as separate
kernel-mode threads on different processors. Threads in Windows running on SMP use
spin locks to implement mutual exclusion when accessing kernel data structures. To guar-
antee that kernel data structures do not remain locked for a prolonged period of time, the
kernel never preempts a thread holding a spin lock if some other thread tries to acquire the
spin lock. This way, the thread holding the lock can fnish its critical section and release
the lock at the earliest possible time.
• TSL instruction: The atomic implementation of a test-and-set-lock instruction with the
aid of the indivisible memory read-modify-write (RMW) cycles, as introduced in unipro-
cessor synchronization, using a spin lock mechanism can extend its TS functionality to
shared-memory multiprocessors. A semaphore can be implemented for this purpose in a
multiprocessor system by declaring it as a shared variable and updating it as required by
the semaphore defnition with the aid of a test-and-set instruction.
Distributed Systems: An Introduction 467
The use of spin lock in many situations is disfavored for creating severe degradation in system
performance, mainly for keeping the CPU engaged with no productive work and at the same time
denying the other deserving processes to get its service. Besides, use of spin lock often creates
traffc on memory bus and consumes the bandwidth of the links that connects processors to shared
memory. In addition, several processors while spinning on a lock can cause contention at the mem-
ory module containing the semaphore variable, and thus impair access by other processors to the
enclosing memory bank. In multiprocessor systems, multiple caches are often used for the sake of
performance improvement. Use of spin locks in such systems can result in increased bus traffc
needed to maintain consistency among copies of the semaphore variable that reside in individual
caches of the competing processors. Depending on the type of the cache–coherence scheme being
used, additional cache–related problems may grow that consequently may further cause to badly
affect the system performance.
The use of spin locks in NUMA systems may sometimes exhibit a critical situation commonly
known as lock starvation. In this situation, a lock might be denied for a considerable duration of
time, possibly indefnitely. Assume that a process Pi attempts to set a lock L residing in its non-local
memory. Let two other processes Pk and Pm which exist in the same node as the lock also attempt
to set it. Since access to local memory is much faster than access to non-local memory, processes
Pk and Pm spin much faster on the lock than process Pi does. Hence, they will always get an oppor-
tunity to set the lock before Pi. If they repeatedly set and use the lock, Pi may not be able to get its
turn. The situation may become worse if many other processes arrive by this time that are local to
the lock; process Pi will then be even further delayed in getting its turn to gain access to the lock,
thereby waiting for a considerable duration of time and facing acute starvation. To avoid such star-
vation, many effective schemes have been proposed, the details of which are outside the scope of
our discussion.
However, the use of spin locks has some advantages in certain situations. When the number
of processes does not exceed the number of CPUs present in the system, there is no justifcation
and simply unnecessary to preempt a process, rather it is preferred to allow the CPU to spin on
the lock until it succeeds. In addition, it is sometimes proftable to let a process loop rather than
blocking it, as already discussed earlier in this section. Preemption in this situation is simply
counter-productive.
Real-time applications, on other hand, however, prefer the use of spin locks to synchronize pro-
cesses. The reason is that while a CPU spinning on a lock can handle incoming interrupts, and the
process executing on this CPU can equally handle signals. This feature is essentially important in
real-time environments in which processing of interrupts and signals are highly time-critical, and
any delay in this regard may miss the deadline. However, as spin locks often generate traffc on the
memory bus or across the network while the CPU continues spinning on the lock, so it is considered
not scalable.
• Sleep lock: When a process Pi attempts to acquire a sleep lock which is already set by
another process, the CPU associated with process Pi is not released but is put in an unusual
state called a sleep state. In this state, the CPU neither executes any instructions nor
responds to any interrupts except interprocessor interrupts (IPIs). The CPU simply waits
for the release of the lock to be reported by the kernel and hence generates no additional
traffc on the memory bus or across the network. This is illustrated in Figure 9.6(d) with
an × mark against all interrupts except IPIs. Sleep locks are sometimes preferred when the
memory or network traffc is already high.
The CPU that sets the lock and later releases it has the responsibility to send IPIs to all the CPUs
that are sleeping on the lock. This obligation, in turn, involves increasing administrative overhead,
since they require a context switch and execution of associated kernel code to generate IPIs as well
as related servicing of the IPIs. The sleep lock, however, yields poor performance if there is heavy
468 Operating Systems
contention for a lock, but the performance is found moderately good, if the traffc density over the
lock is comparatively low. The use of sleep locks may be hazardous in real-time applications, since
the response time may exceed the specifed deadline.
Each and every type of lock, as discussed in process synchronization, provides certain advan-
tages in some situations and equally creates some problems in other situations. Additional hardware
components can be used in the system architecture to avoid the performance problems caused by
different types of locks while retaining all the advantages they usually offer.
FIGURE 9.7 A representative illustration of SLIC bus used as an additional hardware support in multipro-
cessor system to implement process synchronization.
Distributed Systems: An Introduction 469
simultaneously, an arbitration mechanism will be used by the hardware to select one of the CPUs to
award the lock to. SLIC has been successfully implemented in the Sequent Balance system.
Use of the SLIC approach provides several advantages: First, the presence of a dedicated special
synchronization SLIC bus relieves the memory bus from carrying additional load, thereby reducing
congestion of traffc on it. Second, the use of a spin lock rather than a sleep lock helps avoid the need
of generating and subsequently servicing IPIs, thereby alleviating the severity of additional admin-
istrative overhead and relieving the system from signifcant performance degradation. Last but not
least, the CPU here spins on a local lock. Since access to local memory is always much faster than
access to non-local memory, it improves system performance and at the same time does not generate
any additional memory or network traffc.
Under this arrangement, individual processors are dedicated to only one task at a time. Each
manager keeps track of the number of its available workers. A work request can be submitted to a
manager at any level of the hierarchy. Upon receiving a work request calling for R processor, the
respective manager (processor) must secure R or more processors, since some of them may fail. If
the selected manager does not fnd a suffcient number of worker processors, it calls on its manager
at the next higher level for help. This course of action continues until the request is satisfed (i.e. a
suffcient number of available processors is secured) or the top of the hierarchy is reached. If the
request cannot be flled even after reaching the top level, the request may be set aside for the time
being pending availability of the required amount of resources.
Such a hierarchical allocation of processors nicely fts a robust implementation, and scales well
due to the relatively limited amount of generated traffc. The fault tolerance of both the master and
worker processors is found to attain the desired level modestly. The work of a failed master can be
assigned to one of its siblings, to one of its subordinates, or even to another processor as assigned
by its superior. Failure of a processor residing at the top of the hierarchy can likewise be negotiated
by the selection (election) of a suitable successor. Similarly, the failure of a worker processor would
not cause any major hazards, since it can be handled by migrating its work to some other processor
(node) and can then be resumed in a suitable manner.
In spite of having several advantages, the wave scheduling, exhibits some practical diffculties
while implemented. First of all, the manager must always be equipped with the latest status of its
available workforce when attempting to allocate processors. This requires additional overhead each
time in the calculation of estimates, and, moreover, use of estimates may lead to ineffciencies if
they are too conservative or to allocation failures if they are too optimistic. Moreover, since multiple
allocation requests arrive almost simultaneously and granting activity may occur at nearly the same
time, the resources as estimated to be available may turn out to be snatched by another party when
a processor allocation is actually about to occur.
even more signifcant in systems with a large number of CPUs. That is why some operat-
ing systems do not favor a process shuffing approach to mitigating scheduling anomalies.
These anomalies may also be observed in some SMPs when each individual CPU is given
the power to execute the kernel code independently so as to avoid bottlenecks in kernel
code execution and to yield improved performance.
Process scheduling often includes scheduling of different processes of a specifc application con-
currently on different CPUs to achieve computation speed-up exploiting the principle of parallelism.
Synchronization and communication among these processes then becomes a vital consideration
that infuences in deciding scheduling policy, since it directly affects the performance of a system.
While synchronization of these processes mostly needs the service of a spin lock when several pro-
cesses of an application are scheduled on different CPUs (co-scheduling), communication among
these processes usually employs the message passing technique. Again, the way this technique
will be implemented should be chosen very carefully; otherwise it may sometimes thwart the co-
scheduling itself to operate. However, different approaches are used by different operating systems
in this regard, and the kernel is then designed in such a way so that it can make appropriate decisions
to effectively implement co-scheduling.
1. Sharing loads
As already mentioned, processes in multiprocessor system are not assigned to a specifc proces-
sor. Simultaneous (parallel) executions of different threads is quite easy and can be achieved by
maintaining a queue of ready threads. Processor allocation is then simply a matter of assigning the
frst ready thread to the frst available processor unless either all processors are busy or the ready
queue of the threads is emptied. In this way, each processor, whenever available, fetches the next
work item from the queue. This strategy is known as load sharing and is distinguished from the
load balancing scheme in which the work is allocated for a relatively longer duration of time (a more
permanent manner). Load sharing is a natural choice and possibly the most fundamental approach
472 Operating Systems
in which the existing load is distributed almost evenly across the processors, thereby offering sev-
eral advantages. Some of them are:
• It ensures that no processors remain idle when executable work is present in the system.
• No additional scheduler is needed by the operating system to handle all the processors. The
existing scheduler routine is simple enough that it can be run only on the available proces-
sor to choose the next thread to execute.
• The arrangement and management of global queues of threads do not require any addi-
tional mechanism. It can be treated as if it is running on uniprocessor system and hence
can be readily implemented along the lines of an existing proven mechanism, such as
priority-based execution history (SJF), etc. as used in uniprocessor operating systems.
Similar to the strategies employed in process scheduling, as discussed in Chapter 4, there are
different types of thread scheduling almost in the same line, out of which three types of thread
scheduling algorithms are of common interest. Those are:
• First Come First Served (FCFS): When a process arrives, all of its threads are placed
consecutively at the end of the ready queue of the threads. When a processor is available,
the thread at the front of the ready queue is assigned to the processor for execution until it
is completed or blocked.
• Smallest Number of Threads First (SNTF): This concept is similar to SJF in process
scheduling. Here, the threads of the jobs with the smallest number of unscheduled threads
will be given highest priority and hence are placed at the front of the ready queue. These
threads are then scheduled for execution until they are completed or blocked. Jobs with
equal priority will then be placed in the queue according to their time of arrival (FCFS).
• Preemptive SNTF (PSNTF): This one is almost similar to the last. Here, also the threads
of the jobs with the smallest number of unscheduled threads will also be given the highest
priority and are placed at the front of the ready queue to be scheduled for execution. The only
exception here is that the arrival of a new job with a smaller number of threads than an exe-
cuting job will preempt the executing thread of the currently running job, and the execution
of the threads of the newly arrived job will then be started until it is completed or blocked.
Each of the strategies, as usual, has several merits and certain drawbacks. However, the FCFS
strategy is perhaps superior to the other two when a load-sharing approach is employed over a large
number of jobs with a diverse spectrum of characteristics. However, the load-sharing approach also
suffers from several disadvantages. Those are mainly:
• Since the single shared ready queue is accessed by all processors and may be accessed
by more than one processor at the same time, a bottleneck may be created at that end;
hence, some additional mechanism is needed to ensure mutual exclusion. With a small
number of processors in the computer system, this problem is not noticeable. However,
with multiprocessors with a large number of processors, this problem truly persists and
may be acute in some situations.
• Thread switching leads to a small increase in overhead, but in totality it cannot be
ignored. Moreover, the threads being preempted are less likely to be scheduled once
again on the same processor when it resumes its execution. Consequently, caching
becomes less effcient if each processor is equipped with a local cache, and the potential
of affnity-based scheduling cannot be then utilized.
• Since the load-sharing approach uses a central global queue of threads of all the ready
processes, and those are usually scheduled at different times, likely to different pro-
cessors as and when those processors are available, it is highly probable that all of the
Distributed Systems: An Introduction 473
threads of a specifc program will be assigned to different processors at the same time.
As a result, a program with a high degree of coordination among its threads may not be
executed in the desired order at an appropriate time. This requires additional commu-
nication overhead and also the cost of extra process switches that may eventually affect
the overall performance adversely.
In spite of having a lot of disadvantages, its potential advantages, however, legitimately outweigh
them, and that is why this approach is favored as one of the most commonly used schemes in con-
temporary multiprocessor systems. A further refnement of the load-sharing technique has been
made to alleviate some of its potential disadvantages to effectively ft it into the environment. Those
modifcations have been implemented in the Mach operating system developed at Carnegie Mellon
University using a platform of an SMP kernel structure.
For brief details on thread-processor scheduling in Mach OS, see the Support Material at www.
routledge.com/9781032467238.
2. Gang scheduling
This strategy is derived based on the traditional (it predates the use of threads) concept in which a
set of processes is scheduled to run simultaneously on a set of processors, giving rise to the concept
of group scheduling that exhibits several advantages. Out of many, a few notable ones are:
• It is apparent that while a single action includes many processors and processes at one shot,
scheduling overhead is naturally reduced since it avoids individual scheduling of those
processors and the processes that would otherwise yield substantial scheduling overhead.
• It helps execute closely related processes in parallel on different processors, which con-
sequently may reduce blocking required at the time of synchronization, thereby avoiding
costly process switching that ultimately improves the performance as a whole.
Gang scheduling yields good performance if used intelligently. It sometimes requires prior
knowledge of job characteristics for its proper handling. This refers to how many processors to
assign to a program at a given time to make acceptable progress. Consequently, gang scheduling in
some specifc forms is observed to be superior to a load sharing approach in general.
This strategy is just the reverse of the load-sharing scheme and close to the gang scheduling
scheme. Here, each program is allocated a number of processors equal to the number of threads
in the program, and they are kept dedicated for the entire duration of the program execution. The
scheduling of threads here is straightaway implicit, simply defned by the assignment of threads to
processors. When the program is completed or terminated, all the processors are deallocated and
return to the pool of processors for subsequent use by other programs. It is evident that this approach
appears to suffer from severe drawbacks as far as processor utilization is concerned. First, there is
no scope of multiprogramming on any of the processors, thereby restricting the other processes
to share the processors if available during the lifetime of the executing program. Second, if the
running thread of an application is somehow blocked, maybe due to I/O waiting or for the sake of
synchronization with another thread, then the costly processor dedicated to that particular thread
will simply remain idle, yielding no productive output. However, counterarguments directly in favor
of this approach in terms of two distinct advantages can also be made:
• Since the processors are dedicated to different threads during the entire tenure of program
execution, there is no need to have any process switch that, in turn, speeds up the execution
and thereby defnitely improves the performance as a whole.
• Systems having large number of processors in which cost of the processors is appreciably
small compared to the total cost of the system; utilization of processors and processor–
time wastage is usually considered not a dominant parameter in evaluating performance
or effciency of such a system. In fact, highly parallel application having tight coordination
among threads will then be properly ftted in such system following the stated approach
and be proftably effective for its execution.
A dedicated processor assignment strategy works effciently and processor resources are used
effectively if the number of active threads of an application could be made limited or even kept
lower than the number of processors available in the system. Furthermore, the higher the number
of threads, the worse the performance, because there exists a high possibility of frequent thread
preemption and subsequent rescheduling, and also due to many other reasons, that eventually makes
the entire system ineffcient.
The scheduling issues of both the gang scheduling approach and assignment of dedicated pro-
cessors ultimately culminate in the subject matter of processor allocation. The problem relating to
processor allocation in multiprocessors is very similar to memory management on a uniprocessor
rather than scheduling issues on a uniprocessor. The issue now fnally converges as to how many
processors are to be assigned to a program at any given instant to make execution effcient, which
is analogous to how many page frames to allocate to a given process for its smooth execution. It
has thus been proposed to defne a term, activity working set, analogous to virtual memory work-
ing set, that indicates the minimum number of activities (threads) which must be simultaneously
scheduled on processors for the application to reach a desired level of progress. Similar to memory-
management schemes, if all of the elements of an activity working set cannot be simultaneously
scheduled on respective processors, it may give rise to a vulnerable situation, what is known as
processor thrashing. This normally happens when the execution of some threads are required and
thus scheduled, but induces de-scheduling of other threads whose services may soon be needed.
Another situation what is commonly called processor fragmentation also encompasses processor
Distributed Systems: An Introduction 475
scheduling issues. This occurs when some processors are allocated while a few others are left over.
The leftover processors are neither suffcient in number, nor are they properly organized to fulfll
the requirements of the waiting processes for execution. As a result, a good amount of processor
resources are simply left idle. Both of these scheduling strategies under discussion suffer from many
such drawbacks and thus require to properly address all these issues in order to avoid the problems
created by them.
4. Dynamic scheduling
This permits the number of threads in a process to be altered dynamically during the tenure of
the execution of the process so that the operating system can adjust the load for the sake of improve-
ment in processor utilization.
The scheduling decisions involved in this approach area is a joint venture of the operating
system and related application. The operating system is simply responsible for creating various
groups of processors among the jobs. The number of processors in a group may also change
dynamically, monitored by the OS, according to the requirements of the job running currently.
The responsibility of the operating system is primarily limited to processor allocation. Each
job uses its own group of processors to execute a subset of its runnable tasks by mapping these
tasks to corresponding threads as usual. The application which is being run mostly depends on
the policy and mechanism of thread scheduling. The policy decides which subset of task is to
run, which thread is to be suspended when a process is preempted, and similar other choices.
The mechanism to implement the policy is perhaps built using a set of runtime library routines.
Not all applications works well with this strategy; some applications that have a single thread
by default respond well to this strategy, while others although do not ft straightaway but could
be programmed in such a way so as to advantageously exploit this particular feature of the
operating system.
The processor allocation being done by the OS is mostly carried out in the following way: For
newly arrived jobs, only a single processor is allocated, and it is managed by taking one away
from any currently running job that has been allocated more than one processor. When a job
requests one or more processors, if there are idle processors, allocate them to satisfy the request;
otherwise the request cannot be serviced at that point in time and is set aside until a processor
becomes available for it, or sometimes the job on its own no longer fnds any need for an extra
processor.
This strategy requires a considerable high overhead since both the OS and application are jointly
involved in the desired implementation with needed operations. This shortcoming often negates the
performance advantages that may be accrued from this strategy. However, with applications that
can be designed and developed so as to take the advantage of dynamic scheduling, this approach in
that situation is clearly superior to its strong contenders (alternative), gang scheduling, or the dedi-
cated processor assignment strategy.
to usual coherence problem that can, of course, be negotiated by employing some variation of the
standard techniques used to handle cache coherence problem.
Shared memory organization also helps enhance the message passing technique that eventually
improves the performance of interprocess communication. This improvement is, however, realized
by simply avoiding the copying of messages when senders and receivers have access to the same
physical shared memory. But one of the serious drawbacks of this approach is that any modifca-
tion of the message, if carried out by either party would ultimately go against the fundamental
requirement of message passing mechanism; hence, to get out of this problem, it ultimately needs
to have a separate copy for each party, which is again time consuming. This problem, however, can
be alleviated by employing the most common cost-effective copy-on-write technique so that high
effciency can ultimately be attained. Mach OS, in particular, exploited the copy-on-write technique
rigorously to handle most of the issues in relation to interprocess communication. There are also
some operating systems for loosely coupled multiprocessors (such as machines using DSM) that
provide the shared memory abstraction and implement the copy-on-write technique while using the
message-passing facility.
The existence of shared memory also encourages to effectively extend the fle system by map-
ping fles into process virtual spaces, and that is accomplished by using some form of appropriate
primitives. This often helps to realize a potentially effcient mechanism for sharing open fles.
TABLE 9.1
Comparison of Three Different Operating Systems on Machines with Different
Organizations of N CPUs. A representative table showing the comparison of salient
features between multiprocessor operating systems, network operating systems and
distributed systems (multicomputer using middleware)
Multiprocessor (Tightly
Item Network OS Distributed OS Coupled) OS
Does it appear as a virtual No, collection of distinct Yes, single-system image Yes, single-system image
uniprocessor? machines
Do all have to run the same No Yes Yes
operating system?
How many copies of the N N 1
operating system are there?
Is there a single run queue? No No Yes
How is communication Shared Messages Shared
achieved? fles memory
Does it require any Yes Yes No
agreed-upon
communication protocols?
How is memory organized? On an individual machine Distributed shared memory Shared memory
basis
How are devices organized? Usually on an individual Pool of same Pool of same type of
machine basis type of devices devices shared
shared
How is fle sharing carried Usually requires no Does require well-defned Does require well-
out? pre-defned semantics semantics defned semantics
manner. Whenever a process wishes to use a kernel data structure, it simply increments the value in
the sequence lock associated with the data structure, makes a note of its new value, and then starts
to perform its own operation. After completing the operation, it checks whether the value in the lock
has changed. If so, the operation just performed is deemed to have failed, so it invalidates the opera-
tion just executed and attempts to run it again, and so on until the operation succeeds.
The advanced version of Linux 2.6 includes a substantially improved scheduler for traditional
non–real-time processes, and mostly the same real-time scheduling capability of version 2.4 and
earlier. Within the domain of real-time scheduling in the scheduler, Linux defnes three scheduling
classes:
Besides, Linux 2.6 also describes a completely new scheduler known as the O(1) scheduler (an
example of “big-O” notation used for characterizing the time-complexity of algorithms) in which
certain limitations of the Linux 2.4 scheduler have been substantially removed, particularly with
respect to the SCHD-OTHER class, which did not scale well with an increasing number of proces-
sors and processes. Moreover, the Linux scheduler is designed to put more thrust on I/O-bound
tasks over CPU-bound tasks. But the scheduler is designed in such a way that the time to select the
appropriate process and assign it to a deserving processor is almost constant, irrespective of the
total load or the number of available processors in the system.
comparable. This task is performed by a CPU which fnds that its ready queues are empty; it is also
performed periodically by the scheduler, which checks to see if there is a substantial imbalance
among the number of tasks assigned to each processor, typically using an interval of every 1 msec.
if the system is idle and every 20 msecs. otherwise. To balance the load, the scheduler can transfer
some tasks by invoking the load_balance function with the id of the under-loaded CPU as a param-
eter. The highest-priority active tasks are selected for such transfer, because it is important to fairly
distribute high-priority tasks.
Another salient feature of the Linux 2.6 kernel is that it can also support system architectures that
do not provide a memory management unit, which makes the kernel capable of supporting embed-
ded systems. Thus, the same kernel can now be employed in multiprocessors, servers, desktops,
and even embedded systems. Since the kernel modules are equipped with well-specifed interfaces,
several distinct features, such as better scalability, an improved scheduler, speedy synchronization
mechanism between processes, and many other notable attributes have been incorporated into the
kernel.
in executing a thread whose priority exceeds T k’s (lower priority). In fact, T k could have been sched-
uled on some other processor if it were a real-time thread.
through the network. Except for that facility, the operating systems on the workstations are fairly
traditional. Workstation networks, however, occupy a place and play a role somewhere in between
computer networks and true multicomputers (distributed systems).
Multicomputers thus designed are best suited primarily for general-purpose multi-user applications
in which many users are allowed to work together on many unrelated problems but occasionally in
a cooperative manner that involves sharing of resources. Such machines usually yield cost-effective
higher bandwidth, since most of the access made by each processor in individual machines are to its
local memory, thereby reducing latency that eventually resulting in increased system performance.
The nodes in the machines are, however, equipped with the needed interfaces so that they can
always be connected to one another through the communication network.
In contrast to the tightly coupled multiprocessor system, the individual computers forming
the multicomputer system can be located far from each other and thereby can cover a wider
geographical area. Moreover, in tightly coupled systems, the number of processors that can be
effectively and effciently employed is usually limited and constrained by the bandwidth of the
shared memory, resulting to restricted scalability. Multicomputer systems, on the other hand,
with a loosely coupled architecture, are more freely expandable in this regard and theoretically
can contain any number of interconnected computers with no limits as such. On the whole, multi-
processors tend to be more tightly coupled than multicomputers, because they can exchange data
almost at memory speeds, but some fber-optic-based multicomputers have also been found to
work at close to memory speeds.
shared memory. Instead, the only means of communication is by means of message passing. A
representative scheme of multicomputer operating system organization is depicted in Figure 9.8.
As already mentioned, each machine (node) in the multicomputer system (as shown in Figure 9.8)
has its own kernel that contains different modules for managing its various local resources, such
as local CPU, memory, a local disk, and other peripherals. In addition, each machine has a sepa-
rate module for handling interprocessor communication, which is carried out mostly by sending
and receiving messages to and from other machines. The message-passing technique being used
here may itself widely vary semantically between different forms of systems, giving rise to several
issues, such as whether the messages between processes should be buffered and whether the partici-
pating processes are made blocked or unblocked during the course of message-passing operations.
Whatever decision is made in this regard at the time of designing the OS, it depends largely on the
underlying system architecture that consequently determines the reliability aspects of the com-
munication thus made between machines. In fact, the presence or absence of buffers, for example,
at the sender’s or receiver’s end ultimately decides whether reliable communication is guaranteed,
which, in turn, put a tremendous impact on the performance of the system as a whole.
Within the multicomputer operating system, there exists a common layer of software (a utility
process, as shown in Figure 9.8) just above the local kernel that acts as a virtual machine monitor
implementing the operating system as a virtual machine, thereby multiplexing different underly-
ing kernels to support parallel and concurrent execution of various tasks. By using the available
interprocessor communication facilities, this layer provides a software implementation of shared
memory. The services that are commonly offered by this layer are, for example, assigning a task to
a processor, providing transparent storage, general interprocess communication, masking hardware
failures, and other standard services that any operating system usually provides. Some of the salient
features of multicomputer operating systems are:
• Each machine (node) has a copy of the code necessary for communication and primitive
service to processes (such as setting up mapping registers and preempting at the end of a
quantum). This code is the kernel of the operating system.
Distributed Systems: An Introduction 483
In fact, many of the features and issues required to be included in the design of a multicomputer
operating system are equally needed for any distributed system. However, the main difference
between multicomputer operating systems and distributed systems is that the former generally
assume that the underlying hardware is homogeneous and is to be fully controlled. On the other
hand, one important feature of a distributed operating system is migration of processes from one
machine to another to improve the balance of load and to shorten communication paths. Migration
requires a mechanism to gather load information, a distributed policy that decides that a process
should be moved, and a mechanism to effect the transfer. Migration has been demonstrated in a few
UNIX-based DOSs, such as Locus and MOS, and in communication-based DOSs, like Demos/MP.
Many distributed systems nowadays, however, are built on top of existing operating systems.
9.11.3 MIDDLEWARE
The defnition of a true distributed system is given in Section 9.8. Neither a NOS nor a DOS truly
meets the criteria of a real distributed system. The reason is that a NOS never casts a view of a
single coherent system, while a DOS is not aimed to handle a collection of independent comput-
ers (mostly heterogeneous). The obvious question now arises as to whether it would be possible to
develop a distributed system that could have most of the merits of these two different worlds: the
scalability and openness properties of NOSs and the transparency attributes of DOSs. Probably the
most diffcult problem in designing such distributed systems is the need to support network trans-
parency. The solution to this problem can be obtained by injecting an additional layer of software on
top of a NOS to essentially mask (hide) the heterogeneity of the collection of underlying platforms
(such as networks, hardware, operating systems, and many other things) in order to offer a single
coherent system view (network transparency) as well as to improve distribution transparency. Many
contemporary modern operating systems are constructed following this idea by means of including
an additional layer between applications and the NOS, thereby offering a lower-level of abstraction
484 Operating Systems
what is historically called middleware. This layer would eventually implement a convenient gen-
eral-purpose services to application programmers. The following discussion on this topic is almost
along the same lines as that of modern approaches (Tanenbaum, 1995).
NOSs often allow processes of distributed applications on different machines to communicate
with each other by passing messages. In addition, several distributed applications, on the other
hand, make use of interfaces to the local fle system that forms part of the underlying NOS. But the
drawback of this approach is that distribution is hardly transparent, because the user has to specif-
cally mention the destination point at which this action will be carried out. In order to negotiate
this drawback of NOS (i.e. lack of network transparency) to make it use as a distributed system, a
solution is then to place an additional layer of software between applications and the NOS, thereby
offering a higher level of abstraction. This layer is thus legitimately called middleware. Middleware
is essentially a set of drivers, APIs, or other software that improves and makes ease of connectivity
between a client application (that resides on top of it) and a server process (that exists below the level
of middleware). It provides a uniform computational model for use by the programmers of servers
as well as distributed applications.
Local operating systems running on heterogeneous computers are totally dedicated to perform-
ing everything with regard to their own resource management as well as carrying out simple means
of communication to connect other computers. Middleware never manages an individual node pres-
ent in the network system, but it provides a way to hide the heterogeneity of the underlying plat-
forms from the applications running on top of it. Many middleware systems, therefore, offer almost
a complete collection of services and discourage using anything but only their interfaces to those
services. Any attempt to bypassing the middleware layer and directly invoking the services of one
of the underlying local operating systems is often considered an out–of–way shot. Consequently,
there is a need to build a set of higher-level application-independent services to put into systems so
that networked applications can be easily integrated into a single system. This requires defning a
FIGURE 9.9 A representative block diagram of the general structure of a distributed system realized with
the use of a middleware.
Distributed Systems: An Introduction 485
common standard for middleware solutions. At present, there are a number of such standards, and
these available standards are generally not compatible with one another. Even worse, products that
implement the same standards but were introduced by different vendors are rarely interoperable.
Again to overcome this undesirable drawback, placement of upperware on top of this middleware
is thus urgently needed.
• A relatively simple model is treating everything, including I/O devices, such as mouse,
keyboard, disk, network interface, and so on, as a fle along the lines of UNIX and more
rigorously as Plan 9. Essentially, whether a fle is local or remote makes no difference.
All are; that an application opens a fle, reads and writes bytes, and fnally closes it again.
Because fles can be shared by many different processes, communication now reduces to
simply accessing the same fle.
• Another middleware model following a similar line as Plan 9, but less rigid, is centered
around DFS. Such middleware supports distribution transparency only for traditional fles
(i.e. fles that are used merely for storing data). For example, processes are often required
to be started explicitly on specifc machines. This type of middleware is reasonably scal-
able, which makes it quite popular.
• Middleware based on remote procedure calls (RPCs) and group communication sys-
tems such as lists (discussed later) was an important model in the early days. This model
puts more emphasis on hiding network communication by allowing a process to call a pro-
cedure, the implementation of which is located on a remote machine. At the time of calling
such a procedure, parameters are transparently shipped to the remote machine where the
procedure is actually to be executed, and thereafter the results are sent back to the caller. It
therefore appears to the caller as if the procedure call was executed locally, but it actually
keeps the calling process transparent about the network communication that took place,
except perhaps with a slight degradation in performance.
• Another model based on object orientation is equally popular today. The success of RPC
established the fact that if procedure calls could cross the machine boundaries, so could
objects, and it would then be possible to invoke objects on remote machines in a transpar-
ent manner too. This led to the introduction of various middleware systems based on the
notion what is called distributed objects. The essence of this concept is that each object
implements an interface that hides all its internal details from its users. An interface essen-
tially consists of the methods that the object implements. The only thing that a process
can see of an object is its interface. Object-oriented middleware products and standards
are widely used. They include Common Object Request Broker (CORBA), Java Remote
Method Invocation (RMI), Web Services, Microsoft’s Distributed Component Object
Model (DCOM), and so on. CORBA provides remote object invocation, which allows an
object in a program running on one computer to invoke a method of an object in a program
running on another computer. Its implementation hides the fact that messages are passed
over a network in order to send the invocation request and its reply.
Distributed objects are often implemented by having (placing) each object itself located
on a single machine and additionally making its interface available on many other machines.
When a process (on any machine except the machine where the object is located) invokes
a method, the interface implementation on the process’s machine simply transforms the
method invocation into a message, which is ultimately sent (request) to the object. The
486 Operating Systems
object executes the requested method and sends (reply) back the result. The interface
implementation on the process’s machine transforms the reply message into a return value,
which is then handed over to the invoking process. Similar to RPC, here also the process
may be kept completely in the dark about the network communication.
• This approach was further refned to give rise to a model based on the concept of dis-
tributed documents, which is probably best illustrated by the World Wide Web. In the
Web model, information is organized into documents, where each document resides on
a machine somewhere in the world. The exact location of the document is transparent to
the user. Documents contain links that refer to other documents. By following a link, the
specifc document to which that link refers is fetched from its location and displayed on the
user’s screen. The documents in question may be of any type, such as text, audio, or video,
as well as all kinds of interactive graphic-based articles.
In spite of having tremendous strength, middleware does have several limitations. Many distributed
applications rely entirely on the services provided by the underlying middleware to support their
needs for communication and data sharing. For example, an application that is suitable for a cli-
ent–server model, such as a database of names and addresses, can rely on a model of middleware
that provides only remote method invocation. Many other examples can also be cited in this regard.
Although much has already been achieved in simplifying the programming of distributed systems
through the development of middleware support, still then some aspects of the dependability of
systems require support at the application level. In addition, some communication-related functions
that are carried out at the lowest level can be completely and reliably implemented only with the
knowledge and help of the application standing at the upper end point of the communication system.
Therefore, providing that function additionally at the application level as a feature of the communi-
cation system itself is not always wise and sensible. Consequently, this runs counter to the view that
all communication activities can be abstracted (hidden) away from the programming of applications
by the introduction of appropriate middleware layers.
TABLE 9.2
A comparison of salient features between multiprocessor operating systems, multicomputer
operating systems, network operating systems and distributed systems (middleware-based.)
Distributed Operating System Network Operating Middleware-based
Item Multiproc. Multicomp. System Distributed System
Same OS on all nodes Yes Yes No No
Number of copies of OS 1 N N N
Basis for Communication Shared Messages Files Model specifc
memory
Resource Management Global, Global, Per node Per node
Central distributed
Degree of Tranaparency Very high High Low High
Scalability Very low Moderately Yes Varies
Openness No (Closed) No (Closed) Open Open
One aspect with regard to the last row of the Table 9.2 needs to be explained. Regarding open-
ness, both NOSs and distributed systems have an edge over the others. One of the main reasons
is that the different nodes of these systems running while under different operating systems, in
general, support a standard communication protocol (such as TCP/IP) that makes interoperability
much easy. But one practical aspect that goes against this desirable feature is that of using many
different operating systems, which causes severe diffculties in porting applications. In fact, DOSs,
in general, are never targeted to be open. Instead, they are often designed with more emphasis on
performance optimization, which eventually led to the introduction of many proprietary solutions
that ultimately stand in the way of an open system.
Out of these, the frst three issues, network type, network topology, and networking technol-
ogy are concerned what is known as the design of networks; all the other issues mentioned (all
other rows of Table 9.3 on the Support Material at www.routledge.com/9781032467238) above
are concerned mostly with message communication and its related aspects. We will now discuss
all the issues mentioned above (described in Table 9.3 in Support Material at www.routledge.
com/9781032467238) in brief.
Details on fundamental issues related to networking are given in Table 9.3 on the Support
Material at www.routledge.com/9781032467238.
• Ethernet: This is the most widely used multi-access branching bus topology network using
a circuit that consists of cables linked by repeaters (similar to Figure 9.18 on the Support
Material at www.routledge.com/9781032467238) for building distributed systems, because it
is relatively fast and economical. Information is transmitted from one station (node) to another
by breaking it up into units (packets) called frames. Each frame contains the addresses of its
source and destination and a data feld. Each station listens to the bus at all times, and it cop-
ies a frame in a buffer if the frame is meant for it; otherwise it simply ignores the frame. A
bridge used to connect Ethernet LANs is essentially a computer that receives frames on one
Ethernet and, depending on the destination addresses, reproduces them on another Ethernet
to which it is connected. Every Ethernet hardware interface is always assigned by the manu-
facturer a unique address of a maximum of 48 bits authorized by the IEEE to uniquely
identify a specifc Ethernet in the set of interconnected Ethernets forming the site. Since, the
basic Ethernet topology is a bus-based one, only one connection can be in progress at any
time using carrier sense multiple access with collision detection (CSMA/CD) technology
(protocol). However, if many stations fnd no signal on the cable and start transmitting their
frames almost at the same time, their frames would interfere with one another, causing what
is called a collision, which can then be resolved using appropriate algorithms. A collision is
normally detected by an increase in the size of the frame that must exceed a minimum of 512
bits for 10- and 100-Mbit Ethernets and 4096 bits for gigabit Ethernets.
• Token Rings: A network with a ring topology is a well-understood and feld-proven tech-
nology in which a collection of ring interfaces are connected by point-to-point links using
a cheap twisted pair, coaxial cable, or fber-optics as the communication medium and have
almost no wasted bandwidth when all sites are trying to send. Since a ring is fair and also
has a known upper bound on channel access, that is why and for many other reasons, IBM
chose the ring network as its LAN and adopted this technology as a basis for its distributed
system products. The IEEE has also included token ring technology as the IEEE 802.5
standard that eventually became another commonly used LAN technology for building
distributed systems. A ring topology that uses the notion of a token, which is a special
bit pattern containing specifc message, is called a token ring network, and the medium-
access control protocol used is the token ring protocol. Here, a single token of 3 bytes,
which may either be busy or free, circulates continuously around the ring. When a station
wants to transmit a frame, it is required to seize the free token and remove it from the ring
and then attach its message to the token, changing its status to busy before transmitting.
Therefore, a busy token always has a message packet attached to it, and the message can be
of any length and need not be split into frames of a standard size. Since there exists only
one token, only one station can transmit, and only one message can be in transit at any
instant. However, ring interfaces have two operating modes: listen and transmit. In listen
mode, every station that fnds a message checks whether the message is intended for it; if
it is, the destination station copies the message and resets the status bit of the token to free.
Operation of the token ring comes to a halt if the token is lost due to communication errors.
One of the stations is responsible for recovering the system; it listens continuously to the
traffc on the network to check for the presence of a token and then creates a new token if
it fnds that the token has been lost.
• Asynchronous Transfer Mode (ATM) Technology: ATM is a high-speed connection-
oriented switching and multiplexing technology that uses short, fxed-length packets called
cells to transmit different types of traffc simultaneously. It is not synchronous (only tied
Distributed Systems: An Introduction 491
to a master clock) in that information can be sent independently without having a common
clock, as most long-distance telephone lines are. ATM has several salient features that put
it at the forefront of networking technologies. Some of the most common are:
• It provides data transmission speeds of 622 Mbps, 2.5 Gbps, and even more, which
facilitates high bandwidth for distributed applications, such as those based on video-on-
demand technique, video–conferencing applications, and several other types of applica-
tions that often need to access remote databases.
• ATM exploits the concept of virtual networking to allow traffc between two locations
that permits the available bandwidth of a physical channel to be shared by multiple
applications, thereby enabling them to simultaneously communicate at different rates
over the same path between two end points. This facilitates the total available band-
width being dynamically distributed among a variety of user applications.
• ATM uses both fundamental approaches to switching (circuit switching and packet
switching) within a single integrated switching mechanism called cell switching, which
is fexible enough to handle distributed applications of both types, such as those that
generate a variable bit rate (VBR; usually data applications), which can tolerate delays
as well as fuctuating throughput rates, and those that generate a constant bit rate (CBR;
usually video, digitized voice applications) that requires guaranteed throughput rates
and service levels. Moreover, digital switching of cells is relatively easy compared to
using traditional multiplexing techniques in high-speed networks (gigabits per sec),
especially using fber-optics.
• ATM allows the use of only a single network to effciently transport a wide range of
multimedia data comprising text, voice, video, broadcast television, and several other
types. Normally, each type of these data requires the use of a separate network of dis-
tinct technology, and that to be simultaneously provided for effective transportation at
a time. ATM, with one single network, replaces the simultaneous use of many different
types of networks and their underlying technologies, thereby straightaway simplifying
the design of communication networks as well as providing substantial savings in costs.
• In ATM, it is possible to offer only a specifc portion of as big or as small chunk of the
capacity of network bandwidth as is needed by a user, and the billing is also then be
made only on the basis of per-cell usage (perhaps on a giga-cell basis).
• ATM, in addition to point-to-point communication in which there is a single sender and
single receiver, also supports a multicasting facility in which there is a single sender
and multiple receivers. Such a facility is required for many collaborative distributed
applications, such as transmitting broadcast, television (video conferencing) to many
houses (users) at the same time.
• ATM technology is equally applicable in both LAN and WAN environments with
respect to having the same switching technology (cell switching) and same cell format.
• The technology used in ATM is nicely scalable both upward and downward with respect
to many parameters, especially data rates and bandwidths.
• ATM technology, by virtue of its having enormous strength, eventually has been inter-
nationally standardized as the basis for B-ISDN.
ATM, by virtue of having many attractive features (already mentioned) is now in a position of
having created an immense impact on the design of future distributed systems; hence, it is often
legitimately described as the computer networking paradigm of the future, in spite of accepting the
fact that there still remain several problems with this technology for network designers and users
that have yet to be solved.
Brief details on ethernet, token rings, and ATM, with figures, are given on the Support
Material at www.routledge.com/9781032467238.
492 Operating Systems
• Circuit switching: A circuit is essentially a connection used exclusively for message pass-
ing by an intending pair of communicating processes, and the related physical circuit is set
up during the circuit set–up phase, that is, before the frst message is transmitted, and is
purged sometime after the last message has been delivered. Circuit set–up actions involve
deciding the actual network path that messages will follow, reservation of the channels
constituting the circuit, and other communication resources. Exclusive reservation of the
channels ensures no need of any buffers between them. Each connection is given a unique
id, and processes specify the connection id while sending and receiving messages.
The main advantage of the circuit-switching technique is that once the circuit is established, the
full capacity of the circuit is for exclusive use by the connected pair of hosts with almost no delay in
transmission, and the time required to send a message can be estimated and guaranteed. However,
the major drawbacks of this technique are that it requires additional overhead and delays during
circuit setup/disconnection phases to tie–up/disconnect a set of communicating resources. Channel
bandwidth may also be wasted if the channel capacities of the path forming the circuit are not uti-
lized effciently by the connected pair of hosts. This method is, therefore, justifed only if the overall
message density in the system is low, but not for long continuous transmissions, especially, when
medium-to-heavy traffc is expected between a pair of communicating hosts. It is also considered
suitable in situations where transmissions require guaranteed maximum transmission delay. This
technique is, therefore, favored particularly for transmission of voice and real-time data in distrib-
uted applications.
• Packet switching: Here, a message is split into parts of a standard size called packets, and
the channels are shared for transmitting packets of different sender–receiver pairs instead
of using a dedicated communication path. For each individual packet, a connection is set
up, and the channel is then occupied by a single packet of the message of a particular pair;
the channel may then be used for transmitting either subsequent packets of the same mes-
sage of the same pair or a packet of some other message of a different pair. Moreover, pack-
ets of the same message may travel along different routes (connections) and may arrive out
of sequence at the prescribed destination site. When packet switching is used, two kinds of
overhead is primarily involved; frst, a packet must carry some identifcation in its header:
the id of the message to which it belongs, a sequence number within the message, and ids
of the sender and destination processes. Second, the packets that have arrived at the des-
tination site have to be properly reassembled so that the original message can be formed.
Packet switching provides effcient usage of channels (links), because the communication band-
width of a channel is not monopolized by specifc pairs of processes but are shared to transmit
several messages. Hence, all pairs of communicating processes are supposed to receive fair and
unbiased service, which makes this technique attractive, particularly for interactive processes. This
technique, as compared to circuit switching, is more appropriate in situations when small amounts
of burst data are required to be transmitted. Furthermore, by virtue of having fxed-sized packets,
Distributed Systems: An Introduction 493
this approach reduces the cost of retransmission when an error occurs in transmission. In addition,
the dynamic selection of the actual path to be taken by a packet ensures considerable reliability
in the network, because alternate paths in the network could be used in transmission in the event
of channel or PSE failure. However, several drawbacks of this method have also been observed.
Apart from consuming time to set up the connection before transmission, this technique needs to
use buffers to buffer each packet at every host or PSE and again to reassemble the packets at the
destination site; the additional overhead thus incurred per packet is large and eventually makes this
method ineffcient for transmitting large messages. Moreover, there is no guarantee as to how long
it takes a message to travel from a source host to its destination site because the time to be taken
for each packet depends on the route chosen for that packet, in addition to the volume of data to be
transferred.
In order to alleviate the additional cost required in any form of connection strategy to set up the
connection between the sender and receiver before the start of actual transmission, connection-
less protocols are often used in practice for transmitting messages or packets. In such a protocol,
the originating node simply selects one of its neighboring nodes or PSE (see Figure 9.18 given in
Support Material at www.routledge.com/9781032467238) and sends the message or the packet to it.
If that node is not the destination node, it saves the message or the packet in its buffer and decides
which of the neighbors to send it to, and so on until the message or packet reaches the ultimate des-
tination site. In this way, the message or the packet is frst stored in a buffer and is then forwarded
to a selected neighboring host or PSE when the next channel becomes available and the neighboring
host or PSE also has a similar available buffer. Here, the actual path taken by a message or packet
to reach its fnal destination is dynamic because the path is established as the message or packet
travels along. That is why this method is also sometimes called store-and-forward communication:
because every message or packet is temporarily stored by each host or PSE along its route before it
is forwarded to another host or PSE.
Connection-less transmission can accommodate better traffc densities in communication chan-
nels (links) than message or packet switching, since a node can make the choice of the link when
it is ready to send out a message or a packet. It is typically implemented by maintaining a table in
each node (essentially a subset of an adjacency matrix for each node) that indicates which neighbor
to send to in order to reach a specifc destination node along with the exchange of traffc information
among the present nodes. As usual, each node should be equipped with a large buffer for the sake of
temporary storing and later transmission of messages or packets at convenient times if its outgoing
channels are busy or overloaded at any instant.
Brief details on this topic with figures are given on the Support Material at www.
routledge.com/9781032467238.
function (technique) is invoked whenever a connection is to be set up. The choice of routing strategy
has an impact on the ability to adapt to changing traffc patterns in the system and consequently is
crucial to the overall performance of the network. A routing strategy is said to be effcient if the
underlying routing decision process is as fast as possible so that network latency must be minimal.
A routing algorithm describes how routing decisions are to be specifed and how often they are to
be modifed and is commonly said to be good if it could be easily implemented all in hardware. In
LANs, sender–receiver interaction takes place on the communication channel; hence, there is no
need to have any routing strategies, as there is no provision to choose the path to be taken for trans-
mitting the message.
• Fixed (deterministic) routing: In this method, the entire path to be taken for communica-
tion between a pair of nodes is permanently specifed beforehand. Here, the source nodes
or its PSE selects the entire path and also decides which of all other intermediate PSEs
should be used to reach its destination. Each node is equipped with a fairly comprehensive
table and other information about the network environment that indicates paths to all other
nodes in the system at present. All routing information is, however, included along with the
message. When processes running in these nodes intend to communicate, a connection is
set up using this specifed path. A fxed routing strategy is simple and easy to implement.
The routing decision process is somehow effcient because the intermediate PSEs, if any,
need not make any routing decision. However, this strategy fails to provide fexibility when
dealing with fuctuating traffc densities as well as not being able to negotiate the situation
in the event of node faults or link failures, because the specifed path cannot be changed
once the information (or packet) has left the source computer (or its PSE). Consequently,
it makes poor use of network bandwidth, leading to low throughputs and also appreciable
delays when a message (or packets) is blocked due to faults or failures of components, even
when alternative paths are still available for its transmission.
• Virtual circuit: This strategy specifes a path selected at the beginning of a transmis-
sion between a pair of processes and is used for all messages sent during the session.
Information relating to traffc densities and other aspects of the network environment in
the system are taken into consideration when deciding the best path for the session. Hence,
this strategy can adapt to changing traffc patterns, rendering this method not susceptible
to component failures. It therefore ensures better use of network bandwidth and thereby
yields considerably improved throughput and enhanced response times.
• Dynamic (adaptive) routing: This method selects a path whenever a message or a packet
is to be sent, so different messages or even different packets of a message between a pair
of processes may use different paths. This strategy is also known as adaptive routing,
because it has a tendency to dynamically adapt to the continuously changing state of the
network in normal situation, as well as changing its traffc patterns to respond more effec-
tively in the event of faulty nodes or congested/failed channels. Since this scheme can use
alternative paths for packet transmission (packet switching in connection strategies), it
makes more effcient use of network bandwidth, leading to better throughput and enhanced
response times compared to when a virtual circuit is used. Its ability to adapt to alternative
paths makes it resilient to failures, which is particularly important to large-scale expand-
ing architectures in which there is a high probability of facing faulty network components
very often. Under this scheme, packets of a message may arrive out of order (as already
described in the packet-switching approach) at the destination site; proper reassembling of
packets thus needs to be carried out based on the sequence number appended already to
each packet at the time of its transmission.
Here, the policy used in the selection of a path may be either minimal or nonminimal. In
the case of a minimal policy, the path being selected is one of the shortest paths between
a source–destination pair of hosts, and therefore, each packet while visiting every channel
Distributed Systems: An Introduction 495
comes closer to the destination. In the nonminimal policy, a packet may have to follow a
relatively long path in order to negotiate current network conditions. In the ARPANET,
which was the progenitor of the internet, network information relating to traffc density
and other associated aspects of the network environment along with every link in the sys-
tem was constantly exchanged between nodes to determine the current optimal path under
prevailing condition for a given source–destination pair of nodes.
• Hybrid routing: This method is essentially a combination of both static and dynamic
routing methods in the sense that the source node or its PSE specifes only certain major
intermediate PSEs (or nodes) of the entire path to be visited, and the subpath between any
two of the specifed PSEs (or nodes) is to be decided by each specifed PSE (or node) that
works as source along the subpath to select a suitable adjacent ordinary PSE (or node) to
transmit to that PSE (node). This means that each major specifed PSE (or node) maintains
all information about the status of all outgoing channels (i.e. channel availability) and the
adjacent ordinary PSE (i.e. readiness of the PSE to receive) that are to be used while select-
ing the subpath for transmitting the packet. As compared to the static routing method, this
method makes more effcient use of network bandwidth, leading to better throughput and
enhanced response times. Its ability to adapt to alternative paths also makes it resilient to
failures
Brief details of this topic with respective fgures are given on the Support Material at www.
routledge.com/9781032467238.
at lower levels deal with data transmission-related aspects. The concept of layering the protocols in
network design provides several advantages, dividing up the problem into manageable pieces, each
of which can be handled independently of the others, and an entity using a protocol in a higher layer
need not be aware of details at its lower layer.
• The ISO/OSI Reference Model: The International Standards Organization (ISO) has
developed the Open Systems Interconnection reference model (OSI model) for communica-
tion between entities in an open system, popularly known as ISO protocol, ISO protocol
stack, or OSI model. This model identifes seven standard layers and defnes the jobs to be
performed at each layer. It is considered as a guide and not a specifcation, since it essen-
tially provides a framework in which standards can be developed for the needed services
and protocols at each layer. It is to be noted that adherence to the standard protocols is
important for designing open distributed systems, because if standard protocols are used,
separate software components of distributed systems can be developed independently on
computers having different architectures and even while they run under different operating
systems.
Following the OSI model, the information to be transmitted originates at the sender’s end in an
application that presents it to the application layer. This layer adds some control information to it
in the form of a header feld and passes it to the next layer. The information then traverses through
the presentation and session layers, each of which adds its own headers. The presentation layer
performs change of data representation as needed as well as encryption/decryption. The session
layer adds its own header and establishes a connection between the sender and receiver pro-
cesses. The transport layer splits the message into packets and hands over the packet to the next
layer, the network layer, which determines the link via which each packet is to be transmitted
and hands over a link-id along with a packet to the data link layer. The data link layer treats the
packet simply as a string of bits, adds error detection and correction information to it, and hands
it over to the physical layer for actual necessary transmission. At the other end when the message
is received, the data link layer performs error detection and forms frames, the transport layer
forms messages, and the presentation layer puts the data in the representation as desired by the
application. All seven layers of the ISO protocol and their respective functions are briefy sum-
marized in Table 9.3.
TABLE 9.3
Layers of the ISO Protocol
Layer Function
1. Physical layer Provides various mechanisms for transmission of raw bit streams between two sites over a physical
link.
2. Data link layer Forms frames after organizing the bits thus received. Performs error detection/correction on
created frames. Performs fow control of frames between two sites.
3. Network layer Encapsulates frames into packets. Mainly performs routing and transmission fow control.
4. Transport layer Forms outgoing packets. Assembles incoming packets. Performs error detection and, if required,
retransmission.
5. Session layer Establishes and terminates sessions for communications. If required, it also provides for restart and
recovery. This layer is not required for connectionless communication.
6. Presentation layer Represents message information implementing data semantics by performing change of
representation, compression, and encryption/decryption.
7. Application layer Provides services that directly support the end users of the network. The functionality implemented
is application-specifc.
Distributed Systems: An Introduction 497
It is to be noted that in actual implementation, out of the seven layers described, the frst three
layers are likely to be realized in hardware, the next two layers in the operating system, the pre-
sentation layer in library subroutines in the user’s address space, and the application layer in the
user’s program.
Brief details on this topic with fgures and an example are given on the Support Material at www.
routledge.com/9781032467238.
The lowest layer, the network-access layer or host-to-network layer, is essentially a combina-
tion of the physical and data-link layers of the ISO model [Figure 9.10(a)] that covers the whole
FIGURE 9.10 A schematic block diagram of TCP/IP reference model; a) in comparison to OSI model, and
b) also showing its protocols and different related networks used.
498 Operating Systems
physical interface between a data transmission device (computers, workstations, etc.) and a trans-
mission medium or network. This layer specifes the nature of the signals, the characteristics of the
transmission medium, the data rate, and similar other related aspects. This layer is also concerned
with access to and routing data through a network for the exchange of data between two end sys-
tems (workstation, server, etc.) attached to the same network. The sending computer must provide
the network with the address of the destination computer to enable the network to route the data to
the targeted destination. The sending computer may invoke some specifc services (such as priority)
that might be provided by the network. The specifc software used at this layer relies on the type
of network to be used; different standards in this regard have been introduced for circuit switch-
ing/packet switching (e.g. frame relay), LANs (e.g. Ethernet), WANs (e.g. Satnet), and others. This
is illustrated in Figure 9.10(b). Thus, it makes sense to separate those functions that are related to
network access into a separate layer. By doing this, the remainder of the communication software
above this network-access layer need not be at all concerned about the peculiarities and the charac-
teristics of the network to be used. Consequently, this higher-layer software can function in its own
way regardless of the specifc network to which the hosts are attached.
In cases where two devices intending to exchange data are attached to two different networks,
some procedures are needed so that multiple interconnected networks can enable data to reach
their ultimate destination. This function is provided by the internet layer that defnes an offcial
packet format and protocol called Internet Protocol, which is used at this layer to provide the rout-
ing function across multiple networks [see Figure 9.10(b)]. The IP can run on top of any data-link
protocol. This protocol can be implemented not only in end systems but often also in intermediate
components on networks, such as routers. A router is essentially a processor that connects two
networks and whose primary function is to relay data from one network to the other on a route from
the source to the destination end system.
Packet routing is clearly the major issue here, as is to avoid congestion. In most situations, pack-
ets will require multiple hops (using one or more intermediate routers) to make the journey, but
this is kept transparent to users. For these and other reasons, it is reasonable to say that the TCP/IP
internet layer is quite similar in functionality to the OSI network layer. Figure 9.10(a) illustrates this
correspondence. The message comes down to TCP from the upper layer with instructions to send it
to a specifc location with the destination address (host name, port number). TCP hands the message
down to IP with instructions to send it to a specifc location (host name). Note that IP need not be
told the identity of the destination port; all it needs to know that the data are intended for a specifc
host. IP hands the message down to network access layer (e.g. ARPANET) with instructions to send
it to the respective router (the frst hop on the way to its destined location) present in the network.
The IP performs data transmission between two hosts (host-to-host communication) on the internet.
The address of the destination host is provided in the 32-bit IP address format. Protocols in the
next higher layers provide communication between processes; each host assigns unique 16-bit port
numbers to processes, and a sender process uses a destination process address, which is a pair (IP
address, port number). The use of port numbers permits many processes within a host to use send
and receive messages concurrently. Some well-known services, such as FTP, Telnet, SMTP, and
HTTP, have been assigned standard port numbers authorized by the Internet Assigned Numbers
Authority (IANA). IP is a connectionless, unreliable protocol; it does not give any guarantee that
packets of a message will be delivered without error, only once (no duplication), and in the correct
order (in sequence).
The transport layer is placed just above the internet layer in the TCP/IP model [see Figure
9.10(b)] and is designed to allow peer entities on the source and the destination hosts to carry on a
conversation, the same as in the OSI transport layer. Two end-to-end protocols are defned. The frst
one, TCP, is a connection-oriented reliable protocol. It employs a virtual circuit between two pro-
cesses that allows to transmitting a byte stream originating on one machine to be reliably delivered
to any other machine in the internetworks. The second protocol [see Figure 9.10(b)] in this layer,
UDP (User Datagram Protocol), is an unreliable, connectionless protocol for applications that do
Distributed Systems: An Introduction 499
not want TCP’s sequencing or fow control and wish to provide their own. It incurs low overhead
compared to the TCP, because it does not need to set up and maintain a virtual circuit or ensure
reliable delivery. UDP is employed in multimedia applications and in video conferencing, because
the occasional loss of packets is not a correctness issue in these applications; it only results in poor
picture quality. These applications, however, use their own fow and congestion control mechanisms
by reducing the resolution of pictures and consequently lowering the picture quality if a sender, a
receiver, or the network is overloaded.
On top of the transport layer, the topmost layer in the TCP/IP model is the application layer
that corresponds to layers 5–7 in the ISO model. This is depicted in Figure 9.10(a). This layer essen-
tially contains the logic needed to support the various user applications. For each different type of
application, such as fle transfer, a separate module is needed that is specifc to that application.
Therefore, this layer should be equipped with all the higher-level protocols. The protocols used in
this layer in the early days included virtual terminals (TELNET), fle transfer (FTP), and electronic
mail (SMTP), as shown in Figure 9.10(b). Many other protocols have been added to these over the
years, such as the DNS for mapping host names onto their network addresses; HTTP, the protocol
used for fetching pages on the World Wide Web; and many others.
More about the ISO/OSI model and TCP/IP protocol, with fgures, is given on the Support
Material at www.routledge.com/9781032467238.
• Transparency: Distributed systems usually support process migration facility for effcient
utilization of resources. But communication (network) protocols meant for network systems
use location-dependent process-ids (such as port addresses that are unique only within a
node) that severely hinder the process migration implementation, because when a process
migrates, its process-id changes. Therefore, these protocols cannot be used in distributed
500 Operating Systems
systems and, in fact, communication protocols for distributed systems must use location-
independent process-ids that will remain unchanged even when a process migrates from
one location to another within the domain of distributed system.
• System-wide communication: Communication protocols used in network systems are
mainly employed to transport data between two nodes of a network and mostly serve the
purpose of input/output activities. However, most communications in distributed system
which consists of networks of computers, and particularly, when it is based on client–server
model that makes use of a server in which a client issues a request to server to provide some
specifc service for it by sending a message to the server and then keeps on waiting until
the server receives it, and sends back a due acknowledgement. Therefore, communication
protocols in distributed systems must have a simple, connectionless protocol equipped
with features that support such request/response activities.
• Group communication: Distributed systems often make use of group communication
facilities to enable a sender to reliably send a message to n receivers. Many network sys-
tems although provide certain mechanisms to realize this group communication by means
of multicast or even broadcast at the data-link layer, but their respective protocols often
hide these potential facilities from applications. Moreover, when the broadcast mechanism
is employed to send n point-to-point messages and subsequently to wait for n acknowl-
edgements, this badly wastes bandwidth and severely degrades the performance of the
corresponding algorithm and makes it ineffcient. Communication protocols for distrib-
uted systems, therefore, must provide certain means to offer more fexible and relatively
effcient group communication facilities so that a group address can be mapped on one
or more data-link addresses, and the routing protocol can then use a data-link multicast
address to send a message to all the receivers belonging to the group defned by the mul-
ticast address.
• Network management: Management of computer networks often requires manual interven-
tion to update the network confguration (e.g. add/remove a node from a network, alloca-
tion of new address, etc.) to refect its current state of affairs. The communication protocol
is expected to be able to automatically handle network management activities by changing
the network confguration dynamically as and when needed to refect its present state.
• Network security: Security in a network environment is a burning problem, and network
security is a vital aspect that often uses encryption mechanisms to ensure protection of
message data from any types of threat as far as possible when the data traverses across a
network. Encryption methods although are expensive to use, but it is also true that all com-
munication channels and nodes do not always face a threat for a particular user and hence
as such have no need of any encryption. Thus, encryption is needed only when there is a
possibility of a challenge of threat on a critical message while in transit from its source
node to the destination node through an untrustworthy channel/node. Hence, a communi-
cation protocol is particularly required that would provide a fexible and effcient mecha-
nism in which a message is encrypted if and only if the path it uses across the network
during its journey is not trusted and is critically at the face of any possible attack.
• Scalability: The communication protocol for distributed systems must not be thwarted
but equally provide effcient communication in both LAN and WAN environments, even
when they are well extended to cover a larger domain. In addition, a single communication
protocol must be workable as far as possible on both types of networks (LAN and WAN).
or quick response. VMTP is designed to provide group communication facilities and implements
a secure and effcient client–server-based protocol (Cheriton and Williamson, 1989). FLIP, on the
other hand, provides transparency, effcient client–server-based communication, group communica-
tion, security, and easy network management (Kaashoek et al., 1993).
• VMTP: VMTP is essentially a connectionless transport protocol designed mainly for dis-
tributed operating systems based on the concept of a message transaction with special fea-
tures to support request/response activity and was used in V-System. A client here when
sends a message to one or more servers followed by zero or more response messages sent
back to the client by the servers, it is mostly a single request message and a single response
message involved in such message transactions, but at most one per server.
Transparency and group communication facilities are provided by using 64-bit identifers (a
portion of which is reserved for group identifcation entities) to entities that are unique, stable, and
particularly independent of host addresses, and these enable entities to be migrated and handled
independently of network layer addressing. A group management protocol is provided for the pur-
pose of creating new groups, adding new members, or deleting members from an existing group, as
are various types of querying and related information about existing groups.
VMTP provides a selective retransmission mechanism to yield better performance. The packets
of a message are divided into packet groups containing a maximum of 16 kilobytes of segment data.
When a packet group is sent and is received, the receiver creates some form of information that indi-
cates which segment blocks are still outstanding. An acknowledgement is sent accordingly from the
receiver to the sender, and the acknowledgement packet contains information that helps the sender
to selectively retransmit only the missing segment blocks of the packet group.
Many other important and interesting features are present in VMTP. In fact, VMTP provides
a rich collection of optional facilities that extend its functionality and improved performance in
diverse spectrum of situations. One important feature which is useful in real-time communication is
the facility of conditional message delivery. A client in a time-critical situation can use this facility
to ensure that its message is delivered only if the server is able to process it immediately or within a
specifed short duration of time. Such optional facilities that sometimes need to be included should
be designed carefully so that their inclusions will only offer critical extensions to the existing basic
facilities without degrading the performance, especially while executing the majority of common
cases which happen very often.
• Fast Local Internet Protocol (FLIP): FLIP was developed for DOSs and is used in the
Amoeba distributed system. It is a connectionless protocol that includes several salient
features, such as transparency, effcient client–server-based communication facility, group
communication facility, security, and easy network management. A brief overview of FLIP
is given here describing some of its important features, and further details of this protocol
can be found in Kaashoek et al. (1993).
Transparency is provided by FLIP using location-independent 64-bit identifers to entities which are
also called network service access points (NSAPs). Sites on an internetwork can have more than one
NSAP, typically one or more for each entity (e.g. process). Each site is connected to the internetwork
by a FLIP box that either can be a software layer in the operating system of the corresponding site or
can be run on a separate communication processor (CP). Each FLIP box maintains a routing table,
basically a dynamic hint cache that maps NSAP addresses to data-link addresses. Special primitives
are provided to dynamically register and unregister NSAP addresses into the routing table of a FLIP
box. An entity can register more than one address in a FLIP box (e.g. its own address to receive
messages directed to the entity itself and the null address to receive broadcast messages). FLIP uses
a one-way mapping between the private address used to register an entity and the public address
502 Operating Systems
used to advertise the entity. A one-way encryption function is used to ensure that one cannot deduce
the private address from the public address. Therefore, entities that know the (public) address of an
NSAP (because they have communicated with it) are not able to receive messages on that address,
because they do not know the corresponding private address.
A FLIP message may be of any size up to 232 – 1 bytes that is transmitted unreliably between
NSAPs. If a message is too large for a particular network, it is fragmented into smaller chunks,
called fragments. A fragment typically fts in a single network packet. The basic function of FLIP is
to route an arbitrary-length message from the source NSAP to the destination NSAP. The policy on
which a specifc path is selected for routing is based on the information stored in the routing tables
of each FLIP box about the networks to which it is connected. Two key parameters used for this
purpose are the network weight and a security list. A low network weight means that the network is
currently suitable for a message to be forwarded. The network weight can be determined based on,
for example, physical properties of the network, such as bandwidth and delay (due to congestion).
The security bit, on the other hand, indicates whether sensitive data can be sent unencrypted over
the network.
Both point-to-point and group communication facilities are provided by FLIP for sending a mes-
sage to a public address. In fact, FLIP provides three types of system calls, fip_unicast, fip_multi-
cast, and fip_broadcast, The group communication protocols heavily use fip_multicast. This has
the advantage that a group of n processes can be addressed using one FLIP address, even if they are
located on multiple networks.
FLIP implements security without making any encryption of messages by itself. It provides two
mechanisms to impose security when the messages are delivered. In the frst mechanism, a sender
can mark its message sensitive by using the security bit. Such messages are routed only over trusted
networks. The second mechanism is that messages while routed over an untrusted network by a
FLIP are marked unsafe by setting the unsafe bit. When the receiver receives the message, it can tell
the sender by checking the unsafe bit whether there is any safe route between them. If a safe route
exists, the sender then tries to send the sensitive messages in unencrypted form but with the secu-
rity bit set. If no more trusted path is available for the message at any point in time during routing
(which can only happen due to changes in network confguration), it is returned to the sender with
the unreachable bit set. If this happens, the sender encrypts the message and retransmits it with
security bit cleared. Therefore, message encryption is done only when it is required and that too by
the sender, and not by FLIP.
FLIP handles network management easily because dynamic changes in network confguration
are taken care of automatically. Human intervention is seldom used in this regard, and even if it is,
it is required only to declare specifcally which networks are trusted and which are not. The sys-
tem administrator exactly performs this task while working on FLIP and should precisely declare
whether network interfaces are to be trusted, since FLIP on its own cannot determine which inter-
face is trustworthy.
One of the shortcomings of FLIP is that it is unable to provide full-fedged support in wide-area
networking. Although FLIP has been successfully implemented in smaller WAN environments, but
it is not adequate and suitable enough to be used as a standard WAN communication protocol in a
moderately large WAN environment. The root of this shortcoming might be due to one of the rea-
sons that the designers of FLIP were mostly inclined to trade it more functionally scalable, and also
assumed that wide–area communication should be mainly carried out at a relatively higher layers,
and not at network layer in which FLIP belongs.
9.13.4 SOCKETS
Sockets and socket programming were developed in the 1980s in the Berkley UNIX environment.
A socket is essentially a mechanism that enables communication between a client and server pro-
cess and may be either connection-oriented or connectionless. A socket is simply one end of a
Distributed Systems: An Introduction 503
communication path. A client socket in one computer uses an address to call a server socket on
another computer. Once the proper sockets are engaged, the two computers can then exchange
data. Sockets can be used for interprocess communication within the UNIX system domain and
in the internet domain. Typically, computers with server sockets keep a TCP or UDP port open for
unscheduled incoming calls. The client typically determines the socket identifcation of the targeted
server by fnding it in a domain name system (DNS) database. Once a connection is made, the server
switches the dialogue to a different available port number to free up the main port number in order
to allow more additional incoming calls to enter.
Sockets can be used in internet applications, such as TELNET and remote login (rlogin) in which
all the details are kept hidden from the user. However, sockets can be constructed from within a
program (such as in C or Java), thereby enabling the designer and programmer to easily include
semantics of networking functions that consequently permit unrelated processes on different hosts
to communicate with one another.
Sockets while used in a connection-based mode of operation, the processes using the sockets are
either clients or servers. Both client and server processes create a socket. These two sockets are then
connected to set up a communication path that can be used to send and receive messages. The nam-
ing issue is handled in the following way: The server binds its socket to an address that is valid in
the domain in which the socket will be used. This address is now widely advertised in the domain. A
client process uses this address to perform a connect between its socket and that of the server. This
approach, however, avoids the necessary use of process-ids in communication.
The socket mechanism provides suffcient fexibility so that processes using it can choose a mode
of operation that best suits the intended use. For applications in which low overhead is important,
the communicating process can use a connectionless mode of operation using datagrams. But for
applications that critically demand reliability, processes can use a connection-based mode of opera-
tion using a virtual circuit for guaranteed reliable data delivery. The Berkley Sockets Interface is the
de facto standard API for developing networking applications which run over a wide range of dif-
ferent operating systems. The sockets API provides generic access to interprocess communications
services. Windows sockets (WinSock) is, however, essentially based on Berkley specifcations.
A socket used to defne an application program interface (API) is a generic communication
interface for writing programs that use TCP or UDP. In practice, when used as an API, a socket is
identifed by the triple:
The local-address is an IP address, and the local-process is a port number. Because the port num-
bers are unique within a system, the port number implies the protocol (TCP or UDP). However, for
clarity and implementation, sockets used for an API include the protocol as well as the IP address
and port number in defning a unique socket.
Corresponding to the two protocols (TCP and UDP), the Sockets API mainly recognizes two
types of sockets: stream sockets and datagram sockets. Stream sockets make use of TCP, which
provides a connection-based, reliable, guaranteed data transfer of all blocks of data sent between a
pair of sockets for delivery that arrive in the same order that they were sent. Datagram sockets make
use of UDP, which provides connectionless features; therefore, use of these sockets never gives
guaranteed delivery of data, nor is the order of data necessarily preserved. There is also a third type
of socket provided by the Sockets API known as raw sockets, which allows direct access to lower
layer protocols, such as, IP.
For stream communication, the functions send() and recv() are used to send and receive data
over the connection identifed by the s parameter. In the recv() call, the buf parameter (similar to
message in send call) points to the buffer for storing incoming data, with an upper limit on the num-
ber of bytes set by the message-length parameter. The close() and shutdown() calls are described on
the Support Material at www.routledge.com/9781032467238.
504 Operating Systems
For datagram communication, the function sendto( ) and recvfrom( ) are used. The sendto
( ) includes all the parameters of the send( ) call plus a specifcation of the destination address (IP
address and port). Similarly, the recvfrom( ) call includes an address parameter, which is flled in
when data are received.
Brief details on this topic are given on the Support Material at www.routledge.com/9781032467238.
• X-terminals,
• A processor pool consisting of a very large number of CPUs, each with tens of megabytes
of memory, and
• Servers such as fle and print servers.
The X-terminal is a user station consisting of a keyboard, a mouse, and a bit-mapped terminal con-
nected to a computer. The nodes in the system use the processor pool model, which has the features
described in Section 9.5. The object concept is central to Amoeba, and the objects supported this way
are fles, directories, memory segments, screen windows, processors, disks, and tape drives. This
uniform interface to all objects provides generality and simplicity. Servers handle all objects in the
system; both hardware and software, and objects are named, protected, and managed by capabilities.
An important feature of Amoeba, unlike most other distributed systems, is that it has no concept
of a “home machine”. The entire system is present to a user as a whole. Machines do not have own-
ers. The initial shell starts up at login and runs on some arbitrary machine (processor), but as com-
mands are started, in general, they do not run on the same machine as the shell. Instead, the system
automatically looks for the most lightly loaded machine to run each new command on. Similarly,
pool processors are not “owned” by any one user. When a user submits a job, the OS dynamically
allocates one or more processors from the pool instead of allowing it to use any specifc workstation.
Thus, a user’s computation is spread across the hosts in a system that totally disregards any machine
boundaries. This is possible since all the resources in the system are tightly integrated. When the
job is completed, the allocated processors are subsequently released and go back to the pool. In the
event of shortage in the availability of processors, individual processors may be timeshared.
The Amoeba operating system model consists of two basic pieces: one piece is a microkernel,
which runs on every processor of the pool processors and servers, and the other piece is a collec-
tion of servers that provide most of the traditional operating system functionality. The microkernel
performs four primary functions:
Like most other operating systems, Amoeba supports the concept of a process. In addition, it also
supports multiple threads of control within a single address space. The thread concept is extended
Distributed Systems: An Introduction 505
up to the kernel level. A process with one thread is essentially the same as a process in UNIX. Such
a process has a single address space, a set of registers, a program counter, and a stack. Multiple
threads might be used in a fle server in which every incoming request is assigned to a separate
thread to work on; each such thread can be absolutely sequential, even if it has to block waiting
for I/O.
Amoeba provides two communication protocols. One protocol supports the client–server com-
munication model through RPC, while the other protocol provides group communication by using
either multicasting or reliable broadcasting. For actual message transmission, both these protocols
use an underlying Internet protocol called FLIP, already discussed in Section 9.13.3, which is a
network layer protocol in the ISO protocol stack.
Many functions performed by traditional kernels are implemented through servers that run on
top of the kernel. Thus, actions like booting, process creation, and process scheduling are per-
formed by servers. The fle system is also implemented as a fle server. This approach reduces the
size of the microkernel and makes it suitable for a wide range of computer systems from servers to
pool processors.
As usual, Amoeba also has a fle system. However, unlike most other operating systems, the
choice of the fle system is not dictated by the operating system. The fle system runs as a collec-
tion of server processes. Users who do not like the standard ones are free to write their own. The
kernel does not know, and really doesn’t care or want to know, which one is the “real” fle system.
In fact, different users, if desired, may use different incompatible fle systems at the same time. The
standard fle system consists of three servers. The bullet server handles fle storage, the directory
server takes care of fle naming and directory management, and the replication server handles fle
replication. The fle system has been split into three components to achieve increased fexibility, and
make each of the servers straightforward to implement.
Amoeba supports various other servers. One such is the boot server, which is used to provide a
degree of fault tolerance to Amoeba by checking that all servers that are supposed to be running are
in fact really running and taking corrective action when they are not. A server capable to survive
crashes can be included in the boot server’s confguration fle. Each entry tells how often the boot
server should poll and how it should poll. As long as the server responds correctly, the boot server
has nothing to do and takes no further action. Similarly, although Amoeba uses the FLIP protocol
internally to achieve high performance, sometimes it is necessary to speak CP/IP, for example, to
communicate with X-terminals, to send and receive mail to non-Amoeba machines, and to interact
with other Amoeba systems via the Internet. To enable Amoeba to do such so many other things, a
TCP/IP server has been thus provided. Apart from these servers, Amoeba includes a disk server
(used by the directory server for storing its arrays of capability pairs), various other I/O servers, a
time–of–day server, and a random number server (useful for generating port, capabilities, and
FLIP addresses). The so-called Swiss Army Knife server deals with many activities that have to be
done later by starting up processes at a specifed time in the future. Mail severs deal with incoming
and outgoing electronic mail.
• Bridges: Bridges essentially operate at the bottom two layers of the ISO model (data-link
and physical) and hence can be used to connect networks that use the same communica-
tion protocols above the data-link layer, but may not have the same protocols at the data-
link and physical layers. In other words, bridges feature high-level protocol transparency.
Bridges, for example, may be used to connect two networks, one of which uses fber-optics
communication medium and the other of which uses coaxial cables, but both networks
must use the same high-level protocols, such as TCP/IP, for example. The use of similar
protocols at higher levels implies that bridges do not intervene in the activities carried out
by those protocols in different segments. This means that bridges do not modify either the
format or the contents of the frames when they transfer them from one network segment
to another. In fact, they simply copy the frames, and while transferring data between two
network segments, they can even use a third segment in the middle of the two that can-
not understand the data passing through it. In this case, the third intermediate segment
simply serves the purpose of only routing, Bridges are also useful in network partitioning.
When network traffc becomes excessive, the performance of a network segment starts to
degrade; the network segment can be broken into two segments, and a bridge is used in
between to interconnect them.
• Routers: Routers operate at the network layer of the OSI model and use the bottom three
layers of the OSI model. They are usually employed to interconnect those networks that
use the same high-level protocols above the network layer. It is to be noted that protocols
used in data-link and physical layers are transparent to routers. Consequently, if two net-
work segments use different protocols at these two layers, a bridge must be used to con-
nect them. While bridges are aware of the ultimate destination of data, routers only know
which is the next router for the data to be transferred across the network. However, routers
Distributed Systems: An Introduction 507
are more intelligent than bridges in the sense that a router is essentially a processor that
not only copies data from one network segment to another, but whose primary function is
to relay the data from the source to the destination system on a best-possible route being
chosen by using information in a routing table. A router, therefore, is equipped with a fow
control mechanism that negotiates traffc congestion by making decisions to direct the traf-
fc to a suitable less-congested alternative path.
• Brouters: To make the internetwork more versatile; network segments, apart from using
routers, often use bridges to accommodate multi-protocols so that different protocols can
be used at the data-link and physical layers. This requirement eventually resulted in a
design of a kind of devices that are a hybrid of bridges and routers called brouters. These
devices provide many of the distinct advantages of both bridges and routers. Although
these devices are complex in design, expensive to afford, and diffcult to install, but are
most useful for very complex heterogeneous internetworks in which the network segments
use the same high-level communication protocol to yield the best possible internetworking
solution in some situations.
• Gateways: Gateways operate at the top three layers of the OSI model. They are mainly
used for interconnecting dissimilar networks that are built on totally different communi-
cation architecture (both hardware- and software-wise) and use different communication
protocols. For example, a gateway may be used to interconnect two incompatible networks,
one of which may use the TCP/IP suite and the other of which may use IBM’s SNA (System
Network Architecture) protocol suite. Since gateways are used to interconnect networks
using dissimilar protocols, one of the major responsibilities of gateways is protocol transla-
tion and necessary conversion, apart from also occasionally performing a routing function.
In addition, if the internetwork is heterogeneous in nature, which is usual, then building widely
acceptable tools to manage an internetwork is equally diffcult to realize. However, several
organizations, like the ISO, the Internet Engineering Task Force (IETF), the Open Software
Foundation (OSF), and others are engaged in defning certain management standards for com-
munications networks that would be interoperable on multivendor networks. These standards
are, however, developed based on several popular reference models and a few already-defned
network management frameworks. Out of many, three notable standards eventually came
out that can be used as network management tools: Simple Network Management Protocol
(SNMP), Common Management Information Protocol (CMIP), and Distributed Management
Environment (DME).
• The SNMP (Shevenell, 1994; Janet, 1993) standard, introduced in the late 1980s, is essen-
tially a simple and low-cost client–server protocol to monitor and control networks that
use the IP suite. SNMP-based tools have been developed by most of the vendors deal-
ing with network management element. SNMP version 2 uses IETF in order to make
itself more speedy and secure and at the same time capable to handle manager–manager
communication.
• CMIP (Janet, 1993), developed by OSI (ISO/CCITT), is a network management stan-
dard to facilitate interoperability and true integration in which a large number of separate,
isolated network management products and services offered by multi-vendors are pres-
ent. CMIP is essentially based on a manager-agent model that facilitates communication
between managing systems. CMIP-based products are comparatively costly, more com-
plex, and require relatively more processing power to implement. That is why these prod-
ucts, in spite of having several nice features and providing richer functionality, have failed
to attain the expected level of growth.
• The DME from the OSF is a set of standards designed based on SNMP, CMIP, and other
de facto standards and has been specifed for distributed network management products
that provide a framework to realize a consistent network management scheme across a
global multi-vendor distributed environment. Several products based on DME, as reported,
are still in the process of enhancement and further development.
systems may enjoy and exploit most of the advantages that the workstation/server model usually
offers.
The operating system that supports multiple systems to act cooperatively often faces several
requirements to fulfll that this architecture places on it. Those span across three basic dimen-
sions: hardware, control, and data. To satisfy these requirements, several issues and the related
mechanisms to address them need to be included in the DOS at the time of its development. In the
following section, we will discuss some of the most common issues and the related frequently used
mechanisms that the distributed environment demands for smooth operation.
9.14.1 NAMING
Distributed operating systems manage a number of user-accessible entities, such as nodes, I/O
devices, fles, processes, services, mailboxes, and so on. Each object is assigned a unique name and
resides at some location. At the system level, resources are typically identifed by numeric tokens.
Naming is essentially a lookup function, and its mechanism provides a means of mapping between
the symbolic/user-given names and the low-level system identifers used mainly in operating-system
calls. Basically, a name is assigned that designates the specifc object of interest, a location that
identifes its address, and a route that indicates how to reach it. Each host is assigned a system-wide
unique name, which can be either numeric or symbolic, and each process or resource in a host is
assigned an id that is unique, at least in the host. This way, the pair (<host-name>,<process-id>) used
as a token uniquely identifes each specifc process (object) at the chosen host and hence can be used
as its name. When a process wishes to communicate with another process, it uses a pair like (xx,
Pk) as the name of the destination process, where xx is the name of a host to which the process Pk
belongs. In some distributed systems, low-level tokens are globally unique at the system level. This
helps to fully decouple names from locations. This name, however, should be translated into a net-
work address to send a message over a network. The name service in a distributed system possesses
certain desirable properties that mainly include:
• Name transparency, which means that an object name should not divulge any hint about
its actual location.
• Location transparency implies that a name or a token should not be changed when the
related object changes its residence.
• Replication transparency ensures that replication of an object is to be kept hidden from
users.
• Dynamic adaptability to changes of the object’s location by replication facilitates migra-
tion of objects and thereby dynamically changes its address that seems to be sometimes
essential in the event of load sharing, availability, and fault tolerance.
A distributed name service is essentially a mapping that may be multi-level and multi-valued. It
may be organized as a hierarchy in which certain parts of prefxes designate a specifc sub-domain.
The internet service represents an example of a naming hierarchy where each host connected to the
internet has a unique address known as the IP (Internet Protocol) address. The IP address of a host
is provided with a given name by the domain name system (DNS), which is actually a distributed
internet directory service. DNS provides a name server in every domain, whose IP address is known
to all hosts belong to that domain. The name server contains a directory giving the IP address of
each host in that domain. When a process in a host wishes to communicate another process with the
name (<host-name>, <process-id>), the host performs name resolution to determine the IP address
of <host-name>. If <host-name> is not its own name, it sends the name to the name server in its
immediate containing domain, which, in turn, may send it to the name server in its immediate
containing domain, and so on, until the name reaches the name server of the largest domain con-
tained in <host-name>. The name server then removes its own name from <host-name> and checks
510 Operating Systems
whether the remaining string is the name of a single host. If so, it obtains the IP address of the host
from the directory and passes it back along the same route from which it received <host-name>;
otherwise the remaining name string contains at least one domain name, so it passes the remaining
name string to the name server of that domain, and so on. Once the sending process receives the
IP address of <host-name>, the pair (<IP address>, <process-id>) is used to communicate with the
destination process.
Since name resolution using name servers, in general, can be quite lengthy, many distributed
systems attempt to improve performance by caching in each workstation a collection of its recently
requested name-token translations. This technique speeds up repeated name resolution (the same
way a directory cache speeds up repeated references to the directory entry of a fle) when hits in
the name cache result in quick translations that bypass the name server. The diffculty with client
caching in distributed systems is that object migrations can invalidate the cache names. In addition,
the name server of a domain is often replicated or distributed to enhance its availability and to avoid
contention.
Apart from the name server that implements a naming service in a DOS, there are a variety of
other methods, including static maps, broadcasting, and prefx tables. For more details, interested
readers can consult Milenkovic (1992).
Load balancing refers to the distribution of processes located at heavily loaded workstations to
some other one for execution that are relatively lying idle using the high-speed interconnection
networks that connect them. When process migration decides to balance the load, it requires a
mechanism to gather load information: a distributed policy which decides that a process should
be moved and a mechanism to effect the transfer. The load-balance function is invoked by the
scheduler that can dynamically reassign processes among nodes to negotiate load variations.
Fault tolerance can be assisted by maintaining multiple copies of critical code at strategic
points in the system. In this arrangement, execution and service can be restored and continued
fairly quickly in the event of a server failure by migrating the runtime state and reactivating
the affected process at one of the backup nodes. Performance can be improved by distribut-
ing certain applications across the relatively idle workstations. In addition, performance may
sometimes be improved by migrating processes to their related data sites when operations on
large volumes of data are involved. System resources can be better utilized by using the process
migration facility, in particular in situations when some special-purpose resources are needed
by a process but cannot be revoked remotely. In such cases, the requesting process itself can be
migrated to the home site of the requesting resource so that it can execute appropriate routines
locally at that end.
While process migration is considered a powerful mechanism and is used often to realize the
ultimate objectives of a distributed environment, but it may sometimes affect the other parts of the
system; therefore, the utmost care should be taken so that it can only interfere into the rest of the
Distributed Systems: An Introduction 511
system as little as possible. In particular, when process migration is being effected, the ultimate
aims of the system are:
• to minimize the specifc time during which the migrating process is in transit and before
being operational,
• to reduce the additional load imposed on other nodes for executing such an activity, and
• to decrease the residual dependencies.
In spite of having several merits, the implementation of process migration suffers from several
diffculties; the three major ones are:
Systems equipped with virtual memory at individual nodes are quite conducive to process migra-
tion to implement. Its main attractions are the built-in mechanism for demand loading of pages and
the runtime virtual-to-physical address translation mechanism. This approach accelerates process
migration activity, since it involves moving only a skeletal portion of the state that is actually refer-
enced, along with only a few pages. Consequently, it results in minimizing loading of the network
considerably. The rest of the state, however, can be page-faulted as usual across the network in the
course of execution of the process at the destination node. The savings are thus appreciable, since
only a fraction of the total size of a program’s virtual memory is actually involved in migration in
a given execution. A potential drawback of the paging scheme across the network is its increased
load on a network in normal operation caused by the page traffc between the workstations and the
backing store (fles, block servers, etc.).
Process migration, however, essentially requires a total relocation of the execution environment.
This involves relocation of sensitive addresses present in the process, program-transparent redirec-
tion of fle and device references, proper directing of routing of messages and signals, appropriate
management of remote communications and invocations made during and after the migration, and
similar other aspects. In fact, effcient and effective implementation of process migration often
depends to a large extent on naming and location transparency, apart from relying on many other
aspects.
permit users to access remote resources. On the other hand, the basic goal of communication pro-
tocols for distributed systems is not only to allow users to access remote resources but to do so in a
manner transparent to the users.
Several accepted standards and well-implemented protocols for network systems are already
available. For wide-area distributed systems, these protocols often take the form of multiple layers,
each with its own goals and rules. However, well-defned protocols for distributed systems are still
not mature, and as such no specifc standards covering essential aspects of distributed systems are
yet available. A few standard network protocols were described in a previous section. The essential
requirements for protocols for distributed systems were already explained (Section 9.13.2), and a
few standard communication protocols (Section 9.13.3) have been designed covering those aspects.
Interprocess communication between processes located in different nodes (sites) in a distributed
system is often implemented by exchanging messages. The message-passing mechanism in a dis-
tributed system may be the straightforward application of messages as they are used in a uniproces-
sor single system. A different type of technique known as the RPC exists that relies on message
passing as a basic function to handle interprocess communication. Once the location of a destina-
tion process is determined, a message meant for it can be sent over the interconnection network.
Message delivery is also prone to partial failures that may be due to failures in communication
links or faults in nodes located in network path(s) to the destination process. Hence, it is expected
that processes must make their own arrangements to ensure reliable fault-free delivery of messages.
This arrangement is in the form of an interprocess communication protocol (IPC protocol), which is
nothing but a set of rules and conventions that must be adhered to in order to handle transient faults.
Reliable message exchange between processes usually follows steps like:
When a process sends a message, the protocol issues a system call at the sender’s site that raises an
interrupt at the end of a specifc time interval. This interrupt is commonly called a timeout inter-
rupt. When the message is delivered to a process, the destination process sends a special acknowl-
edgement message to the sender process to inform it that the message has been clearly received. If
the timeout interrupt occurs before an acknowledgement is received, the protocol retransmits the
message to the destination process and makes a system call to request another timeout interrupt.
These actions are repeated for a certain number of times, and then it is declared that there has been
a fault, either due to a failure in the communication link or a fault at the destination node.
Now consider what happens if the message itself is received correctly, but the acknowledgement
is lost. The sender will retransmit the message, so the receiver will get it twice. It is thus essential
that the receiver be able to distinguish a new message from the retransmission of an old one. Usually
this problem is solved by putting consecutive sequence numbers in each original message. If the
receiver gets a message bearing the same sequence number as the previous message, it identifes that
Distributed Systems: An Introduction 513
the message is a duplicate one and hence ignores it. A similar arrangement may be used to ensure
that a reply sent by the receiver process reaches the sender process. The message passing mecha-
nism was discussed in detail in Chapter 4. Distributed message passing and its associated several
design issues are explained in the following subsections.
• At-most-once semantics: A destination process either receives a message once or does not
receive it. These semantics are realized when a process receiving a message does not send
an acknowledgement and a sender process does not perform retransmission of messages.
• At-least-once semantics: A destination process is guaranteed to receive a message; how-
ever, it may receive several copies of the message. These semantics are realized when a
process receiving a message sends an acknowledgement and a sender process retransmits a
message if it does not receive an acknowledgement before a time-out occurs.
• Exactly once semantics: A destination process receive a messages exactly once. These
semantics are obtained when sending of acknowledgements and retransmissions are per-
formed as in at-least-once semantics, but the IPC protocol recognizes duplicate messages
and subsequently discards them.
The implications of these three semantics signifcantly differ, and their applications also vary,
depending mostly on the situations where they will be used.
At-most-once semantics result when a protocol does not use acknowledgements or retrans-
mission. Generally, these semantics are used if a lost message does not create any serious problem
in the correctness of an application or the application itself knows how to get rid of such diffcul-
ties. For example, an application that receives regular reports from other processes is quite aware
when a message is not received as expected, so it may itself communicate with a sender whose
message is lost and ask it to resend the message. These semantics are usually accompanied by
communication mechanisms with high effciency because acknowledgements and retransmis-
sions are not employed.
At-least-once semantics result when a protocol uses acknowledgements and retransmission,
because a destination process receives a message more than once if the acknowledgement is lost in
transit due to communication failure or getting delayed as a result of network contention or conges-
tion. A message being received for the second or subsequent time is treated as a duplicate message.
An application can use at-least-once semantics only if the presence and processing of duplicate
messages will not cause any serious effect in relation to correctness of the applications, such as
multiple updates of data instead of a single update over a database. Adequate arrangements should
be made in such processing accompanying the database so that the situation of multiple appearances
of messages can be detected before causing any bad effects.
Exactly–once–semantics result when a protocol uses acknowledgements and retransmissions
but discards duplicate messages. These semantics hide transient faults from both sender and receiver
processes, because the IPC protocol is ready to afford high communication overhead that results
from handling of faults and treating duplicate messages.
reliability properties as well as on the ability of a sender process to perform actions after send-
ing a message (as well as on the nature of actions performed by a sender process after sending a
message).
Reliable and Unreliable Protocols: A reliable message-passing protocol is one that guar-
antees delivery of a message, or its reply if possible; in other words, it would not be lost. It
achieves this through at-least-once or exactly–once semantics for both messages and their
replies. To implement this semantic, it makes use of a reliable transport protocol or similar
logic and performs error-checking, acknowledgement, retransmission, and reordering of dis-
ordered messages. Since delivery is guaranteed, it is not necessary to let the sending process
know that the message was delivered. However, it might be useful to provide an acknowl-
edgement to the sending process so that the sending process is informed that the delivery has
already been carried out. If the facility, however, fails to perform delivery (either in the event
of communication link failure or a fault in the destination system), the sending process is
alerted accordingly about the occurrence of the failure. Implementation of a reliable protocol
is comparatively complex and expensive due to having substantial communication overhead for
providing needed acknowledgements and required retransmissions of messages and replies. At
the other extreme, an unreliable protocol may simply send a message into the communication
network but will report neither success nor failure. It does not guarantee that a message or its
reply will not be lost. It provides at-most-once semantics either for messages or their replies.
This alternative approach, however, greatly reduces the complexity, processing, and communi-
cation overhead of the message-passing facility.
Blocking and Non-Blocking Protocols:Blocking and non-blocking protocols are also called
process-synchronous and process-asynchronous protocols, respectively. As already explained in
Chapter 4, it is common and customary to block a process that executes a Receive system call unless
no other message is sent to it. At the same time, there are no defnite reasons to block a process that
executes a Send system call. Thus, with non-blocking Send, when a process issues a Send primi-
tive, the operating system returns control to the process as soon as the message has been queued
for transmission or a copy has been made to its buffer. If no copy is made, any changes made to the
message by the sending process before or while it is being transmitted cannot take effect and are
then made at the risk of the related process. When the message has been transmitted or copied to a
safe place for subsequent transmission, the sending process is then interrupted to only inform that
the message has been delivered or that the message buffer may be reused. Interrupt(s) may also be
generated to notify the non-blocking sending process of the arrival of a reply or an acknowledge-
ment so that it can take appropriate action. Similarly, non-blocking Receive is issued by a process
that then proceeds to run, and when a message arrives, the process is informed by an interrupt, or
it can poll for status periodically.
Non-blocking primitives (Send, Receive) when used in message-passing mechanism, however,
makes the system quite effcient and also fexible. But, one of the serious drawbacks of this approach
is that it is diffcult to detect faults, and thereby equally hard to test and debug programs that use
these primitives. The reason is that it is irreproducible, and timing-dependent sequences can create
delicate and diffcult problems.
On the other hand, there are blocking or synchronous primitives. A blocking Send does not
return control to the sending process until the message has been transmitted (unreliable service) or
until the message has been sent and a due acknowledgement is received (reliable service). Blocking
of a sender process, however, may simplify a protocol, reduce its overhead, and also add some desir-
able features (salt) to its semantics. For example, if a sender process is blocked until its message is
delivered to a destination process, the message would never have to be retransmitted after the sender
is activated, so the message need not be buffered by the protocol when the sender is activated. Also,
blocking of the sender helps provide semantics that are similar to the relatively easy conventional
procedure call. Similarly, a blocking Receive does not return control until a message has been
placed in the allocated buffer.
Distributed Systems: An Introduction 515
This protocol is a non-blocking (asynchronous) and unreliable one and is used by processes in
which the destination process has nothing to return as a result of execution and the sending end
requires no confrmation that the destination has received the request. Since no acknowledgement
or reply message is involved in this protocol, only one message per call is transmitted (from sender
to receiver). The sender normally proceeds immediately after sending the request message, as there
is no need to wait for a reply message. The protocol provides may–be–call semantics and requires
no retransmission of request message. This semantic, therefore, does not offer any guarantees;
although it is the easiest to implement but also probably the least desirable. Asynchronous mes-
sage transmission with unreliable transport protocols is generally useful for implementing periodic
update services. A node that misses too many update messages can send a special request message
to the time server to get a reliable update after a maximum amount of time.
One of the reliable protocols for use by processes that exchange requests and replies is the
request–reply acknowledgement (RRA) protocol. Receipt of replies at the sending end ensures that
the destination process has received the request, so a separate acknowledgement of the request is not
required. The sender, however, sends an explicit acknowledgement of the reply.
The sender process is made blocked until it receives a reply, so a single request buffer at the
sender site is suffcient irrespective of the number of messages a process sends out or the number
of processes it sends them to. The destination process is not blocked until it receives an acknowl-
edgement, so it could handle requests from other processes while it waits for acknowledgement.
Consequently, the destination site needs one reply buffer for each sender process. The number of
messages can be reduced through piggybacking, which is the technique of including the acknowl-
edgement of a reply in the next request to the same destination process. Since a sender process is
blocked until it receives a reply, an acknowledgement of a reply is implicit in its next request. That
is why the reply to the last request would require an additional explicit acknowledgement message.
The RRA protocol essentially follows at-least-once semantics because messages and replies can-
not be lost; however, they might be delivered more than once. At the same time, duplicate requests
would have to be discarded at the destination site to provide exactly–once semantics.
The request–reply (RR) protocol simply performs retransmission of a request when a timeout
occurs. One of the shortcomings of a non-blocking version of the RR protocol is that the destina-
tion process has to buffer its replies indefnitely because a sender does not explicitly acknowledge
a reply; moreover, unlike the RRA protocol, an acknowledgement is not implicitly piggybacked on
the sender’s next request, because the sender may have issued the next request before it received the
reply to its previous request. Consequently, this protocol requires very high buffer space to accom-
modate the logic.
The RR protocol essentially follows at-least-once semantics. Consequently, duplicate requests
and replies are to be discarded if exactly–once semantics are desired. If requests issued by a sender
are delivered to the destination process in the same order, the duplicate identifcation and subse-
quent discarding arrangement of the RRA protocol can be used with minor changes. A destination
process preserves the sequence number and replies of all requests in pool of buffers. When it rec-
ognizes a duplicate request through a comparison of sequence numbers, it searches for the reply to
the request in the buffer pool using the sequence number and then retransmits the reply if found
in a buffer; otherwise it simply ignores the request since a reply would be sent after processing the
request in near future.
516 Operating Systems
The non-blocking RR protocol has relatively little overhead and can be simplifed for use in
applications involving idempotent computations. A computation is said to be idempotent if it pro-
duces the same result if executed repeatedly. For example, the computation k:= 5 is idempotent,
whereas the computation k:= k + 1 is not. When an application involves only idempotent compu-
tations, data consistency would not be affected if a request is processed more than once, so it is
possible to exclude arrangements for discarding duplicate requests. Similarly, read and write (not
updating) operations performed in a fle are idempotent, so it is possible to employ the simplifed
RR protocol when using a remote fle server. It has the additional advantage that the fle server need
not maintain information about which requests it has already processed, which helps it to be state-
less and more reliable.
Brief details on R, RRA, and RR protocols with fgures are given on the Support Material at www.
routledge.com/9781032467238.
Group communication supported by systems can also be viewed in other ways. These systems can
be divided into two distinct categories depending on who can send to whom. Some systems support
closed groups, in which only the members of the group can send to the group. Outsiders cannot send
messages to the group as a whole, although they may be able to send messages to individual mem-
bers. In contrast, other systems support open groups, in which any process in the system can send
to any group. The distinction between closed and open groups is often made for implementation rea-
sons. Closed groups are typically used for parallel processing. For example, a collection of processes
working together to play a chess game might form a closed group. These processes have their own
goal and do not intend to interact with the outside world. On the other hand, when the implementation
of the group idea is to support replicated servers, it becomes important that processes that are not
members (clients) can send to the group. In addition, the members of the group themselves may also
need to use group communication, for example, to decide who should carry out a particular request.
Out of the three types of group communication as mentioned, the frst is the one-to-many
scheme, which is also known as multicast communication, in which there are multiple receiv-
ers for a message sent by a single sender. A special case of multicast communication is broad-
cast communication, in which the message is sent to all processors connected to a network.
Multicast/broadcast communication is very useful for several practical applications. For exam-
ple, to locate a processor providing a specifc service, an inquiry message may be broadcast.
The specifc processor along with the others will respond, and in this case, it is not necessary to
receive an answer from every processor; just fnding one instance of the desired service is suff-
cient. Several design issues are related to these multicast communication schemes, such as group
management, group addressing, group communication primitives, message delivery to receiver
processes, buffered and unbuffered multicast, atomic multicast, various types of semantics, and
Distributed Systems: An Introduction 517
fexible reliability, etc. Each such issue requires implementation of different types of strategies
to realize the desired solution.
The many-to-one message communication scheme involves multiple senders but a single
receiver. The single receiver, in turn, may be categorized as selective or nonselective. A selec-
tive receiver specifes a unique sender; a message exchange takes place only if that sender sends
a message. On the contrary, a nonselective receiver specifes a set of senders, and if any one of
them sends a message to this receiver, a message exchange occurs. Thus, the receiver may wait, if
it wants, for information from any of a group of senders rather than from one specifc sender. Since
it is not known in advance which member(s) of the group will have its information available frst,
this behavior is clearly nondeterministic. In some situations, this fexibility is useful to dynami-
cally control the group of senders from whom to accept the message. For example, a buffer process
may accept a request from a producer process to store an item in the buffer when the buffer is not
full; it may otherwise also accept a request from a consumer process to get an item from the buffer
whenever the buffer is not empty. To realize this behavior through a program, a notation is needed
to express and control this type of nondeterminism. One such construct is the guarded command
statement introduced by Dijkstra (1975). Since this issue is more related to programming languages
rather than operating systems, it is not discussed further here.
The many-to-many message communication scheme involves multiple senders and multiple
receivers. Since this scheme implicitly includes one-to-many and many-to-one message communi-
cation schemes, all the issues related to these two schemes are equally applicable to many-to-many
communication scheme also. Moreover, the many-to-many communication scheme itself has an
important issue, referred to as ordered message delivery, which ensures that all messages are deliv-
ered to all receivers in an order acceptable to an application for correct functioning. For example,
assume that two senders send messages to update the same record of a database to two server pro-
cesses with a replica of the database. If the messages sent from the two senders are received by the
two servers in different orders, then the fnal values of the updated record of the database may be
different in the two replicas. This shows that this application requires all messages to be delivered
in the same order to all receivers (servers) for accurate functioning.
Ordering of messages in the many-to-many communication scheme requires message sequenc-
ing because many different messages sent from different senders may arrive at the receiver’s end at
different times, and a defnite order is required for proper functioning. A special message-handling
mechanism is thus required to ensure ordered message delivery. Fortunately, there are some com-
monly used semantics for ordered delivery of multicast/broadcast messages, such as absolute order-
ing, consistent ordering, and casual ordering. The details of these schemes are outside the scope
of this book.
Many other different aspects need to be addressed concerning group communication, especially
in relation to formation of groups and its different attributes. Some of them are: how the groups will
be formed, whether it will be peer groups or hierarchical groups, how group membership will be
described and implemented, how a group will be addressed, whether the message-passing mecha-
nism will obey atomicity, and last but not least, is the fexibility and scalability of the groups. Many
other aspects still remain that need to be properly handled to realize effective and effcient group
communication as a whole.
application. But the fact is that an IPC protocol, when developed independently, may often be found
ft to a specifc application but does not provide a foundation on which a variety of distributed appli-
cations can be built. Therefore, a need was felt to have a general IPC protocol that can be employed
to design various distributed applications. The concept of RPC came out at this juncture to cater
to this need, and this facility was further enhanced to provide a relatively convenient mechanism
for building distributed systems, in general. Although the RPC facility never provides a universal
solution for all types of distributed applications, still it is considered a comparatively better commu-
nication mechanism that is adequate for building a fairly large number of distributed applications.
The RPC has many attractive features, such as its simplicity, its generality, its effciency, and
above all its ease of use that eventually have made it a widely accepted primary communication
mechanism to handle IPC in distributed systems. The idea behind the RPC model is to make it
similar to the well-known and well-understood ordinary procedure call model used for transfer of
control and data within a program. In fact, the mechanism of RPC is essentially an extension of the
traditional procedure call mechanism in the sense that it enables a call to be made to a procedure
that does not reside in the same address space of the calling process. The called procedure, com-
monly called a remote procedure, may reside on the same computer as the calling process or on
a different computer. That is why the RPC is made to be transparent; the calling process should
not be aware that the called procedure is executing on a different machine, or vice versa. Since the
caller and callee processes reside on disjoint addresses spaces (possibly on different computers),
the remote procedure has no access to data and variables in the caller’s environment; therefore, the
RPC mechanism uses a message-passing scheme to exchange information between the caller and
the callee processes in the usual way.
Remote procedures are found a natural ft for the client/server model of distributed computing.
The caller–callee relationship can be viewed as similar to a client–server relationship in which
the remote procedure is a server and a process calling it is a client. Servers may provide common
services by means of the public server procedures that a number of potential clients can call. The
server process is normally dormant, awaiting the arrival of a request message. When one arrives,
the server process extracts the procedure’s parameters, computes the result, sends a reply message,
and then awaits the next call message to receive. This concept is exactly like shared subroutine
libraries in single-computer (uniprocessor) systems. Thus, in both environments, the public routines
being used must be made reentrant or otherwise be kept protected from preemption by some form
of concurrency control, such as mutual exclusion. However, several design issues associate with
RPCs must be addressed; some common ones follow.
probably be many differences in the ways in which the messages (including text as well as
data) are to be represented. If a full-fedged communication architecture is used to con-
nect the machines, then this aspect can be handled by the presentation layer. However, in
most cases, the communication architecture provides only a basic communication facility,
entrusting the conversion responsibility entirely to the RPC mechanism. However, one of
the best approaches to negotiate this problem is to provide a standardized format (probably
ISO format) for the most frequently used objects, such as integers, foating-point numbers,
characters, and strings. Parameter representation using this standardized format can then
be easily converted to and from the native (local) format on any type of machine at the time
of passing a parameter and receiving the results.
connection thus established can then be used for near-future calls. If a given time interval passes
with no activity on the established connection, then the connection is automatically terminated. In
situations where there are many repeated calls to the same procedure within the specifed interval,
persistent binding exploits the existing connections to execute the remote procedure, thereby avoid-
ing the overhead required to establish a new connection.
where <proc-id> is the id of a remote procedure, <message> is a list of parameters, and the call is
implemented using a blocking protocol. The result of the call may be passed back through one of
the parameters or through an explicit return value. Implementation of an RPC mechanism usually
involves the fve elements: client, client stub, RPCRuntime, server stub, and server. All these are
used to perform name resolution, parameter passing, and return of results during a RPC.
The beauty of the entire mechanism lies in keeping the client completely in the dark about the
work which was done remotely instead of being carried out by the local kernel. When the client
regains control following the procedure call that it made, all it knows is that the results of the execu-
tion of the procedure are available to it. Therefore, from the client’s end, it appears that the remote
services are accessed (obtained) by making ordinary (conventional) procedure calls and not by
using send and receive primitives. The entire details of the message-passing mechanisms are kept
hidden in the client stub as well as in the server stub, making the steps involved in message passing
invisible to both the client and the server.
The RPC is a powerful tool that can be employed as a foundation by using it to make the build-
ing blocks for distributed computing. It has several merits, and its advantages over the conventional
client–server paradigm are especially due to two factors. First, it may be possible to set up a remote
procedure by simply sending its name and location to the name server. This is much easier than set-
ting up a server. Second, only those processes that are aware of the existence of a remote procedure
can invoke it. So the use of remote procedures inherently provides more privacy and thereby offers
more security than use of its counterpart client–server paradigm. Its primary disadvantages are
that it consumes more processing time to complete as well as its lack of fexibility, and the remote
procedure has to be registered with a name server, so its location cannot be changed easily.
Brief details on RPC implementation with a fgure are given on the Support Material at www.
routledge.com/9781032467238.
and server need not be machines of the same type. The interface defnition, however, contains a
specifcation of the remote procedure and its parameters that are required to be compiled by rpcgen,
which reads special fles denoted by an x prefx. So, to compile a RPCL fle, simply enter
rpcgen rpcprog.x
The Sun RPC schematic does not use the services of a name server. Instead, each site contains
a port mapper that is similar to a local name server. The port mapper contains the names of pro-
cedures and their port ids. A procedure that is to be invoked as a remote procedure is assigned a
port, and this information is registered with the port mapper. The client frst makes a request to the
port mapper of the remote site to fnd which port is used by the remote procedure. It then calls the
procedure at that port. However, a weakness of this arrangement is that a caller must know the site
where a remote procedure exists.
the beauty of RMI is that the client would have achieved execution of some of its own code at the
server’s site. Different clients running on different JVMs can use the same service r_eval to get dif-
ferent codes executed at the server’s site
Finally, the RPC model offers more fexibility and versatility while retaining its simplicity.
Message-passing mechanisms, on the other hand, are relatively tedious, and moreover, the use of
message primitives is somewhat unnatural and also confusing. That is why the RPC mechanism,
Distributed Systems: An Introduction 523
among other reasons, is always preferred to a message-passing mechanism, at least for the sake
of convenience.
The general approach in data access and related caching of data works as follows: when a pro-
cess on a node attempts to access data from a memory block on the shared-memory space, the local
memory-mapping manager takes the control to service its request. If the memory block containing
the requested data is resident in the local memory, the request is serviced by supplying the data as
asked for from the local memory. Otherwise a network block fault (similar to page fault in virtual
memory) is generated and control is passed to the operating system. The OS then sends a message
to the node on which the desired memory block is located to get the block. The targeted block is
migrated from the remote node to the client process’s node, and the operating system maps it into
the application’s address space. The faulting instruction is restarted and can now proceed toward
completion as usual. This shows that the data blocks keep migrating from one node to another
only on demand, but no communication is visible to the user processes. In other words, to the user
processes, the system looks like a tightly coupled shared-memory multiprocessor system in which
multiple processes can freely read and write the shared memory at will. Caching of data in local
memory eventually reduces the traffc on network substantially for a memory access on cache hit.
Signifcant performance improvement can thus be obtained if network traffc can be minimized
by increasing the cache hit, which can be attained by ensuring a high degree of locality of data
accesses.
• Structure: This refers to the layout of the shared data in memory. In fact, the structure of
the shared-memory space of a DSM system is not universal but varies, normally depending
on the type of application the DSM system is going to support.
• Block Size: The block size of a DSM system is sometimes also referred as granularity.
Possible units of a block are a few words, a page, or even a few pages, which are considered
the unit of data sharing and the unit of data transfer across the network. Proper selection of
block size is a major issue that determines the granularity of parallelism and the generated
load in network traffc in the event of network block faults.
• Memory Coherence: DSM systems that allow replication of shared data items in the
main memories of a number of nodes to handle many different situations often suffer
from memory coherence problems (similar to the well-known cache coherence problem
in uniprocessor systems and traditional multi-cache coherence problem in shared memory
multiprocessor systems) that deal with the consistency of a piece of shared data lying in the
main memories of two or more nodes. To negotiate this problem, different memory coher-
ence protocols can be used that depend on the assumptions and trade-offs that are made
with regard to the pattern of memory access.
• Memory Access Synchronization: Concurrent accesses to shared data in DSM system
are a regular feature that require proper synchronization at the time of data access to
maintain the consistency of shared data apart from using only the coherence protocol.
Synchronization primitives, such as lock, semaphores, event counts, and so on are thus
needed to negotiate situations of concurrent accesses over shared data.
• Replacement Strategy: Similar to the cache replacement strategy used in uniprocessor
systems, the data block of the local memory sometimes needs to be replaced in some situ-
ations using an appropriate strategy. In situation, when the local memory of a node is full
and the needed data are not present in the memory (similar to the occurrence of a cache
miss) at that node, this implies not only a fetch of the needed data block from a remote node
but also a suitable replacement of an existing data block from the memory of the work-
ing node in order to make room for the new (fetched) one. This indicates that a suitable
replacement strategy is urgently needed when a DSM system is designed.
Distributed Systems: An Introduction 525
• Thrashing: In DSM systems, data blocks are often migrated between nodes on demand.
Therefore, if two nodes compete for simultaneous write access to a single data item, the
corresponding data block may then experience a back-and-forth transfer so often that much
time is spent on this activity, and no useful work can then be done. This situation in which
a data block is involved in back-and-forth journey in quick succession is usually known as
thrashing. A DSM system, therefore, must be designed incorporating a suitable policy so
that thrashing can be avoided as much as possible.
• Heterogeneity: When a DSM system is built for an environment in which the set of com-
puters is heterogeneous, then it must be designed in such a way that it can address all
the issues relating to heterogeneous systems and be able to work properly with a set of
machines with different architectures.
• Migration: The principle involved in the migration algorithm maintains only a single
physical copy of the shared memory, and migration of it, whenever required, is carried
out by making a copy to the site where access is desired. The foating copy of the shared
memory may be relatively easily integrated into the virtual-memory addressing scheme of
its resident host. Implementation of shared memory in this way appears to be simple, but it
exhibits poor performance, particularly when the locus of shared memory activity tends to
move quickly among hosts. In addition, when two or more hosts attempt to access shared
memory within the same time frame, apart from using an appropriate synchronization
mechanism to mitigate the situation, it causes excessive migration of data that eventu-
ally leads to thrashing. This is especially troublesome when the competing hosts actually
access non-overlapping areas of shared memory.
• Central Shared Memory: This approach is basically to have shared memory that can
be maintained centrally, with only a single physical copy using a central server. In this
scheme, reads and writes to shared memory performed at any other site are converted to
messages and sent to the central server for any further processing. In the case of reads,
the server returns requested values. For writes, the server updates the master copy of the
shared memory and returns an acknowledgement as usual. This implementation makes the
operations on shared memory relatively easy, as there is centrally only single-site seman-
tics of reads and writes where any target memory object always contains the most-recent
current value.
• Read Replication: This scheme allows the simultaneous coexistence of multiple read-
copies of shared memory at different hosts. At most one host is allowed to have write
access to shared memory. Hence, its operation is similar to a multiple-reader, single-writer
scheme (traditional readers/writers problem). A host intending to read shared memory
obtains a local copy exercising its read access and is then able to satisfy repeated read
queries at its own end only locally. In order to maintain consistency, active reading must
preclude (exclude) writing by invalidating and temporarily withholding write-access rights
to shared memory elsewhere in the system. Naturally, multiple concurrent read copies are
permitted that are commonly used in the implementation of distributed shared memory.
526 Operating Systems
When the DSM design allows read replication, for the sake of performance improvement,
it often divides the shared memory into logical blocks in which each block is assigned to
an owner host, and reading and writing is then performed on a per-block basis in a manner
as described.
• Full Replication: This scheme allows simultaneous coexistence of multiple read and write
copies of portions of shared memory. Consistency in these copies is maintained by means
of using appropriate protocols that broadcast write to all read and write copies, and only
the affected blocks modify themselves accordingly. Global writes are handled in a differ-
ent way considered to be deemed ft.
Different systems, however, enjoy the freedom to implement their own patterns of distributed shared
memory suitable for the system as well as the environment in which they are being used. Besides,
there are some other factors that effectively infuence such implementations are mainly: frequency
of reads versus writes, computational complexity, locality of reference of shared memory, and the
expected number of messages in use.
• Transparency: A DFS allows fles to be accessed by processes of any node of the system,
keeping them completely unaware of the location of their fles in the nodes and disks in
the system.
• Remote information sharing: A process on one node can create a fle that can be accessed
by other processes running on any other nodes at different locations at any point in time
later on.
• File sharing semantics: These specify the rules of fle sharing: whether and how the
effect of fle modifcations made by one process are visible to other processes using the
fles concurrently.
• User mobility: A DFS normally allows a user to work on different nodes at different times
without insisting to work on a specifc node, thereby offering the fexibility to work at will
with no necessity of physically relocating the associated secondary storage devices.
• Diskless workstation: A DFS, with its transparent remote fle-access capability, allows
a system to use diskless workstations in order to make the system more economical and
handy, tending to be less noisy and thereby having fewer faults and failures.
• Reliability: A fle accessed by a process may exist in different nodes of a distributed sys-
tem. A fault in either a node or a communication link failure between the two can severely
affect fle processing activity. Distributed fle systems ensure high reliability by providing
availability of fles through fle replication that keeps multiple copies of a fle on different
nodes of the system to negotiate the situation of temporary failure of one or more nodes.
Moreover, through the use of a stateless fle server design, the impact of fle server crashes
on ongoing fle processing activities can be minimized. In an ideal design, both the exis-
tence of multiple copies made by fle replication and their locations are kept hidden from
the clients.
Distributed Systems: An Introduction 527
• Performance: Out of many factors that affect performance, one is network latency, which
is mostly due to data transfer caused by processing of remote fles. A technique called fle
caching is often used to minimize frequent network journeys, thereby reducing network
traffc in fle processing.
• Scalability: This specifes the ability to expand the system whenever needed. But when
more new nodes are added to the existing distributed system, the response time to fle
system commands normally tends to degrade. This shortcoming is commonly addressed
through techniques that localize data transfer to sections of a distributed system, called
clusters, which have a high-speed LAN.
• Transparency of fle system: This means a user need not know much about the loca-
tion of fles in a system, and the name of a fle should not reveal its location in the fle
system. The notion of transparency has four desirable facets that address these issues.
Those are:
• Location transparency: This specifes that the name of a fle should not reveal its
location. In fact, a user should not know the locations or the number of fle servers
and storage devices. In addition, the fle system should be able to change the location
of a fle without having to change its name (path name). This is commonly known as
location independence that enables a fle system to optimize its own performance.
For example, when fles are accessed from a node presently experiencing heavy net-
work congestion, it can result in poor performance. The DFS in this situation may
move a few fles from the affected node to other nodes. This operation is called
fle migration. Location independence can also be used to improve utilization of
storage devices in the system. Most DFSs provide location transparency, but they
seldom offer location independence. Consequently, fles cannot be migrated to other
nodes. This restriction deprives the DFS of an opportunity to optimize fle access
performance.
• Access transparency: This implies that both local and global fles should be accessed in
the same way, and the fle system should not make any distinction between them. The
fle system should automatically locate the target fle and make necessary arrangements
for transfer of data to the requested site.
• Naming transparency: This means the name of a fle should give no hint as to where the
fle is physically located. In addition, a fle should be allowed to move from one node
to another within the jurisdiction of a distributed system without changing the name of
the fle.
• Replication transparency: If multiple copies of a fle exist on multiple nodes, both the
existence of multiple copies and their locations should be hidden from the users.
• User mobility: A DFS normally should allow a user to work on different nodes at different
times without enforcing work on a specifc node, thereby offering the fexibility to work
at will with no need to physically relocate the associated secondary storage devices. The
performance characteristics of the fle system in this situation should not discourage users
from accessing their fles from workstations other than the one at which they usually work.
One way to support user mobility may be to automatically bring a user’s environment
(user’s home directory and similar other things) at the time of login to the node where the
user logs in.
528 Operating Systems
• High availability: A DFS should continue to function even in the event of partial failures
of the system, mainly either due to node faults, communication link failure, or crashes of
storage devices. However, such failure may sometimes cause a temporary loss of service to
small groups of users and may result in an overall degradation in performance and func-
tionality over the entire system. To realize high availability, the DFS must have multiple
independent fle servers (in contrast to a central data repository that may be a cause of
performance bottleneck), and each must be equipped with multiple independent storage
devices. Replication of fles at multiple servers is a frequently used primary mechanism to
ensure high availability.
• High reliability: A DFS should have the proper arrangement to safeguard the system
as far as possible even when stored information is lost. That is why the system should
automatically generate backup copies of critical data that can help the system continue
to function even in the face of failure of the original. Out of many different available
techniques, stable storage is a popular one used by numerous fle systems to attain high
reliability.
• Data integrity: Multiple users in a DFS often compete to access a shared fle concurrently,
thereby causing the probability of a threat in the integrity of data stored in it. In this situa-
tion, requests from multiple users attempting to concurrently access a fle must be properly
synchronized by any means using some form of concurrency control mechanism. Many
different proven techniques are, however, available that can be used by a fle system to
implement concurrency control for the sake of data integrity.
Distributed Systems: An Introduction 529
• Security: A DFS for accessing distant resources and communicating with other processes
relies on a communication network which may include public communication channels or
communication processors that are not under the control of the distributed OS. Hence, the
DFS is always exposed to different forms of threats and apprehends attacks on any of its
nodes and attached resources. That is why a DFS should be made secured, so that its users
can be assured with regard to the confdentiality and privacy of their data. Necessary secu-
rity mechanisms must thus be implemented so that the information stored in a fle system is
protected against any unauthorized access. In addition, if rights to access a fle are passed
to a user, they should be used safely. This means the user receiving the rights should in no
way be able to pass them further if they are not permitted to.
• Fault tolerance: Occurrence of a fault often disrupts ongoing fle processing activity and
results in the fle data and control data (metadata) of the fle system being inconsistent. To
protect consistency of metadata, a DFS may employ a journaling technique like that in
a conventional centralized time-sharing fle system, or DFS may use stateless fle server
design, which needs no measures to protect the consistency of metadata when a fault
occurs. To protect consistency of fle data, DFS may provide transaction semantics, which
are useful in implementing atomic transactions so that an application may itself perform
fault tolerance, if it so desires.
• Heterogeneity: The scalability and openness of a distributed system inevitably require it
to be a heterogeneous one. This is perhaps the most general formation, which consists of
interconnected sets of dissimilar hardware and software (often independent computers)
that provide the fexibility of employing different computer platforms interconnected by a
wide range of different types of networks for a diverse spectrum of applications to run by
different types of users. Consequently, a DFS should be designed in a way that can allow a
variety of workstations with different internal formats to participate in an effective sharing
of fles. Accepting the heterogeneity of a distributed system, the design of a DFS on such
platform is critically diffcult to realize, yet it is considered one of the prime issues at the
time of designing a DFS.
• Simplicity and ease of use: Several important factors while need to be incorporated into
the design of a DFS, but those, on other hand, attempt to negate the fle system to be simple
and easy to use. Still, the most important factor is that the user interface to the fle system
must be as simple as possible. This means the semantics of the fle processing commands
should be similar to those of a fle system for a traditional centralized time-sharing system,
and the number of commands should be as small as possible. In addition, while the DFS
should be able to support the whole range of applications commonly used by a community
of users, at the same time it must be user-friendly and easy–to–understand even for a not
very skilled user.
• Availability: This refers to the fact that the fle (or a copy of it) can be opened and accessed
by a client based on the path name (related to its locations). On the other hand, the abil-
ity to access a fle requires only the client and server nodes be functional, because a path
between these two is guaranteed by resiliency of the network. To resolve the path being
given to use a target fle, DFS would usually perform resolution of all path components in
the client node itself. In this regard, replication of directories existing in remote nodes, if
530 Operating Systems
it appears in the path name component, would be carried out in the client node to improve
the availability of a fle.
• Robustness: The fault tolerance of a fle system depends on its robustness irrespective of
its implementation. A fle is said to be robust if it can survive faults, caused mainly due to
crashing of storage devices, in a guaranteed manner. Redundancy techniques offer stable
storage devices; one such is disk mirroring used in RAID level 1 to clone multiple copies
(usually two) of the server. The backup copy is always kept updated but is normally pas-
sive in the sense that it does not respond to client requests. Whenever the primary fails,
the backup copy becomes dominant and takes over. Proven techniques using comparison
and verifcation of the primary and backup to ensure their sameness are employed to keep
these two in synchrony and to detect failures at the same time. Such stable storage usually
works well for applications that require a high degree of fault tolerance, such as atomic
transactions.
• Recoverability: This refers to the ability of a fle to roll back to its most-recent consistent
state when an operation on a fle fails or is aborted by the user. Out of many available
proven mechanisms, one is the atomic update techniques used in transaction processing
that can be exploited in fle implementation to make it recoverable. An atomic transaction
either completes successfully and transforms a fle into a new consistent state or fails with-
out changing the state of the target fle. In effect, the previous consistent state of the fle is
recovered in the event of transaction failure. Generally, to make fles recoverable, updates
are not performed in place; rather updates can be tentatively written into different blocks,
called shadow pages. If the transaction completes successfully, it commits and makes the
tentative updates permanent by updating the directory and index structures to point to the
new blocks, discarding the old ones.
All three primary attributes of a fle system mentioned are independent of one another. Thus, a fle
may be recoverable without necessarily being robust or available. Similarly, a fle may be robust and
recoverable without being available. Likewise, a fle may be available without being recoverable or
robust. This means different techniques can be used to ensure each of these criteria individually.
Different fault tolerance techniques are used for faults (availability) that arise during an open
operation and those that occur after a fle has been opened (mainly access operation). A DFS usually
maintains many copies of the information needed for path name resolution and many copies of a fle
to negotiate faults. However, availability techniques are very complex and even more expensive if
faults that occur after opening and during fle processing (fle access) are to be tolerated (Quorum-
based fault tolerance techniques to handle replicated data in such cases can be used). Hence, a few
distributed systems handle these faults. Moreover, the communication media used in many LANs
with an inherent broadcast nature also provides numerous innovative variations in implementing
fault tolerance in distributed systems. For example, processes often may checkpoint themselves
across the network, and a special node may be given charge of eavesdropping and recording all
interprocess messages. Thus, in the event of a node crash, the affected process may be reconstructed
from its checkpoint state and restored to date by having all outstanding messages relayed to it.
A few commonly used fault tolerance techniques employed by DFS are cached directories and
fle replication that address faults in a fle server and in intermediate nodes during an open opera-
tion. The stateless fle server (see Section 9.9.2) design, however, addresses faults in a fle server
during fle processing.
Brief details on this topic are given on the Support Material at www.routledge.com/9781032467238
the fle and a user process to simplify the implementation of fle operations (such as a read or
write). This design approach of a fle system in which users’ state information pertaining to the
operation performed from one access request to the next is maintained is commonly referred to as
stateful design. This recorded current state information is subsequently used when executing the
next immediate request. On the other hand, if any state information concerning a fle processing
activity while servicing a user’s request is not maintained by the fle system, it is referred as state-
less design.
When a client crashes, the fle processing activity must be abandoned, and the fle would have
to be restored to its previous consistent state so that the client can restart its fle processing activity
afresh. In fact, the client and the fle server share a virtual circuit which holds the fle processing
state and resources, like fle server metadata, and those become orphans when either a client or
server crashes. It actually breaks the virtual circuit, so the actions would have to be rolled back
and the already-created metadata would have to be destroyed. This can, however, be carried out,
perhaps by the use of a client–server protocol that implements transaction semantics. If a DFS does
not provide transaction semantics, a client process would then have to make its own arrangements
so as to restore the fle to a recent-previous consistent state.
On the other hand, when a fle server crashes, state information pertaining to fle processing
activity stored in the server metadata is immediately lost, so ongoing fle processing activity has to
be abandoned, and the fle has to then be restored to its recent-previous consistent state to make it
once again workable.
Therefore, the service paradigm in a stateful server requires detection as well as complex crash
recovery procedures. Both client and server individually need to reliably detect crashes. The server
must expend the added effort to detect client crashes so that it can discard any state it is holding for
the client to free its resources, and the client likewise must detect server crashes so that it can per-
form necessary error-handling activities. Therefore, in order to avoid both these problems that occur
with a stateful server in the event of failures, the fle server design in the DFS has been proposed to
be stateless to negotiate these situations.
• Stateless File Servers: A stateless server does not maintain any state information pertain-
ing to fle processing activity, so there exists no implied context between a client and a fle
server. Consequently, a client must maintain state information concerning a fle processing
activity, and therefore, every fle system called from a client must be accompanied by all
the necessary parameters to successfully carry out the desired operations. Many actions
traditionally performed only at fle open time are repeated at every fle operation. When
the client receives the fle server’s response, it assumes that the fle operation (read/write)
requested by it has been completed successfully. If the fle server crashes or communica-
tion error/failure occurs by this time, time-outs occur and retransmission is carried out
by the client. The fle server after recovery (recovery is mostly trivial, maybe simply by
532 Operating Systems
A stateless fle server, however, cannot detect and discard duplicate requests because these
actions require state information; therefore, it may service a request more than once. Hence, to
avoid any harmful effects of reprocessing, client requests must be idempotent. Read/write requests,
however, are by nature idempotent, but directory-related requests like creation and deletion of fles
are not idempotent. Consequently, a client may face an ambiguous or misleading situation if a fle
server crashes and is subsequently is recovered during a fle processing activity.
Two distinct advantages of using a stateless server approach are: when a server crashes while
serving a request, the client need only resend the request until the server responds. When a client
crashes during request processing, no recovery is necessary for either the client or the server. A
stateless server must, therefore, only have the ability to carry out repeatable operations, and data
will never be lost due to a server crash.
Two potential drawbacks of a stateless server approach are: the diffculty of enforcing consis-
tency and also incurring a substantial performance penalty in comparison with stateful servers.
Performance degradation is mainly due to two reasons. First, the fle server, as already mentioned,
always opens a fle at every fle operation and passes back the state information to the client. Second,
when a client performs a write operation, reliability considerations dictate that the data of the fle
should be written in the form of direct write-through into the disk copy of a fle to the server imme-
diately. As a result, the fle server cannot employ buffering, fle caching, or disk caching in order to
speed up its own operation. The other distinct drawbacks of this server are that it requires longer-
request messages from its clients that should include all the necessary parameters to complete the
desired operation, since it does not maintain any client state information. Consequently, longer-
request processing becomes slower, since a stateless server does not maintain any state information
to speed up processing. However, a hybrid form of fle servers can be designed that could avoid
repeated fle open operations. A stateless fle service can be implemented on top of datagram net-
work service. It is used in Sun Microsystems’ NFS system.
FIGURE 9.11 A schematic representation of fundamentals of fle processing in a distributed fle system
(DFS) environment without using any special DFS techniques.
the request to the fle server. The fle server opens the fle and builds the FCB. Now, whenever the
client performs a read or write operation on the fle, the operation is usually implemented through
message passing between the client interface and the fle server interface. I/O buffers that exist for
the fle at the server’s end participate in the operation, and only one record at a time is passed from
the fle server to the client.
The DFS can be organized in two ways. At one extreme, the server can be equipped with most of
the layers of the fle system and connected to workstations (diskless) on a fast LAN that could offer
a response time close to that of a local disk. Due to heavy reliance on the server, the remote-access
method tends to increase the load both on the server as well as on the network. Consequently, DFS
performance tends to decrease as the volume of network traffc generated by remote-fle accesses
increases. In fact, network latencies can completely overshadow the effciency of access mecha-
nisms even when only a small fraction of data accesses are non-local. This fact motivates to under-
take certain measures so that network traffc can be reduced by minimizing the data transfers over
the network during fle processing activity.
Consequently, this gives rise to the concept at the other extreme, in which the client itself can
contain most of the layers of the fle system, with the server providing only a low-level virtual disk
abstraction. In this approach, a client can view or may use the server simply as an unsophisticated
repository of data blocks. The client basically checks out the fles of interest, or portions thereof,
from the server and performs most of the processing of fles locally. Upon closing or sometime
thereafter, fles are returned to the server only for permanent storage or optional usage of sharing.
In effect, the working set of fles (i.e. the portions of fles under processing) is cached by the client
for processing using its own memory and local disk and then returned to the server only for perma-
nent storage. Under this scheme, the server may be accessed only when a client cache-miss occurs.
The client caching approach offers several potential benefts. As a result of local availability
of the data being cached, access speed can be signifcantly improved over remote access, espe-
cially with those data that are repeatedly referenced. Consequently, the response time is noticeably
improved and performance is automatically enhanced due to reduced dependence on the communi-
cation network, thereby avoiding the delays imposed by the network, as well as decreasing load on
the server. In addition, one positive advantage of the client caching approach is that it is very condu-
cive to fault tolerance. But this approach on the other hand suffers from several distinct drawbacks.
First, local fle processing requires relatively costly and powerful workstations to be equipped with
larger memory and possibly local disks. Second, concurrent access to shared fles by client-cached
workstations may lead to well-known cache inconsistency (cache coherence) problems that may
require the use of additional special client/server protocols to resolve.
In fact, whatever design approach is followed, the DFS design must be scalable, which means
that DFS performance would not be much degraded with an increase in the size of the distributed
system architecture. Scalability is considered an important criteria, especially for environments that
534 Operating Systems
gradually grow with time; lack of scalability in these situations often hinders the performance to
attain a desired level.
Several techniques are commonly used in the operation of DFS that enable the DFS to achieve
high performance. Some notable of them are:
• Effcient File Access: File access may be said to be effcient if it can provide a lower aver-
age response time to client requests or higher throughput of client requests. These criteria,
in turn, depend mostly on the structural design of a fle server as well as how the operation
of it is ordered. Two common server structures, however, exist that provide effcient fle
access.
• Multi-Threaded File Server: In this fle server, several threads of operation exist;
each thread is capable of servicing one client request at any point in time. Since fle
processing is basically an I/O-bound activity, operation of several of these threads that
service different client request can then be simultaneously performed, causing no harm
and resulting in fast response to client requests and high throughput at the same time.
Moreover, as the number of client requests that are active at any instant increases, the
number of threads can also then be varied to handle all of them individually and, of
course, with the availability of the OS resources that are required to support it, such as
thread control blocks (TCBs).
• Hint-Based File Server: A hint-based fle server is basically a hybrid design in the
sense that it provides features of both a stateful and a stateless fle server. Whenever
possible, it operates in a stateful manner for the sake of realizing increased effciency.
At other times, it operates in a stateless manner. A hint is basically a form of informa-
tion in relation to ongoing fle processing activity, for example, the id of the next record
in a sequential fle that would be accessed in a fle processing activity. The fle server
always maintains a collection of hints in its volatile storage. When a fle operation is
requested by a client, the fle server checks for the presence of a hint that would help
in its processing. If a hint is available, the fle server effectively uses it to speed up the
fle operation that would automatically enhance performance; otherwise, the fle server
operates in a stateless manner: it opens the fle and uses the record/byte id provided by
the client to access the required record or byte. In either case, after completing the fle
operation, the server inserts a part of the state of the fle processing activity in its vola-
tile storage as a hint and also returns it to the client as is done in the case of a stateless
fle server. However, the overall effciency of this fle server depends on the number of
fle operations that are assisted by the presence of hints.
• File Caching: As already explained, in more balanced systems, the client holds some por-
tions of the data from a remote fles in a buffer (main memory) or on local disk in its own
node called the fle cache. This fle cache and the copy of the fle on a disk in the server
node form a memory hierarchy, so operation of the fle cache and its related advantages
are similar to those of a CPU cache and virtual memory. Chunks of fle data are loaded
from the fle server into the fle cache. To exploit the advantages of spatial locality (spatial
locality refers to the tendency of execution to involve a number of memory locations that
are viewed as clustered. This refects the tendency of processing to access instructions/
data sequentially, such as when processing a fle of data or a table of data), each chunk is
made large enough to service only a few fle accesses made by a client. Observations and
studies on distributions of fle size reveal that the average fle size is usually small, hence,
even whole-fle caching is also feasible. Considering the fact that chunk size usually varies
per–client application, the size of the chunk is frequently taken as 8 Kbytes, which prob-
ably includes entire fles from many different applications, and the fle-cache hit ratios
with this size are observed to exceed even 0.98. In addition, a DFS may sometimes use a
separate attributes cache to cache information in relation to fle attributes. However, there
Distributed Systems: An Introduction 535
are several issues need to be addressed in the design of a fle-cache that eventually play an
important role in DFS performance. Some of the key issues in this regard are:
• File cache location: Clients can cache data in main memory, on local disk, or in both
at the client node. Organizing a cache in memory would defnitely provide faster access
to fle data but would result in low reliability, because in the event of a crash of the
client node, the entire fle cache would be lost and may contain modifed fle data that
are yet to be written back to the fle copy in the server. Alternatively, the cache can be
organized on the local disk in the client node. While this approach would slow down
fle data access, it would offer better reliability, since in the event of client node crash,
all the data in fle cache would remain unaffected. Reliability of the fle cache organized
on a local disk could be further enhanced with the use of RAID techniques, like disk
mirroring.
• File update policy: cache coherence: The read-only blocks of caches may be in mem-
ory, while the read–write blocks of caches are to be written to local disk with no delay.
When such a write operation is to be performed on a local disk, the modifed fle data
would have to be written immediately into the fle copy of the server at the same time.
This policy, called write-through, is probably the simplest to use to enforce concur-
rency control. The write-through method is also reliable, because this method could be
implemented as a transaction or an atomic operation to ensure that it completes (a trans-
action is a sequence of operations on one or more objects that transforms a [current]
consistent state of the database/fle into a new consistent state. Temporary inconsistent
values that may occur during the execution of a transaction are hidden by making the
affected entities inaccessible to other clients). However, accomplishment of this method
temporarily delays some of the conficting writers. To avoid delaying the client, the
system performs writes to the cache and thus quickly releases writers without making
them wait for the disk writes to complete; the update of the fle copy (disk write) can be
performed at a later time. This policy is called the delayed write policy and can also
result in reduced disk I/O, particularly when requests are repeated for the same data
blocks which are already in memory. But the problem with delayed writes is that main
memory is usually volatile and any failure in the node system in the meantime can seri-
ously corrupt and damage vital fle system data. Adequate arrangements thus should be
provided to ensure that the modifed data would not be lost even if the client node failed
in the meantime. To ameliorate the problematic situation that may arise from a delayed
write, some systems fush write-backs to disk at regular time intervals.
While client caching offers performance advantages, at the same time, it relates to the well-
known critical problem known as cache inconsistency (or cache coherence) irrespective of the
write policy. Thus, a DFS should enforce some level of cache consistency. The consistency in caches
holds good when they contain exact copies of remote data. Inconsistency in a cache is caused when
multiple clients cache portions of the same shared fle and different updates are performed concur-
rently on local copies lying with each individual caches. These uncontrolled updates and inconsis-
tent caches can result in the creation of several different and irreconcilable versions of the same fle.
Inconsistency in a cache also arises when the remote data are changed by one client that modifes
the fle and the corresponding local cache copies already cached by other clients consequently
become invalid but are not removed. The root of this problem perhaps lies in the choice of policy
decisions adopted in write operations that change a fle copy. However, the choice of a particular
policy as well as consistency guarantees are largely infuenced by the nature of the consistency
semantics used.
The primary objective is to somehow prevent the presence of invalid data in client caches; hence,
a cache validation function is required that would identify invalid data in client caches and deal with
them in accordance with the fle-sharing semantics (consistency semantics) of the DFS. File-sharing
536 Operating Systems
semantics usually specify the visibility of updates when multiple clients are accessing a shared fle
concurrently. For example, when UNIX semantics are used, fle updates made by a client should
be immediately visible to other clients of the fle, so that the cache validation function can either
refresh invalid data or prevent its use by a client. It should be noted that the fle-sharing semantics
suitable for centralized systems are not necessarily the most appropriate for distributed systems.
• Cache validation: Cache validation can be approached in two basic ways: client-initiated
validation and server-initiated validation. Client-initiated validation is performed by the
cache manager at a client node. At every fle access by a client, it checks whether the
desired data are already in the cache. If so, it checks whether the data are valid. If the check
succeeds, the cache manager provides the data from the cache to the client; otherwise, it
refreshes the data in the cache before supplying them to the client. Such frequent checking
can be ineffcient, since it consumes processing cycles of both the client and the server.
In addition, this approach leads to additional cache validation traffc over the network at
every access to the fle, resulting in an ultimate increase in existing traffc density over the
network. Such traffc can, however, be reduced if the validation can be performed periodi-
cally rather than at every fle access, provided such validation is not inconsistent with the
fle sharing semantics of the DFS. This approach is followed by Sun NFS. Alternatively,
in the server-initiated approach, the fleserver keeps track of which client nodes contain
what fle data in their caches and uses this information in the following way: when a client
updates data in some part k of a fle, the fle server detects the other client nodes that have
k in their fle cache and informs their cache managers that their copies of k have become
invalid so that they can take appropriate action. Each cache manager then has an option
of either deleting the copy k from its cache or refreshing its cache either immediately or at
the frst reference to it.
A simple and relatively easy method to detect invalid data is through the use of time-stamps
that indicate when the fle was last modifed. A time-stamp is associated with a fle and each
of its cached chunks. When a chunk of a fle is copied into a cache, the fle’s time-stamp is
also copied, along with the chunk. The cached chunk is declared invalid if its time-stamp is
smaller than the fle’s time-stamp at any time. This way a write operation in some part k of a
fle by one client can invalidate all copies of k in other clients’ caches. Each cache manager in
that situation deletes copy k from its cache and refreshes it by reloading it at the time of its next
reference.
Due to its being expensive, the cache validation approach needs to be avoided, if possible. One
way to avoid the cache validation overhead is to use fle-sharing semantics like session semantics
which do not require that updates made by one client be visible to clients in other nodes. This feature
avoids the need for validation altogether. Another approach may be to disable fle caching if a client
opens a fle in update mode. All accesses to such a fle are then directly implemented in server node.
Now, all clients wishing to use the fle would have to access the fle in a way similar to remote fle
processing.
• Chunk size: Similar to page size in paging systems, chunk size in the fle cache is also a
signifcant factor and considered a vital performance metric in the fle caching approach.
Determination of the optimum chunk size requires balancing of several competing factors.
Usually, the chunk size should be large so that spatial locality of fle data contributes to a
high hit ratio in the fle cache. Use of a large chunk size, on the other hand, means a higher
probability of data invalidation due to modifcations performed by different other clients,
which eventually leads to more cache validation overhead and thereby more delays than
when a small chunk size is employed. So, the size of the chunk in a DFS should be decided
after making a compromise between these two conficting considerations. Moreover, a
Distributed Systems: An Introduction 537
fxed size of chunks may not ft all the clients of a DFS. That is why some DFSs have
adapted different chunk sizes to each different individual client.
• Scalability: The scalability of DFS depends to a large extent on the architecture of the
computing system on which it is run and is mainly achieved through certain techniques that
restrict most data traffc generated by fle-processing activities to confne within a small
section of a distributed system called clusters of nodes or simply clusters. Clusters and their
characteristics are discussed in next section. In fact, when nodes are organized in the form
of clusters, they exhibit certain important features that make this approach quite effcient
and effective in terms of scalability. For instance, clusters typically represent subnets in a
distributed system in which each cluster is a group of nodes connected by a high-speed LAN
but essentially represents a single node of a distributed system. Consequently, the data traf-
fc within a cluster possesses a high data transfer rate, giving rise to improved response time
as well as increased throughput. In addition, while the number of clusters in a distributed
system is increased, it does not cause any degradation in performance, because it does not
proportionately add much network traffc. Moreover, as the system architecture is expanded,
due to location transparency as well as location independence of the added nodes, any fle
could be simply moved to any cluster where the client is actually located. Even if the DFS
does not possess location independence, still the movement of fles and similar other aspects
can be implemented through fle replication or fle caching for read-only fles. Likewise, for
read/write fles, the use of session semantics enables to locate a fle version in the client node
without using any cache validation, which automatically eliminates cache validation traffc;
hence, network traffc would ultimately be reduced.
Brief details on fle caching with a fgure are given on the Support Material at www.routledge.
com/9781032467238.
contact the next server in the list. Thus, if the list of servers contains two servers, the second server
acts as a hot standby for the frst server.
Virtual fle system layer: The implications of virtual fle systems (VFS) was described in
Chapter 7. With the use of the VFS layer (module), NFS provides access transparency that enables
user programs to issue fle operations for local or remote fles without it making any difference.
Other DFSs, if they support UNIX system calls, can then also be present and be integrated in the
same way. Addition of VFS to the UNIX kernel enables it to distinguish between local and remote
fles, and to translate between the UNIX-independent fle identifers used by NFS and the internal
fle identifers normally used in UNIX and other fle systems. The fle identifers used in NFS are
called fle handles. A fle handle is opaque to clients and contains whatever information the server
needs to distinguish an individual fle. When NFS is implemented in UNIX, the fle handle is
derived from the fle’s inode number by adding two extra felds, fle-system identifer and i-node
generation number.
The VFS layer implements the mount protocol and creates a system-wide unique designator for
each fle called the v-node (virtual node). The VFS structure for each mounted fle system thus has
one v-node per open fle. A VFS structure relates a remote fle system to the local directory on which
it is mounted. The v-node contains an indicator to show whether a fle is local or remote. If the fle
on which an operation is to be performed is located in one of the local fle systems, the VFS invokes
that fle system and the v-node contains a reference to the index of the local fle (an i-node in a UNIX
implementation); otherwise it invokes the NFS layer (Figure 9.12), and the v-node in that situation con-
tains the fle handle of the remote fle. The NFS layer contains the NFS server module, which resides
in the kernel on each computer that acts as an NFS server. The NFS interacts with the server module at
the remote node (computer) containing the relevant fle through NFS protocol operations. The beauty
of this architecture is that it permits any node to be both a client and a server at the same time.
• Mount protocol: The modifed version of the UNIX mount command can be issued by
clients to request mounting of a remote fle system. They specify the remote host name,
pathname of a directory in the remote fle system, and the local name with which it is to
be mounted. Each node in the system contains an export list that contains pairs of the form
(<directory>, <list-of-nodes). Each pair indicates that <directory>, which exists in one of
the local fle systems, can be remotely mounted only in the nodes contained in <list-of-
nodes>. The mount command communicates with the mount service process on the remote
host using a mount protocol. This is essentially an RPC protocol. When the superuser of
a node makes a request to mount a remote directory, the NFS checks the validity (access
permission for the relevant fle system) of the request, mounts the directory, and fnally
returns a fle handle which contains the identifer of the fle system that contains the remote
directory and the inode of the directory in that fle system. The location (IP address and
port number) of the server and the fle handle for the remote directory are passed on to the
VFS layer and the NFS client. In effect, users in the node can view a directory hierarchy
constructed through these mount commands. NFS also permits cascaded mounting of fle
systems; that is, a fle system could be mounted at a mount point in another fle system,
which is itself mounted inside another fle system, and so on. The mounting of sub-trees
of remote fle systems by clients is supported by a mount service process that runs at the
user level on each NFS server computer. On each server, there is a well-known fle (/etc/
exports) containing the names of local fle systems that are available for remote mounting.
An access list is associated with each fle-system name (identifer) indicating which hosts
are permitted to mount the fle system. However, some restrictions in this regard have been
imposed in the NFS design that carefully avoid transitivity of the mount mechanism; oth-
erwise each fle server would have to know about all mounts performed by all clients over
its fle systems, which eventually would require the fle server to be stateful.
• NFS protocol: The NFS protocol employs RPCs to provide remote fle processing services
using a client–server model. In fact, Sun’s RPC system (described earlier) was developed
for use in NFS. It can be confgured to use either UDP or TCP, and the NFS protocol is
compatible with both. The RPC interface to the NFS server is open; any process can send
requests to an NFS server if the requests are valid and they include valid user credentials.
The submission of a request with signed user credentials may be required as an optional
security feature for the encryption of data for privacy and integrity. An NFS server does
not provide any means of locking of fles or records, and as such, users must employ their
own mechanisms to implement concurrency control. A fle server here is truly stateless, so
each RPC has parameters that identify the fle, the directory containing the fle, and the
data to be read or written. In addition, being stateless, the fle server performs an implicit
open and close for every fle operation, and for this purpose, it does not use the UNIX buf-
fer cache. The NFS provides numerous calls, such as; looking up a fle within a directory,
540 Operating Systems
reading directory entries, manipulating links and directories, accessing fle attributes
(i-node information), and performing a fle read/write operations.
• Path name translation: UNIX fle systems translate (resolve) multi-part fle pathnames to
i-node references in a step-by-step manner whenever system calls, such as open, creat, or
stat are used. In NFS, pathnames cannot be translated at a server, because the name may
cross a mount point at the client; directories holding different parts of a multi-part name
may reside in fle systems that are located at different servers. So, pathnames are parsed,
and their translation is performed in an iterative manner by the client. Each part of a name
that refers to a remote-mounted directory is then translated to a fle handle using a separate
lookup request to the remote server. To explain this procedure, let us assume that a user X1
located in node N1 uses a pathname a/b/c/d, where b is the root directory of a mounted fle
system. To begin with, the host node N1 creates v-nodea, the v-node for a. The NFS uses
the mount table of N1 when looking up the next component of the pathname and sees that
b is a mounted directory. It then creates v-nodeb from the information in the mount table.
Let us assume that v-nodeb is for a fle in node N2, so the NFS makes a copy of directory b
in node N1. While looking for c in the copy b, the NFS again uses the mount table N1. This
action would resolve c properly, even if c is a fle system that was mounted by a superuser
of node N1 at some point in the remote fle system b. The fle server in node N2, which con-
tains b, has no need to know about this mounting. However, instead of using this procedural
approach, if the pathname b/c/d were to be handed over directly to the fleserver in node
N2, the server would then have to have all the information about each and every mount
performed by all clients over its fle system. Consequently, this would require the fle server
to be stateful, which directly contradicts our stateless design strategy of the fle server. To
make this time-consuming process of entire pathname resolution relatively fast, each client
node is usually equipped with an additional directory name cache.
• Server and client caching: Caching in both the server and the client computer is an indis-
pensable feature of NFS implementations in order to achieve improved performance. The
caching techniques used here work in a similar way as in a conventional UNIX environment,
because all read and write requests issued by user-level processes pass through a single cache
that is implemented in the UNIX kernel space. The cache is always kept up to date, and fle
accesses cannot bypass the cache.
The cache used for the server machine of NFS servers follows a similar line as other fle accesses.
The use of the server’s cache to hold recently read disk blocks does not raise any cache consistency
(cache coherence) problems, but when the server performs write operations, additional measures are
needed, so clients can be assured the results of write operations are persistent, even in the event of
a server’s crash. However, in different versions of the NFS protocol, the cache read/write operations
are of different styles, providing different options to exploit.
The NFS client module caches the result of read, write, getattr, lookup, and readdir operations
in order to reduce the number of requests transmitted to servers, thereby minimizing the processing
time and maximizing the speed of execution apart from avoiding network latency. Client caching
introduces the potential for different versions of fles or portions of fles to exist in different client
nodes, because writes by a client do not result in the immediate updating of cached copies of the
same fle in other clients. Instead, clients are responsible for polling the server to check the currency
of the cached data that they hold. Usually, an appropriate timestamp-based method is used to vali-
date cached blocks before they are used. Contents of a cached block are assumed to be valid for a
certain period of time. For any access after this time, the cached block is used only if the timestamp
is larger than the timestamp of the fle.
In general, read-ahead and delayed-write mechanisms are employed, and to implement these
methods, the NFS client needs to perform some reads and writes asynchronously using one or more
Distributed Systems: An Introduction 541
bio-daemon (block input-output; the term daemon is often used to refer to user-level processes that
perform system tasks) processes at each client. Bio-daemon processes certainly provide improved
performance, ensuring that the client module does not block waiting for reads to return or writes to
commit at the server. They are not a logical requirement, since in the absence of read-ahead, a read
operation in a user process will trigger a synchronous request to the relevant server, and the results
of writes in user processes will be transferred to the server when the relevant fle is closed or when
the virtual fle system at the client performs a sync operation.
• File operations and file sharing: Every file operation in NFS should require a request
to be made to the server to religiously obey the remote service paradigm. In addition,
NFS employs caching of file blocks (file caching) at each client computer for greatly
enhanced performance. Although this is important for the achievement of satisfactory
performance, but this hybrid arrangement results in some deviation from strict UNIX
one-copy file update semantics that consequently complicates the file sharing semantics
offered by it.
To speed up fle operation, NFS uses two caches. A fle-attributes cache caches i-node infor-
mation. Use of this cache is important, since it has been observed that a large percentage of
requests issued to a fle server is related to fle attributes. The cached attributes are discarded
after 3 seconds for fles and after 30 seconds for directories. The other cache used is the fle-
blocks cache, which is the conventional fle cache. As usual, it contains data blocks from the
fle. The fle server uses large (normally 8 Kbytes) data blocks and uses read-ahead and delayed-
write techniques (i.e. buffering techniques, already discussed in Chapter 7) to achieve improved
performance in fle access. A modifed block is sent to the fle server for writing into the fle at
an unspecifed time (asynchronously). This policy is used even if clients concurrently access the
same fle block in conficting modes, so a modifcation made by one client is not immediately
visible to other clients accessing the fle. A modifed block is also sent to the fle server when the
relevant fle is closed or when a sync operation is performed by the virtual fle system at the cli-
ent’s end. A directory names cache is used in each client node to expedite pathname resolution at
the time of fle access. It usually contains remote directory names and their v-nodes. New entries
are added to the cache when a new pathname prefx is resolved, and similarly, entries are deleted
when a lookup fails because of mismatch between attributes returned by the fle server and those
of the cached v-nodes.
• Conclusion: In summary, the design of Sun NFS has been implemented for almost every
known operating system and heterogeneous hardware platforms and is supported by a
variety of fling systems. The NFS server implementation is stateless and by nature idem-
potent that enables clients and servers to resume execution after a failure without the need
for recovery procedures. In fact, the failure of a client computer or a user-level process in a
client has no effect on any server that it may be using, since servers hold no state on behalf
of their clients. The design provides good location transparency and access transparency
if the NFS mount service is used properly to produce similar name spaces at all clients.
Migration of fles or fle systems is not fully achieved by NFS, but if fle systems are moved
between servers, then manual intervention to reconfgure the remote mount tables in each
client must then be separately carried out to enable clients to access the fle systems in
their new location. The performance of NFS is made much enhanced by the caching of
fle blocks at each client computer, even after deviating from strict UNIX one-copy fle
update semantics. The performance fgures as published show that NFS servers can be
built to handle very large real-world loads in an effcient and cost-effective manner. The
performance of a single server can be enhanced easily by the addition of processors, disks,
and controllers, of course within a specifed limit. When such limits are reached, additional
542 Operating Systems
servers can be installed and fle systems must be reallocated between them. The measured
performance of several implementations of NFS and its widespread adoption for use in situ-
ations that generate very heavy loads are clear indications of the effciency with which the
NFS protocol can be implemented.
• Striping data using appropriate disk-block size to store data across multiple disks (to the
extent of several thousand disks using RAID technology) attached to multiple nodes.
• Using block-level locking based on a very sophisticated scalable distributed token man-
agement system to ensure data consistency of fle system data and metadata when multiple
application nodes in the cluster attempt to access the shared fles concurrently.
• Providing typical access patterns like sequential, reverse sequential, and random and opti-
mizing I/O access for these patterns.
• Offering scalable metadata (e.g. indirect blocks in the FMT) management that allows all
nodes of the cluster accessing the fle system to perform the same fle metadata operations.
• Allowing GPFS fle systems to be exported to clients outside the cluster through NFS or
Samba that when integrated form the clustered NFS which provides scalable fle service.
This, in turn, permits simultaneous access to a common set of data from multiple nodes,
apart from monitoring of fle services, load balancing, and IP address failover.
• Effcient client-side caching.
• Supporting a large block size, confgurable by the administrator, to ft I/O requirements.
• Utilizing advanced algorithms that improve read-ahead and write-behind fle functions.
Distributed Systems: An Introduction 543
• Accomplishing fault-tolerance by using robust clustering features and support for continu-
ous data replication of journal logs, metadata, and fle data. The journal is located in the
fle system under which the fle belongs and is processed. In the event of a node failure,
other nodes can access its journal and carry out the pending operations.
To provide adequate protection and security, GPFS ultimately enhanced access-control that pro-
tects directories and fles by providing a means of specifying who should be granted access. GPFS
supports (on AIX) NFS V4 access control lists (ACLs) in addition to traditional ACL support.
Traditional GPFS ACLs are based on the POSIX model. Access control lists extend the base per-
missions or standard fle-access modes, such as; read (r), write (w), and execute (x), and beyond
these three categories; the fle owner, fle group, and other users are used to allow the defnition to
include additional users and user groups. In addition, GPFS introduces a fourth access mode; con-
trol (c), which can be used to govern who can manage the ACL itself.
systems of this category being currently built (Chakraborty, 2020). This architectural system design
methodology quickly gained much importance due to the range of options it can provide to cater
to many different operational environments, particularly in the area of server applications (Buyya,
1999).
Appropriate system software is thus required to fully exploit a cluster hardware confguration
that requires some enhancements over a traditional single-system operating system. It should be
clearly noted that cluster software is not a DOS, but it contains several features that closely resemble
those found in DOS. Some of them are: it provides high availability through redundancy of avail-
able resources, such as CPUs and other I/O media. On the one hand, it speeds up computation by
exploiting the presence of several CPUs within the cluster, and on the other hand, with the use of
the same (or heterogeneous) hardware confgurations, it spreads the favor of parallel processing
by providing a single-system image to the user. In addition, with the use of appropriate schedul-
ing software, a cluster effectively exhibits the capability to balance load in the existing computing
system. Last but not least, the software is adequately equipped in providing fault-tolerance as well
as failure management.
• High Availability: More availability of the computing resources present in the cluster
comes from its high scalability. It also implies high fault-tolerance. Since each node in a
cluster is an independent stand-alone computer, the occurrence of faults, and thereby fail-
ure of any node, does not create as such any loss of service. In many of today’s products,
fault-tolerance is handled automatically in software. Moreover, clustering also possesses
failover capability by using a backup computer placed within the cluster to take charge of
a failed computer to negotiate any exigency.
• Expandability and Scalability: It is possible to confgure a cluster in such a way to add
new systems to the existing cluster using standard technology (commodity hardware and
software components). This provides expandability, an affordable upgrade path that lets
organizations increase their computing power while preserving their existing investment by
incurring only a little additional expense. The performance of applications also improves
with the aid of a scalable software environment. Clusters also offer high scalability and
more availability of the computing resources. In fact, this approach offers both absolute
scalability as well as incremental scalability, which means that a cluster confguration can
be easily extended by adding new systems to the cluster in small increments, of course
within the underlying specifed limits.
• Openness: A clustering approach is also capable of hiding the heterogeneity that may exist
in the collection of underlying interconnected machines (computers) and thereby ensuring
interoperability between different implementations.
• High Throughput: The clustering approach offers effectively unbounded processing
power, storage capacity, high performance, and high availability. These together thereby
offer considerably high throughput in all situations.
• Superior Cost/Performance: By using commodity building blocks, it is possible to
realize a cluster, that could offer an equal or even greater computing power as well as
superior performance than a comparable single large machine, at much lower cost and
complexity.
• Separate server: In this approach to clustering, each computer is a separate server with its own
disks and no disks shared between such systems exists. [Figure 9.39(b) given on the Support
Material at www.routledge.com/9781032467238.] However, this approach requires software
management and scheduling mechanisms to handle continuously arriving client requests to
assign them to different servers in a way that load balancing can be maintained and high uti-
lization of the available resources can be attained. Consequently, this approach can offer high
performance and also high availability. Moreover, to make this approach attractive, failover
capability is required so that in the event of failure of one computer, any other computer in
the cluster can take up the incomplete executing application from the point of its failure and
continue its execution to its completion. To implement this, data must be constantly copied
among systems so that each system has easy access to the most-current data of the other sys-
tems. However, such data exchange operations essentially involve high communication traffc
as well as server load that can incur additional overhead for the sole purpose of ensuring high
availability, and this, in turn, also results in a substantial degradation in overall performance.
• Servers connected to disks (shared nothing): In order to reduce the network traffc and server
overhead caused mostly by data exchange operations needed among the systems in a cluster,
most clusters equipped with servers are connected to common disks [Figure 9.39(a) on the
Support Material at www.routledge.com/9781032467238]. One variation of this approach is
simply called shared nothing, in which the common disks (not shared disk) are partitioned into
volumes, and each volume is owned by a single computer. If one computer fails, the cluster
must be reconfgured so that another computer can gain ownership of the volumes owned by
the failed computer. In this way, constant copying of data among all systems to enable each
system to have easy access to the most-current data of the other systems can simply be foregone.
• Shared-disk servers: In the shared-disk approach, multiple computers present in a cluster share
the same disks at the same time so that each computer has access to all of the volumes on all of
546 Operating Systems
• Single entry point: A user logs normally onto the cluster rather than to an individual
computer.
• Single control point: A default node always exists that is used for cluster management
and control.
• Single memory space: The presence of distributed shared memory allows programs to
share variables.
• Single fle hierarchy: The user views a single hierarchy of fle directories under the same
root directory.
FIGURE 9.13 A representative block diagram of a cluster computer architecture realized with the use of PC
workstations and cluster middleware.
Distributed Systems: An Introduction 547
• Single job-management system: A job scheduler exists in a cluster (not related to any
individual computer) that receives jobs from all users submitted to the cluster irrespective
of any specifcation about on which computer host a submitted job will be executed.
• Single virtual networking: Any node can access any other point in the cluster, even
though actual physical cluster confguration may consist of several interconnected net-
works. There exists a single virtual network operation.
• Single user interface: Irrespective of the workstation through which a user enters the
cluster, a common graphic interface supports all users at the same time.
• Single process space: A uniform process-identifcation scheme is used. A process execut-
ing on any node can create or communicate with any other process on any local or remote
node.
• Process migration: Any process running on any node can be migrated to any other node
irrespective of its location, which enables balancing the load in the system.
• Single I/O space: Any node can access any local and remote I/O device, including disks,
without having any prior knowledge of its actual physical location.
• Checkpointing: This function periodically saves the process state, intermediate results,
and other related information of the running process that allows implementation of roll-
back recovery (a failback function) in the event of a fault and subsequent failure of the
system.
Similar other services are also required for cluster–middleware in order to cast a single-
system image of the cluster. The last four items of the preceding list enhance the availability
of the cluster, while the other items in the list are related to providing a single-system image
of the cluster.
system at any point in time. The Windows Cluster Server is based on the following funda-
mental concepts:
• Cluster Service and Management: A collection of software must reside on each node
that manages all cluster-specifc activity. A cluster as a whole is, however, managed using
distributed control algorithms which are implemented through actions performed in all
nodes. These algorithms require that all nodes in a cluster have a consistent view of the
cluster; that is, they must possess identical lists of nodes within the cluster. An application
has to use a special Cluster API and dynamic link library (DLL) to access cluster services.
• Resources: The concept of resources in Windows is somewhat different. All resources in
the cluster server are essentially objects that can be actual physical resources in the sys-
tem, including hardware devices, such as disk drives and network cards; logical resources,
such as logical disk volumes, TCP/IP addresses, entire applications, and databases; or a
resource that can even be a service. A resource is implemented by a dynamic link library
(DLL), so it is specifed by providing a DLL interface. Resources are managed by a
resource monitor which interacts with the cluster service via RPC and responds to cluster
service commands to confgure and move a collection of resources. A resource is said to
be online at a node when it is connected to that specifc node to provide certain services.
• Group: A group is a collection of resources managed as a single unit. A resource belongs
to a group. Usually, a group contains all of the elements needed to run a specifc applica-
tion, including the services provided by that application. A group is owned by one node in
the cluster at any time; however, it can be shifted (moved) to another node in the event of
a fault or failure. A resource manager exists in a node that is responsible for starting and
stopping a group. If a resource fails, the resource manager alerts the failover manager and
hands over the group containing the resource so that it can be restarted at another node.
• Fault Tolerance: Windows Cluster Server provides fault-tolerance support in clusters by
using two or more server nodes. Basic fault tolerance is usually provided through RAIDs
of 0, 1, or 5 that are shared by all server nodes. In addition, when a fault or a shutdown
occurs on one server, the cluster server moves its functions to another server without caus-
ing a disruption in its services.
An illustration of the various important components of Windows Cluster Server and their rela-
tionships in a single system of a cluster is depicted in Figure 9.14. Individual cluster services are
accessed by one manager out of many. Each node has a node manager which is responsible for
maintaining this node’s membership in the cluster and also the list of nodes in a cluster. Periodically,
it sends messages called heartbeats to the node managers on other nodes present in the cluster for
the purpose of node fault detection. When one node manager detects a loss of heartbeat messages
from another node in the cluster, it broadcasts a message on the private LAN to the entire cluster,
causing all members to exchange messages to verify their view of current cluster membership. If a
node manager does not respond or a node fault is otherwise detected, it is removed from the cluster,
and each node then accordingly corrects its list of nodes. This event is called a regroup event. The
resource manager concerned with resources now comes into action, and all active groups located
in that faulty node are then “pulled” to other active nodes in the cluster so that resources in them
can be accessed. Use of a shared disk facilitates this arrangement. When a node is subsequently
restored after a failure, the failover manager concerned with nodes decides which groups can be
handed over to it. This action is called a failback, it safeguards and ensures resource effciency in
the system. The handover and failback actions can also be performed manually.
In effect, the resource manager/failover manager makes all decisions regarding resource
groups and takes appropriate actions to startup, reset, and failover. In the event of a node failure,
the failover managers on the other active nodes cooperate to effect a distribution of resource groups
from the failed system to the remaining active systems. When a node is subsequently restored after
Distributed Systems: An Introduction 549
rectifying its fault, the failover can decide to move some groups back to the restored system along
with the others. In particular, any group can be confgured with a preferred owner. If that owner
fails and then restarts, it is desirable that the group in question be moved back to the node using a
rollback operation.
The confguration database used by the cluster is maintained by the confguration database
manager. The database contains all information about resources and groups and node ownership
of groups. The database managers on each of the cluster nodes interact cooperatively to maintain
a consistent picture of confguration information. Fault-tolerant transaction software is used to
ensure that changes in the overall cluster confguration before failure, during failure, and after
recovery from failure are performed consistently and correctly.
There are many other managers to perform their respective duties and responsibilities. One such
processing entity is known as an event processor (handler) that coordinates and connects all of the
components of the cluster service, handles common operations, and controls cluster service initial-
ization. The communication manager monitors message exchange with all other nodes present in
the cluster. The global update manager provides a service used by other components within the
cluster service.
550 Operating Systems
Windows Cluster Server balances the incoming network traffc load by distributing the traffc
among the server nodes in a cluster. It is accomplished in the following way: the cluster is assigned
a single IP address; however, incoming messages go to all server nodes in the cluster. Based on the
current load distribution arrangement, exactly one of the servers accepts the message and responds
to it. In the event of a node failure, the load belonging to the failed node is distributed among other
active nodes. Similarly, when a new node is included, the load distribution is reconfgured to direct
some of the incoming traffc to the new joining node.
Solaris kernel to work together with virtually no changes required to the kernel, even if any
changes are made in any area above the kernel.
• Global process management: Existing process management is enhanced with the use of
global process management, which provides globally unique process ids for each process
in the cluster so that each node is aware of the location and status of each process. This fea-
ture is useful in process migration, wherein a process during its lifetime can be transferred
from one node to another to balance the computational loads and ease of computation in
different nodes or to achieve computation speed-up. A migrated process in that situation
should be able to continue using the same pathnames to access fles from a new node. Use
of a DFS, in particular, facilitates this feature work. The threads of a single process, how-
ever, must be on the same node.
• Disk path monitoring: All disk paths can be monitored and confgured to automatically
reboot a node in the case of multiple path failure. Faster reaction in the case of severe disk
path failure provides improved availability.
• Confguration checker: Checks for vulnerable cluster confgurations regularly and rap-
idly, thereby attempting to limit failures due to odd confguration throughout the life-time
of the cluster.
• Networking: A number of alternative approaches are taken by Sun clusters when handling
network traffc.
• Only a single node in the cluster is selected to have a network connection and is dedi-
cated to perform all network protocol processing. In particular, for TCP/IP-based pro-
cessing, while handling incoming traffc, this node would analyze TCP and IP headers
and would then route the encapsulated data to the appropriate node. Similarly, for out-
going traffc, this node would encapsulate data from other transmitting nodes with TCP/
IP headers for necessary transmission. This approach has several advantages but suffers
from a serious drawback for not being scalable, particularly when the cluster consists of
a large number of nodes, and thus, it has fallen out of favor.
• Another approach may be to assign a separate unique IP address to each node in the
cluster, each node will then execute the network protocols over the external network
directly. One serious diffculty with this approach is that the transparency criteria of
a cluster is now adversely affected. The cluster confguration is no longer transparent
(rather is opened) to the outside world. Another vital aspect is the diffculty of handling
the failover situation in the event of a node failure when it is necessary to transfer an
application running on the failed node to another active node but the active node has a
different network address.
• A packet flter can be used to route packets to the destined node, and all protocol pro-
cessing is performed on that node. A cluster in this situation appears to the outside
world to be a single server with a single IP address. Incoming traffc is then appro-
priately distributed among the available nodes of the cluster to balance the load. This
approach is found to be an appropriate one for a Sun cluster to adopt.
Incoming packets are frst received on the node that has the physical network connection with the
outside world. This receiving node flters the packet and delivers it to the right target node over the
cluster’s own internal connections. Similarly, all outgoing packets are routed over the cluster’s own
interconnection to the node (or one of multiple alternative nodes) that has an external physical net-
work adapter. However, all protocol processing in relation to outgoing packets is always performed
by the originating node. In addition, the Sun cluster maintains a global network confguration net-
work database in order to keep track of the network traffc to each node.
• Availability and scalability: Sun cluster provides availability through failover, whereby
the services that were running at a failed node are transferred (relocated) to another node.
552 Operating Systems
FIGURE 9.16 A representative scheme of a general Sun cluster fle system extension.
Scalability is provided by sharing (as well as distributing) the total load across the exist-
ing servers.
• Multiple storage technologies and storage brands: Solaris cluster can be used in com-
bination with different storage technologies, such as FC, SCSI, iSCSI, and NAS storage on
Sun or non-Sun storage.
• Easy-to-use command-line interface: An object-oriented command line interface pro-
vides a consistent and familiar structure across the entire command set, making it easy to
learn and use and limiting human error. Command logging enables tracking and replay.
• Global distributed fle system: The beauty and the strength of the Sun cluster is its global
fle system, as shown in Figure 9.16, which is built on the virtual node (v-node) and virtual
fle system (VFS) concepts. The v-node structure is used to provide a powerful, general-
purpose interface to all types of fle systems. A v-node is used to map pages of memory
into the address space of a process, to permit access to a fle system, and to map a process
to an object in any type of fle system. The VFS interface accepts general-purpose com-
mands that operate on entire fles and translates them into actions appropriate for that
subject fle system. The global fle system provides a uniform interface to all fles which
are distributed over the cluster. A process can open a fle located anywhere in the cluster,
and processes on all nodes use the same pathname to locate a fle. In order to implement
global fle access, MC includes a proxy fle system built on top of the existing Solaris fle
system at the v-node interface. The VFS/v-node operations are appropriately converted by
a proxy layer into object invocations. The invoked object may reside on any node in the
system. The invoked object subsequently performs a local v-node/VFS operation on the
underlying fle system. No modifcation is, however, required either at the kernel level or in
the existing fle system to support this global fle environment. In addition, caching is used
to reduce the number of remote object invocations, which, in turn, minimizes the traffc on
the cluster interconnect. A few of the multiple fle systems and volume managers that are
supported by the Sun cluster are:
Distributed Systems: An Introduction 553
SUMMARY
The time-sharing system of the 1970s could be considered the frst stepping stone toward distrib-
uted computing system that implemented simultaneous sharing of computer resources by mul-
tiple users located away from the main computer system. Different forms of hardware design
of distributed computing systems, including multiprocessors and multicomputers, and the vari-
ous forms of software that drive these systems are described. The generic DOS and its design
issues are explained. Numerous considerations used in generic multiprocessor operating systems
with emphasis on processor management for different forms of multiprocessor architecture are
described. Practical implementations of Linux OS and Windows OS in multiprocessor environ-
ments are presented here as case studies. Distributed systems based on different models of mul-
ticomputers (networks of computers) consisting of a collection of independent computer systems
(homogeneous or heterogeneous) interconnected by a communication network (LAN and WAN)
using Ethernet, token ring, and so on for the purpose of exchanging messages are illustrated. The
formal design issues of generic multicomputer operating systems to be run on any kind of multi-
computers are presented.
In fact, a distributed system essentially provides an environment in which users can conveniently
use both local and remote resources. Computing of varieties of applications with the client/server
model in computer networks is distributed to users (clients), and resources to be shared are main-
tained on server systems available to all clients. Thus, the client/server model is a blend of decen-
tralized and centralized approaches. The actual application is divided between client and server to
optimize ease of use and performance. The basic design issues of DOSs built on the client/server
model are briefy described. The interprocess communication required in any distributed system
is realized either by a message-passing facility or a RPC in which different programs on differ-
ent machines interact using procedure call/return syntax and semantics that act as if the partner
program were running on the same machine. The actual implementation of RPC in Sun systems is
briefy described. A brief overview of distributed shared memory and its implementation aspects
is narrated. A major part of a distributed system is the DFS, and the key design issues and a brief
overview of its operations are described, along with an example of its actual implementation carried
out in Windows, SUN NFS, and Linux GPFS. The most modern approach in distributed computer
system design is the cluster, built on the client/server model in which all the machines work together
as a unifed computing resource using an additional layer of software known as middleware that
casts an illusion of being one machine. Its advantages, classifcations, and different methods of
clustering are described in short, along with the general architecture of clusters and their operating
system issues. The different aspects of implementation of Windows and SUN clusters are shown
here as case studies.
EXERCISES
1. With respect to the salient features in hardware architectures, differentiate among the fol-
lowing types of computing systems: a. time sharing, b. network, c. distributed, and d.
parallel processing.
2. State and explain the salient features of a distributed computing system.
3. What are the main advantages and disadvantages that distributed computing systems
exhibit over centralized ones?
554 Operating Systems
4. What are the commonly used different models for confguring distributed computing sys-
tems? Discuss in brief their relative advantages and disadvantages. Which model is consid-
ered dominant? Give reasons to justify it.
5. State the important issues in the design of the kernel of the operating system in symmetric
multiprocessors. Explain in brief how these issues are being handled.
6. What are the salient features considered in the design of the kernel of the operating sys-
tem in a distributed shared memory multiprocessor? Explain in brief how these issues are
handled.
7. In terms of hardware complexity, operating system complexity, potential parallelism, and
cost involved, compare the following types of systems which consist of a large number of
processors (say, 16 to 32):
a. A multiprocessor system with a single shared memory (SMP).
b. A multiprocessor system in which each processor has its own memory. The processors
are located far from each other and are connected by a low-capacity communication line
forming a network. Each processor can communicate with others by exchanging mes-
sages.
c. A multiprocessor system in which each processor has its own memory in addition to
shared memory used by all processors in the system.
8. Discuss the suitability of various kinds of locks to satisfy the requirements of synchroni-
zation of processors in multiprocessor systems. While spin or sleep locks are used, can
priority inversion occur? Justify your answer.
9. Discuss the different approaches employed in scheduling of threads and related assignment
of processors in multiprocessor systems.
10. What are the salient features that must be considered in the design of a multicomputer
operating system?
11. What is middleware? In spite of having acceptable standards, such as TCP/IP, why is mid-
dleware still needed?
12. State the different models of middleware that are available for use. Furnish in brief the
various middleware services that are implemented in application systems.
13. What are the reasons distributed operating systems are more diffcult to design than cen-
tralized time-sharing operating systems?
14. What are the main differences between a network operating system and a distributed oper-
ating system?
15. State and explain the major issues at the time of designing a distributed operating
system.
16. Discuss some of the important concepts that might be used to improve the reliability of a
distributed operating system. What is the main problem faced in making a system highly
reliable?
17. Explain the main guiding principle to be obeyed to enhance the performance of a distrib-
uted operating system.
18. Why is scalability an important feature in the design of a distributed system? Discuss some
of the important issues that must be settled at the time of designing a scalable distributed
system.
19. “Heterogeneity is unavoidable in many distributed systems”. What are the common
types of incompatibilities faced in heterogeneous distributed systems? What are the
common issues that must be dealt with at the time of designing a heterogeneous distrib-
uted system?
20. Compare and contrast between network operating systems, distributed operating systems,
and distributed systems (middleware-based).
21. Most computer networks use fewer layers than those specifed in the OSI model. Explain
what might be the reason for this. What problems, if any, could this lead to?
Distributed Systems: An Introduction 555
22. Why is the OSI model considered not suitable for use in a LAN environment? Give the
architecture of a communication protocol model suitable for LANs. Briefy describe the
functions of each layer of this architecture
23. Suggest three different routing strategies for use in networks of computers. Discuss the
relative advantages and disadvantages of the strategies thus suggested.
24. What are the main differences between connection-oriented and connectionless communi-
cation protocols? Discuss their relative merits and drawbacks.
25. What is asynchronous transfer mode technology used in networking? State some of the
most common important features ATM has that put it at the forefront of networking
technologies. What type of impact will each of these features have on future distributed
systems?
26. State the mechanism used in the FLIP protocol for each of the following:
a. Transparent communication
b. Group communication
c. Secure communication
d. Easy network communication
State at least one shortcoming of the FLIP protocol.
27. What is a socket? Explain the mechanism followed to implement sockets. What is the
implication of a socket interface?
28. What is meant by internetworking? What are the main issues in internetworking? In light
of the interconnection technologies used in internetworking, explain the differences among
the following terms: a. bridges, b. router, c. brouter, and d. gateway.
29. What are the main differences between blocking and non-blocking protocols used in inter-
process communication in distributed systems on a workstation-server model?
30. Explain the nature of and reasons for differences in naming of system objects between
centralized and distributed systems in a client–server model.
31. Defne process migration in a distributed system. Discuss the situations and the advantages
that can be accrued from process migration activities.
32. What is meant by IPC semantics? Write down the most commonly used IPC semantics
used in distributed systems along with their signifcant implications.
33. Discuss the relative merits and demerits of blocking and non-blocking protocols.
34. Comment on properties of the following non-blocking protocol:
a. Sender sends a request and continues processing.
b. Receiver sends a reply.
c. Sender sends an acknowledgement when it receives a reply.
35. Requests made using non-blocking send calls may arrive out of sequence at the destina-
tion site when dynamic routing is used. Discuss how a non-blocking RR protocol should
discard duplicate requests when this property holds.
36. Write notes on factors that infuence the duration of the timeout interval in the RRA
protocol. How can duplicate replies received in the sender site in the RRA protocol be
discarded?
37. Describe a mechanism for implementing consistent ordering of messages in each of the
following cases (essentially a group communication): a. one-to-many communication, b.
many-to-one communication, and c. many-to-many communication.
38. What was the basic inspiration behind the development of the RPC facility? How does an
RPC facility make the job of the distributed application developer simpler?
39. With reference to the defnition of synchronous and asynchronous RPC, discuss their rela-
tive merits and drawbacks.
40. In RPC, the called procedure may be on the same computer as the calling procedure, or it
may be on a different computer. Explain why the term remote procedure call is used even
when the called procedure is on the same computer as the calling procedure.
556 Operating Systems
41. What is a “stub”? How are stubs generated? Explain how the use of stubs helps make the
RPC mechanism transparent.
42. The caller process of an RPC must wait for a reply from the callee process after making a
call. Explain how this can actually be done.
43. List some merits and drawbacks of non-persistent and persistent binding for RPCs.
44. Compare and contrast between RPC and message passing.
45. “Distributed shared memory should be incorporated into systems that are equipped with
high-bandwidth and low-latency communication links”. Justify.
46. Defne and explain the following: a. group communication, b. interprocess communica-
tion, and c. Java RMI.
47. How can the performance of distributed shared memory be improved?
48. State the signifcant issues that must be kept in mind at the time of designing a distributed
shared memory system.
49. What are the main factors that must be supported by a distributed fle system?
50. State and explain the primary attributes that can infuence the fault tolerance of a distrib-
uted fle system.
51. Differentiate between stateful and stateless servers. Why do some distributed applications
use stateless servers in spite of the fact that stateful servers provide an easier programming
paradigm and are typically more effcient than stateless servers?
52. State the infuence of stateful and stateless fle server design on tolerance of faults in cli-
ent–server nodes.
53. Discuss how a client should protect itself against failures in a distributed fle system using;
a. a stateful fle server design and b. a stateless fle server design.
54. State at least two common server structures that can provide effcient fle access in a dis-
tributed fle system.
55. State some important techniques that are commonly used in the operation of DFS to enable
the DFS to achieve high performance.
56. Explain how the cache coherence problem in a distributed fle system is negotiated.
57. Should a DFS maintain fle buffers at a server node or at a client node? What is the signif-
cance and subsequence infuence of this decision in the working of a DFS?
58. State and explain the techniques that are commonly used in the operation of a DFS that
enable the DFS to achieve high performance.
59. Discuss the important issues to be handled during recovery of a failed node in a system that
uses fle replication to provide availability.
60. “The clustering concept is an emerging technology in the design of a distributed comput-
ing system”. State the salient features considered its major design objectives.
61. State the simplest form of classifcation of clusters. Describe the various most commonly
used methods in clustering.
62. Explain with an appropriate diagram the general architecture in organizing computers to
form a cluster system.
Cheriton, D. “The V Distributed System”, Communications of the ACM, vol. 31, no. 3, pp. 314–333, 1998.
Cheriton, D. R., Williamson, C. L. “VMTP as the Transport Layer for High-Performance Distributed System”,
IEEE Communication, vol. 27, no. 6, pp. 37–44, 1989.
Coulouris, G., Dollimore, J. Distributed Systems Concepts and Designs, Third Edition, Boston, MA, Addition
Wesley, 2001.
Culler, D. E., Singh, J. P. Parallel Computer Architecture: A Hardware/Software Approach, Burlington, MA,
Morgan Kaufmann Publishers Inc, 1994.
Dijkstra, E. W. “Guarded Commands, Nondeterminacy, and Formal Derivation of Programs”, Communications
of the ACM, vol. 18, no. 8, pp. 453–457, 1975.
Janet, E. L. “Selecting a Network Management Protocol, Functional Superiority vs. Popular Appeal”,
Telephony, 1993.
Kaashoek, M. F., Van Renesse, et al., “FLIP: An Internetwork Protocol For Supporting Distributed Systems”,
ACM Transactions On Computer Systems, vol. 11, no. 1, pp. 73–106, 1993.
Levy, E., Silberschatz, A. “Distributed File Systems: Concepts and Examples”, Computing Surveys, vol. 32,
no. 4, pp. 321–374, 1990.
McDougall, R., Laudon, J. “Multi-Core Processors are Here”, USENIX, Login: The USENIX Magazine, vol.
31, no. 5, pp. 32–39, 2006.
Milenkovic, M. Operating Systems: Concepts and Design, New York, McGraw–Hill, 1992.
Mukherjee, B., Karsten, S. Operating Systems for Parallel Machines in Parallel Computers: Theory and
Practice (Edited by T. Casavant, et al.), Los Alamitos, CA, IEEE Computer Society Press, 1996.
Mullender, S. J., Tanenbaum, A. S., et al. “Amoeba: A Distributed Operating System for the 1990s”, IEEE
Computer, vol. 23, no. 5, pp. 44–53, 1990.
Russinovich, M. E., Solomon, D. A. Microsoft Windows Internals, Fourth Edition, New York, Microsoft
Press, 2005.
Sandberg, R. The Sun Network File System: Design, Implementation, and Experience, Mountain View, CA,
Sun Microsystems, 1987.
Schneider, F. B. “Synchronization in Distributed Programs”, ACM Transactions on Programming Languages
and Systems, vol. 4, no. 2, pp. 125–148, 1982.
Shevenell, M. “NMP v2 Needs Reworking to Emerge as a Viable Net Management Platform”. Network World,
March 7, 1994.
Shoch, J. F., Hupp, J. A. “The Worm programs: Early Experiences with a Distributed Computation”,
Communication of the ACM, vol. 25, no. 3, pp. 172–180, 1982.
Short, R., Gamache, R., et al. “Windows NT Clusters for Availability and Scalability”, Proceedings,
COMPCON Spring 97, February, 1997.
Srinivasan, R. RPC: Remote Procedure Call Protocol Specifcation Version 2. Internet RFC 1831, August
1995.
Tanenbaum, A. Distributed Operating Systems, Englewood Cliffs, NJ, Prentice–Hall, 1995.
Tay, B. H., Ananda, A. L. “A Survey of Remote Procedure Calls”, Operating Systems Review, vol. 24,
pp. 68–79, 1990.
Thekkath, C. A., Mann, T., et al. “A Scalable Distributed File System”, Symposium on Operating Systems
Principles, pp. 224–237, 1997.
Wulf, W. A., Cohen, E. S., et al. “HYDRA: The Kernel of a Multiprocessor Operating System”, Communication
of the ACM, vol. 17, pp. 337–345, 1974.
WEBSITES
http://docs.sun.com/app/docs/doc/817-5093
Sun System Administration Guide: Devices and File Systems
IEEE Computer Society Task Force on Cluster Computing: An International forum to promote cluster com-
puting research and education.
Beowulf: An international forum to promote cluster computing research and education.
10 Real-Time Operating Systems
Learning Objectives
• To describe the background of the evolution of real-time systems and give an overview of
the real-time task and its parameters.
• To explain the different issues involved with real-time systems.
• To articulate the evolution of real-time operating systems.
• To describe the design philosophies, characteristics, requirements, and features of a real-
time operating system.
• To demonstrate the basic components of a real-time operating system, including its kernel
structure and scheduling mechanisms, together with an example of Linux real-time sched-
uling approach.
• To explain the role of clocks and timers to provide time services in the system, along with
an example of clock and timer resolutions in Linux as a case study.
• To describe the mechanism used in the implementation of communication and synchroni-
zation required in this system.
• To explain the signals realized in this system in the form of software interrupts.
• To explain the memory allocation mechanism, including allocation strategies, protection,
and locking.
• To demonstrate practical implementations of RTOSs as case studies by presenting the
Linux real-time extension, KURT system, RT Linux system, Linux OS, pSOSystem, and
also VxWorks, used in Mars Pathfnder.
the sake of clarity, such an individual function can be defned here as a task. Thus, a process can be
viewed as progressing through a sequence of tasks. At any instant, a process is engaged in a single
task, and it is the process/task that must be considered a unit of computation.
Brief details on this topic with an example are given on the Support Materials at www.routledge.
com/9781032467238
• A hard real-time task is one that must meet its deadline (hard deadline); missing this
deadline will cause a penalty of higher-order of magnitude, leading to fatal damage or an
unacceptable and even irreparable error to the system. A hard real-time system (e.g. an
Avionic control) is thus typically dedicated to processing real-time applications and prov-
ably meets the response requirements of an application under the conditions. Application
systems, such as guidance and control applications, are typically serviced using hard real-
time systems, because they fail if they cannot meet the response requirements. A ballistic
missile may be shifted from its specifed trajectory if the response requirement is not
rigidly obeyed.
• A soft real-time task has an associated deadline that is desirable to obey but not manda-
tory; it still makes sense to schedule and complete the task even it has passed its deadline.
A soft real-time system makes the best effort to meet the response requirement of a real-
time application but cannot guarantee that it will be able to meet it under all conditions.
Typically, it meets the response requirement in a probabilistic manner, say, 95 percent of
the prescribed deadline (time). Application systems such as multimedia applications and
applications, like reservations and banking systems, that essentially aim to provide good
560 Operating Systems
quality of service but do not have a notion of failure, thus may be serviced using soft real-
time systems. The quality of picture on a video may deteriorate occasionally if the response
requirement is not met, but one can still watch the video with almost no interruption.
• Another characterization of real-time tasks can be described as follows: A set of related
jobs (activities) is called a task. Jobs in a task may be precedence constrained to execute
in a certain order. Sometimes jobs may be constrained to complete within a certain time
from one another. Jobs may have data dependencies even when they are not precedence
constrained. If pi is the minimum length of the intervals between the release times of
consecutive tasks (inter-release interval), that is, the task period, and ai is the arrival
time, ri is the ready time, di is the deadline, ci is the worst-case execution time, and φi is
the release time of the frst job (activity) in task Ti, then:
• Periodic tasks: Task Ti is a sequence of jobs. Task Ti is time-driven. The characteristics
are known a priori, and task Ti is characterized by (pi, ci, φi). For example, the task is
to monitor the temperature of a furnace in a factory.
• Aperiodic tasks: Task Ti is event-driven. The characteristics are not known a priori,
and task Ti is characterized by (ai, ri, ci, di). This task has either soft deadlines by which
it must fnish or start or no deadlines (i.e. it may have a constraint on both start and fn-
ish time). An example is a task that is activated upon detecting a change in furnace’s
condition (temperature).
• Sporadic tasks: A periodic tasks with a known minimum inter-arrival time.
We want the system to be responsive, that is, to complete each task as soon as possible. On the
other hand, a late response might be annoying but tolerable. It is thus attempted to optimize the
responsiveness of the system for aperiodic tasks but never at the cost of hard real-time tasks, which
require deadlines to be met religiously at all times.
• Support for real-time operating system’s common activities (such as multiple contexts,
memory management, garbage collection, interrupt handling, clock synchronization).
• Support for real-time language features (such as language constructs for estimating
worst-case execution time of tasks).
• Resource management (RM) issues: Scheduling, fault-tolerance, resource reclaiming,
communication.
Real-time scheduling paradigms:
• Allocate time slots for tasks onto processor(s) (i.e. where and when a given task would
execute).
• Objective: predictably meeting task deadlines (schedulability check, schedule
construction).
Real-time task scheduling can be broadly classifed as shown in Figure 10.1. The details are dis-
cussed in later sections.
• Preemptive scheduling: Task execution is preempted and later resumed at an appropriate
time.
– Preemption occurs mainly to execute higher-priority tasks.
– Offers higher schedulability.
– Involves higher scheduling overhead due to frequent context switching.
• Nonpreemptive scheduling:
– Once a task starts executing, it is allowed to continue its execution until it completes.
– Offers lower schedulability.
– Relatively less overhead due to less context switching.
• Optimal scheduling: defnition
A static scheduling algorithm is said to be optimal if, for any set of tasks, it always
produces a feasible schedule (i.e. a schedule that satisfes the constraints of the tasks)
whenever any other algorithm can also do so.
A dynamic scheduling algorithm is said to be optimal if it always produces a feasible
schedule whenever a static algorithm with complete prior knowledge of all the pos-
sible tasks can do so.
Static scheduling is used for scheduling periodic tasks, whereas dynamic scheduling is
used to schedule both periodic as well as aperiodic tasks.
• Real-time languages:
• Support for management of time:
– Language constructs for expressing timing constraints, keeping track of resource
utilization.
• Schedulability analysis:
– Aid compile-time schedulability check.
• Reusable real-time software modules:
– Object-oriented methodology.
• Support for distributed programming and fault-tolerance.
• Real-time databases.
Conventional database systems.
• Diskbased.
• Use transaction logging and two-phase locking protocols to ensure transaction atomi-
city and serializability.
• These characteristics preserve data integrity, but they also result in relatively slow and
unpredictable response times.
Real-time database systems: The issues include:
• Transaction scheduling to meet deadlines.
• Explicit semantics for specifying timing and other constraints.
• Checking the database system’s ability to meet transaction deadlines during application
initialization.
An RTOS is one that essentially guarantees a certain capability within a specifed time constraint.
For example, an operating system could be designed to ensure that a certain object is made available
for a robot working on an assembly line. In what is usually called a hard real-time operating system,
if the calculation could not be performed to make the object available at the pre-specifed time, the
operating system would terminate with a failure. Similarly, in a soft real-time operating system, the
assembly line would not necessarily arrive at a failure but continue to function, though the production
output might be lower as objects failed to appear at their stipulated time, causing the robot to be tem-
porarily unproductive. Some real-time operating systems are designed and developed for a special
application, and others are more or less general purpose. Some existing general-purpose non–real-time
operating systems, however, claim to be real-time operating systems. To some extent, almost any
general-purpose operating system, such as Microsoft’s Windows 2000 or IBM’s OS/390, and to some
extent Linux, can be evaluated for its real-time operating system qualities. Reasons for this choice
include the timing requirement of applications that are not so hard. That is, even if an operating
system does not qualify, it may have some characteristics that enable it to be considered a solution
to a particular problem belonging to the category of real-time application. It is to be noted that the
objective of a true real-time operating system does not necessarily have to have a high throughput. In
fact, the specialized scheduling algorithm, a high clock-interrupt rate, and other similar related fac-
tors often intervene in the execution of the system that hinders to yield the needed high throughput.
A good real-time operating system thus not only provides effcient mechanisms and services to carry
out good real-time scheduling and resource management policies but also keeps its own time and
resource consumptions predictable and accountable. In addition, a real-time operating system should
be more modular and extensible than its counterpart, a general-purpose operating system. Some early
large-scale real-time operating systems were the so-called control program developed by American
Airlines and the Sabre Airline Reservations System introduced by IBM.
• An event-driven operating system that only changes tasks when an event requires service.
• A time-sharing design that switches tasks on a clock interrupt (clock-driven), as well as on
events.
The time-sharing design wastes more CPU time on unnecessary task-switches (context-switches)
but offers better multitasking, the illusion that a user has sole use of a machine.
even when the requests come from the external events with prescribed timings. To real-
ize this, it depends on several factors, mainly how quickly it can respond to interrupts
and also the availability of both hardware and software resources that are adequate to
manage all such requests within the specifed time. One useful way to assess deter-
minism is by measuring the maximum delay it faces when responding to high-priority
device interrupts. In the case of traditional OSs, this delay might be in the range of tens
to hundreds of milliseconds, while in RTOSs, this delay should not be beyond a few
microseconds to a millisecond.
• Responsiveness: While determinism is concerned with the time to recognize and respond
to an interrupt, responsiveness is related to the time that an operating system takes to ser-
vice the interrupt after acknowledgement. Several aspects contribute to responsiveness.
Some notable ones are:
• Interrupt latency: It is the time the system takes before the start of the execution of an
immediate interrupt service routine, including the time required for process switching
or context switching, if there is any.
• The amount of time required to actually execute the ISR. This is generally dependent
on the hardware platform being used.
• The effect of nested interrupts. It may cause further delay due to the arrival of a high-
priority interrupt when the servicing of one is in progress.
• Dispatch latency: It is the length of time between the completion of an ISR and resump-
tion of the corresponding process (or thread) suspended due to the occurrence of the
interrupt.
In fact, determinism and responsiveness together constitute the response time to external events,
one of the most critical requirements of a real-time operating system that should be religiously
obeyed by events, devices, data fows, and above all the individuals located in the domain external
to the system.
even when a fault occurs. In the event of a fault, graceful degradation in performance
is observed that leads only to offering a reduced level of service, but revert to normal
operations when the fault is rectifed. When the system operates at a reduced level, cru-
cial functions are usually assigned high priorities to enable them to perform in a timely
manner. In the event of a failure, the system typically notifes a user or a user process
that it is about to attempt corrective measures (actions) and then continues operation
at a reduced level of service. Even when a shutdown is necessary, an attempt is always
made to preserve fle and data consistency.
• Stability: A real-time system is said to be stable if, in situations where it is impossible
to meet the deadlines of all tasks, the system will at least meet the deadlines of its most
critical, high-priority tasks, even if that means sacrifcing relatively less important critical
tasks, and accordingly, it then notifes the users or user processes of these affected tasks
about its inability beforehand so that the default actions with respect to these tasks can be
appropriately taken by the system in time.
• Periodic threads: As an RTOS deals with periodic, aperiodic, and sporadic tasks,
described in Section 10.2, so are there periodic threads, aperiodic threads, and sporadic
threads. A periodic task can be implemented in the form of a thread (computational activ-
ity) that executes periodically called a periodic thread. Such a thread is supposed to be
created and destroyed repeatedly at every period, which simply wastes costly CPU time,
thereby degrading system performance as a whole. That is why the RTOS that supports
periodic tasks (e.g. Real-time Mach OS), the kernel avoids such unnecessary redundancy
in repeated creating and destroying the threads by simply reinitializing the thread putting
it to sleep when the thread completes. Here, the kernel keeps track of the duration of time
and releases (brings back to the ready queue) the thread again at the beginning of the next
period. Most commercial operating systems, however, do not support periodic threads.
At best, a periodic task can be implemented at the user level as a thread that alternately
executes the code of the task and sleeps until the beginning of the next period. This means
that the thread does its own reinitialization and keeps track of time for its own next release
without any intervention of the kernel to monitor.
• Aperiodic and sporadic threads: When an aperiodic and sporadic task (see Section 10.2)
is implemented in the form of thread, it gives rise to an aperiodic thread or a sporadic
thread. This type of thread is essentially event-driven and is released in response to the
occurrence of the specifed types of events that may be triggered by external interrupts.
Upon its completion, an aperiodic or sporadic thread is reinitialized and suspended as
usual. The differences between these three types of tasks are covered in Section 10.2.
Thread states: Threads and processes are analogous, so their states and state transitions are also
analogous, except that threads do not have resource ownership. Since, the thread is an alternative
Real-Time Operating Systems 567
form of schedulable unit of computation to the traditional notion of process, it similarly undergoes
different states in its lifetime, as follows:
• Ready state: When the thread is waiting for its turn to gain access to the CPU which
executes it.
• Running state: A thread is in the running state when the code attached to the thread is
being executed by the CPU.
• Waiting state: A thread is said to be in the waiting state when its on–going execution is
not completed, but it may have to go to sleep, thereby entering the sleeping state. The other
possibility may be that some other thread suspends the currently running thread. This
thread now enters the suspended state. Suspending the execution of a thread is deprecated
in the new version.
• Dead state: When the thread has fnished its execution.
A task (and hence a thread) can be suspended or blocked for many different reasons. The operating
system typically keeps separate queues for threads that are blocked or suspended for different reasons.
Similarly, the kernel normally keeps a number of ready queues; each queue is of ready threads with
specifc attributes (e.g. a queue of ready threads with the same priority). A thread scheduler, analo-
gous to the process scheduler, switches the processor among a set of competing threads, thereby caus-
ing a thread to go through its different states. The thread scheduler in some systems is a user program,
and in others, it is part of the OS. Since threads have few states and less information to be saved while
changing states, the thread scheduler has less work to do when actually switching from one thread to
another than is required for switching processes; an important aspect in favor of using threads.
Brief details on this topic with a fgure are given on the Support Material at www.routledge.
com/9781032467238.
The kernel also deals with many other aspects, such as fault-tolerance, reliability, stability, and
recovery from hardware and software exceptions; but those aspects are deliberately kept outside the
scope of this discussion.
• System Calls: Out of several different functions that the kernel usually provides, it also
offers many other functions (e.g. Application Program Interface (API) functions) which,
when called from user programs do some work at the kernel space on behalf of the call-
ing process (or thread). Any call to one of the API functions, some of which are listed
in Figure 10.2, is essentially a system call. In systems that provide memory protection,
user and kernel processes (or threads) are executed in separate memory spaces. When
a system call (API function) is issued by the calling process (or thread), the process is
568 Operating Systems
FIGURE 10.2 A representative structure of a microkernel used in real-time operating systems (RTOS).
blocked, and the kernel saves everything in relation to the calling process (or thread) and
switches from user mode to kernel mode. It then takes the function name and arguments
of the call from the process’s (or thread’s) stack and executes the function on behalf of
the process (or thread). When the execution of the system call completes, the kernel
executes a return from exception, causing the system to return to user mode. The calling
process then resumes from the point where it left if it still has the highest priority. This
line of action, followed in sequence to execute an API function, is called a synchronous
system call. If the system call causes another process (or thread) to have a higher priority,
then the currently executing system call will be interrupted, and the process (or thread)
with the higher priority will be started.
When the call is asynchronous (such as in the case of an asynchronous I/O request), the calling
process (or thread) continues its own execution (without blocking) after issuing the call. The kernel
then provides a separate process (or thread) to execute the called function.
Many embedded operating system that do not provide memory protection allow the user and the
kernel process to execute in the same space. This is often favored in order to keep the execution-
time overhead small (due to no need for a process/context switch) as well as to avoid the overhead
of consuming extra memory space to provide full memory protection (on the order of an additional
Real-Time Operating Systems 569
few kilobytes needed per process). A system call in such a system is just like a procedure or function
call within the application.
• Scheduling and Timer Services: The heart of the system kernel is the scheduler which,
in most operating systems, executes periodically as well as whenever the state of a process
(or thread) changes. The scheduler assigns processors to schedulable jobs or, equivalently,
assigns schedulable jobs to processors. The scheduler is triggered to come into action by
means of raising the clock interrupts issued from the system clock device periodically.
The period of clock interrupts is called tick size, which is on the order of 10 milliseconds
in most operating systems. However, in a clock-driven system that uses a cyclic scheduler,
clock interrupts occur only at the beginning of a frame. At each clock interrupt, the kernel
performs several responsibilities to service the interrupt, some notable ones are:
The kernel frst attempts to process the timer events by checking the queue of pending expiration
times stored in the time order in the queue of all the timers that are bound to the clock. This way,
the kernel can determine whether timer events have occurred since the previous time it checked
the queue. If the kernel fnds that a timer event did occur, it carries out the specifed action. In this
manner, the kernel processes all the timer events that have occurred and then queues all the speci-
fed actions. It subsequently carries out all these actions at appropriate times before fnally returning
control to the user.
The next action that kernel takes is to update the execution budget, which is the time-slice
normally offered by the scheduler when it schedules a process (or thread) for execution based on
policy and considering other constraints. At each clock interrupt, the scheduler usually decrements
the budget of the executing process (or thread) by the tick size. If the process (or thread) is not
completed when the updated budget (i.e. the remaining time-slice) becomes 0, the kernel then sim-
ply decides to preempt the executing process (or thread). Some scheduling policies, such as FIFO,
offer an infnite time slice or do not decrement the budget of the executing process (or thread),
thereby allowing the process (or thread) to continue its execution even when some other processes
(or threads) of equal priority keep on waiting in the queue.
After taking all these actions, the kernel proceeds to update the ready queue. Some threads by
this time may be ready (e.g. released upon timer expirations), and the thread that was executing at
the time of the clock interrupt may need to be preempted. The scheduler accordingly updates the
ready queue to bring it to the current status and fnally gives control to the process (or thread) that
is lying at the head of the highest-priority queue.
Now, as the scheduler is periodically activated only at each clock interrupt, the system is criti-
cally dependent on tick size, and the length of the tick-size is an important factor. A relatively large
tick size appears suitable for some commonly used scheduling policies, such as round-robin in time-
shared applications, but it may badly affect the schedulability of time-critical applications. On the
other hand, while a smaller tick size nicely fts with time-critical environments, but at the same time
degrades the system performance due to increased scheduling overhead caused mainly by frequent
servicing of relatively expensive regular clock interrupts. That is why most operating systems prefer
to have a combination of time-based (tick) scheduling along with event-driven scheduling. This
way, whenever an event occurs, the kernel invokes the scheduler to update the ready queue, and it
then quickly wakes up or releases a process (or thread), detects a process (or thread) unblocked, or
creates a new process (or thread), and many other similar things. In this way, a process (or thread)
is properly placed in the ready queue as soon as it is set.
• External Interrupts: An interrupt is an important tool by which the system can moni-
tor the environment by gaining control during execution, give control to the deserving
point for needed execution, or facilitate many similar other activities that an operating
system needs to fulfll its objectives. Here, interrupt means hardware interrupts, and those
570 Operating Systems
interrupts that take place due to the occurrence of events in the external system located
outside the domain of the computer system are referred to as external interrupts. The oper-
ating systems, in particular, the RTOS that deals with external interrupts to keep up with
the external events with which it is concerned, a proper handling of such interrupts in time
is an essential functional requirement of the kernel of such systems. However, these inter-
rupts may be of various types depending on the nature of the interrupt source, and as such,
the amount of time required to handle an interrupt, including its servicing, varies over a
wide span. That is why interrupt handling in most contemporary operating systems is clas-
sifed into two distinct categories. Those are:
Immediate interrupt service: Interrupt handling at its frst instance is executed depending
on the interrupt-priority level, which is entirely determined by the hardware being used, and as
such, most modern processor architectures provide some form of priority-interrupt with different
interrupt-priority levels. In addition, it is apparent that interrupt-priority levels (hardware-based)
are always higher than all process (or thread) priorities (software-based) and are even higher than
the priority at which the scheduler executes. When many interrupts with the same or different
interrupt-priority levels are raised within a small span of time, which is usually very common, the
interrupt-priority level attached to a particular interrupt then determines the ordering (either by
means of polling or some other suitable mechanism), following which the interrupt is serviced. The
corresponding service routine is then called by the kernel for execution. Moreover, if at any instant
when the execution of an interrupt servicing routine is in progress, a higher-priority interrupt, as
indicated by the interrupt-priority level, is raised, the ongoing interrupt servicing may be inter-
rupted to accommodate the relatively high-priority interrupt for its servicing, or the processor and
the kernel take care of the higher-priority interrupt in another ftting manner.
The interrupt-priority level of an interrupt determines the immediate interrupt service, which is
linked with what is known as responsiveness of the kernel. The total time required from the time the
interrupt is raised to the start of the execution of interrupt servicing of the interrupt after completing
all the needed housekeeping and other related activities is called interrupt latency. The ultimate
design objective of any kernel is to minimize the interrupt latency so as to make the kernel more
responsive. Various attempts in different areas related to this issue have been made to minimize the
latency. One notable one is to make the immediate interrupt handling routine as short as possible.
Another is to modify the design of the kernel so that the device-independent part (which is the code
for saving the processor state) of the interrupt service code can be injected into the interrupt service
routine of the device itself (device-dependent) to enable the processor to directly jump to the inter-
rupt service routines without going through any time-consuming kernel activity.
Scheduled interrupt service: To complete interrupt handling, after the end of the frst step,
which is immediate interrupt service, the second step begins with the execution of another service
routine known as scheduled interrupt service (not to be confused with interrupt servicing that actu-
ally starts from the beginning of the execution of interrupt service routine of the interrupting device
after the end of these two steps in interrupt handling) that is invoked by the frst step after complet-
ing its own responsibilities. The function being carried out in this second step is by the execution
of what is known as scheduled interrupt handling routine. This routine is preemptible and should
be scheduled at ftting priority (software) in a priority-driven system. For example, in some RTOS,
like in LynxOS, the priority of the kernel process (or thread) that executes a scheduled interrupt
handling routine is the priority of the user process (or thread) that opened the interrupting device.
That is why Figure 10.2 shows that after the execution of immediate interrupt service, the scheduler
begins to execute the one that inserts and places the scheduled interrupt handling process (or thread)
in the ready queue, and eventually lets the highest-priority process (or thread) to be scheduled that
consequently gains control of the processor.
Brief details of this topic with fgures are given on the Support Material at www.routledge.com/
9781032467238.
Real-Time Operating Systems 571
tasks can be dispatched at runtime. All these factors along with similar other ones, when considered
together, give rise to the following categories of scheduling algorithms (as shown in Figure 10.1):
• Static table-driven scheduling: The approaches used here to realize a scheduling mecha-
nism are applicable to periodic tasks (or jobs). The parameters associated with the tasks
(or jobs) are used as input that includes periodic arrival time, execution time, periodic
ending deadline, and relative priority of each task. The scheduler, using these parameters,
attempts to develop a schedule that ensures it, meets the requirements of all such periodic
tasks. This is essentially a predictable approach and at the same time is an infexible one,
because any change to any task requirements demands the schedule to be restructured
afresh. Earliest-deadline-frst (EDF) or other periodic deadline scheduling techniques are
typical examples of this category.
• Static priority-driven preemptive scheduling: The approaches employed here assign
priorities to tasks, based on which traditional priority-driven preemptive scheduling is car-
ried out by the scheduler. The mechanism used by this scheduler is similar to one com-
mon to most non–real-time multitasking (multiprogramming) systems in which priority
assignment depends on many factors. But in real-time systems, the priority assignment is
straightaway related to the time constraints associated with each task. The rate-monotonic
(RM) scheduling algorithm uses this approach in which static priorities are assigned to
tasks based on the length of their periods.
• Dynamic planning-based scheduling: The approaches determine the feasibility dynami-
cally (online) during runtime rather than statically (offine) prior to the start of execution.
An arriving task is accepted for execution only if it is feasible to meet its time constraints
(deadlines). One of the outcomes of this feasibility analysis is a schedule or plan that is used
to decide when to dispatch this task. When a task arrives, but before its execution begins, the
scheduling mechanism makes an attempt to create a fresh schedule that incorporates the new
arrival along with the previously scheduled tasks. If the new arrival can be scheduled in such
a way that its deadlines are satisfed, and no currently scheduled task is affected by missing
its deadline, then the schedule is restructured to accommodate the newly arrived task.
• Dynamic best-effort scheduling: The approaches used here do not perform any feasibility
analysis; instead the system attempts to meet all deadlines and aborts any started task (or job)
whose deadline is missed. Many commercially popular real-time systems favor this approach.
When a task (or a job) arrives, the system assigns a priority to the job based on its character-
istics, and normally some form of deadline scheduling, such as, EDF scheduling, is chosen.
Since the tasks are, by nature, typically aperiodic, no static scheduling analysis is workable.
With this type of scheduling, until a deadline (task or job) arrives or until the task completes, it
is not possible to know whether a timing constraint will be met. This is one of the major disad-
vantages of this form of scheduling, although it has the advantage of being easy to implement.
This section mostly attempts to give an overview of the scheduling mechanisms and deals
only with those scheduling services that the kernel can easily provide to signifcantly simplify
the implementation of complex algorithms for scheduling aperiodic tasks at the user level.
Many different increasingly powerful and appropriate approaches to real-time task (or
job) scheduling have been proposed. All of these are based on additional information asso-
ciated with each task (or job). The scheduling algorithms are all designed with the primary
objective of starting (or completing) tasks or jobs at the most appropriate (valuable) times,
neither too early nor too late, and hence mostly rely on rapid interrupt handling and task
dispatching, despite dynamic resource demands and conficts, processing overloads, and
hardware and software faults.
Brief details on this topic with a fgure are given on the Support Material at www.routledge.
com/9781032467238.
Real-Time Operating Systems 573
classes of scheduling algorithm. Such scheduling is precisely applicable for events, such as releases
and completion of jobs or even interruption of executing jobs due to the occurrence of some other
events. That is why priority-driven algorithms are event-driven. This algorithm always attempts
to keep the resources busy whenever they are available by scheduling a ready job which requires
them. So, when a processor or any other resource is available and some job is ready to use it to make
progress, such an algorithm never makes the job wait. This attribute is often called greediness.
Priority-driven scheduling is thus often called greedy scheduling, since this algorithm is always
eager to make decisions that are locally optimal.
A priority-driven scheduler is essentially an on–line scheduler. It does not precompute a
schedule of the tasks (or jobs). Rather, it is, in general, implemented by assigning priorities to
jobs at release time before execution. In fact, priority-driven algorithms differ in how priorities
are to be assigned to jobs. The algorithms in this regard for scheduling periodic tasks are clas-
sifed into two main categories: fxed (or static) priority, in which the priority is assigned to each
periodic task (or job) at its release time and is fxed relative to other task (or job). In a dynamic
priority algorithm, the priority of task (or job) may change (dynamically) over the time between
its release and completion. In fxed priority, jobs ready for execution are usually placed in one
or more job queues arranged by the order of priority as assigned. At each scheduling decision
time, the scheduler updates the ready job queues in the descending order of priorities and then
the job located in front of the highest-priority queue is scheduled and executed on the available
processor(s), and after that the next job in the queue, and so on. Hence, a priority-driven schedul-
ing algorithm essentially arranges the job to a large extent by a list of assigned priorities, and that
is why this approach is sometimes called list scheduling. In fact, the priority list, along with the
other relevant decisions or rules (such as whether preemption will be done) when injected into it,
they all together constitute the scheduling algorithm as a whole.
Most traditional scheduling algorithms are essentially priority-driven. For example, both FIFO
and LIFO algorithms assign priorities to jobs according to their arrival times, and RR scheduling is
the same when preemption is considered. Moreover, in RR scheduling, the priority of the executing
job is often dynamically made lowered to the minimum of all jobs waiting for execution by placing
it at the end of queue when the job has already executed for a time slice. The SJF, SPN, and LJF
algorithms assign priorities on the basis of job execution times.
However, most real-time scheduling algorithms of practical interest essentially assign fxed pri-
orities to individual jobs. The priority of each job is assigned upon its release when it is inserted into
the ready job queue. Once assigned, the priority of the job relative to other jobs in the ready queue
remains fxed. In the other category, priorities are fxed at the level of individual jobs, but the priori-
ties at the task (within the job) level are variable.
Brief details on this topic with a fgure are given on the Support Material at www.routledge.
com/9781032467238.
POSIX-compliant systems, allow the user to schedule equal-priority threads belonging to the same
ready queue with the choice between round-robin or FIFO policies, and the kernel then conveniently
carries out either policy.
At each scheduling time decision, the scheduler has to fnd the highest-priority nonempty queue
to schedule the highest-priority ready threads for execution. The worst-case time complexity of this
operation is, at least theoretically, O(β), where β is the number of priority levels supported by the
operating system. In fact, the average number of comparisons required to scan the queues to fnd
the highest-priority job to schedule at any instant is (β/K) + log2 K – 1, where K is the word length
of the CPU. If a system has 256 priority levels using a 32-bit CPU (the word length of CPU is 32
bits), then the scheduler would take at most 12 comparisons to detect the highest-priority thread to
be scheduled.
Brief details on this topic with a fgure are given on the Support Material at www.routledge.
com/9781032467238.
C1 C2 C
˜ ˜°°˜ n ˛1
T1 T2 Tn
In other words, the sum of the processor utilizations of all the individual tasks cannot exceed a value
1, which corresponds to total utilization of the processor. This inequality also indicates a bound on
the number of tasks that can be successfully scheduled by a perfect scheduling algorithm. For any
particular algorithm, the bound may be even lower. It can be shown that for RM scheduling, the
following inequality also holds:
C1 C2 C
˜ ˜°°˜ n ˛ n (21/ n –1)
T1 T2 Tn
576 Operating Systems
When n = 1, the upper bound is n (21/n – 1) = 1; for n = 2, the upper bound is n (21/n – 1) = 0.828;
n = 3, the upper bound n (21/n – 1) = 0.779; and in this way as n → ∞, the upper bound n (21/n – 1) =
ln 2 ≈ 0.693. This shows that as the number of tasks increases, the scheduling bound converges to
ln 2 ≈ 0.693.
Brief details on this topic with fgures and an example are given on the Support Material at www.
routledge.com/9781032467238.
with the existing jobs once again on the basis of their slacks; the smaller the slack, the higher
the priority.
For example, consider a set of jobs Jk(u, v) where k = 1,2,3, . . ., in which u is the deadline, and
v is the execution (or computation) time of the job Jk. Now, assume that a job J1 (6, 3) is released at
time 0 with a deadline 6, and its execution time is 3. The job starts to execute at time 0. As long as
it executes, it slack s remains at 3, because at any time t, before its completion, the slacks = d – t –
x = 6 – t – (3 – t)= 6 – 3 = 3. Now, suppose that it is preempted at time 2 by a job J3, which executes
from time 2 to 4. At the end of this interval (i.e. at t = 4), the slack s of J1 = d – t – x = 6 – 4 – (3 – 2)
[the remaining portion of J1 is (3 – 2) as 2 units of work out of a total need of 3 units are already
done in the interval 0 to 2] = 1. Thus, the slack of J1 decreases from 3 to 1 after the interval when J3
completes its execution in the interval from 2 to 4.
With the LST algorithm, as discussed, the scheduling decisions are made only at the times when
jobs are released (arrive at the ready queue) or completed; this version of the LST algorithm does not
really follow the LST rule of priority assignment rigorously at all times in its truest sense. To be very
specifc, this version of LST algorithm thus can be, at best, called the nonstrict LST algorithm. If
the scheduler were to adhere to LST rules strictly, then it would have to continuously monitor the
slacks of all ready jobs and keep comparing them with the slack of the executing job. It would have
to reassign priorities to jobs whenever their slacks changed relative to each other. Consequently, the
runtime overhead of the strict LST algorithm includes the additional time required to monitor and
compare the slacks of all ready job as time progresses. In addition, if the slacks of more than one
jobs become equal at any time, they need to be serviced in a round-robin manner that results in extra
consumption of time due to context switches suffered by these jobs. For these and many other rea-
sons, the strict LST algorithm is effectively unattractive in practice and thus has fallen out of favor.
A relevant complicated example of this topic is given on the Support Material at www.
routledge.com/9781032467238.
tasks. More seriously, without having good resource access-control, the duration of a priority inver-
sion can be unbounded. The priority inversion experienced by Pathfnder software was precisely
unbounded (uncontrolled) and is presented here as a good example of when this undesirable phe-
nomenon occurs.
Brief details on this topic with fgures and an example are given on the Support Material at www.
routledge.com/9781032467238.
1. Scheduling Rule: Ready jobs are scheduled on the processor preemptively in a priority-
driven manner according to their current priorities. At its release time t, the current priority
л (t) of every job J usually holds its own current priority. The job remains at its priority
except under the condition stated in rule 3.
2. Allocation Rule: When a job J requests a resource R at time t:
a). If R is free, R is allocated to J until J releases the resource, and
b). If R is not free, the request is denied, and J is blocked.
3. Priority-Inheritance Rule: When the requesting job J becomes blocked, the job Jk which
blocks J inherits the current priority л (t) of J. The job Jk executes at its inherited priority л
(t) until it releases R; at that time, the priority of Jk returns to its previous priority лt (t΄) at
time t΄ when it acquires the resource R (before its inheritance of higher priority).
The priority inversion problem can now be explained using the priority inheritance protocol by an
example. Assume a low-priority task (or process) P2 already holds a resource which a high-priority
task (or process) P1 needs. So the low-priority task (or process) P2 would temporarily acquire the
priority of the task (or process) P1, which would enable it to be scheduled and exit after fnishing
its execution using the resource. This priority change takes place as soon as the higher-priority
task blocks on the resource; this blocking should come to an end when the resource is released by
the lower-priority task and the lower-priority task gets back to its previous default priority when it
acquires the resource. In this way, the problem of unbounded priority inversion, as discussed in the
last section, can be resolved with the use of the priority-inheritance protocol. However, use of the
priority inheritance protocol in many situations is impractical because it would require the kernel to
note minute details of the operation of processes (as normally happens when deadlock is handled).
Brief details on this topic with a solution to the priority inversion problem are given on the
Support Material at www.routledge.com/9781032467238.
• Each ready job Jk at any time t is scheduled by the scheduling algorithm and executes at its
current (assigned) priority лk (t), and the assigned priority of all such jobs is fxed.
• A priority is associated with each resource. The resources required by all jobs are known
apriori before the execution of any job begins.
Real-Time Operating Systems 579
In this approach, a new parameter called priority ceiling associated with every resource is used.
The priority ceiling of any resource Rx is one level higher than the highest priority of all the jobs
that require Rx and is denoted by U(Rx). It is to be noted that if the resource access-control protocol
includes the priority-inheritance rule, then a task (or job) can inherit a priority as high as k during
its execution if it requires a resource with priority ceiling k.
At any time t, the current priority-ceiling, or simply ceiling, Ũ(t) of the system is equal to the
highest-priority ceiling of the resources that are in use at that time, if resources are in use. If all the
resources are free at the time, the current ceiling Ũ(t) is equal to Ω, a nonexistent priority level that
is lower than the lowest priority of all jobs.
In its simplest form, the priority-ceiling protocol is defned by the following rules, with the
assumption that some of the jobs contend for resources and that every resource has only 1 unit.
1. Scheduling Rule: Ready jobs are scheduled on the processor preemptively in a priority-
driven manner according to their current priorities. At its release time t, the current priority
л (t) of every job J is equal to its assigned priority. The job remains at this priority except
under the condition stated in rule 3.
2. Allocation Rule: When a job J requests a resource R at time t:
a). If R is not free, the request is denied, and J is blocked.
b). If R is free,
– If J’s priority л (t) is higher than the current priority ceiling Ũ(t), R is allocated to J.
– If J’s priority л (t) is not higher than the current priority ceiling Ũ(t) of the system, R
is allocated to J only if J is the job holding the resource (s) whose priority ceiling is
equal to Ũ(t); otherwise, J’s request is denied, and J becomes blocked.
3. Priority-Inheritance Rule: When the requesting job J becomes blocked, the job Jk which
blocks J inherits the current priority л (t) of J. The job Jk executes at its inherited priority
л (t) until it releases every resource whose priority ceiling is equal to or higher than л (t);
at that time, the priority of Jk returns to its previous priority лt (t΄) at time t΄ when it was
granted the resource(s) (before its inheritance of higher priority).
The priority-ceiling protocol (or ceiling-priority protocol, CPP) can be easily implemented by the
system or at the user level in a fxed-priority system that supports FIFO within equal policy. The CPP,
however requires prior knowledge of resource requirements (similar to the methods which are used in
avoidance of deadlocks) of all threads. From this knowledge, the resource manager generates the pri-
ority ceiling U(R) of every resource R. In addition to the current and assigned priorities of each thread,
the thread’s TCB also contains the names of all resources held by the thread at the current time.
Whenever a thread requests a resource R, it actually requests a lock on R. The resource man-
ager then locks the scheduler and looks up U(R); if the current priority of the requesting thread
is lower than U(R), it sets the thread’s current priority to U(R), allocates R to the thread, and then
unlocks the scheduler. Similarly, when a thread unlocks a resource R, the resource manager checks
whether the thread’s current priority is higher than U(R). The fact that the thread’s current priority
is higher than U(R) indicates that the thread still holds a resource with a priority ceiling higher
than U(R). The thread’s priority should be left unchanged in this case. On the other hand, if the
thread’s current priority is not higher than U(R), the priority may need to be lowered when R is
released. In this case, the resource manager locks the scheduler, changes the current priority of
the thread to the highest-priority ceiling of all resources the thread still holds at that time or
to the thread’s assigned priority (i.e. bringing back the thread’s priority to its previous value at
the time of allocating the resource R) if the thread no longer holds any resources.
protocol lets the requesting job have a resource whenever the resource is free. In contrast, according
to the allocation policy of the priority-ceiling protocol, a task (or job) may be denied its requested
resource even when the resource is free.
The priority-inheritance rules of these two protocols are by and large the same. Both rules
agree with the principle that whenever a lower-priority job J k blocks job J, whose request is just
denied, the priority of J k is raised to J’s priority л (t). The difference mainly arises because of
the non-greedy nature of the priority-ceiling protocol when it is possible for job J to be blocked
by a lower-priority job which does not even hold the requested resource, while this is not pos-
sible according to the priority-inheritance protocol. Priority-ceiling blocking is also referred to
sometimes as avoidance blocking. The reason for this term is that the blocking caused by the
priority-ceiling protocol is essentially at the expense of the avoidance of deadlocks among jobs.
That is why these two terms, avoidance blocking and priority-ceiling, are often interchange-
ably used.
The overhead of priority inheritance is rather high. Since the priority-ceiling protocol also uses
the same mechanism, its overhead is naturally also high, although not as high as simple prior-
ity inheritance, since there is no transitive blocking. Also, each resource acquisition and release
requires a change of priority of at most the executing thread. That is why CPP is sometimes called
the poor person’s priority-ceiling protocol.
Within each class, multiple priorities may be used, with priorities for the real-time processes
(threads) always higher than priorities for non–real-time processes (threads) belonging to the
SCHED_OTHER class. There are altogether 100 priority levels for real-time classes, ranging from
0 to 99 inclusive, and the SCHD_OTHER class ranges from 100 to 139. The rule is: the lower the
number, the higher the priority.
In Linux, processes using the SCHED_FIFO and SCHED_RR policies are scheduled on a fxed-
priority basis, whereas processes using the SCHED_OTHER policy are scheduled on a time-shar-
ing basis. Any process (thread) belonging to the class SCHED_OTHER can only begin its execution
if there are no real-time threads ready to execute. The scheduling policies and mechanisms used for
the non–real-time processes belonging to the class SCHED_OTHER were discussed in Chapter 4
in which scheduling with traditional uniprocessor operating systems was explained, so this area has
not been included in the current discussion.
For FIFO (SCHED_FIFO) threads, the rules are:
• The executing FIFO threads are normally nonpreemptible, but the system will interrupt an
executing FIFO thread when:
1. Another FIFO thread of higher priority becomes ready.
2. The executing FIFO thread becomes blocked for one of the many reasons, such as, wait-
ing for an I/O event to occur.
Real-Time Operating Systems 581
3. The executing FIFO thread voluntarily relinquishes control of the processor following a
system call to the primitive sched_yield.
• When an executing FIFO thread is interrupted (blocked), it is placed in the respective
queue associated with its priority. When it becomes unblocked, it returns to the same prior-
ity queue in the active queue list.
• When a FIFO thread becomes ready, and if that thread has a priority higher than the pri-
ority of the currently executing thread, then the executing thread is preempted, and the
available highest-priority ready FIFO thread is scheduled and executed. If, at any instant,
the number of the highest-priority thread is more than one, then the thread that has been
waiting the longest time is selected for execution.
The SCHED_RR policy when implemented is almost similar to SCHED_FIFO, except with the
inclusion of a usual time-slice associated with each thread. When a SCHED_RR thread has exe-
cuted for one specifed time-slice, it is preempted and is returned to its priority queue with the same
time-slice value. A real-time thread of equal or higher priority is then selected for execution. Time-
slice values, however, are never changed.
The implementation of FIFO and RR scheduling, taking a set of four threads with their rela-
tive priorities of an arbitrary process, is depicted in Figure 10.3, which also shows the distinction
between these two policies when implemented. Assume that all these waiting threads are ready for
execution at an instant when the currently executing thread waits or terminates and that no other
higher-priority thread is awakened while a thread is under execution.
Figure 10.3(b) shows the fow with FIFO scheduling of the threads when they all belong to the
SCHED_FIFO class. Thread B executes until it waits or terminates. Next, although C and D have
the same priority, thread C starts because it has been waiting longer (arrived earlier) than D. Thread
C executes until it waits or terminates; then thread D executes until it waits or terminates. Finally,
thread A executes.
Similarly, Figure 10.3(c) shows the fow with RR scheduling of the threads when they all belong
to the SCHED_RR class. Thread B executes until it waits or terminates. Next, thread C and D are
time-sliced because they both have the same priority. Finally, thread A executes.
It is worth noting that the user has the option to control the maximum and minimum priorities
associated with a scheduling policy using the primitives sched_get_priority_min() and sched_get_
priority_max(). Similarly, one can also fnd the time-slices given to processes that are scheduled
in a round-robin policy using sched_rr_get_interval(). Since the source is already at hand, one can
easily change these parameters at will.
FIGURE 10.3 An example of a representative scheme of Linux Real-time scheduling using FIFO algorithm
and Round-robin (RR) algorithm taking four threads (process).
582 Operating Systems
for these purposes are shared memory, message queues, synchronization primitives (e.g. condition
variables, mutexes, and semaphores), and events and signals. Although shared memory provides
a low-level, high-bandwidth and low-latency means of interprocess communication and often is
used for communication among not only processes that run on uniprocessor, but also on processes
that run on tightly coupled multiprocessors. However, real-time applications sometimes do not
explicitly synchronize accesses to shared memory; rather they mostly rely on “synchronization by
scheduling”; that is, processes (or threads) that access the shared memory are so scheduled as to
make explicit synchronization unnecessary. Hence, the entire burden of providing reliable access to
shared memory is shifted from “synchronization” to “scheduling and schedulability analysis”. As
a result, the use of shared memory for realizing interprocess communication becomes costly, and
it also makes the system brittle. For these and many other reasons, use of shared memory for the
purpose of interprocess synchronization and communications between real-time processes is not
helpful and thus is not considered here.
Brief details on this topic are given on the Support Material at www.routledge.com/9781032467238.
FIGURE 10.4 A representative illustration of a message-passing mechanism using message transfer via
message queue.
584 Operating Systems
Brief details on this topic are given on the Support Material at www.routledge.com/9781032467238.
essentially an attribute of the message queue which can be set when the message queue
is opened. Similarly, by default, a thread that calls the receive function mq_receive() may
be blocked if the message queue is empty. It is similarly possible to make the receive call
non-blocking in the same fashion.
• Notifcation: It is typically a means by which a message queue notifes a process when
the queue changes from being empty to nonempty. Indeed, a notifcation facility keeps the
service provider always aware of the current capability of the message queue, which allows
the service provider to respond quickly in an appropriate manner to assist send and receive
operations to smoothly continue by the respective threads.
Other kinds of communication between tasks (or processes) in RTOSs (especially in embedded
systems) is the passing of what might be called “synchronization information” from one task to
another. Synchronization information is essentially like a command, where some commands could
be positive and some negative. For example, a positive command would be something like; “I am
facing an emergency, and I want your help to handle it”, or more generally, “Please join me in han-
dling”. A negative command, on the other hand, to a task would be something like: “Please do not
print now, because my task is using the printer”, or more generally: “I want to lock . . . for my own
use only”.
In Windows NT, events and asynchronous procedure calls (APCs) serve the same purpose as
signals in UNIX systems. An NT event (object) is, in fact, an effcient and powerful notifcation and
synchronization mechanism. For being synchronous, NT event delivery has relatively low overhead
in comparison to that of comparatively high overhead of asynchronism. Moreover, the NT event is
many-to-many in the sense that multiple threads can wait on one event and a thread can wait for
multiple events. Each UNIX signal, in contrast, is targeted to only an individual thread (or process),
and each signal is handled independently of other signals.
The pSOSystem introduced by Motorola provides both events and asynchronous signals. The
events are essentially synchronous and point–to–point. An event is usually sent to a specifed
receiving task. The event has, in fact, no effect on the receiving task if the task does not call the
event-receive function. An asynchronous signal, in contrast, forces the receiving task to respond.
A real-time POSIX signal, however, provides numerous interesting features that exemplify their
good and practical uses in real-time applications.
Real-time signals can be queued, while traditional UNIX signals cannot. If not queued, a signal
to be delivered, if blocked, may be lost. Hence, queuing ensures the reliability (reliable delivery)
of signals. Moreover, a queued real-time signal can carry data, while a traditional signal handler
has only one parameter: the number of the associated signal. In contrast, the signal handler of a
real-time signal whose SA_SIGINFO bit in the sa_ fags feld is set has an additional parameter, a
pointer to a data structure that contains the data value to be passed to the signal handler. This capa-
bility increases the communication bandwidth of the signal mechanism. Using this feature, a server
can use a mechanism to notify a client that its requested service has been completed and subse-
quently pass the result back to the client at the same time. In addition, queued signals are essentially
prioritized and are to be delivered according to the order of priority: the lower the signal-number,
the higher the priority.
The signal mechanism is equally expensive, similar to hardware interrupt servicing.
Changing mode from user to supervisor, complicated operations to service the signal, and
fnally a return to user mode are all hugely time-consuming activities when compared to other
commonly used synchronization and communication mechanisms. Moreover, signal handlers,
similar to interrupt handlers, are also executed ahead of thread priority. Hence, it is mandatory
that the duration of time needed to service signal handlers are to be kept as minimum as pos-
sible, just as it is equally an important requirement to shorten the execution time of interrupt
service routines.
fragmentation may work with as such no problem for desktop machines when rebooted periodi-
cally but are unacceptable for many RTOSs and especially for embedded systems that often run
for years without rebooting. One way to alleviate this problem is to use the simple fxed-sized-
blocks algorithm that has been observed to be work very well for simple RTOSs and also for
not-very-large embedded systems.
• Use of virtual memory: The introduction of virtual memory, its related aspects, and their
solutions in traditional non-real-time operating system were described in Chapter 5. Many
RTOSs designed primarily for embedded real-time applications, such as data acquisition,
signal processing, and monitoring, may not require virtual memory and its related map-
ping. For example, the pSOSystem, upon request, creates a physically contiguous blocks
of memory for the application. The application can request variable-size segments from its
memory block and defne a memory partition consisting of physically contiguous fxed-
size buffers. While virtual memory provides many distinctive practical advantages in the
management of memory, but its presence equally contributes a huge amount of penalties
in relation to mainly space and time. Indeed, the address translation table required in this
regard itself consumes a good amount of space of main memory in the resident part of
the operating system, and scanning of the address translation table for mapping a virtual
address to its corresponding physical address consumes additional time that consequently
slows down the overall execution speed. Moreover, this scheme often complicates DMA-
controlled I/O, which requires physically contiguous memory space, since the processor
must have to set up the DMA controller multiple times, one for each physically contiguous
block of addresses for data transfer to and from main memory.
• Use of memory in deeply embedded systems: Most embedded OSes are stored in binary
code in ROM for execution. That is, unlike traditional operating systems, code is never
brought into main memory. The reason is that the code is not expected to change very
often, as embedded devices are usually dedicated to their own individual specifc purposes.
Indeed, the behavior of such systems is usually specifed completely at their design time.
• Dynamic memory allocation: Dynamic allocation of main memory creates an issue relat-
ing to determinism of service times. Many general-computing non–real-time operating
systems offer memory allocation services from a buffer known as the “heap”. Offering of
additional memory from the heap and returning this memory to the heap when it is not
needed ultimately give rise to the external fragmentation problem, which produces many
scattered useless small holes in the heap. Consequently, it results in shortage of useful
memory even though enough memory is still available in the heap, which summarily may
cause heap services to degrade. This external fragmentation problem, however, can be
solved by so-called garbage collection (defragmentation) software, but it is often wildly
non-deterministic.
Real-time operating systems thus altogether avoid both memory fragmentation and “garbage col-
lection” along with their all ill effects. Instead, one of the alternatives that RTOSs offer are non-
fragmenting memory allocation techniques by using a limited number of various sizes of memory
chunks, which are then made available to application software. While this approach is certainly
less fexible than the approach taken by memory heaps, it does avoid external memory fragmenta-
tion and subsequent related defragmentation. Additional memory, if required by the application, is
offered from this pool of different sizes of memory chunks according to the requested size, and it
is subsequently returned when it is not required and then put onto a free buffer list of buffers of its
own size that are later available for future re-use.
More about this topic with a fgure is given on the Support Material at www.routledge.com/
9781032467238.
588 Operating Systems
handle hard real-time application requirements, but even then, this modifed Linux falls far short in
relation to compliance with POSIX real-time extensions. What is more serious due to this inclusion
is its portability: applications written to execute on these extensions are neither portable to standard
UNIX machines nor to any other commercial real-time operating systems.
Several shortcomings of Linux have been experienced when it is used for real-time applications.
One of the most crucial ones arises from the disabling of interrupts made by Linux subsystems when
they are in critical sections. While most device drivers usually disable interrupts for a few microsec-
onds, the disk subsystems of Linux may disable interrupts for as long as a few hundred microseconds
at a time, and the clock interrupts will be remained blocked during the longer duration. This causes
the predictability of the system to be seriously affected. One of the solutions to this problem may be
to perhaps rewrite afresh all the offending drivers to make their nonpreemptible sections as short as
possible, like all other standard real-time operating systems. Unfortunately, neither extension released
so far attacked this problem head on; on the contrary, one tries to live with it, while the others avoid it.
The scheduling mechanism of Linux revised version 2.6 largely enhanced the capability of
scheduling for non–real-time processes but includes essentially the same real-time scheduling
activities as provided in its previous releases. In fact, the Linux scheduler handles real-time pro-
cesses on a fxed-priority basis along with the non–real-time processes on a time-sharing basis.
The scheduling activities carried out by Linux were already discussed in detail in Section 10.8.3.3.
The clock and timer resolution in standard Linux was explained in Sections 10.8.4 and 10.8.4.1.
However, Linux improved its time service by introducing a high-resolution time service, UTIME,
designed to provide microsecond clock and time granularity using both the hardware clock and the
Pentium timestamp counter. With the use of UTIME, system calls that contain time parameters,
such as, select and poll, can specify time down to microsecond granularity. Rather than having
the clock device programmed to interrupt periodically, UTIME programs the clock to interrupt in
one-short mode. At any time, the next timer interrupt will occur at the earliest of all future timer
expiration times. Since the kernel responds as soon as a timer expires, the actual timer resolution is
only limited by the duration of a few microseconds with UTIME, which is consumed by the kernel
to service a timer interrupt. Due to carrying out extra work by UTIME to provide high-resolution
time services, the execution time of the timer interrupt service routine in Linux with UTIME is
naturally several times larger than that in standard Linux.
With regard to use of the thread concept, Linux did not provide a thread library until recently.
Rather, it used to offer only the low-level system call clone(), by which a process can be created that
shares the address space of its parent process, as well as other parts of the parent’s context (such
as open fle descriptors, message managers, signal handlers, etc.) as specifed by the call. Recently,
Leroy developed a thread library consisting of Linux threads, which are essentially UNIX pro-
cesses created using the clone() system call. Thus, Linux threads are one-thread per process and are
scheduled by the kernel scheduler just like UNIX processes. Since Linux threads are one-thread per
process, the distinct advantage of this model is that it simplifes the implementation of the thread
library and thereby increases its robustness. On the contrary, one of its disadvantages is that the con-
text switches on mutex and condition operations must go through the kernel. Still, context switches
in the Linux kernel are found quite effcient.
Linux threads provide most of the POSIX thread extension API functions and conform to the
standard except for signal handling. In fact, Linux threads use signals SIGUSR1 and SIGUSR2 for
their own work, which are no longer available to applications. Since Linux does not support POSIX
real-time extensions, there is as such no signal available for application-defned use. In addition,
signals are not queued and may not be delivered in order of priority.
Two extensions of Linux (already mentioned earlier) are the KURT (Kansas University Real-
time) System and RT Linux System, which enables applications with hard real-time requirements
to run on each such Linux platform.
A brief discussion of this topic is given on the Support Material at www.routledge.com/
9781032467238.
590 Operating Systems
then again sets the fag to enable Linux to resume. By this time, the RT (thin) kernel queues all the
pending interrupts to be handled by Linux and passes them to Linux when Linux enables interrupts
again (the fag is set). But if an interrupt is intended for Linux, the RT (thin) kernel then simply
relays it to the Linux kernel for needed action. In this way, the RT (thin) kernel enables most of the
real-time tasks to meet their deadlines, except possibly a few missed ones, and at the same time puts
the Linux kernel and user processes in the background of real-time tasks.
Indeed, the real-time part of the application system is written as one or more loadable kernel
modules that run in the kernel space. Tasks in each module may have their scheduler; the current
version, however, provides a RM scheduler as well as an EDF scheduler.
More details on this topic with a fgure are given on the Support Material at www.routledge.com/
9781032467238.
10.9.2 LYNXOS
The current version LynxOS 3.0 has been upgraded from its initial monolithic design to today’s
microkernel design, the core of which mainly provides the essential services of an operating system,
such as; scheduling, interrupt dispatch, and synchronization, while the other services are provided
by kernel lightweight service modules, often called kernel plug-ins (KPIs). With KPIs, the system
can be confgured to support I/O (devices) and fle systems, TCP/IP streams, sockets, and so on.
Consequently, it functions as a multipurpose UNIX operating system, as its earlier versions do.
KPIs are truly multithreaded. Each KPI can create as many threads as needed in order to execute its
routines (responsibilities). In this OS, there is no context switch when sending a message (e.g. RFS)
to a KPI, and moreover, inter-KPI communication needs only a few instructions to work.
One of the salient features of LynxOS is that it can be confgured as a self-hosted system equipped
with the tools such as compilers, debuggers, and performance proflers, etc. This means that in such
a system; embedded (real-time) applications can be developed using the tools on the same system on
which they are to be deployed and run. Moreover, the system provides adequate memory protection
mechanisms through hardware memory management unit (MMU) to protect operating system and
critically important applications from untrustworthy ones. In addition, it also offers demand paging
to realize optimal memory usage while handling large memory demands issued by the applications.
Application threads (and processes) in LynxOS use system calls when making I/O requests, such
as open(), close(), read(), write(), and select(), etc., in the same fashion as traditional UNIX does.
Moreover, each I/O request is sent by the kernel directly to a device driver of the respective I/O
device. The device drivers in LynxOS follow the split interrupt handling strategy. Each driver
contains: (i) an interrupt handler that carries out the frst step of interrupt handling at an interrupt
request priority and (ii) a kernel thread that shares the same address space with the kernel but is
separate from the kernel. If the interrupt handler does not complete the processing of an interrupt, it
sets an asynchronous system trap to interrupt the kernel. When the kernel can respond to the (soft-
ware) interrupt (of course, when the kernel is in a preemptable state), it schedules an instance of the
kernel thread at the priority of the thread, which eventually opens the interrupting device. When
the kernel thread executes, it continues interrupt handling and re-enables the interrupt when it com-
pletes. LynxOS calls this mechanism priority tracking, and LynxOS holds a patent for this scheme.
10.9.3 PSOSYSTEM
on a uniprocessor system and provides each task with the choice of either preemptive priority-driven
or time-driven scheduling. In addition, the pSOSystem 2.5 and higher versions offer priority inheri-
tance and a priority-ceiling protocol. The pSOS+m (Motorola) extending the pSOS+ feature is set to
operate seamlessly across multiple, tightly coupled, or distributed processors. The pSOS+m has the
same API functions as pSOS+, as well as functions for interprocess communications and synchro-
nization. The most recent release offers a POSIX real-time extension-compliant layer. Additional
optional components provide a TCP/IP protocol stack and target, and host-based debugging tools.
Later, the release of the pSOSystem 3.0 RTOS includes many other key technical innovations.
A brief discussion of this topic with a fgure is given on the Support Material at www.
routledge.com/9781032467238.
In addition, VxWorks provides the VxWorks shell, which is essentially a command-line interface
that allows one to interact directly with VxWorks through the use of respective commands. One can
then use commands to load programs. When VxWorks is booted over the network, an automatic
network fle system entry is created based on the boot parameters. Last but not least, since VxWorks
performs load-time linking (dynamic linking), it must maintain a symbol table. A symbol in this
context is nothing but a named value.
A brief description of the problem with Pathfnder and its solution is given on the Support
Material at www.routledge.com/9781032467238.
SUMMARY
This chapter demonstrates the typical characteristics of real-time applications handled by real-time
systems, which are monitored by real-time operating systems. We frst describe in brief the differ-
ent issues involved with real-time systems and then reveal how these issues are negotiated by the
different components of RTOS. The timing constraints mentioned by jobs or tasks can be expressed
in terms of response time, defned as the length of time from the release time of the job to the
instant when it completes. The timing constraint of a real-time task can be hard or soft depending
on how strictly the timing constraint must be obeyed (hard) or not (soft). Based on a set of basic
needed parameters, including timing constraints, tasks can be categorized as periodic, aperiodic,
and sporadic. However, the different issues that are closely associated with real-time systems are
mainly architectural aspects, resource management, and software features including real-time lan-
guages and real-time databases. These issues have ultimately been negotiated by the RTOSs with
the introduction of some basic characteristics and requirements, along with the features met by their
various fundamental components, including threads and their different types. The kernel design of
RTOS is the most critical one that offers some prime services, namely interrupt and system calls,
timer services, and scheduling. Different types of scheduling mechanisms, both static and dynamic,
based on numerous approaches, mainly clock-driven and priority-driven, are described, in which
each one has numerous forms to meet certain predefned objectives. Linux’s scheduling mechanism
is described as a representative case study. The most important communication and synchroniza-
tion issues in RTOS are described in brief with their related different aspects. The critical priority
inversion problem and its ill effects are described, with a real-life example that happened with
Mars Pathfnder. The priority inheritance and priority ceiling are explained, along with respective
comparisons. Lastly, several studies in relation to practical implementation of RTOS are described,
mainly with their salient features on different platforms, such as Linux; KURT; RT Linux; LynxOS;
pSOSystem; and VxWorks, used in the Mars Pathfnder spaceship.
EXERCISES
1. How does a real-time application differ from a non–real-time one? Defne real-time com-
puting. State the features that make it different from a conventional computing.
2. State and explain the differences between hard and soft real-time tasks. Enumerate the
differences that exist between periodic, aperiodic, and sporadic real-time tasks.
594 Operating Systems
3. State and briefy explain the major design issues involved in a representative real-time
system.
4. State the distinctive features that an operating system must possess to be a real-time oper-
ating system.
5. What are the design philosophies of a real-time operating system?
6. State the basic components of the kernel of a representative real-time system.
7. “Interrupts play a vital role in the working of a real-time operating system”: What are the
different types of interrupts present in a representative real-time operating system? What
are the roles played by these interrupts, and how do they work?
8. “The scheduler is commonly described as the heart of a real-time system kernel”.
Justify. Explain the fundamental steps followed by a basic scheduler of a representa-
tive RTOS.
9. State the notable features of a real-time scheduling algorithm. State the metrics that are
used as parameters to measure the performance of scheduling algorithms.
10. Briefy defne the different classes of real-time scheduling algorithms and how they differ
from one another. What are the pieces of information about a task (or a job) that might be
useful in real-time scheduling?
11. Compare and contrast offine and online scheduling when applied to hard real-time tasks
(or jobs).
12. Discuss the basic principles and the working mechanism of a priority-driven scheduler.
13. What are the essential requirements of a clock-driven approach in scheduling? State and
explain at least one method that belongs to this category of scheduling.
14. “Scheduling carried out using a priority-driven approach is often called greedy scheduling
as well as list scheduling”. Explain.
15. Priority-based scheduling can be implemented both in a preemptive as well as in a non-
preemptive manner. Discuss the relative merits and drawbacks of these two different
approaches.
16. What are the relative advantages and disadvantages observed between fxed-priority and
dynamic-priority approaches in scheduling of real-time tasks (or jobs)?
17. State and explain with a suitable example the mechanism followed by a rate-monotonic
scheduling algorithm. Enumerate its merits and drawbacks with respect to the situations in
which it is employed.
18. Why is a dynamic-priority scheme preferred in a priority-driven approach to real-time task
scheduling?
19. State and explain with a suitable example the mechanism followed by a earliest-deadline-
frst scheduling algorithm. Enumerate its merits and drawbacks with respect to the situa-
tions in which it is employed.
20. Explain why a EDF scheduling is called a task-level dynamic-priority algorithm and on the
other hand can also be called a job-level fxed-priority algorithm.
21. Consider a set of fve aperiodic tasks with the execution profles given here.
A 10 20 100
B 20 20 30
C 40 20 60
D 50 20 80
E 60 20 70
Develop scheduling diagrams similar to those in Figure 10.8 (given on the Support
Material at www.routledge.com/9781032467238) for this set of tasks.
Real-Time Operating Systems 595
22. Explain the principle and the mechanisms used by the least-slack-time-frst, sometimes
also called minimum-laxity-frst, algorithm. Why is it considered superior in a dynamic-
priority approach to its counterpart, EDF scheduling? What are the major shortcomings of
the LST algorithm?
23. Consider a system with three processors P1, P2, and P3 on which fve periodic tasks X, Y, Z,
U, and V execute. The periods of X, Y, and Z are 2, and their execution times are equal to 1.
The periods of U and V are 8, and their execution times are 6. The phase of every task is
assumed to be 0. The relative deadline of every task is equal to its period.
a. Show that if the tasks are scheduled dynamically according to the LST algorithm on
three processors, some tasks in the system cannot meet their deadlines.
b. Find a feasible schedule of fve tasks on three processors.
c. Parts (a) and (b) indicate that the LST algorithm is not optimal for scheduling on more
than one processor. However, when all the jobs have the same deadline or the same
release time, the LST algorithm is optimal. Justify this.
24. What is meant by priority inversion? State the adverse impact of this phenomenon on
priority-driven scheduling mechanisms. What are some methods by which the ill effects
of this phenomenon can be avoided?
25. What is meant by priority inheritance? What are the basic rules that must be followed by a
priority-inheritance protocol? Explain with a suitable example how the priority-inheritance
protocol resolves the problem of unbounded priority inversion. What are the limitations of
the priority-inheritance protocol?
26. What is meant by a priority ceiling? What are the basic rules that must be followed by the
priority-ceiling protocol? How does the priority-ceiling approach resolve the shortcomings
of the priority-inheritance protocol?
27. Defne clock. How is a timer implemented in a RTOS? Describe the roles played by clocks
and timers in the proper working of RTOS.
28. How and in which ways is synchronization between tasks realized in a RTOS?
29. Describe the message-passing scheme used as communication mechanism between tasks
in an RTOS.
30. Defne signal. How is the signal realized in systems? Explain how the signal is actively
involved in responsive mechanisms in a real-time operating system.
31. Describe the basic principles followed in the management of memory in RTOS. Describe
the mechanisms used in memory allocation to support a RTOS to run.
Stankovic, J. A. “Strategic Directions in Real-time and Embedded Systems”, ACM Computing Surveys, vol.
28, pp. 751–763, 1996.
Zhao, W. “Special Issues on Real-time Operating Systems”, Operating System Review, vol. 23, p. 7, 1989.
WEBSITES
http://qnx.com
http://rtlinux.org
http://windriver.com
Additional reading
Beck, M., Bohme, H. and others: Linux Kernel Programming, 3rd edition, Pearson Education, 2002
Ben, Ari, M.: Principles of Concurrent and Distributed Programming, Prentice–Hall International,
Englewood Cliffs, NJ, 2006
Brinch Hansen. P. Operating system Principles, Prentice–Hall, Englewood Cliffs, New Jersey, 1973
Buyya, R., High Performance Cluster Computing: Architecture and Systems, Upper Saddle River, NJ: Prentice
Hall, 1999.
Cerf, V. G. “Spam, Spim, and Spit” Comm. of the ACM, vol. 48, pp. 39–43, April 2005.
Chakraborty, P. Computer Organization and Architecture: Evolutionary concepts, Principles, and Designs.
CRC Press, 2020.
Coulouris, G., Dollimore, J. Distributed Systems—Concepts and Design, 3rd Edition, Addison–Wesley, New
York, 2001
Kosaraju, S.: Limitations of Dijkstra’s semaphore primitives and petri nets,” Operating Systems Review, 7, 4,
pp. 122–126, 1973
Krishna, C., and Lee, Y., eds. “Special Issue on Real–Time Systems.” Proceedings of the IEEE, January, 1994.
Lewis, D. and Berg, D. Multithreaded Programming with Pthreads, Prentice–Hall, Englewood Cliffs, 1997
Mchugh, J. A. M. and Deek, F. P. “An Incentive System for Reducing Malware Attacks.”, Comm. of the ACM,
vol. 48, pp. 94–99, June 2005.
Mullender, S. J., Distributed System 2nd Edition.
Ridge, D., et al. “Beowulf: Harnessing the power of parallelism in a Pile–of–PCs.” Proceedings, IEEE
Aerospace, 1997.
Silberschatz, A., and Galvin, P. Operating System Concepts. Reading, MA: Addison–Wesley, 1994.
Singhal, M. and Shivaratri, N.G. Advanced Concepts in Operating Systems, McGraw–Hill, New York, 1994
Sinha, P. K. Distributed Operating Systems, IEEE Press, New York, 1997.
Srinivasan, R. RPC: Remote Procedure Call Protocol Specifcation Version 2. Internet RFC 1831, August,
1995.
Stallings, W. Operating Systems: Internals and Design Principles, 5th edition, Prentice–Hall, Pearson
Education, 2006.
Tanenbaum, A. S. Modern Operating Systems, Englewood Cliffs, New Jersey, Prentice–Hall, 1992
Tanenbaum, A. Distributed Operating Systems, Englewood Cliffs, New Jersey, Prentice–Hall, 1995.
Wind: VxWorks Programmer’s Guide, WindRiver System Inc., 1997
Zobel, D. “The Deadlock problem—a classifying bibliography”. Operating Systems Review, 17 ( 4 ), pp.
6 – 15, 1983.
597
Index
A asymmetric encryption, 430
asynchronous I/O, 341
absolute pathname, 364 asynchronous RPC, 519
access control lists, 411, 433 asynchronous transfer mode, 490
access control matrix, security, 409 ATM technology, 490–491
disadvantage of, 410 atomic actions, 153
access control, security, 408 atomic actions, fle system, 379
data oriented, 409 atomic transactions, 455
Linux, 432 authentication, 186, 398
UNIX, 431 artifact–based, 419
user oriented, 408 biometrics, 419
access time, disk, 323 challenge–response, 418
access token, security, 433 password-based, 416
access transparency, 453 user, 416
active attacks, security, 403 authorization, 398
active list, 298
activity working set, 474
B
adapter, 309
adaptive lock, 468 backup
adaptive routing, 495 full, 378
additional process state, 100 incremental, 378
address bandwidth and latency, network, 499
logical, 226, 248 Banker’s algorithm, 205
physical, 226, 248 basic fle system, see fle system
virtual, 261 batch systems, 7
addressable unit, 221 Belady anomaly, 286
address space, 38 Belady effect, 284
address translation, 248 best ft, 240
logical to actual physical, 248 binary fles, 349
relocation, 223 binary semaphores, 159
segmented system, 253 lock variable implementation, 160
simple paging system, 250 biometrics, 419
virtual address, 265, 276 BIOS, 83
Windows, 299 bitmap algorithm, 238
Ad Hoc algorithm, 286 bit tables, 373
advanced batch processing, 8 block cipher, 428
advanced encryption standard, 429 block device, 309
advised paging, 280 blocked state, 99
advisory lock, 390 blocked / suspended state, 101
affnity, 470 blocking factor, 366
affnity-based scheduling, 478 choice of, 367
alias, 366 blocking and non–blocking protocols, 514
allocation policy, working set blocking receive, 183
fxed, 289 blocking of records, 366
variable, 289 fxed–length, 367
allocation strategy spanned, 367
fxed partition, 230 unspanned, 367
variable partition, 239 variable–length, 367
AMOEBA, 504 blocking send, 184
anticipatory fetching, 280 block-serial fle, 350
anticipatory I/O scheduler, Linux, 339 boot block, 387
antivirus, 424 boot sector, 392
detection of, 425 boot sector virus, 423
identifcation of, 425 boot server, 505
removal of, 425 boundary tags, 241, 242
antivirus approaches, 423 bridges, 506
aperiodic task, 560 Brinch Hansen’s approach, 171
Application Programming Interface (APIs), 54 brouters, 507
archive, 349 brute–force attack, 427
associative mapping, 271 buddy blocks, 244
599
600 Index
fle allocation table, 371, see also directory fle server, 352, 450
fle attributes, 350 hint–based, 534
fle–cache location, 535 multi–threaded, 534
fle caching, 534 stateful, 531
fle control block, 353 stateless, 531
fle descriptor, 350 fle service, 352
UNIX, 350, 385 fle sharing, 365
fle directories, 362 link, 365
fle management, 345 symbolic link, 366
fle management functions, 37, 79 UNIX, 386
fle management system fle structure, 348, 385
design issues and functions, 354 fle system, 346
design principles, 355 design principles, 355
Linux, 388 drivers, 341
requirements, 353 fault tolerance, 378
UNIX, 385 high–level, 347
Windows, 391 integrity, 377
fle map table (FMT), 372 log–structured, 383
fle naming, 347 low–level, 347
fle operations, 350 performance, 381
fle organization recovery technique, 378
clustering, 376 reliability, 376
physical representation, 376 structured, 347
salvaging, 376 virtual, 379, 538
skewing, 376 fle system performance, 381
fle organization and access, 356 factors, 382
access methods, 361 fle system reliability
direct fle, 361 fault–tolerance technique, 378
hashed fle, 361 implementation, 377
Indexed fle, 360 recovery technique, 378
Index–sequential fle, 359 fle types, 348
Inverted fle, 360 fle update policy (DFS), 535
link, 365 flter driver, 341
pile, 357 frst–in–frst–out (FIFO), 127
sequential fle, 357 frst ft, 240
structured fles, 356 frst generation OS, 6
fle replication, 530 frst generation system, 6
fles, 346 fxed allocation policy, 289
access methods, 361 fxed priority scheduling, 574
access rights, 366 fxed routing, 494
archive, 349 fat directory, 362
binary, 349 fip, 501
blocking factor, 366 fipper, 233
blocking of records, 366 fork, 104
block–serial, 350 four–level paging, 268
character–serial, 350 fragmentation
concurrent access, 366 external, 236, 241
deblocking, 367 internal, 231
descriptor, 350 free area, 242
direct, 361 free space management, disk, 373
directories, 350, 362 Bit tables, 373
free space management, 373 chained free blocks, 374
hashed, 361 different techniques, 373
immediate (master fle table), 392 disk status map, 373
indexed, 360 free list of blocks, 375
indexed–sequential, 359 group of blocks, 374
inverted, 360 indexing, 374
link, 365 in UNIX, 386
pile, 357 full backup, fle system, 378
regular, 349 full replication, 526
sequential, 357
structured, 356
G
swap, 225
symbolic link, 366 gang scheduling, 473
types, 348 garbage collection, 413
Index 605