0% found this document useful (0 votes)
355 views

PDF

Uploaded by

Gilang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
355 views

PDF

Uploaded by

Gilang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 125

11

CASE STUDY 2: WINDOWS 8

Windows is a modern operating system that runs on consumer PCs, laptops,


tablets and phones as well as business desktop PCs and enterprise servers. Win-
dows is also the operating system used in Microsoft’s Xbox gaming system and
Azure cloud computing infrastructure. The most recent version is Windows 8.1.
In this chapter we will examine various aspects of Windows 8, starting with a brief
history, then moving on to its architecture. After this we will look at processes,
memory management, caching, I/O, the file system, power management, and final-
ly, security.

11.1 HISTORY OF WINDOWS THROUGH WINDOWS 8.1


Microsoft’s development of the Windows operating system for PC-based com-
puters as well as servers can be divided into four eras: MS−DOS, MS−DOS-based
Windows, NT-based Windows, and Modern Windows. Technically, each of
these systems is substantially different from the others. Each was dominant during
different decades in the history of the personal computer. Figure 11-1 shows the
dates of the major Microsoft operating system releases for desktop computers.
Below we will briefly sketch each of the eras shown in the table.

857
858 CASE STUDY 2: WINDOWS 8 CHAP. 11

Year MS−DOS MS-DOS NT-based Modern Notes


based Windows Windows
Windows
1981 1.0 Initial release for IBM PC
1983 2.0 Suppor t for PC/XT
1984 3.0 Suppor t for PC/AT
1990 3.0 Ten million copies in 2 years
1991 5.0 Added memory management
1992 3.1 Ran only on 286 and later
1993 NT 3.1
1995 7.0 95 MS-DOS embedded in Win 95
1996 NT 4.0
1998 98
2000 8.0 Me 2000 Win Me was inferior to Win 98
2001 XP Replaced Win 98
2006 Vista Vista could not supplant XP
2009 7 Significantly improved upon Vista
2012 8 First Modern version
2013 8.1 Microsoft moved to rapid releases

Figure 11-1. Major releases in the history of Microsoft operating systems for
desktop PCs.

11.1.1 1980s: MS-DOS

In the early 1980s IBM, at the time the biggest and most powerful computer
company in the world, was developing a personal computer based the Intel 8088
microprocessor. Since the mid-1970s, Microsoft had become the leading provider
of the BASIC programming language for 8-bit microcomputers based on the 8080
and Z-80. When IBM approached Microsoft about licensing BASIC for the new
IBM PC, Microsoft readily agreed and suggested that IBM contact Digital Re-
search to license its CP/M operating system, since Microsoft was not then in the
operating system business. IBM did that, but the president of Digital Research,
Gary Kildall, was too busy to meet with IBM. This was probably the worst blun-
der in all of business history, since had he licensed CP/M to IBM, Kildall would
probably have become the richest man on the planet. Rebuffed by Kildall, IBM
came back to Bill Gates, the cofounder of Microsoft, and asked for help again.
Within a short time, Microsoft bought a CP/M clone from a local company, Seattle
Computer Products, ported it to the IBM PC, and licensed it to IBM. It was then
renamed MS-DOS 1.0 (MicroSoft Disk Operating System) and shipped with the
first IBM PC in 1981.
SEC. 11.1 HISTORY OF WINDOWS THROUGH WINDOWS 8.1 859

MS-DOS was a 16-bit real-mode, single-user, command-line-oriented operat-


ing system consisting of 8 KB of memory resident code. Over the next decade,
both the PC and MS-DOS continued to evolve, adding more features and capabili-
ties. By 1986, when IBM built the PC/AT based on the Intel 286, MS-DOS had
grown to be 36 KB, but it continued to be a command-line-oriented, one-applica-
tion-ata-time, operating system.

11.1.2 1990s: MS-DOS-based Windows

Inspired by the graphical user interface of a system developed by Doug Engel-


bart at Stanford Research Institute and later improved at Xerox PARC, and their
commercial progeny, the Apple Lisa and the Apple Macintosh, Microsoft decided
to give MS-DOS a graphical user interface that it called Windows. The first two
versions of Windows (1985 and 1987) were not very successful, due in part to the
limitations of the PC hardware available at the time. In 1990 Microsoft released
Windows 3.0 for the Intel 386, and sold over one million copies in six months.
Windows 3.0 was not a true operating system, but a graphical environment
built on top of MS-DOS, which was still in control of the machine and the file sys-
tem. All programs ran in the same address space and a bug in any one of them
could bring the whole system to a frustrating halt.
In August 1995, Windows 95 was released. It contained many of the features
of a full-blown operating system, including virtual memory, process management,
and multiprogramming, and introduced 32-bit programming interfaces. However,
it still lacked security, and provided poor isolation between applications and the
operating system. Thus, the problems with instability continued, even with the
subsequent releases of Windows 98 and Windows Me, where MS-DOS was still
there running 16-bit assembly code in the heart of the Windows operating system.

11.1.3 2000s: NT-based Windows

By end of the 1980s, Microsoft realized that continuing to evolve an operating


system with MS-DOS at its center was not the best way to go. PC hardware was
continuing to increase in speed and capability and ultimately the PC market would
collide with the desktop, workstation, and enterprise-server computing markets,
where UNIX was the dominant operating system. Microsoft was also concerned
that the Intel microprocessor family might not continue to be competitive, as it was
already being challenged by RISC architectures. To address these issues, Micro-
soft recruited a group of engineers from DEC (Digital Equipment Corporation) led
by Dave Cutler, one of the key designers of DEC’s VMS operating system (among
others). Cutler was chartered to develop a brand-new 32-bit operating system that
was intended to implement OS/2, the operating system API that Microsoft was
jointly developing with IBM at the time. The original design documents by Cut-
ler’s team called the system NT OS/2.
860 CASE STUDY 2: WINDOWS 8 CHAP. 11

Cutler’s system was called NT for New Technology (and also because the orig-
inal target processor was the new Intel 860, code-named the N10). NT was de-
signed to be portable across different processors and emphasized security and
reliability, as well as compatibility with the MS-DOS-based versions of Windows.
Cutler’s background at DEC shows in various places, with there being more than a
passing similarity between the design of NT and that of VMS and other operating
systems designed by Cutler, shown in Fig. 11-2.

Year DEC operating system Characteristics


1973 RSX-11M 16-bit, multiuser, real-time, swapping
1978 VAX/VMS 32-bit, vir tual memory
1987 VAXELAN Real-time
1988 PRISM/Mica Canceled in favor of MIPS/Ultrix

Figure 11-2. DEC operating systems developed by Dave Cutler.

Programmers familiar only with UNIX find the architecture of NT to be quite


different. This is not just because of the influence of VMS, but also because of the
differences in the computer systems that were common at the time of design.
UNIX was first designed in the 1970s for single-processor, 16-bit, tiny-memory,
swapping systems where the process was the unit of concurrency and composition,
and fork/exec were inexpensive operations (since swapping systems frequently
copy processes to disk anyway). NT was designed in the early 1990s, when multi-
processor, 32-bit, multimegabyte, virtual memory systems were common. In NT,
threads are the units of concurrency, dynamic libraries are the units of composition,
and fork/exec are implemented by a single operation to create a new process and
run another program without first making a copy.
The first version of NT-based Windows (Windows NT 3.1) was released in
1993. It was called 3.1 to correspond with the then-current consumer Windows
3.1. The joint project with IBM had foundered, so though the OS/2 interfaces were
still supported, the primary interfaces were 32-bit extensions of the Windows APIs,
called Win32. Between the time NT was started and first shipped, Windows 3.0
had been released and had become extremely successful commercially. It too was
able to run Win32 programs, but using the Win32s compatibility library.
Like the first version of MS-DOS-based Windows, NT-based Windows was
not initially successful. NT required more memory, there were few 32-bit applica-
tions available, and incompatibilities with device drivers and applications caused
many customers to stick with MS-DOS-based Windows which Microsoft was still
improving, releasing Windows 95 in 1995. Windows 95 provided native 32-bit
programming interfaces like NT, but better compatibility with existing 16-bit soft-
ware and applications. Not surprisingly, NT’s early success was in the server mar-
ket, competing with VMS and NetWare.
SEC. 11.1 HISTORY OF WINDOWS THROUGH WINDOWS 8.1 861

NT did meet its portability goals, with additional releases in 1994 and 1995
adding support for (little-endian) MIPS and PowerPC architectures. The first
major upgrade to NT came with Windows NT 4.0 in 1996. This system had the
power, security, and reliability of NT, but also sported the same user interface as
the by-then very popular Windows 95.
Figure 11-3 shows the relationship of the Win32 API to Windows. Having a
common API across both the MS-DOS-based and NT-based Windows was impor-
tant to the success of NT.
This compatibility made it much easier for users to migrate from Windows 95
to NT, and the operating system became a strong player in the high-end desktop
market as well as servers. However, customers were not as willing to adopt other
processor architectures, and of the four architectures Windows NT 4.0 supported in
1996 (the DEC Alpha was added in that release), only the x86 (i.e., Pentium fam-
ily) was still actively supported by the time of the next major release, Windows
2000.

Win32 application program

Win32 application programming interface

Win32s
Windows Windows Windows Windows
3.0/3.1 95/98/98SE/Me NT/2000/Vista/7 8/8.1

Figure 11-3. The Win32 API allows programs to run on almost all versions of
Windows.

Windows 2000 represented a significant evolution for NT. The key technolo-
gies added were plug-and-play (for consumers who installed a new PCI card, elim-
inating the need to fiddle with jumpers), network directory services (for enterprise
customers), improved power management (for notebook computers), and an im-
proved GUI (for everyone).
The technical success of Windows 2000 led Microsoft to push toward the dep-
recation of Windows 98 by enhancing the application and device compatibility of
the next NT release, Windows XP. Windows XP included a friendlier new look-
and-feel to the graphical interface, bolstering Microsoft’s strategy of hooking con-
sumers and reaping the benefit as they pressured their employers to adopt systems
with which they were already familiar. The strategy was overwhelmingly suc-
cessful, with Windows XP being installed on hundreds of millions of PCs over its
first few years, allowing Microsoft to achieve its goal of effectively ending the era
of MS-DOS-based Windows.
862 CASE STUDY 2: WINDOWS 8 CHAP. 11

Microsoft followed up Windows XP by embarking on an ambitious release to


kindle renewed excitement among PC consumers. The result, Windows Vista,
was completed in late 2006, more than five years after Windows XP shipped. Win-
dows Vista boasted yet another redesign of the graphical interface, and new securi-
ty features under the covers. Most of the changes were in customer-visible experi-
ences and capabilities. The technologies under the covers of the system improved
incrementally, with much clean-up of the code and many improvements in per-
formance, scalability, and reliability. The server version of Vista (Windows Server
2008) was delivered about a year after the consumer version. It shares, with Vista,
the same core system components, such as the kernel, drivers, and low-level librar-
ies and programs.
The human story of the early development of NT is related in the book Show-
stopper (Zachary, 1994). The book tells a lot about the key people involved and
the difficulties of undertaking such an ambitious software development project.

11.1.4 Windows Vista

The release of Windows Vista culminated Microsoft’s most extensive operating


system project to date. The initial plans were so ambitious that a couple of years
into its development Vista had to be restarted with a smaller scope. Plans to rely
heavily on Microsoft’s type-safe, garbage-collected .NET language C# were
shelved, as were some significant features such as the WinFS unified storage sys-
tem for searching and organizing data from many different sources. The size of the
full operating system is staggering. The original NT release of 3 million lines of
C/C++ that had grown to 16 million in NT 4, 30 million in 2000, and 50 million in
XP. It is over 70 million lines in Vista and more in Windows 7 and 8.
Much of the size is due to Microsoft’s emphasis on adding many new features
to its products in every release. In the main system32 directory, there are 1600
DLLs (Dynamic Link Libraries) and 400 EXEs (Executables), and that does not
include the other directories containing the myriad of applets included with the op-
erating system that allow users to surf the Web, play music and video, send email,
scan documents, organize photos, and even make movies. Because Microsoft
wants customers to switch to new versions, it maintains compatibility by generally
keeping all the features, APIs, applets (small applications), etc., from the previous
version. Few things ever get deleted. The result is that Windows was growing dra-
matically release to release. Windows’ distribution media had moved from floppy,
to CD, and with Windows Vista, to DVD. Technology had been keeping up, how-
ever, and faster processors and larger memories made it possible for computers to
get faster despite all this bloat.
Unfortunately for Microsoft, Windows Vista was released at a time when cus-
tomers were becoming enthralled with inexpensive computers, such as low-end
notebooks and netbook computers. These machines used slower processors to
save cost and battery life, and in their earlier generations limited memory sizes. At
SEC. 11.1 HISTORY OF WINDOWS THROUGH WINDOWS 8.1 863

the same time, processor performance ceased to improve at the same rate it had
previously, due to the difficulties in dissipating the heat created by ever-increasing
clock speeds. Moore’s Law continued to hold, but the additional transistors were
going into new features and multiple processors rather than improvements in sin-
gle-processor performance. All the bloat in Windows Vista meant that it per-
formed poorly on these computers relative to Windows XP, and the release was
never widely accepted.
The issues with Windows Vista were addressed in the subsequent release,
Windows 7. Microsoft invested heavily in testing and performance automation,
new telemetry technology, and extensively strengthened the teams charged with
improving performance, reliability, and security. Though Windows 7 had rela-
tively few functional changes compared to Windows Vista, it was better engineered
and more efficient. Windows 7 quickly supplanted Vista and ultimately Windows
XP to be the most popular version of Windows to date.

11.1.5 2010s: Modern Windows

By the time Windows 7 shipped, the computing industry once again began to
change dramatically. The success of the Apple iPhone as a portable computing de-
vice, and the advent of the Apple iPad, had heralded a sea-change which led to the
dominance of lower-cost Android tablets and phones, much as Microsoft had dom-
inated the desktop in the first three decades of personal computing. Small,
portable, yet powerful devices and ubiquitous fast networks were creating a world
where mobile computing and network-based services were becoming the dominant
paradigm. The old world of portable computers was replaced by machines with
small screens that ran applications readily downloadable from the Web. These ap-
plications were not the traditional variety, like word processing, spreadsheets, and
connecting to corporate servers. Instead, they provided access to services like Web
search, social networking, Wikipedia, streaming music and video, shopping, and
personal navigation. The business models for computing were also changing, with
advertising opportunities becoming the largest economic force behind computing.
Microsoft began a process to redesign itself as a devices and services company
in order to better compete with Google and Apple. It needed an operating system
it could deploy across a wide spectrum of devices: phones, tablets, game consoles,
laptops, desktops, servers, and the cloud. Windows thus underwent an even bigger
evolution than with Windows Vista, resulting in Windows 8. However, this time
Microsoft applied the lessons from Windows 7 to create a well-engineered, per-
formant product with less bloat.
Windows 8 built on the modular MinWin approach Microsoft used in Win-
dows 7 to produce a small operating system core that could be extended onto dif-
ferent devices. The goal was for each of the operating systems for specific devices
to be built by extending this core with new user interfaces and features, yet provide
as common an experience for users as possible. This approach was successfully
864 CASE STUDY 2: WINDOWS 8 CHAP. 11

applied to Windows Phone 8, which shares most of the core binaries with desktop
and server Windows. Support of phones and tablets by Windows required support
for the popular ARM architecture, as well as new Intel processors targeting those
devices. What makes Windows 8 part of the Modern Windows era are the funda-
mental changes in the programming models, as we will examine in the next sec-
tion.
Windows 8 was not received to universal acclaim. In particular, the lack of the
Start Button on the taskbar (and its associated menu) was viewed by many users as
a huge mistake. Others objected to using a tablet-like interface on a desktop ma-
chine with a large monitor. Microsoft responded to this and other criticisms on
May 14, 2013 by releasing an update called Windows 8.1. This version fixed
these problems while at the same time introducing a host of new features, such as
better cloud integration, as well as a number of new programs. Although we will
stick to the more generic name of ‘‘Windows 8’’ in this chapter, in fact, everything
in it is a description of how Windows 8.1 works.

11.2 PROGRAMMING WINDOWS

It is now time to start our technical study of Windows. Before getting into the
details of the internal structure, however, we will take a look at the native NT API
for system calls, the Win32 programming subsystem introduced as part of NT-
based Windows, and the Modern WinRT programming environment introduced
with Windows 8.
Figure 11-4 shows the layers of the Windows operating system. Beneath the
applet and GUI layers of Windows are the programming interfaces that applica-
tions build on. As in most operating systems, these consist largely of code libraries
(DLLs) to which programs dynamically link for access to operating system fea-
tures. Windows also includes a number of programming interfaces which are im-
plemented as services that run as separate processes. Applications communicate
with user-mode services through RPCs (Remote-Procedure-Calls).
The core of the NT operating system is the NTOS kernel-mode program
(ntoskrnl.exe), which provides the traditional system-call interfaces upon which the
rest of the operating system is built. In Windows, only programmers at Microsoft
write to the system-call layer. The published user-mode interfaces all belong to
operating system personalities that are implemented using subsystems that run on
top of the NTOS layers.
Originally NT supported three personalities: OS/2, POSIX and Win32. OS/2
was discarded in Windows XP. Support for POSIX was finally removed in Win-
dows 8.1. Today all Windows applications are written using APIs that are built on
top of the Win32 subsystem, such as the WinFX API in the .NET programming
model. The WinFX API includes many of the features of Win32, and in fact many
SEC. 11.2 PROGRAMMING WINDOWS 865

Modern Windows Apps Windows Services Windows Desktop Apps


Modern app mgr Modern broker processes Desktop mgr(explorer)
WinRT: .NET/C++, WWA/JS NT services: smss, lsass, [.NET: base classes, GC]
COM services, winlogon, GUI (shell32, user32, gdi32)
AppContainer Win32 subsystem process Dynamic libraries (ole, rpc)
Process lifetime mgr (csrss.exe) Subsystem API (kernel32)

Native NT API, C/C++ run-time (ntdll.dll)


User mode
Kernel mode
NTOS kernel layer (ntoskrnl.exe)

Drivers: devices, file NTOS executive layer GUI driver


systems, network (ntoskrnl.exe) (Win32k.sys)

Hardware abstraction layer (hal.dll)

Hypervisor (hvix, hvax)

Figure 11-4. The programming layers in Modern Windows.

of the functions in the WinFX Base Class Library are simply wrappers around
Win32 APIs. The advantages of WinFX have to do with the richness of the object
types supported, the simplified consistent interfaces, and use of the .NET Common
Language Run-time (CLR), including garbage collection (GC).
The Modern versions of Windows begin with Windows 8, which introduced
the new WinRT set of APIs. Windows 8 deprecated the traditional Win32 desktop
experience in favor of running a single application at a time on the full screen with
an emphasis on touch over use of the mouse. Microsoft saw this as a necessary
step as part of the transition to a single operating system that would work with
phones, tablets, and game consoles, as well as traditional PCs and servers. The
GUI changes necessary to support this new model require that applications be
rewritten to a new API model, the Modern Software Development Kit, which in-
cludes the WinRT APIs. The WinRT APIs are carefully curated to produce a more
consistent set of behaviors and interfaces. These APIs have versions available for
C++ and .NET programs but also JavaScript for applications hosted in a brow-
ser-like environment wwa.exe (Windows Web Application).
In addition to WinRT APIs, many of the existing Win32 APIs were included in
the MSDK (Microsoft Development Kit). The initially available WinRT APIs
were not sufficient to write many programs. Some of the included Win32 APIs
were chosen to limit the behavior of applications. For example, applications can-
not create threads directly with the MSDK, but must rely on the Win32 thread pool
to run concurrent activities within a process. This is because Modern Windows is
866 CASE STUDY 2: WINDOWS 8 CHAP. 11

shifting programmers away from a threading model to a task model in order to dis-
entangle resource management (priorities, processor affinities) from the pro-
gramming model (specifying concurrent activities). Other omitted Win32 APIs in-
clude most of the Win32 virtual memory APIs. Programmers are expected to rely
on the Win32 heap-management APIs rather than attempt to manage memory re-
sources directly. APIs that were already deprecated in Win32 were also omitted
from the MSDK, as were all ANSI APIs. The MSDK APIs are Unicode only.
The choice of the word Modern to describe a product such as Windows is sur-
prising. Perhaps if a new generation Windows is here ten years from now, it will
be referred to as post-Modern Windows.
Unlike traditional Win32 processes, the processes running modern applications
have their lifetimes managed by the operating system. When a user switches away
from an application, the system gives it a couple of seconds to save its state and
then ceases to give it further processor resources until the user switches back to the
application. If the system runs low on resources, the operating system may termi-
nate the application’s processes without the application ever running again. When
the user switches back to the application at some time in the future, it will be re-
started by the operating system. Applications that need to run tasks in the back-
ground must specifically arrange to do so using a new set of WinRT APIs. Back-
ground activity is carefully managed by the system to improve battery life and pre-
vent interference with the foreground application the user is currently using. These
changes were made to make Windows function better on mobile devices.
In the Win32 desktop world applications are deployed by running an installer
that is part of the application. Modern applications have to be installed using Win-
dows’ AppStore program, which will deploy only applications that were uploaded
into the Microsoft on-line store by the developer. Microsoft is following the same
successful model introduced by Apple and adopted by Android. Microsoft will not
accept applications into the store unless they pass verification which, among other
checks, ensures that the application is using only APIs available in the MSDK.
When a modern application is running, it always executes in a sandbox called
an AppContainer. Sandboxing process execution is a security technique for iso-
lating less trusted code so that it cannot freely tamper with the system or user data.
The Windows AppContainer treats each application as a distinct user, and uses
Windows security facilities to keep the application from accessing arbitrary system
resources. When an application does need access to a system resource, there are
WinRT APIs that communicate to broker processes which do have access to more
of the system, such as a user’s files.
As shown in Fig. 11-5, NT subsystems are constructed out of four compo-
nents: a subsystem process, a set of libraries, hooks in CreateProcess, and support
in the kernel. A subsystem process is really just a service. The only special prop-
erty is that it is started by the smss.exe (session manager) program—the initial
user-mode program started by NT—in response to a request from CreateProcess
in Win32 or the corresponding API in a different subsystem. Although Win32 is
SEC. 11.2 PROGRAMMING WINDOWS 867

the only remaining subsystem supported, Windows still maintains the subsystem
model, including the csrss.exe Win32 subsystem process.

Program process

Subsystem
libraries

Subsystem run-time library


(CreateProcess hook) Subsystem process
Native NT API,C/C++ run-time
User-mode
Kernel-mode

Subsystem
Local procedure kernel support
Native NT call (LPC)
system services NTOS Executive

Figure 11-5. The components used to build NT subsystems.

The set of libraries both implements higher-level operating-system functions


specific to the subsystem and contains the stub routines which communicate be-
tween processes using the subsystem (shown on the left) and the subsystem proc-
ess itself (shown on the right). Calls to the subsystem process normally take place
using the kernel-mode LPC (Local Procedure Call) facilities, which implement
cross-process procedure calls.
The hook in Win32 CreateProcess detects which subsystem each program re-
quires by looking at the binary image. It then asks smss.exe to start the subsystem
process (if it is not already running). The subsystem process then takes over
responsibility for loading the program.
The NT kernel was designed to have a lot of general-purpose facilities that can
be used for writing operating-system-specific subsystems. But there is also special
code that must be added to correctly implement each subsystem. As examples, the
native NtCreateProcess system call implements process duplication in support of
POSIX fork system call, and the kernel implements a particular kind of string table
for Win32 (called atoms) which allows read-only strings to be efficiently shared a-
cross processes.
The subsystem processes are native NT programs which use the native system
calls provided by the NT kernel and core services, such as smss.exe and lsass.exe
(local security administration). The native system calls include cross-process facil-
ities to manage virtual addresses, threads, handles, and exceptions in the processes
created to run programs written to use a particular subsystem.
868 CASE STUDY 2: WINDOWS 8 CHAP. 11

11.2.1 The Native NT Application Programming Interface

Like all other operating systems, Windows has a set of system calls it can per-
form. In Windows, these are implemented in the NTOS executive layer that runs
in kernel mode. Microsoft has published very few of the details of these native
system calls. They are used internally by lower-level programs that ship as part of
the operating system (mainly services and the subsystems), as well as kernel-mode
device drivers. The native NT system calls do not really change very much from
release to release, but Microsoft chose not to make them public so that applications
written for Windows would be based on Win32 and thus more likely to work with
both the MS-DOS-based and NT-based Windows systems, since the Win32 API is
common to both.
Most of the native NT system calls operate on kernel-mode objects of one kind
or another, including files, processes, threads, pipes, semaphores, and so on. Fig-
ure 11-6 gives a list of some of the common categories of kernel-mode objects sup-
ported by the kernel in Windows. Later, when we discuss the object manager, we
will provide further details on the specific object types.

Object category Examples


Synchronization Semaphores, mutexes, events, IPC ports, I/O completion queues
I/O Files, devices, drivers, timers
Program Jobs, processes, threads, sections, tokens
Win32 GUI Desktops, application callbacks

Figure 11-6. Common categories of kernel-mode object types.

Sometimes use of the term object regarding the data structures manipulated by
the operating system can be confusing because it is mistaken for object-oriented.
Operating system objects do provide data hiding and abstraction, but they lack
some of the most basic properties of object-oriented systems such as inheritance
and polymorphism.
In the native NT API, calls are available to create new kernel-mode objects or
access existing ones. Every call creating or opening an object returns a result called
a handle to the caller. The handle can subsequently be used to perform operations
on the object. Handles are specific to the process that created them. In general
handles cannot be passed directly to another process and used to refer to the same
object. However, under certain circumstances, it is possible to duplicate a handle
into the handle table of other processes in a protected way, allowing processes to
share access to objects—even if the objects are not accessible in the namespace.
The process duplicating each handle must itself have handles for both the source
and target process.
Every object has a security descriptor associated with it, telling in detail who
may and may not perform what kinds of operations on the object based on the
SEC. 11.2 PROGRAMMING WINDOWS 869

access requested. When handles are duplicated between processes, new access
restrictions can be added that are specific to the duplicated handle. Thus, a process
can duplicate a read-write handle and turn it into a read-only version in the target
process.
Not all system-created data structures are objects and not all objects are kernel-
mode objects. The only ones that are true kernel-mode objects are those that need
to be named, protected, or shared in some way. Usually, they represent some kind
of programming abstraction implemented in the kernel. Every kernel-mode object
has a system-defined type, has well-defined operations on it, and occupies storage
in kernel memory. Although user-mode programs can perform the operations (by
making system calls), they cannot get at the data directly.
Figure 11-7 shows a sampling of the native APIs, all of which use explicit
handles to manipulate kernel-mode objects such as processes, threads, IPC ports,
and sections (which are used to describe memory objects that can be mapped into
address spaces). NtCreateProcess returns a handle to a newly created process ob-
ject, representing an executing instance of the program represented by the Section-
Handle. DebugPor tHandle is used to communicate with a debugger when giving it
control of the process after an exception (e.g., dividing by zero or accessing invalid
memory). ExceptPor tHandle is used to communicate with a subsystem process
when errors occur and are not handled by an attached debugger.

NtCreateProcess(&ProcHandle, Access, SectionHandle, DebugPor tHandle, ExceptPor tHandle, ...)


NtCreateThread(&ThreadHandle, ProcHandle, Access, ThreadContext, CreateSuspended, ...)
NtAllocateVir tualMemory(ProcHandle, Addr, Size, Type, Protection, ...)
NtMapViewOfSection(SectHandle, ProcHandle, Addr, Size, Protection, ...)
NtReadVir tualMemory(ProcHandle, Addr, Size, ...)
NtWriteVirtualMemor y(ProcHandle, Addr, Size, ...)
NtCreateFile(&FileHandle, FileNameDescriptor, Access, ...)
NtDuplicateObject(srcProcHandle, srcObjHandle, dstProcHandle, dstObjHandle, ...)

Figure 11-7. Examples of native NT API calls that use handles to manipulate ob-
jects across process boundaries.

NtCreateThread takes ProcHandle because it can create a thread in any process


for which the calling process has a handle (with sufficient access rights). Simi-
larly, NtAllocateVir tualMemory, NtMapViewOfSection, NtReadVir tualMemory, and
NtWriteVirtualMemor y allow one process not only to operate on its own address
space, but also to allocate virtual addresses, map sections, and read or write virtual
memory in other processes. NtCreateFile is the native API call for creating a new
file or opening an existing one. NtDuplicateObject is the API call for duplicating
handles from one process to another.
Kernel-mode objects are, of course, not unique to Windows. UNIX systems
also support a variety of kernel-mode objects, such as files, network sockets, pipes,
870 CASE STUDY 2: WINDOWS 8 CHAP. 11

devices, processes, and interprocess communication (IPC) facilities like shared


memory, message ports, semaphores, and I/O devices. In UNIX there are a variety
of ways of naming and accessing objects, such as file descriptors, process IDs, and
integer IDs for SystemV IPC objects, and i-nodes for devices. The implementation
of each class of UNIX objects is specific to the class. Files and sockets use dif-
ferent facilities than the SystemV IPC mechanisms or processes or devices.
Kernel objects in Windows use a uniform facility based on handles and names
in the NT namespace to reference kernel objects, along with a unified imple-
mentation in a centralized object manager. Handles are per-process but, as de-
scribed above, can be duplicated into another process. The object manager allows
objects to be given names when they are created, and then opened by name to get
handles for the objects.
The object manager uses Unicode (wide characters) to represent names in the
NT namespace. Unlike UNIX, NT does not generally distinguish between upper-
and lowercase (it is case preserving but case insensitive). The NT namespace is a
hierarchical tree-structured collection of directories, symbolic links and objects.
The object manager also provides unified facilities for synchronization, securi-
ty, and object lifetime management. Whether the general facilities provided by the
object manager are made available to users of any particular object is up to the ex-
ecutive components, as they provide the native APIs that manipulate each object
type.
It is not only applications that use objects managed by the object manager.
The operating system itself can also create and use objects—and does so heavily.
Most of these objects are created to allow one component of the system to store
some information for a substantial period of time or to pass some data structure to
another component, and yet benefit from the naming and lifetime support of the
object manager. For example, when a device is discovered, one or more device
objects are created to represent the device and to logically describe how the device
is connected to the rest of the system. To control the device a device driver is load-
ed, and a driver object is created holding its properties and providing pointers to
the functions it implements for processing the I/O requests. Within the operating
system the driver is then referred to by using its object. The driver can also be ac-
cessed directly by name rather than indirectly through the devices it controls (e.g.,
to set parameters governing its operation from user mode).
Unlike UNIX, which places the root of its namespace in the file system, the
root of the NT namespace is maintained in the kernel’s virtual memory. This
means that NT must recreate its top-level namespace every time the system boots.
Using kernel virtual memory allows NT to store information in the namespace
without first having to start the file system running. It also makes it much easier
for NT to add new types of kernel-mode objects to the system because the formats
of the file systems themselves do not have to be modified for each new object type.
A named object can be marked permanent, meaning that it continues to exist
until explicitly deleted or the system reboots, even if no process currently has a
SEC. 11.2 PROGRAMMING WINDOWS 871

handle for the object. Such objects can even extend the NT namespace by provid-
ing parse routines that allow the objects to function somewhat like mount points in
UNIX. File systems and the registry use this facility to mount volumes and hives
onto the NT namespace. Accessing the device object for a volume gives access to
the raw volume, but the device object also represents an implicit mount of the vol-
ume into the NT namespace. The individual files on a volume can be accessed by
concatenating the volume-relative file name onto the end of the name of the device
object for that volume.
Permanent names are also used to represent synchronization objects and shared
memory, so that they can be shared by processes without being continually recreat-
ed as processes stop and start. Device objects and often driver objects are given
permanent names, giving them some of the persistence properties of the special i-
nodes kept in the /dev directory of UNIX.
We will describe many more of the features in the native NT API in the next
section, where we discuss the Win32 APIs that provide wrappers around the NT
system calls.

11.2.2 The Win32 Application Programming Interface

The Win32 function calls are collectively called the Win32 API. These inter-
faces are publicly disclosed and fully documented. They are implemented as li-
brary procedures that either wrap the native NT system calls used to get the work
done or, in some cases, do the work right in user mode. Though the native NT
APIs are not published, most of the functionality they provide is accessible through
the Win32 API. The existing Win32 API calls rarely change with new releases of
Windows, though many new functions are added to the API.
Figure 11-8 shows various low-level Win32 API calls and the native NT API
calls that they wrap. What is interesting about the figure is how uninteresting the
mapping is. Most low-level Win32 functions have native NT equivalents, which is
not surprising as Win32 was designed with NT in mind. In many cases the Win32
layer must manipulate the Win32 parameters to map them onto NT, for example,
canonicalizing path names and mapping onto the appropriate NT path names, in-
cluding special MS-DOS device names (like LPT:). The Win32 APIs for creating
processes and threads also must notify the Win32 subsystem process, csrss.exe,
that there are new processes and threads for it to supervise, as we will describe in
Sec. 11.4.
Some Win32 calls take path names, whereas the equivalent NT calls use hand-
les. So the wrapper routines have to open the files, call NT, and then close the
handle at the end. The wrappers also translate the Win32 APIs from ANSI to Uni-
code. The Win32 functions shown in Fig. 11-8 that use strings as parameters are
actually two APIs, for example, CreateProcessW and CreateProcessA. The
strings passed to the latter API must be translated to Unicode before calling the un-
derlying NT API, since NT works only with Unicode.
872 CASE STUDY 2: WINDOWS 8 CHAP. 11

Win32 call Native NT API call


CreateProcess NtCreateProcess
CreateThread NtCreateThread
SuspendThread NtSuspendThread
CreateSemaphore NtCreateSemaphore
ReadFile NtReadFile
DeleteFile NtSetInformationFile
CreateFileMapping NtCreateSection
Vir tualAlloc NtAllocateVir tualMemory
MapViewOfFile NtMapViewOfSection
DuplicateHandle NtDuplicateObject
CloseHandle NtClose

Figure 11-8. Examples of Win32 API calls and the native NT API calls that they
wrap.

Since few changes are made to the existing Win32 interfaces in each release of
Windows, in theory the binary programs that ran correctly on any previous release
will continue to run correctly on a new release. In practice, there are often many
compatibility problems with new releases. Windows is so complex that a few
seemingly inconsequential changes can cause application failures. And applica-
tions themselves are often to blame, since they frequently make explicit checks for
specific operating system versions or fall victim to their own latent bugs that are
exposed when they run on a new release. Nevertheless, Microsoft makes an effort
in every release to test a wide variety of applications to find incompatibilities and
either correct them or provide application-specific workarounds.
Windows supports two special execution environments both called WOW
(Windows-on-Windows). WOW32 is used on 32-bit x86 systems to run 16-bit
Windows 3.x applications by mapping the system calls and parameters between the
16-bit and 32-bit worlds. Similarly, WOW64 allows 32-bit Windows applications
to run on x64 systems.
The Windows API philosophy is very different from the UNIX philosophy. In
the latter, the operating system functions are simple, with few parameters and few
places where there are multiple ways to perform the same operation. Win32 pro-
vides very comprehensive interfaces with many parameters, often with three or
four ways of doing the same thing, and mixing together low-level and high-level
functions, like CreateFile and CopyFile.
This means Win32 provides a very rich set of interfaces, but it also introduces
much complexity due to the poor layering of a system that intermixes both high-
level and low-level functions in the same API. For our study of operating systems,
only the low-level functions of the Win32 API that wrap the native NT API are rel-
evant, so those are what we will focus on.
SEC. 11.2 PROGRAMMING WINDOWS 873

Win32 has calls for creating and managing both processes and threads. There
are also many calls that relate to interprocess communication, such as creating, de-
stroying, and using mutexes, semaphores, events, communication ports, and other
IPC objects.
Although much of the memory-management system is invisible to pro-
grammers, one important feature is visible: namely the ability of a process to map
a file onto a region of its virtual memory. This allows threads running in a process
the ability to read and write parts of the file using pointers without having to expli-
citly perform read and write operations to transfer data between the disk and mem-
ory. With memory-mapped files the memory-management system itself performs
the I/Os as needed (demand paging).
Windows implements memory-mapped files using three completely different
facilities. First it provides interfaces which allow processes to manage their own
virtual address space, including reserving ranges of addresses for later use. Sec-
ond, Win32 supports an abstraction called a file mapping, which is used to repres-
ent addressable objects like files (a file mapping is called a section in the NT
layer). Most often, file mappings are created to refer to files using a file handle,
but they can also be created to refer to private pages allocated from the system
pagefile.
The third facility maps views of file mappings into a process’ address space.
Win32 allows only a view to be created for the current process, but the underlying
NT facility is more general, allowing views to be created for any process for which
you have a handle with the appropriate permissions. Separating the creation of a
file mapping from the operation of mapping the file into the address space is a dif-
ferent approach than used in the mmap function in UNIX.
In Windows, the file mappings are kernel-mode objects represented by a hand-
le. Like most handles, file mappings can be duplicated into other processes. Each
of these processes can map the file mapping into its own address space as it sees
fit. This is useful for sharing private memory between processes without having to
create files for sharing. At the NT layer, file mappings (sections) can also be made
persistent in the NT namespace and accessed by name.
An important area for many programs is file I/O. In the basic Win32 view, a
file is just a linear sequence of bytes. Win32 provides over 60 calls for creating
and destroying files and directories, opening and closing files, reading and writing
them, requesting and setting file attributes, locking ranges of bytes, and many more
fundamental operations on both the organization of the file system and access to
individual files.
There are also various advanced facilities for managing data in files. In addi-
tion to the primary data stream, files stored on the NTFS file system can have addi-
tional data streams. Files (and even entire volumes) can be encrypted. Files can be
compressed, and/or represented as a sparse stream of bytes where missing regions
of data in the middle occupy no storage on disk. File-system volumes can be
organized out of multiple separate disk partitions using different levels of RAID
874 CASE STUDY 2: WINDOWS 8 CHAP. 11

storage. Modifications to files or directory subtrees can be detected through a noti-


fication mechanism, or by reading the journal that NTFS maintains for each vol-
ume.
Each file-system volume is implicitly mounted in the NT namespace, accord-
ing to the name given to the volume, so a file \ foo \ bar might be named, for ex-
ample, \ Device \ HarddiskVolume \ foo \ bar. Internal to each NTFS volume, mount
points (called reparse points in Windows) and symbolic links are supported to help
organize the individual volumes.
The low-level I/O model in Windows is fundamentally asynchronous. Once an
I/O operation is begun, the system call can return and allow the thread which initi-
ated the I/O to continue in parallel with the I/O operation. Windows supports can-
cellation, as well as a number of different mechanisms for threads to synchronize
with I/O operations when they complete. Windows also allows programs to speci-
fy that I/O should be synchronous when a file is opened, and many library func-
tions, such as the C library and many Win32 calls, specify synchronous I/O for
compatibility or to simplify the programming model. In these cases the executive
will explicitly synchronize with I/O completion before returning to user mode.
Another area for which Win32 provides calls is security. Every thread is asso-
ciated with a kernel-mode object, called a token, which provides information about
the identity and privileges associated with the thread. Every object can have an
ACL (Access Control List) telling in great detail precisely which users may ac-
cess it and which operations they may perform on it. This approach provides for
fine-grained security in which specific users can be allowed or denied specific ac-
cess to every object. The security model is extensible, allowing applications to add
new security rules, such as limiting the hours access is permitted.
The Win32 namespace is different than the native NT namespace described in
the previous section. Only parts of the NT namespace are visible to Win32 APIs
(though the entire NT namespace can be accessed through a Win32 hack that uses
special prefix strings, like ‘‘ \ \ .’’). In Win32, files are accessed relative to drive let-
ters. The NT directory \ DosDevices contains a set of symbolic links from drive
letters to the actual device objects. For example, \ DosDevices \ C: might be a link
to \ Device \ HarddiskVolume1. This directory also contains links for other Win32
devices, such as COM1:, LPT:, and NUL: (for the serial and printer ports and the
all-important null device). \ DosDevices is really a symbolic link to \ ?? which
was chosen for efficiency. Another NT directory, \ BaseNamedObjects, is used to
store miscellaneous named kernel-mode objects accessible through the Win32 API.
These include synchronization objects like semaphores, shared memory, timers,
communication ports, and device names.
In addition to low-level system interfaces we have described, the Win32 API
also supports many functions for GUI operations, including all the calls for manag-
ing the graphical interface of the system. There are calls for creating, destroying,
managing, and using windows, menus, tool bars, status bars, scroll bars, dialog
boxes, icons, and many more items that appear on the screen. There are calls for
SEC. 11.2 PROGRAMMING WINDOWS 875

drawing geometric figures, filling them in, managing the color palettes they use,
dealing with fonts, and placing icons on the screen. Finally, there are calls for
dealing with the keyboard, mouse and other human-input devices as well as audio,
printing, and other output devices.
The GUI operations work directly with the win32k.sys driver using special in-
terfaces to access these functions in kernel mode from user-mode libraries. Since
these calls do not involve the core system calls in the NTOS executive, we will not
say more about them.

11.2.3 The Windows Registry

The root of the NT namespace is maintained in the kernel. Storage, such as


file-system volumes, is attached to the NT namespace. Since the NT namespace is
constructed afresh every time the system boots, how does the system know about
any specific details of the system configuration? The answer is that Windows
attaches a special kind of file system (optimized for small files) to the NT name-
space. This file system is called the registry. The registry is organized into sepa-
rate volumes called hives. Each hive is kept in a separate file (in the directory
C: \ Windows \ system32 \ config \ of the boot volume). When a Windows system
boots, one particular hive named SYSTEM is loaded into memory by the same boot
program that loads the kernel and other boot files, such as boot drivers, from the
boot volume.
Windows keeps a great deal of crucial information in the SYSTEM hive, in-
cluding information about what drivers to use with what devices, what software to
run initially, and many parameters governing the operation of the system. This
information is used even by the boot program itself to determine which drivers are
boot drivers, being needed immediately upon boot. Such drivers include those that
understand the file system and disk drivers for the volume containing the operating
system itself.
Other configuration hives are used after the system boots to describe infor-
mation about the software installed on the system, particular users, and the classes
of user-mode COM (Component Object-Model) objects that are installed on the
system. Login information for local users is kept in the SAM (Security Access
Manager) hive. Information for network users is maintained by the lsass service
in the security hive and coordinated with the network directory servers so that
users can have a common account name and password across an entire network. A
list of the hives used in Windows is shown in Fig. 11-9.
Prior to the introduction of the registry, configuration information in Windows
was kept in hundreds of .ini (initialization) files spread across the disk. The reg-
istry gathers these files into a central store, which is available early in the process
of booting the system. This is important for implementing Windows plug-and-play
functionality. Unfortunately, the registry has become seriously disorganized over
time as Windows has evolved. There are poorly defined conventions about how the
876 CASE STUDY 2: WINDOWS 8 CHAP. 11

Hive file Mounted name Use


SYSTEM HKLM \SYSTEM OS configuration information, used by kernel
HARDWARE HKLM \HARDWARE In-memory hive recording hardware detected
BCD HKLM \BCD* Boot Configuration Database
SAM HKLM \SAM Local user account information
SECURITY HKLM \SECURITY lsass’ account and other security information
DEFAULT HKEY USERS \.DEFAULT Default hive for new users
NTUSER.DAT HKEY USERS \<user id> User-specific hive, kept in home directory
SOFTWARE HKLM \SOFTWARE Application classes registered by COM
COMPONENTS HKLM \COMPONENTS Manifests and dependencies for sys. components

Figure 11-9. The registry hives in Windows. HKLM is a shorthand for


HKEY LOCAL MACHINE.

configuration information should be arranged, and many applications take an ad


hoc approach. Most users, applications, and all drivers run with full privileges and
frequently modify system parameters in the registry directly—sometimes interfer-
ing with each other and destabilizing the system.
The registry is a strange cross between a file system and a database, and yet
really unlike either. Entire books have been written describing the registry (Born,
1998; Hipson, 2002; and Ivens, 1998), and many companies have sprung up to sell
special software just to manage the complexity of the registry.
To explore the registry Windows has a GUI program called regedit that allows
you to open and explore the directories (called keys) and data items (called values).
Microsoft’s PowerShell scripting language can also be useful for walking through
the keys and values of the registry as if they were directories and files. A more in-
teresting tool to use is procmon, which is available from Microsoft’s tools’ Web-
site: www.microsoft.com/technet/sysinternals.
Procmon watches all the registry accesses that take place in the system and is
very illuminating. Some programs will access the same key over and over tens of
thousands of times.
As the name implies, regedit allows users to edit the registry—but be very
careful if you ever do. It is very easy to render your system unable to boot, or
damage the installation of applications so that you cannot fix them without a lot of
wizardry. Microsoft has promised to clean up the registry in future releases, but
for now it is a huge mess—far more complicated than the configuration infor-
mation maintained in UNIX. The complexity and fragility of the registry led de-
signers of new operating systems—in particular—iOS and Android—to avoid any-
thing like it.
The registry is accessible to the Win32 programmer. There are calls to create
and delete keys, look up values within keys, and more. Some of the more useful
ones are listed in Fig. 11-10.
SEC. 11.2 PROGRAMMING WINDOWS 877

Win32 API function Description


RegCreateKeyEx Create a new registr y key
RegDeleteKey Delete a registry key
RegOpenKeyEx Open a key to get a handle to it
RegEnumKeyEx Enumerate the subkeys subordinate to the key of the handle
RegQuer yValueEx Look up the data for a value within a key

Figure 11-10. Some of the Win32 API calls for using the registry

When the system is turned off, most of the registry information is stored on the
disk in the hives. Because their integrity is so critical to correct system func-
tioning, backups are made automatically and metadata writes are flushed to disk to
prevent corruption in the event of a system crash. Loss of the registry requires
reinstalling all software on the system.

11.3 SYSTEM STRUCTURE

In the previous sections we examined Windows as seen by the programmer


writing code for user mode. Now we are going to look under the hood to see how
the system is organized internally, what the various components do, and how they
interact with each other and with user programs. This is the part of the system
seen by the programmer implementing low-level user-mode code, like subsystems
and native services, as well as the view of the system provided to device-driver
writers.
Although there are many books on how to use Windows, there are many fewer
on how it works inside. One of the best places to look for additional information
on this topic is Microsoft Windows Internals, 6th ed., Parts 1 and 2 (Russinovich
and Solomon, 2012).

11.3.1 Operating System Structure

As described earlier, the Windows operating system consists of many layers, as


depicted in Fig. 11-4. In the following sections we will dig into the lowest levels
of the operating system: those that run in kernel mode. The central layer is the
NTOS kernel itself, which is loaded from ntoskrnl.exe when Windows boots.
NTOS itself consists of two layers, the executive, which containing most of the
services, and a smaller layer which is (also) called the kernel and implements the
underlying thread scheduling and synchronization abstractions (a kernel within the
kernel?), as well as implementing trap handlers, interrupts, and other aspects of
how the CPU is managed.
878 CASE STUDY 2: WINDOWS 8 CHAP. 11

The division of NTOS into kernel and executive is a reflection of NT’s


VAX/VMS roots. The VMS operating system, which was also designed by Cutler,
had four hardware-enforced layers: user, supervisor, executive, and kernel corres-
ponding to the four protection modes provided by the VAX processor architecture.
The Intel CPUs also support four rings of protection, but some of the early target
processors for NT did not, so the kernel and executive layers represent a soft-
ware-enforced abstraction, and the functions that VMS provides in supervisor
mode, such as printer spooling, are provided by NT as user-mode services.
The kernel-mode layers of NT are shown in Fig. 11-11. The kernel layer of
NTOS is shown above the executive layer because it implements the trap and inter-
rupt mechanisms used to transition from user mode to kernel mode.

System library kernel user-mode dispatch routines (ntdll.dll)


User mode
Kernel mode
NTOS Trap/exception/interrupt dispatch
kernel
layer CPU scheduling and synchronization: threads, ISRs, DPCs, APCs

Drivers Procs and threads Virtual memory Object manager Config manager
File systems,
volume manager,
LPC Cache manager I/O manager Security monitor
TCP/IP stack,
net interfaces
graphics devices, Executive run-time library
all other devices NTOS executive layer

Hardware abstraction layer

Hardware
CPU, MMU, interrupt controllers, memory, physical devices, BIOS

Figure 11-11. Windows kernel-mode organization.

The uppermost layer in Fig. 11-11 is the system library (ntdll.dll), which ac-
tually runs in user mode. The system library includes a number of support func-
tions for the compiler run-time and low-level libraries, similar to what is in libc in
UNIX. ntdll.dll also contains special code entry points used by the kernel to ini-
tialize threads and dispatch exceptions and user-mode APCs (Asynchronous Pro-
cedure Calls). Because the system library is so integral to the operation of the ker-
nel, every user-mode process created by NTOS has ntdll mapped at the same fixed
address. When NTOS is initializing the system it creates a section object to use
when mapping ntdll, and it also records addresses of the ntdll entry points used by
the kernel.
Below the NTOS kernel and executive layers is a layer of software called the
HAL (Hardware Abstraction Layer) which abstracts low-level hardware details
like access to device registers and DMA operations, and the way the parentboard
SEC. 11.3 SYSTEM STRUCTURE 879

firmware represents configuration information and deals with differences in the


CPU support chips, such as various interrupt controllers.
The lowest software layer is the hypervisor, which Windows calls Hyper-V.
The hypervisor is an optional feature (not shown in Fig. 11-11). It is available in
many versions of Windows—including the professional desktop client. The hyper-
visor intercepts many of the privileged operations performed by the kernel and
emulates them in a way that allows multiple operating systems to run at the same
time. Each operating system runs in its own virtual machine, which Windows calls
a partition. The hypervisor uses features in the hardware architecture to protect
physical memory and provide isolation between partitions. An operating system
running on top of the hypervisor executes threads and handles interrupts on
abstractions of the physical processors called virtual processors. The hypervisor
schedules the virtual processors on the physical processors.
The main (root) operating system runs in the root partition. It provides many
services to the other (guest) partitions. Some of the most important services pro-
vide integration of the guests with the shared devices such as networking and the
GUI. While the root operating system must be Windows when running Hyper-V,
other operating systems, such as Linux, can be run in the guest partitions. A guest
operating system may perform very poorly unless it has been modified (i.e., para-
virtualized) to work with the hypervisor.
For example, if a guest operating system kernel is using a spinlock to synchro-
nize between two virtual processors and the hypervisor reschedules the virtual
processor holding the spinlock, the lock hold time may increase by orders of mag-
nitude, leaving other virtual processors running in the partition spinning for very
long periods of time. To solve this problem a guest operating system is enlight-
ened to spin only a short time before calling into the hypervisor to yield its physi-
cal processor to run another virtual processor.
The other major components of kernel mode are the device drivers. Windows
uses device drivers for any kernel-mode facilities which are not part of NTOS or
the HAL. This includes file systems, network protocol stacks, and kernel exten-
sions like antivirus and DRM (Digital Rights Management) software, as well as
drivers for managing physical devices, interfacing to hardware buses, and so on.
The I/O and virtual memory components cooperate to load (and unload) device
drivers into kernel memory and link them to the NTOS and HAL layers. The I/O
manager provides interfaces which allow devices to be discovered, organized, and
operated—including arranging to load the appropriate device driver. Much of the
configuration information for managing devices and drivers is maintained in the
SYSTEM hive of the registry. The plug-and-play subcomponent of the I/O man-
ager maintains information about the hardware detected within the HARDWARE
hive, which is a volatile hive maintained in memory rather than on disk, as it is
completely recreated every time the system boots.
We will now examine the various components of the operating system in a bit
more detail.
880 CASE STUDY 2: WINDOWS 8 CHAP. 11

The Hardware Abstraction Layer

One goal of Windows is to make the system portable across hardware plat-
forms. Ideally, to bring up an operating system on a new type of computer system
it should be possible to just recompile the operating system on the new platform.
Unfortunately, it is not this simple. While many of the components in some layers
of the operating system can be largely portable (because they mostly deal with in-
ternal data structures and abstractions that support the programming model), other
layers must deal with device registers, interrupts, DMA, and other hardware fea-
tures that differ significantly from machine to machine.
Most of the source code for the NTOS kernel is written in C rather than assem-
bly language (only 2% is assembly on x86, and less than 1% on x64). However, all
this C code cannot just be scooped up from an x86 system, plopped down on, say,
an ARM system, recompiled, and rebooted owing to the many hardware differ-
ences between processor architectures that have nothing to do with the different in-
struction sets and which cannot be hidden by the compiler. Languages like C make
it difficult to abstract away some hardware data structures and parameters, such as
the format of page-table entries and the physical memory page sizes and word
length, without severe performance penalties. All of these, as well as a slew of
hardware-specific optimizations, would have to be manually ported even though
they are not written in assembly code.
Hardware details about how memory is organized on large servers, or what
hardware synchronization primitives are available, can also have a big impact on
higher levels of the system. For example, NT’s virtual memory manager and the
kernel layer are aware of hardware details related to cache and memory locality.
Throughout the system NT uses compare&swap synchronization primitives, and it
would be difficult to port to a system that does not have them. Finally, there are
many dependencies in the system on the ordering of bytes within words. On all the
systems NT has ever been ported to, the hardware was set to little-endian mode.
Besides these larger issues of portability, there are also minor ones even be-
tween different parentboards from different manufacturers. Differences in CPU
versions affect how synchronization primitives like spin-locks are implemented.
There are several families of support chips that create differences in how hardware
interrupts are prioritized, how I/O device registers are accessed, management of
DMA transfers, control of the timers and real-time clock, multiprocessor synchron-
ization, working with firmware facilities such as ACPI (Advanced Configuration
and Power Interface), and so on. Microsoft made a serious attempt to hide these
types of machine dependencies in a thin layer at the bottom called the HAL, as
mentioned earlier. The job of the HAL is to present the rest of the operating sys-
tem with abstract hardware that hides the specific details of processor version, sup-
port chipset, and other configuration variations. These HAL abstractions are pres-
ented in the form of machine-independent services (procedure calls and macros)
that NTOS and the drivers can use.
SEC. 11.3 SYSTEM STRUCTURE 881

By using the HAL services and not addressing the hardware directly, drivers
and the kernel require fewer changes when being ported to new processors—and in
most cases can run unmodified on systems with the same processor architecture,
despite differences in versions and support chips.
The HAL does not provide abstractions or services for specific I/O devices
such as keyboards, mice, and disks or for the memory management unit. These
facilities are spread throughout the kernel-mode components, and without the HAL
the amount of code that would have to be modified when porting would be sub-
stantial, even when the actual hardware differences were small. Porting the HAL
itself is straightforward because all the machine-dependent code is concentrated in
one place and the goals of the port are well defined: implement all of the HAL ser-
vices. For many releases Microsoft supported a HAL Development Kit allowing
system manufacturers to build their own HAL, which would allow other kernel
components to work on new systems without modification, provided that the hard-
ware changes were not too great.
As an example of what the hardware abstraction layer does, consider the issue
of memory-mapped I/O vs. I/O ports. Some machines have one and some have the
other. How should a driver be programmed: to use memory-mapped I/O or not?
Rather than forcing a choice, which would make the driver not portable to a ma-
chine that did it the other way, the hardware abstraction layer offers three proce-
dures for driver writers to use for reading the device registers and another three for
writing them:
uc = READ PORT UCHAR(por t); WRITE PORT UCHAR(por t, uc);
us = READ PORT USHORT(por t); WRITE PORT USHORT(por t, us);
ul = READ PORT ULONG(por t); WRITE PORT LONG(por t, ul);

These procedures read and write unsigned 8-, 16-, and 32-bit integers, respectively,
to the specified port. It is up to the hardware abstraction layer to decide whether
memory-mapped I/O is needed here. In this way, a driver can be moved without
modification between machines that differ in the way the device registers are im-
plemented.
Drivers frequently need to access specific I/O devices for various purposes. At
the hardware level, a device has one or more addresses on a certain bus. Since
modern computers often have multiple buses (PCI, PCIe, USB, IEEE 1394, etc.), it
can happen that more than one device may have the same address on different
buses, so some way is needed to distinguish them. The HAL provides a service for
identifying devices by mapping bus-relative device addresses onto systemwide log-
ical addresses. In this way, drivers do not have to keep track of which device is
connected to which bus. This mechanism also shields higher layers from proper-
ties of alternative bus structures and addressing conventions.
Interrupts have a similar problem—they are also bus dependent. Here, too, the
HAL provides services to name interrupts in a systemwide way and also provides
ways to allow drivers to attach interrupt service routines to interrupts in a portable
882 CASE STUDY 2: WINDOWS 8 CHAP. 11

way, without having to know anything about which interrupt vector is for which
bus. Interrupt request level management is also handled in the HAL.
Another HAL service is setting up and managing DMA transfers in a de-
vice-independent way. Both the systemwide DMA engine and DMA engines on
specific I/O cards can be handled. Devices are referred to by their logical ad-
dresses. The HAL implements software scatter/gather (writing or reading from
noncontiguous blocks of physical memory).
The HAL also manages clocks and timers in a portable way. Time is kept
track of in units of 100 nanoseconds starting at midnight on 1 January 1601, which
is the first date in the previous quadricentury, which simplifies leap-year computa-
tions. (Quick Quiz: Was 1800 a leap year? Quick Answer: No.) The time services
decouple the drivers from the actual frequencies at which the clocks run.
Kernel components sometimes need to synchronize at a very low level, espe-
cially to prevent race conditions in multiprocessor systems. The HAL provides
primitives to manage this synchronization, such as spin locks, in which one CPU
simply waits for a resource held by another CPU to be released, particularly in
situations where the resource is typically held only for a few machine instructions.
Finally, after the system has been booted, the HAL talks to the computer’s
firmware (BIOS) and inspects the system configuration to find out which buses and
I/O devices the system contains and how they have been configured. This infor-
mation is then put into the registry. A summary of some of the things the HAL
does is given in Fig. 11-12.
Device Device Spin
registers addresses Interrupts DMA Timers locks Firmware

1.
RAM MOV EAX,ABC
ADD EAX,BAX
BNE LABEL
MOV EAX,ABC
MOV EAX,ABC
2. ADD EAX,BAX
BNE LABEL
MOVE AX,ABC
ADD EAX,BAX
BNE LABEL

3. Disk
Printer

Hardware abstraction layer

Figure 11-12. Some of the hardware functions the HAL manages.

The Kernel Layer

Above the hardware abstraction layer is NTOS, consisting of two layers: the
kernel and the executive. ‘‘Kernel’’ is a confusing term in Windows. It can refer to
all the code that runs in the processor’s kernel mode. It can also refer to the
SEC. 11.3 SYSTEM STRUCTURE 883

ntoskrnl.exe file which contains NTOS, the core of the Windows operating system.
Or it can refer to the kernel layer within NTOS, which is how we use it in this sec-
tion. It is even used to name the user-mode Win32 library that provides the wrap-
pers for the native system calls: kernel32.dll.
In the Windows operating system the kernel layer, illustrated above the execu-
tive layer in Fig. 11-11, provides a set of abstractions for managing the CPU. The
most central abstraction is threads, but the kernel also implements exception han-
dling, traps, and several kinds of interrupts. Creating and destroying the data struc-
tures which support threading is implemented in the executive layer. The kernel
layer is responsible for scheduling and synchronization of threads. Having support
for threads in a separate layer allows the executive layer to be implemented using
the same preemptive multithreading model used to write concurrent code in user
mode, though the synchronization primitives in the executive are much more spe-
cialized.
The kernel’s thread scheduler is responsible for determining which thread is
executing on each CPU in the system. Each thread executes until a timer interrupt
signals that it is time to switch to another thread (quantum expired), or until the
thread needs to wait for something to happen, such as an I/O to complete or for a
lock to be released, or a higher-priority thread becomes runnable and needs the
CPU. When switching from one thread to another, the scheduler runs on the CPU
and ensures that the registers and other hardware state have been saved. The
scheduler then selects another thread to run on the CPU and restores the state that
was previously saved from the last time that thread ran.
If the next thread to be run is in a different address space (i.e., process) than
the thread being switched from, the scheduler must also change address spaces.
The details of the scheduling algorithm itself will be discussed later in this chapter
when we come to processes and threads.
In addition to providing a higher-level abstraction of the hardware and han-
dling thread switches, the kernel layer also has another key function: providing
low-level support for two classes of synchronization mechanisms: control objects
and dispatcher objects. Control objects are the data structures that the kernel
layer provides as abstractions to the executive layer for managing the CPU. They
are allocated by the executive but they are manipulated with routines provided by
the kernel layer. Dispatcher objects are the class of ordinary executive objects
that use a common data structure for synchronization.

Deferred Procedure Calls

Control objects include primitive objects for threads, interrupts, timers, syn-
chronization, profiling, and two special objects for implementing DPCs and APCs.
DPC (Deferred Procedure Call) objects are used to reduce the time taken to ex-
ecute ISRs (Interrupt Service Routines) in response to an interrupt from a partic-
ular device. Limiting time spent in ISRs reduces the chance of losing an interrupt.
884 CASE STUDY 2: WINDOWS 8 CHAP. 11

The system hardware assigns a hardware priority level to interrupts. The CPU
also associates a priority level with the work it is performing. The CPU responds
only to interrupts at a higher-priority level than it is currently using. Normal prior-
ity levels, including the priority level of all user-mode work, is 0. Device inter-
rupts occur at priority 3 or higher, and the ISR for a device interrupt normally ex-
ecutes at the same priority level as the interrupt in order to keep other less impor-
tant interrupts from occurring while it is processing a more important one.
If an ISR executes too long, the servicing of lower-priority interrupts will be
delayed, perhaps causing data to be lost or slowing the I/O throughput of the sys-
tem. Multiple ISRs can be in progress at any one time, with each successive ISR
being due to interrupts at higher and higher-priority levels.
To reduce the time spent processing ISRs, only the critical operations are per-
formed, such as capturing the result of an I/O operation and reinitializing the de-
vice. Further processing of the interrupt is deferred until the CPU priority level is
lowered and no longer blocking the servicing of other interrupts. The DPC object
is used to represent the further work to be done and the ISR calls the kernel layer
to queue the DPC to the list of DPCs for a particular processor. If the DPC is the
first on the list, the kernel registers a special request with the hardware to interrupt
the CPU at priority 2 (which NT calls DISPATCH level). When the last of any ex-
ecuting ISRs completes, the interrupt level of the processor will drop back below 2,
and that will unblock the interrupt for DPC processing. The ISR for the DPC inter-
rupt will process each of the DPC objects that the kernel had queued.
The technique of using software interrupts to defer interrupt processing is a
well-established method of reducing ISR latency. UNIX and other systems started
using deferred processing in the 1970s to deal with the slow hardware and limited
buffering of serial connections to terminals. The ISR would deal with fetching
characters from the hardware and queuing them. After all higher-level interrupt
processing was completed, a software interrupt would run a low-priority ISR to do
character processing, such as implementing backspace by sending control charac-
ters to the terminal to erase the last character displayed and move the cursor back-
ward.
A similar example in Windows today is the keyboard device. After a key is
struck, the keyboard ISR reads the key code from a register and then reenables the
keyboard interrupt but does not do further processing of the key immediately. In-
stead, it uses a DPC to queue the processing of the key code until all outstanding
device interrupts have been processed.
Because DPCs run at level 2 they do not keep device ISRs from executing, but
they do prevent any threads from running until all the queued DPCs complete and
the CPU priority level is lowered below 2. Device drivers and the system itself
must take care not to run either ISRs or DPCs for too long. Because threads are
not allowed to execute, ISRs and DPCs can make the system appear sluggish and
produce glitches when playing music by stalling the threads writing the music
buffer to the sound device. Another common use of DPCs is running routines in
SEC. 11.3 SYSTEM STRUCTURE 885

response to a timer interrupt. To avoid blocking threads, timer events which need
to run for an extended time should queue requests to the pool of worker threads the
kernel maintains for background activities.

Asynchronous Procedure Calls

The other special kernel control object is the APC (Asynchronous Procedure
Call) object. APCs are like DPCs in that they defer processing of a system rou-
tine, but unlike DPCs, which operate in the context of particular CPUs, APCs ex-
ecute in the context of a specific thread. When processing a key press, it does not
matter which context the DPC runs in because a DPC is simply another part of in-
terrupt processing, and interrupts only need to manage the physical device and per-
form thread-independent operations such as recording the data in a buffer in kernel
space.
The DPC routine runs in the context of whatever thread happened to be run-
ning when the original interrupt occurred. It calls into the I/O system to report that
the I/O operation has been completed, and the I/O system queues an APC to run in
the context of the thread making the original I/O request, where it can access the
user-mode address space of the thread that will process the input.
At the next convenient time the kernel layer delivers the APC to the thread and
schedules the thread to run. An APC is designed to look like an unexpected proce-
dure call, somewhat similar to signal handlers in UNIX. The kernel-mode APC for
completing I/O executes in the context of the thread that initiated the I/O, but in
kernel mode. This gives the APC access to both the kernel-mode buffer as well as
all of the user-mode address space belonging to the process containing the thread.
When an APC is delivered depends on what the thread is already doing, and even
what type of system. In a multiprocessor system the thread receiving the APC may
begin executing even before the DPC finishes running.
User-mode APCs can also be used to deliver notification of I/O completion in
user mode to the thread that initiated the I/O. User-mode APCs invoke a user-
mode procedure designated by the application, but only when the target thread has
blocked in the kernel and is marked as willing to accept APCs. The kernel inter-
rupts the thread from waiting and returns to user mode, but with the user-mode
stack and registers modified to run the APC dispatch routine in the ntdll.dll system
library. The APC dispatch routine invokes the user-mode routine that the applica-
tion has associated with the I/O operation. Besides specifying user-mode APCs as
a means of executing code when I/Os complete, the Win32 API QueueUserAPC
allows APCs to be used for arbitrary purposes.
The executive layer also uses APCs for operations other than I/O completion.
Because the APC mechanism is carefully designed to deliver APCs only when it is
safe to do so, it can be used to safely terminate threads. If it is not a good time to
terminate the thread, the thread will have declared that it was entering a critical re-
gion and defer deliveries of APCs until it leaves. Kernel threads mark themselves
886 CASE STUDY 2: WINDOWS 8 CHAP. 11

as entering critical regions to defer APCs when acquiring locks or other resources,
so that they cannot be terminated while still holding the resource.

Dispatcher Objects

Another kind of synchronization object is the dispatcher object. This is any


ordinary kernel-mode object (the kind that users can refer to with handles) that
contains a data structure called a dispatcher header, shown in Fig. 11-13. These
objects include semaphores, mutexes, events, waitable timers, and other objects
that threads can wait on to synchronize execution with other threads. They also in-
clude objects representing open files, processes, threads, and IPC ports. The dis-
patcher data structure contains a flag representing the signaled state of the object,
and a queue of threads waiting for the object to be signaled.

Object header

Notification /Synchronization flag


Executive
object Signaled state DISPATCHER_HEADER

List head for waiting threads

Object-specific data

Figure 11-13. dispatcher header data structure embedded in many executive ob-
jects (dispatcher objects).

Synchronization primitives, like semaphores, are natural dispatcher objects.


Also timers, files, ports, threads, and processes use the dispatcher-object mechan-
isms for notifications. When a timer fires, I/O completes on a file, data are avail-
able on a port, or a thread or process terminates, the associated dispatcher object is
signaled, waking all threads waiting for that event.
Since Windows uses a single unified mechanism for synchronization with ker-
nel-mode objects, specialized APIs, such as wait3, for waiting for child processes
in UNIX, are not needed to wait for events. Often threads want to wait for multiple
events at once. In UNIX a process can wait for data to be available on any of 64
network sockets using the select system call. In Windows, there is a similar API
WaitForMultipleObjects, but it allows for a thread to wait on any type of dis-
patcher object for which it has a handle. Up to 64 handles can be specified to Wait-
ForMultipleObjects, as well as an optional timeout value. The thread becomes
ready to run whenever any of the events associated with the handles is signaled or
the timeout occurs.
There are actually two different procedures the kernel uses for making the
threads waiting on a dispatcher object runnable. Signaling a notification object
will make every waiting thread runnable. Synchronization objects make only the
first waiting thread runnable and are used for dispatcher objects that implement
SEC. 11.3 SYSTEM STRUCTURE 887

locking primitives, like mutexes. When a thread that is waiting for a lock begins
running again, the first thing it does is to retry acquiring the lock. If only one
thread can hold the lock at a time, all the other threads made runnable might im-
mediately block, incurring lots of unnecessary context switching. The difference
between dispatcher objects using synchronization vs. notification is a flag in the
dispatcher header structure.
As a little aside, mutexes in Windows are called ‘‘mutants’’ in the code be-
cause they were required to implement the OS/2 semantics of not automatically
unlocking themselves when a thread holding one exited, something Cutler consid-
ered bizarre.

The Executive Layer

As shown in Fig. 11-11, below the kernel layer of NTOS there is the executive.
The executive layer is written in C, is mostly architecture independent (the memo-
ry manager being a notable exception), and has been ported with only modest
effort to new processors (MIPS, x86, PowerPC, Alpha, IA64, x64, and ARM). The
executive contains a number of different components, all of which run using the
control abstractions provided by the kernel layer.
Each component is divided into internal and external data structures and inter-
faces. The internal aspects of each component are hidden and used only within the
component itself, while the external aspects are available to all the other compo-
nents within the executive. A subset of the external interfaces are exported from
the ntoskrnl.exe executable and device drivers can link to them as if the executive
were a library. Microsoft calls many of the executive components ‘‘managers,’’ be-
cause each is charge of managing some aspect of the operating services, such as
I/O, memory, processes, objects, etc.
As with most operating systems, much of the functionality in the Windows ex-
ecutive is like library code, except that it runs in kernel mode so its data structures
can be shared and protected from access by user-mode code, and so it can access
kernel-mode state, such as the MMU control registers. But otherwise the executive
is simply executing operating system functions on behalf of its caller, and thus runs
in the thread of its called.
When any of the executive functions block waiting to synchronize with other
threads, the user-mode thread is blocked, too. This makes sense when working on
behalf of a particular user-mode thread, but it can be unfair when doing work relat-
ed to common housekeeping tasks. To avoid hijacking the current thread when the
executive determines that some housekeeping is needed, a number of kernel-mode
threads are created when the system boots and dedicated to specific tasks, such as
making sure that modified pages get written to disk.
For predictable, low-frequency tasks, there is a thread that runs once a second
and has a laundry list of items to handle. For less predictable work there is the
888 CASE STUDY 2: WINDOWS 8 CHAP. 11

pool of high-priority worker threads mentioned earlier which can be used to run
bounded tasks by queuing a request and signaling the synchronization event that
the worker threads are waiting on.
The object manager manages most of the interesting kernel-mode objects
used in the executive layer. These include processes, threads, files, semaphores,
I/O devices and drivers, timers, and many others. As described previously, kernel-
mode objects are really just data structures allocated and used by the kernel. In
Windows, kernel data structures have enough in common that it is very useful to
manage many of them in a unified facility.
The facilities provided by the object manager include managing the allocation
and freeing of memory for objects, quota accounting, supporting access to objects
using handles, maintaining reference counts for kernel-mode pointer references as
well as handle references, giving objects names in the NT namespace, and provid-
ing an extensible mechanism for managing the lifecycle for each object. Kernel
data structures which need some of these facilities are managed by the object man-
ager.
Object-manager objects each have a type which is used to specify exactly how
the lifecycle of objects of that type is to be managed. These are not types in the
object-oriented sense, but are simply a collection of parameters specified when the
object type is created. To create a new type, an executive component calls an ob-
ject-manager API to create a new type. Objects are so central to the functioning of
Windows that the object manager will be discussed in more detail in the next sec-
tion.
The I/O manager provides the framework for implementing I/O device drivers
and provides a number of executive services specific to configuring, accessing, and
performing operations on devices. In Windows, device drivers not only manage
physical devices but they also provide extensibility to the operating system. Many
functions that are compiled into the kernel on other systems are dynamically load-
ed and linked by the kernel on Windows, including network protocol stacks and
file systems.
Recent versions of Windows have a lot more support for running device drivers
in user mode, and this is the preferred model for new device drivers. There are
hundreds of thousands of different device drivers for Windows working with more
than a million distinct devices. This represents a lot of code to get correct. It is
much better if bugs cause a device to become inaccessible by crashing in a user-
mode process rather than causing the system to crash. Bugs in kernel-mode device
drivers are the major source of the dreaded BSOD (Blue Screen Of Death) where
Windows detects a fatal error within kernel mode and shuts down or reboots the
system. BSOD’s are comparable to kernel panics on UNIX systems.
In essence, Microsoft has now officially recognized what researchers in the
area of microkernels such as MINIX 3 and L4 have known for years: the more
code there is in the kernel, the more bugs there are in the kernel. Since device driv-
ers make up something in the vicinity of 70% of the code in the kernel, the more
SEC. 11.3 SYSTEM STRUCTURE 889

drivers that can be moved into user-mode processes, where a bug will only trigger
the failure of a single driver (rather than bringing down the entire system), the bet-
ter. The trend of moving code from the kernel to user-mode processes is expected
to accelerate in the coming years.
The I/O manager also includes the plug-and-play and device power-man-
agement facilities. Plug-and-play comes into action when new devices are detect-
ed on the system. The plug-and-play subcomponent is first notified. It works with
a service, the user-mode plug-and-play manager, to find the appropriate device
driver and load it into the system. Getting the right one is not always easy and
sometimes depends on sophisticated matching of the specific hardware device ver-
sion to a particular version of the drivers. Sometimes a single device supports a
standard interface which is supported by multiple different drivers, written by dif-
ferent companies.
We will study I/O further in Sec. 11.7 and the most important NT file system,
NTFS, in Sec. 11.8.
Device power management reduces power consumption when possible, ex-
tending battery life on notebooks, and saving energy on desktops and servers. Get-
ting power management correct can be challenging, as there are many subtle
dependencies between devices and the buses that connect them to the CPU and
memory. Power consumption is not affected just by what devices are powered-on,
but also by the clock rate of the CPU, which is also controlled by the device power
manager. We will take a more in depth look at power management in Sec. 11.9.
The process manager manages the creation and termination of processes and
threads, including establishing the policies and parameters which govern them.
But the operational aspects of threads are determined by the kernel layer, which
controls scheduling and synchronization of threads, as well as their interaction
with the control objects, like APCs. Processes contain threads, an address space,
and a handle table containing the handles the process can use to refer to kernel-
mode objects. Processes also include information needed by the scheduler for
switching between address spaces and managing process-specific hardware infor-
mation (such as segment descriptors). We will study process and thread man-
agement in Sec. 11.4.
The executive memory manager implements the demand-paged virtual mem-
ory architecture. It manages the mapping of virtual pages onto physical page
frames, the management of the available physical frames, and management of the
pagefile on disk used to back private instances of virtual pages that are no longer
loaded in memory. The memory manager also provides special facilities for large
server applications such as databases and programming language run-time compo-
nents such as garbage collectors. We will study memory management later in this
chapter, in Sec. 11.5.
The cache manager optimizes the performance of I/O to the file system by
maintaining a cache of file-system pages in the kernel virtual address space. The
cache manager uses virtually addressed caching, that is, organizing cached pages
890 CASE STUDY 2: WINDOWS 8 CHAP. 11

in terms of their location in their files. This differs from physical block caching, as
in UNIX, where the system maintains a cache of the physically addressed blocks of
the raw disk volume.
Cache management is implemented using mapped files. The actual caching is
performed by the memory manager. The cache manager need be concerned only
with deciding what parts of what files to cache, ensuring that cached data is
flushed to disk in a timely fashion, and managing the kernel virtual addresses used
to map the cached file pages. If a page needed for I/O to a file is not available in
the cache, the page will be faulted in using the memory manager. We will study
the cache manager in Sec. 11.6.
The security reference monitor enforces Windows’ elaborate security mech-
anisms, which support the international standards for computer security called
Common Criteria, an evolution of United States Department of Defense Orange
Book security requirements. These standards specify a large number of rules that a
conforming system must meet, such as authenticated login, auditing, zeroing of al-
located memory, and many more. One rules requires that all access checks be im-
plemented by a single module within the system. In Windows, this module is the
security reference monitor in the kernel. We will study the security system in more
detail in Sec. 11.10.
The executive contains a number of other components that we will briefly de-
scribe. The configuration manager is the executive component which imple-
ments the registry, as described earlier. The registry contains configuration data for
the system in file-system files called hives. The most critical hive is the SYSTEM
hive which is loaded into memory at boot time. Only after the executive layer has
successfully initialized its key components, including the I/O drivers that talk to
the system disk, is the in-memory copy of the hive reassociated with the copy in
the file system. Thus, if something bad happens while trying to boot the system,
the on-disk copy is much less likely to be corrupted.
The LPC component provides for a highly efficient interprocess communica-
tion used between processes running on the same system. It is one of the data tran-
sports used by the standards-based remote procedure call facility to implement the
client/server style of computing. RPC also uses named pipes and TCP/IP as tran-
sports.
LPC was substantially enhanced in Windows 8 (it is now called ALPC, for
Advanced LPC) to provide support for new features in RPC, including RPC from
kernel mode components, like drivers. LPC was a critical component in the origi-
nal design of NT because it is used by the subsystem layer to implement communi-
cation between library stub routines that run in each process and the subsystem
process which implements the facilities common to a particular operating system
personality, such as Win32 or POSIX.
Windows 8 implemented a publish/subscibe service called WNF (Windows
Notification Facility). WNF notifications are based on changes to an instance of
WNF state data. A publisher declares an instance of state data (up to 4 KB) and
SEC. 11.3 SYSTEM STRUCTURE 891

tells the operating system how long to maintain it (e.g., until the next reboot or
permanently). A publisher atomically updates the state as appropriate. Subscri-
bers can arrange to run code whenever an instance of state data is modified by a
publisher. Because the WNF state instances contain a fixed amount of preallocated
data, there is no queuing of data as in message-based IPC—with all the attendant
resource-management problems. Subscribers are guaranteed only that they can see
the latest version of a state instance.
This state-based approach gives WNF its principal advantage over other IPC
mechanisms: publishers and subscribers are decoupled and can start and stop inde-
pendently of each other. Publishers need not execute at boot time just to initialize
their state instances, as those can be persisted by the operating system across
reboots. Subscribers generally need not be concerned about past values of state
instances when they start running, as all they should need to know about the state’s
history is encapsulated in the current state. In scenarios where past state values
cannot be reasonably encapsulated, the current state can provide metadata for man-
aging historical state, say, in a file or in a persisted section object used as a circular
buffer. WNF is part of the native NT APIs and is not (yet) exposed via Win32 in-
terfaces. But it is extensively used internally by the system to implement Win32
and WinRT APIs.
In Windows NT 4.0, much of the code related to the Win32 graphical interface
was moved into the kernel because the then-current hardware could not provide the
required performance. This code previously resided in the csrss.exe subsystem
process which implemented the Win32 interfaces. The kernel-based GUI code
resides in a special kernel-driver, win32k.sys. This change was expected to im-
prove Win32 performance because the extra user-mode/kernel-mode transitions
and the cost of switching address spaces to implement communication via LPC
was eliminated. But it has not been as successful as expected because the re-
quirements on code running in the kernel are very strict, and the additional over-
head of running in kernel-mode offsets some of the gains from reducing switching
costs.

The Device Drivers

The final part of Fig. 11-11 consists of the device drivers. Device drivers in
Windows are dynamic link libraries which are loaded by the NTOS executive.
Though they are primarily used to implement the drivers for specific hardware,
such as physical devices and I/O buses, the device-driver mechanism is also used
as the general extensibility mechanism for kernel mode. As described above,
much of the Win32 subsystem is loaded as a driver.
The I/O manager organizes a data flow path for each instance of a device, as
shown in Fig. 11-14. This path is called a device stack and consists of private
instances of kernel device objects allocated for the path. Each device object in the
device stack is linked to a particular driver object, which contains the table of
892 CASE STUDY 2: WINDOWS 8 CHAP. 11

routines to use for the I/O request packets that flow through the device stack. In
some cases the devices in the stack represent drivers whose sole purpose is to filter
I/O operations aimed at a particular device, bus, or network driver. Filtering is
used for a number of reasons. Sometimes preprocessing or postprocessing I/O op-
erations results in a cleaner architecture, while other times it is just pragmatic be-
cause the sources or rights to modify a driver are not available and so filtering is
used to work around the inability to modify those drivers. Filters can also imple-
ment completely new functionality, such as turning disks into partitions or multiple
disks into RAID volumes.

I/O manager

C: File-system Filter File-system filter driver D: File-system filter

C: File-system Filter File-system filter driver D: File-system filter

IRP C: File system NTFS driver D: File system IRP

C: Volume Volume manager driver D: Volume

C: Disk class device Disk class driver D: Disk class device

C: Disk partition(s) Disk miniport driver D: Disk partition(s)

Device stack Each device object Device stack


consisting of device links to a driver consisting of device
objects for C: object with function objects for D:
entry points

Figure 11-14. Simplified depiction of device stacks for two NTFS file volumes.
The I/O request packet is passed from down the stack. The appropriate routines
from the associated drivers are called at each level in the stack. The device stacks
themselves consist of device objects allocated specifically to each stack.

The file systems are loaded as device drivers. Each instance of a volume for a
file system has a device object created as part of the device stack for that volume.
This device object will be linked to the driver object for the file system appropriate
to the volume’s formatting. Special filter drivers, called file-system filter drivers,
can insert device objects before the file-system device object to apply functionality
to the I/O requests being sent to each volume, such as inspecting data read or writ-
ten for viruses.
SEC. 11.3 SYSTEM STRUCTURE 893

The network protocols, such as Windows’ integrated IPv4/IPv6 TCP/IP imple-


mentation, are also loaded as drivers using the I/O model. For compatibility with
the older MS-DOS-based Windows, the TCP/IP driver implements a special proto-
col for talking to network interfaces on top of the Windows I/O model. There are
other drivers that also implement such arrangements, which Windows calls mini-
ports. The shared functionality is in a class driver. For example, common func-
tionality for SCSI or IDE disks or USB devices is supplied by a class driver, which
miniport drivers for each particular type of such devices link to as a library.
We will not discuss any particular device driver in this chapter, but will provide
more detail about how the I/O manager interacts with device drivers in Sec. 11.7.

11.3.2 Booting Windows

Getting an operating system to run requires several steps. When a computer is


turned on, the first processor is initialized by the hardware, and then set to start ex-
ecuting a program in memory. The only available code is in some form of non-
volatile CMOS memory that is initialized by the computer manufacturer (and
sometimes updated by the user, in a process called flashing). Because the software
persists in memory, and is only rarely updated, it is referred to as firmware. The
firmware is loaded on PCs by the manufacturer of either the parentboard or the
computer system. Historically PC firmware was a program called BIOS (Basic
Input/Output System), but most new computers use UEFI (Unified Extensible
Firmware Interface). UEFI improves over BIOS by supporting modern hard-
ware, providing a more modular CPU-independent architecture, and supporting an
extension model which simplifies booting over networks, provisioning new ma-
chines, and running diagnostics.
The main purpose of any firmware is to bring up the operating system by first
loading small bootstrap programs found at the beginning of the disk-drive parti-
tions. The Windows bootstrap programs know how to read enough information off
a file-system volume or network to find the stand-alone Windows BootMgr pro-
gram. BootMgr determines if the system had previously been hibernated or was in
stand-by mode (special power-saving modes that allow the system to turn back on
without restarting from the beginning of the bootstrap process). If so, BootMgr
loads and executes WinResume.exe. Otherwise it loads and executes WinLoad.exe
to perform a fresh boot. WinLoad loads the boot components of the system into
memory: the kernel/executive (normally ntoskrnl.exe), the HAL (hal.dll), the file
containing the SYSTEM hive, the Win32k.sys driver containing the kernel-mode
parts of the Win32 subsystem, as well as images of any other drivers that are listed
in the SYSTEM hive as boot drivers—meaning they are needed when the system
first boots. If the system has Hyper-V enabled, WinLoad also loads and starts the
hypervisor program.
Once the Windows boot components have been loaded into memory, control is
handed over to the low-level code in NTOS which proceeds to initialize the HAL,
894 CASE STUDY 2: WINDOWS 8 CHAP. 11

kernel, and executive layers, link in the driver images, and access/update configu-
ration data in the SYSTEM hive. After all the kernel-mode components are ini-
tialized, the first user-mode process is created using for running the smss.exe pro-
gram (which is like /etc/init in UNIX systems).
Recent versions of Windows provide support for improving the security of the
system at boot time. Many newer PCs contain a TPM (Trusted Platform Mod-
ule), which is chip on the parentboard. chip is a secure cryptographic processor
which protects secrets, such as encryption/decryption keys. The system’s TPM can
be used to protect system keys, such as those used by BitLocker to encrypt the
disk. Protected keys are not revealed to the operating system until after TPM has
verified that an attacker has not tampered with them. It can also provide other
cryptographic functions, such as attesting to remote systems that the operating sys-
tem on the local system had not been compromised.
The Windows boot programs have logic to deal with common problems users
encounter when booting the system fails. Sometimes installation of a bad device
driver, or running a program like regedit (which can corrupt the SYSTEM hive),
will prevent the system from booting normally. There is support for ignoring re-
cent changes and booting to the last known good configuration of the system.
Other boot options include safe-boot, which turns off many optional drivers, and
the recovery console, which fires up a cmd.exe command-line window, providing
an experience similar to single-user mode in UNIX.
Another common problem for users has been that occasionally some Windows
systems appear to be very flaky, with frequent (seemingly random) crashes of both
the system and applications. Data taken from Microsoft’s Online Crash Analysis
program provided evidence that many of these crashes were due to bad physical
memory, so the boot process in Windows provides the option of running an exten-
sive memory diagnostic. Perhaps future PC hardware will commonly support ECC
(or maybe parity) for memory, but most of the desktop, notebook, and handheld
systems today are vulnerable to even single-bit errors in the tens of billions of
memory bits they contain.

11.3.3 Implementation of the Object Manager

The object manager is probably the single most important component in the
Windows executive, which is why we have already introduced many of its con-
cepts. As described earlier, it provides a uniform and consistent interface for man-
aging system resources and data structures, such as open files, processes, threads,
memory sections, timers, devices, drivers, and semaphores. Even more specialized
objects representing things like kernel transactions, profiles, security tokens, and
Win32 desktops are managed by the object manager. Device objects link together
the descriptions of the I/O system, including providing the link between the NT
namespace and file-system volumes. The configuration manager uses an object of
type key to link in the registry hives. The object manager itself has objects it uses
SEC. 11.3 SYSTEM STRUCTURE 895

to manage the NT namespace and implement objects using a common facility.


These are directory, symbolic link, and object-type objects.
The uniformity provided by the object manager has various facets. All these
objects use the same mechanism for how they are created, destroyed, and ac-
counted for in the quota system. They can all be accessed from user-mode proc-
esses using handles. There is a unified convention for managing pointer references
to objects from within the kernel. Objects can be given names in the NT name-
space (which is managed by the object manager). Dispatcher objects (objects that
begin with the common data structure for signaling events) can use common syn-
chronization and notification interfaces, like WaitForMultipleObjects. There is the
common security system with ACLs enforced on objects opened by name, and ac-
cess checks on each use of a handle. There are even facilities to help kernel-mode
developers debug problems by tracing the use of objects.
A key to understanding objects is to realize that an (executive) object is just a
data structure in the virtual memory accessible to kernel mode. These data struc-
tures are commonly used to represent more abstract concepts. As examples, exec-
utive file objects are created for each instance of a file-system file that has been
opened. Process objects are created to represent each process.
A consequence of the fact that objects are just kernel data structures is that
when the system is rebooted (or crashes) all objects are lost. When the system
boots, there are no objects present at all, not even the object-type descriptors. All
object types, and the objects themselves, have to be created dynamically by other
components of the executive layer by calling the interfaces provided by the object
manager. When objects are created and a name is specified, they can later be refer-
enced through the NT namespace. So building up the objects as the system boots
also builds the NT namespace.
Objects have a structure, as shown in Fig. 11-15. Each object contains a head-
er with certain information common to all objects of all types. The fields in this
header include the object’s name, the object directory in which it lives in the NT
namespace, and a pointer to a security descriptor representing the ACL for the ob-
ject.
The memory allocated for objects comes from one of two heaps (or pools) of
memory maintained by the executive layer. There are (malloc-like) utility func-
tions in the executive that allow kernel-mode components to allocate either page-
able or nonpageable kernel memory. Nonpageable memory is required for any
data structure or kernel-mode object that might need to be accessed from a CPU
priority level of 2 or more. This includes ISRs and DPCs (but not APCs) and the
thread scheduler itself. The page-fault handler also requires its data structures to
be allocated from nonpageable kernel memory to avoid recursion.
Most allocations from the kernel heap manager are achieved using per-proc-
essor lookaside lists which contain LIFO lists of allocations the same size. These
LIFOs are optimized for lock-free operation, improving the performance and
scalability of the system.
896 CASE STUDY 2: WINDOWS 8 CHAP. 11

Object name
Directory in which the object lives
Object Security information (which can use object)
header Quota charges (cost to use the object)
List of processes with handles
Reference counts
Pointer to the type object
Type name
Access types
Access rights
Quota charges
Object Synchronizable?
Object-specific data Pageable
data
Open method
Close method
Delete method
Query name method
Parse method
Security method

Figure 11-15. Structure of an executive object managed by the object manager


Each object header contains a quota-charge field, which is the charge levied
against a process for opening the object. Quotas are used to keep a user from using
too many system resources. There are separate limits for nonpageable kernel
memory (which requires allocation of both physical memory and kernel virtual ad-
dresses) and pageable kernel memory (which uses up kernel virtual addresses).
When the cumulative charges for either memory type hit the quota limit, alloca-
tions for that process fail due to insufficient resources. Quotas also are used by the
memory manager to control working-set size, and by the thread manager to limit
the rate of CPU usage.
Both physical memory and kernel virtual addresses are valuable resources.
When an object is no longer needed, it should be removed and its memory and ad-
dresses reclaimed. But if an object is reclaimed while it is still in use, then the
memory may be allocated to another object, and then the data structures are likely
to become corrupted. It is easy for this to happen in the Windows executive layer
because it is highly multithreaded, and implements many asynchronous operations
(functions that return to their caller before completing work on the data structures
passed to them).
To avoid freeing objects prematurely due to race conditions, the object man-
ager implements a reference counting mechanism and the concept of a referenced
pointer. A referenced pointer is needed to access an object whenever that object is
in danger of being deleted. Depending on the conventions regarding each particu-
lar object type, there are only certain times when an object might be deleted by an-
other thread. At other times the use of locks, dependencies between data struc-
tures, and even the fact that no other thread has a pointer to an object are sufficient
to keep the object from being prematurely deleted.
SEC. 11.3 SYSTEM STRUCTURE 897

Handles

User-mode references to kernel-mode objects cannot use pointers because they


are too difficult to validate. Instead, kernel-mode objects must be named in some
other way so the user code can refer to them. Windows uses handles to refer to
kernel-mode objects. Handles are opaque values which are converted by the object
manager into references to the specific kernel-mode data structure representing an
object. Figure 11-16 shows the handle-table data structure used to translate hand-
les into object pointers. The handle table is expandable by adding extra layers of
indirection. Each process has its own table, including the system process which
contains all the kernel threads not associated with a user-mode process.
Handle-table
descriptor A: Handle-table entries [512]
Table pointer

Object
Object
Object

Figure 11-16. Handle table data structures for a minimal table using a single
page for up to 512 handles.

Figure 11-17 shows a handle table with two extra levels of indirection, the
maximum supported. It is sometimes convenient for code executing in kernel
mode to be able to use handles rather than referenced pointers. These are called
kernel handles and are specially encoded so that they can be distinguished from
user-mode handles. Kernel handles are kept in the system processes’ handle table
and cannot be accessed from user mode. Just as most of the kernel virtual address
space is shared across all processes, the system handle table is shared by all kernel
components, no matter what the current user-mode process is.
Users can create new objects or open existing objects by making Win32 calls
such as CreateSemaphore or OpenSemaphore. These are calls to library proce-
dures that ultimately result in the appropriate system calls being made. The result
of any successful call that creates or opens an object is a 64-bit handle-table entry
that is stored in the process’ private handle table in kernel memory. The 32-bit
index of the handle’s logical position in the table is returned to the user to use on
subsequent calls. The 64-bit handle-table entry in the kernel contains two 32-bit
words. One word contains a 29-bit pointer to the object’s header. The low-order 3
bits are used as flags (e.g., whether the handle is inherited by processes it creates).
These 3 bits are masked off before the pointer is followed. The other word con-
tains a 32-bit rights mask. It is needed because permissions checking is done only
898 CASE STUDY 2: WINDOWS 8 CHAP. 11

Handle-table
Descriptor D: Handle-table pointers [32]
Table pointer

B: Handle-table pointers [1024]


E: Handle-table pointers [1024]

A: Handle-table entries [512]


F:Handle-table entries [512]
Object
Object
Object C:Handle-table entries [512]

Figure 11-17. Handle-table data structures for a maximal table of up to 16 mil-


lion handles.

at the time the object is created or opened. If a process has only read permission to
an object, all the other rights bits in the mask will be 0s, giving the operating sys-
tem the ability to reject any operation on the object other than reads.

The Object Namespace

Processes can share objects by having one process duplicate a handle to the ob-
ject into the others. But this requires that the duplicating process have handles to
the other processes, and is thus impractical in many situations, such as when the
processes sharing an object are unrelated, or are protected from each other. In
other cases it is important that objects persist even when they are not being used by
any process, such as device objects representing physical devices, or mounted vol-
umes, or the objects used to implement the object manager and the NT namespace
itself. To address general sharing and persistence requirements, the object man-
ager allows arbitrary objects to be given names in the NT namespace when they are
created. However, it is up to the executive component that manipulates objects of a
particular type to provide interfaces that support use of the object manager’s na-
ming facilities.
The NT namespace is hierarchical, with the object manager implementing di-
rectories and symbolic links. The namespace is also extensible, allowing any ob-
ject type to specify extensions of the namespace by specifying a Parse routine.
The Parse routine is one of the procedures that can be supplied for each object type
when it is created, as shown in Fig. 11-18.
The Open procedure is rarely used because the default object-manager behav-
ior is usually what is needed and so the procedure is specified as NULL for almost
all object types.
SEC. 11.3 SYSTEM STRUCTURE 899

Procedure When called Notes


Open For every new handle Rarely used
Parse For object types that extend the namespace Used for files and registry keys
Close At last handle close Clean up visible side effects
Delete At last pointer dereference Object is about to be deleted
Security Get or set object’s security descriptor Protection
Quer yName Get object’s name Rarely used outside kernel

Figure 11-18. Object procedures supplied when specifying a new object type.

The Close and Delete procedures represent different phases of being done with
an object. When the last handle for an object is closed, there may be actions neces-
sary to clean up the state and these are performed by the Close procedure. When
the final pointer reference is removed from the object, the Delete procedure is call-
ed so that the object can be prepared to be deleted and have its memory reused.
With file objects, both of these procedures are implemented as callbacks into the
I/O manager, which is the component that declared the file object type. The ob-
ject-manager operations result in I/O operations that are sent down the device stack
associated with the file object; the file system does most of the work.
The Parse procedure is used to open or create objects, like files and registry
keys, that extend the NT namespace. When the object manager is attempting to
open an object by name and encounters a leaf node in the part of the namespace it
manages, it checks to see if the type for the leaf-node object has specified a Parse
procedure. If so, it invokes the procedure, passing it any unused part of the path
name. Again using file objects as an example, the leaf node is a device object
representing a particular file-system volume. The Parse procedure is implemented
by the I/O manager, and results in an I/O operation to the file system to fill in a file
object to refer to an open instance of the file that the path name refers to on the
volume. We will explore this particular example step-by-step below.
The QueryName procedure is used to look up the name associated with an ob-
ject. The Security procedure is used to get, set, or delete the security descriptors
on an object. For most object types this procedure is supplied as a standard entry
point in the executive’s security reference monitor component.
Note that the procedures in Fig. 11-18 do not perform the most useful opera-
tions for each type of object, such as read or write on files (or down and up on
semaphores). Rather, the object manager procedures supply the functions needed
to correctly set up access to objects and then clean up when the system is finished
with them. The objects are made useful by the APIs that operate on the data struc-
tures the objects contain. System calls, like NtReadFile and NtWriteFile, use the
process’ handle table created by the object manager to translate a handle into a ref-
erenced pointer on the underlying object, such as a file object, which contains the
data that is needed to implement the system calls.
900 CASE STUDY 2: WINDOWS 8 CHAP. 11

Apart from the object-type callbacks, the object manager also provides a set of
generic object routines for operations like creating objects and object types, dupli-
cating handles, getting a referenced pointer from a handle or name, adding and
subtracting reference counts to the object header, and NtClose (the generic function
that closes all types of handles).
Although the object namespace is crucial to the entire operation of the system,
few people know that it even exists because it is not visible to users without special
viewing tools. One such viewing tool is winobj, available for free at the URL
www.microsoft.com/technet/sysinternals. When run, this tool depicts an object
namespace that typically contains the object directories listed in Fig. 11-19 as well
as a few others.

Directory Contents
\?? Starting place for looking up MS-DOS devices like C:
\ DosDevices Official name of \ ??, but really just a symbolic link to \ ??
\Device All discovered I/O devices
\Driver Objects corresponding to each loaded device driver
\ObjectTypes The type objects such as those listed in Fig. 11-21
\Windows Objects for sending messages to all the Win32 GUI windows
\BaseNamedObjects User-created Win32 objects such as semaphores, mutexes, etc.
\Arcname Par tition names discovered by the boot loader
\NLS National Language Support objects
\FileSystem File-system driver objects and file system recognizer objects
\Security Objects belonging to the security system
\KnownDLLs Key shared libraries that are opened early and held open

Figure 11-19. Some typical directories in the object namespace.

The strangely named directory \ ?? contains the names of all the MS-DOS-
style device names, such as A: for the floppy disk and C: for the first hard disk.
These names are actually symbolic links to the directory \ Device where the device
objects live. The name \ ?? was chosen to make it alphabetically first so as to
speed up lookup of all path names beginning with a drive letter. The contents of
the other object directories should be self explanatory.
As described above, the object manager keeps a separate handle count in every
object. This count is never larger than the referenced pointer count because each
valid handle has a referenced pointer to the object in its handle-table entry. The
reason for the separate handle count is that many types of objects may need to have
their state cleaned up when the last user-mode reference disappears, even though
they are not yet ready to have their memory deleted.
One example is file objects, which represent an instance of an opened file. In
Windows, files can be opened for exclusive access. When the last handle for a file
SEC. 11.3 SYSTEM STRUCTURE 901

object is closed it is important to delete the exclusive access at that point rather
than wait for any incidental kernel references to eventually go away (e.g., after the
last flush of data from memory). Otherwise closing and reopening a file from user
mode may not work as expected because the file still appears to be in use.
Though the object manager has comprehensive mechanisms for managing ob-
ject lifetimes within the kernel, neither the NT APIs nor the Win32 APIs provide a
reference mechanism for dealing with the use of handles across multiple concur-
rent threads in user mode. Thus, many multithreaded applications have race condi-
tions and bugs where they will close a handle in one thread before they are finished
with it in another. Or they may close a handle multiple times, or close a handle
that another thread is still using and reopen it to refer to a different object.
Perhaps the Windows APIs should have been designed to require a close API
per object type rather than the single generic NtClose operation. That would have
at least reduced the frequency of bugs due to user-mode threads closing the wrong
handles. Another solution might be to embed a sequence field in each handle in
addition to the index into the handle table.
To help application writers find problems like these in their programs, Win-
dows has an application verifier that software developers can download from
Microsoft. Similar to the verifier for drivers we will describe in Sec. 11.7, the ap-
plication verifier does extensive rules checking to help programmers find bugs that
might not be found by ordinary testing. It can also turn on a FIFO ordering for the
handle free list, so that handles are not reused immediately (i.e., turns off the bet-
ter-performing LIFO ordering normally used for handle tables). Keeping handles
from being reused quickly transforms situations where an operation uses the wrong
handle into use of a closed handle, which is easy to detect.
The device object is one of the most important and versatile kernel-mode ob-
jects in the executive. The type is specified by the I/O manager, which along with
the device drivers, are the primary users of device objects. Device objects are
closely related to drivers, and each device object usually has a link to a specific
driver object, which describes how to access the I/O processing routines for the
driver corresponding to the device.
Device objects represent hardware devices, interfaces, and buses, as well as
logical disk partitions, disk volumes, and even file systems and kernel extensions
like antivirus filters. Many device drivers are given names, so they can be accessed
without having to open handles to instances of the devices, as in UNIX. We will
use device objects to illustrate how the Parse procedure is used, as illustrated in
Fig. 11-20:

1. When an executive component, such as the I/O manager imple-


menting the native system call NtCreateFile, calls ObOpenObjectBy-
Name in the object manager, it passes a Unicode path name for the
NT namespace, say \ ?? \ C: \ foo \ bar.
902 CASE STUDY 2: WINDOWS 8 CHAP. 11

Win32 CreateFile(C:\ foo\ bar)


User mode
(10) Kernel mode
I/O NtCreateFile( \??\C:\foo\ bar)
manager
(1) (9) Handle
Object OpenObjectByName( \??\C:\foo\ bar)
manager \ Devices
(3) (2)
I/O ??
IopParseDevice(DeviceObject,\foo\bar) Harddisk1
manager
(4) (5)
File IRP
IoCallDriver
object
(6) File system filters C:
DEVICE OBJECT:
C: s Device stack (5) IoCallDriver for C: Volume
NTFS (8)
(7)
NtfsCreateFile() IoCompleteRequest
SYMLINK:
\Devices\Harddisk1
(a) (b)

Figure 11-20. I/O and object manager steps for creating/opening a file and get-
ting back a file handle.

2. The object manager searches through directories and symbolic links


and ultimately finds that \ ?? \ C: refers to a device object (a type de-
fined by the I/O manager). The device object is a leaf node in the part
of the NT namespace that the object manager manages.

3. The object manager then calls the Parse procedure for this object
type, which happens to be IopParseDevice implemented by the I/O
manager. It passes not only a pointer to the device object it found (for
C:), but also the remaining string \ foo \ bar.

4. The I/O manager will create an IRP (I/O Request Packet), allocate a
file object, and send the request to the stack of I/O devices determined
by the device object found by the object manager.

5. The IRP is passed down the I/O stack until it reaches a device object
representing the file-system instance for C:. At each stage, control is
passed to an entry point into the driver object associated with the de-
vice object at that level. The entry point used here is for CREATE
operations, since the request is to create or open a file named
\ foo \ bar on the volume.
SEC. 11.3 SYSTEM STRUCTURE 903

6. The device objects encountered as the IRP heads toward the file sys-
tem represent file-system filter drivers, which may modify the I/O op-
eration before it reaches the file-system device object. Typically
these intermediate devices represent system extensions like antivirus
filters.

7. The file-system device object has a link to the file-system driver ob-
ject, say NTFS. So, the driver object contains the address of the
CREATE operation within NTFS.

8. NTFS will fill in the file object and return it to the I/O manager,
which returns back up through all the devices on the stack until Iop-
ParseDevice returns to the object manager (see Sec. 11.8).

9. The object manager is finished with its namespace lookup. It re-


ceived back an initialized object from the Parse routine (which hap-
pens to be a file object—not the original device object it found). So
the object manager creates a handle for the file object in the handle
table of the current process, and returns the handle to its caller.

10. The final step is to return back to the user-mode caller, which in this
example is the Win32 API CreateFile, which will return the handle to
the application.

Executive components can create new types dynamically, by calling the


ObCreateObjectType interface to the object manager. There is no definitive list of
object types and they change from release to release. Some of the more common
ones in Windows are listed in Fig. 11-21. Let us briefly go over the object types in
the figure.
Process and thread are obvious. There is one object for every process and
every thread, which holds the main properties needed to manage the process or
thread. The next three objects, semaphore, mutex, and event, all deal with
interprocess synchronization. Semaphores and mutexes work as expected, but with
various extra bells and whistles (e.g., maximum values and timeouts). Events can
be in one of two states: signaled or nonsignaled. If a thread waits on an event that
is in signaled state, the thread is released immediately. If the event is in nonsig-
naled state, it blocks until some other thread signals the event, which releases ei-
ther all blocked threads (notification events) or just the first blocked thread (syn-
chronization events). An event can also be set up so that after a signal has been
successfully waited for, it will automatically revert to the nonsignaled state, rather
than staying in the signaled state.
Port, timer, and queue objects also relate to communication and synchroniza-
tion. Ports are channels between processes for exchanging LPC messages. Timers
904 CASE STUDY 2: WINDOWS 8 CHAP. 11

Type Description
Process User process
Thread Thread within a process
Semaphore Counting semaphore used for interprocess synchronization
Mutex Binar y semaphore used to enter a critical region
Event Synchronization object with persistent state (signaled/not)
ALPC port Mechanism for interprocess message passing
Timer Object allowing a thread to sleep for a fixed time interval
Queue Object used for completion notification on asynchronous I/O
Open file Object associated with an open file
Access token Security descriptor for some object
Profile Data structure used for profiling CPU usage
Section Object used for representing mappable files
Key Registr y key, used to attach registry to object-manager namespace
Object directory Director y for grouping objects within the object manager
Symbolic link Refers to another object manager object by path name
Device I/O device object for a physical device, bus, driver, or volume instance
Device driver Each loaded device driver has its own object

Figure 11-21. Some common executive object types managed by the object
manager.

provide a way to block for a specific time interval. Queues (known internally as
KQUEUES) are used to notify threads that a previously started asynchronous I/O
operation has completed or that a port has a message waiting. Queues are designed
to manage the level of concurrency in an application, and are also used in high-per-
formance multiprocessor applications, like SQL.
Open file objects are created when a file is opened. Files that are not opened
do not have objects managed by the object manager. Access tokens are security
objects. They identify a user and tell what special privileges the user has, if any.
Profiles are structures used for storing periodic samples of the program counter of
a running thread to see where the program is spending its time.
Sections are used to represent memory objects that applications can ask the
memory manager to map into their address space. They record the section of the
file (or page file) that represents the pages of the memory object when they are on
disk. Keys represent the mount point for the registry namespace on the object
manager namespace. There is usually only one key object, named \ REGISTRY,
which connects the names of the registry keys and values to the NT namespace.
Object directories and symbolic links are entirely local to the part of the NT
namespace managed by the object manager. They are similar to their file system
counterparts: directories allow related objects to be collected together. Symbolic
SEC. 11.3 SYSTEM STRUCTURE 905

links allow a name in one part of the object namespace to refer to an object in a
different part of the object namespace.
Each device known to the operating system has one or more device objects that
contain information about it and are used to refer to the device by the system.
Finally, each device driver that has been loaded has a driver object in the object
space. The driver objects are shared by all the device objects that represent
instances of the devices controlled by those drivers.
Other objects (not shown) have more specialized purposes, such as interacting
with kernel transactions, or the Win32 thread pool’s worker thread factory.

11.3.4 Subsystems, DLLs, and User-Mode Services

Going back to Fig. 11-4, we see that the Windows operating system consists of
components in kernel mode and components in user mode. We have now com-
pleted our overview of the kernel-mode components; so it is time to look at the
user-mode components, of which three kinds are particularly important to Win-
dows: environment subsystems, DLLs, and service processes.
We have already described the Windows subsystem model; we will not go into
more detail now other than to mention that in the original design of NT, subsys-
tems were seen as a way of supporting multiple operating system personalities with
the same underlying software running in kernel mode. Perhaps this was an attempt
to avoid having operating systems compete for the same platform, as VMS and
Berkeley UNIX did on DEC’s VAX. Or maybe it was just that nobody at Micro-
soft knew whether OS/2 would be a success as a programming interface, so they
were hedging their bets. In any case, OS/2 became irrelevant, and a latecomer, the
Win32 API designed to be shared with Windows 95, became dominant.
A second key aspect of the user-mode design of Windows is the dynamic link
library (DLL) which is code that is linked to executable programs at run time rath-
er than compile time. Shared libraries are not a new concept, and most modern op-
erating systems use them. In Windows, almost all libraries are DLLs, from the
system library ntdll.dll that is loaded into every process to the high-level libraries
of common functions that are intended to allow rampant code-reuse by application
developers.
DLLs improve the efficiency of the system by allowing common code to be
shared among processes, reduce program load times from disk by keeping com-
monly used code around in memory, and increase the serviceability of the system
by allowing operating system library code to be updated without having to recom-
pile or relink all the application programs that use it.
On the other hand, shared libraries introduce the problem of versioning and in-
crease the complexity of the system because changes introduced into a shared li-
brary to help one particular program have the potential of exposing latent bugs in
other applications, or just breaking them due to changes in the implementation—a
problem that in the Windows world is referred to as DLL hell.
906 CASE STUDY 2: WINDOWS 8 CHAP. 11

The implementation of DLLs is simple in concept. Instead of the compiler


emitting code that calls directly to subroutines in the same executable image, a
level of indirection is introduced: the IAT (Import Address Table). When an ex-
ecutable is loaded it is searched for the list of DLLs that must also be loaded (this
will be a graph in general, as the listed DLLs will themselves will generally list
other DLLs needed in order to run). The required DLLs are loaded and the IAT is
filled in for them all.
The reality is more complicated. Another problem is that the graphs that
represent the relationships between DLLs can contain cycles, or have nondetermin-
istic behaviors, so computing the list of DLLs to load can result in a sequence that
does not work. Also, in Windows the DLL libraries are given a chance to run code
whenever they are loaded into a process, or when a new thread is created. Gener-
ally, this is so they can perform initialization, or allocate per-thread storage, but
many DLLs perform a lot of computation in these attach routines. If any of the
functions called in an attach routine needs to examine the list of loaded DLLs, a
deadlock can occur, hanging the process.
DLLs are used for more than just sharing common code. They enable a host-
ing model for extending applications. Internet Explorer can download and link to
DLLs called ActiveX controls. At the other end of the Internet, Web servers also
load dynamic code to produce a better Web experience for the pages they display.
Applications like Microsoft Office link and run DLLs to allow Office to be used as
a platform for building other applications. The COM (component object model)
style of programming allows programs to dynamically find and load code written
to provide a particular published interface, which leads to in-process hosting of
DLLs by almost all the applications that use COM.
All this dynamic loading of code has resulted in even greater complexity for
the operating system, as library version management is not just a matter of match-
ing executables to the right versions of the DLLs, but sometimes loading multiple
versions of the same DLL into a process—which Microsoft calls side-by-side. A
single program can host two different dynamic code libraries, each of which may
want to load the same Windows library—yet have different version requirements
for that library.
A better solution would be hosting code in separate processes. But out-of--
process hosting of code results has lower performance, and makes for a more com-
plicated programming model in many cases. Microsoft has yet to develop a good
solution for all of this complexity in user mode. It makes one yearn for the relative
simplicity of kernel mode.
One of the reasons that kernel mode has less complexity than user mode is that
it supports relatively few extensibility opportunities outside of the device-driver
model. In Windows, system functionality is extended by writing user-mode ser-
vices. This worked well enough for subsystems, and works even better when only
a few new services are being provided rather than a complete operating system per-
sonality. There are few functional differences between services implemented in the
SEC. 11.3 SYSTEM STRUCTURE 907

kernel and services implemented in user-mode processes. Both the kernel and
process provide private address spaces where data structures can be protected and
service requests can be scrutinized.
However, there can be significant performance differences between services in
the kernel vs. services in user-mode processes. Entering the kernel from user mode
is slow on modern hardware, but not as slow as having to do it twice because you
are switching back and forth to another process. Also cross-process communica-
tion has lower bandwidth.
Kernel-mode code can (carefully) access data at the user-mode addresses pas-
sed as parameters to its system calls. With user-mode services, either those data
must be copied to the service process, or some games be played by mapping mem-
ory back and forth (the ALPC facilities in Windows handle this under the covers).
In the future it is possible that the hardware costs of crossing between address
spaces and protection modes will be reduced, or perhaps even become irrelevant.
The Singularity project in Microsoft Research (Fandrich et al., 2006) uses run-time
techniques, like those used with C# and Java, to make protection a completely soft-
ware issue. No hardware switching between address spaces or protection modes is
required.
Windows makes significant use of user-mode service processes to extend the
functionality of the system. Some of these services are strongly tied to the opera-
tion of kernel-mode components, such as lsass.exe which is the local security
authentication service which manages the token objects that represent user-identity,
as well as managing encryption keys used by the file system. The user-mode plug-
and-play manager is responsible for determining the correct driver to use when a
new hardware device is encountered, installing it, and telling the kernel to load it.
Many facilities provided by third parties, such as antivirus and digital rights man-
agement, are implemented as a combination of kernel-mode drivers and user-mode
services.
The Windows taskmgr.exe has a tab which identifies the services running on
the system. Multiple services can be seen to be running in the same process
(svchost.exe). Windows does this for many of its own boot-time services to reduce
the time needed to start up the system. Services can be combined into the same
process as long as they can safely operate with the same security credentials.
Within each of the shared service processes, individual services are loaded as
DLLs. They normally share a pool of threads using the Win32 thread-pool facility,
so that only the minimal number of threads needs to be running across all the resi-
dent services.
Services are common sources of security vulnerabilities in the system because
they are often accessible remotely (depending on the TCP/IP firewall and IP Secu-
rity settings), and not all programmers who write services are as careful as they
should be to validate the parameters and buffers that are passed in via RPC.
The number of services running constantly in Windows is staggering. Yet few
of those services ever receive a single request, though if they do it is likely to be
908 CASE STUDY 2: WINDOWS 8 CHAP. 11

from an attacker attempting to exploit a vulnerability. As a result more and more


services in Windows are turned off by default, particularly on versions of Windows
Server.

11.4 PROCESSES AND THREADS IN WINDOWS


Windows has a number of concepts for managing the CPU and grouping re-
sources together. In the following sections we will examine these, discussing some
of the relevant Win32 API calls, and show how they are implemented.

11.4.1 Fundamental Concepts

In Windows processes are containers for programs. They hold the virtual ad-
dress space, the handles that refer to kernel-mode objects, and threads. In their
role as a container for threads they hold common resources used for thread execu-
tion, such as the pointer to the quota structure, the shared token object, and default
parameters used to initialize threads—including the priority and scheduling class.
Each process has user-mode system data, called the PEB (Process Environment
Block). The PEB includes the list of loaded modules (i.e., the EXE and DLLs),
the memory containing environment strings, the current working directory, and
data for managing the process’ heaps—as well as lots of special-case Win32 cruft
that has been added over time.
Threads are the kernel’s abstraction for scheduling the CPU in Windows. Pri-
orities are assigned to each thread based on the priority value in the containing
process. Threads can also be affinitized to run only on certain processors. This
helps concurrent programs running on multicore chips or multiprocessors to expli-
citly spread out work. Each thread has two separate call stacks, one for execution
in user mode and one for kernel mode. There is also a TEB (Thread Environ-
ment Block) that keeps user-mode data specific to the thread, including per-thread
storage (Thread Local Storage) and fields for Win32, language and cultural local-
ization, and other specialized fields that have been added by various facilities.
Besides the PEBs and TEBs, there is another data structure that kernel mode
shares with each process, namely, user shared data. This is a page that is writable
by the kernel, but read-only in every user-mode process. It contains a number of
values maintained by the kernel, such as various forms of time, version infor-
mation, amount of physical memory, and a large number of shared flags used by
various user-mode components, such as COM, terminal services, and the debug-
gers. The use of this read-only shared page is purely a performance optimization,
as the values could also be obtained by a system call into kernel mode. But system
calls are much more expensive than a single memory access, so for some sys-
tem-maintained fields, such as the time, this makes a lot of sense. The other fields,
such as the current time zone, change infrequently (except on airborne computers),
SEC. 11.4 PROCESSES AND THREADS IN WINDOWS 909

but code that relies on these fields must query them often just to see if they have
changed. As with many performance hacks, it is a bit ugly, but it works.

Processes

Processes are created from section objects, each of which describes a memory
object backed by a file on disk. When a process is created, the creating process re-
ceives a handle that allows it to modify the new process by mapping sections, allo-
cating virtual memory, writing parameters and environmental data, duplicating file
descriptors into its handle table, and creating threads. This is very different than
how processes are created in UNIX and reflects the difference in the target systems
for the original designs of UNIX vs. Windows.
As described in Sec. 11.1, UNIX was designed for 16-bit single-processor sys-
tems that used swapping to share memory among processes. In such systems, hav-
ing the process as the unit of concurrency and using an operation like fork to create
processes was a brilliant idea. To run a new process with small memory and no
virtual memory hardware, processes in memory have to be swapped out to disk to
create space. UNIX originally implemented fork simply by swapping out the par-
ent process and handing its physical memory to the child. The operation was al-
most free.
In contrast, the hardware environment at the time Cutler’s team wrote NT was
32-bit multiprocessor systems with virtual memory hardware to share 1–16 MB of
physical memory. Multiprocessors provide the opportunity to run parts of pro-
grams concurrently, so NT used processes as containers for sharing memory and
object resources, and used threads as the unit of concurrency for scheduling.
Of course, the systems of the next few years will look nothing like either of
these target environments, having 64-bit address spaces with dozens (or hundreds)
of CPU cores per chip socket and dozens or hundreds gigabytes of physical memo-
ry. This memory may be radically different from current RAM as well. Current
RAM loses its contents when powered off, but phase-change memories now in
the pipeline keep their values (like disks) even when powered off. Also expect
flash devices to replace hard disks, broader support for virtualization, ubiquitous
networking, and support for synchronization innovations like transactional mem-
ory. Windows and UNIX will continue to be adapted to new hardware realities,
but what will be really interesting is to see what new operating systems are de-
signed specifically for systems based on these advances.

Jobs and Fibers

Windows can group processes together into jobs. Jobs group processes in
order to apply constraints to them and the threads they contain, such as limiting re-
source use via a shared quota or enforcing a restricted token that prevents threads
from accessing many system objects. The most significant property of jobs for
910 CASE STUDY 2: WINDOWS 8 CHAP. 11

resource management is that once a process is in a job, all processes’ threads in


those processes create will also be in the job. There is no escape. As suggested by
the name, jobs were designed for situations that are more like batch processing
than ordinary interactive computing.
In Modern Windows, jobs are used to group together the processes that are ex-
ecuting a modern application. The processes that comprise a running application
need to be identified to the operating system so it can manage the entire application
on behalf of the user.
Figure 11-22 shows the relationship between jobs, processes, threads, and
fibers. Jobs contain processes. Processes contain threads. But threads do not con-
tain fibers. The relationship of threads to fibers is normally many-to-many.

job

process process

thread thread thread thread thread

fiber fiber fiber fiber fiber fiber fiber fiber

Figure 11-22. The relationship between jobs, processes, threads, and fibers.
Jobs and fibers are optional; not all processes are in jobs or contain fibers.

Fibers are created by allocating a stack and a user-mode fiber data structure for
storing registers and data associated with the fiber. Threads are converted to fibers,
but fibers can also be created independently of threads. Such a fiber will not run
until a fiber already running on a thread explicitly calls SwitchToFiber to run the
fiber. Threads could attempt to switch to a fiber that is already running, so the pro-
grammer must provide synchronization to prevent this.
The primary advantage of fibers is that the overhead of switching between
fibers is much lower than switching between threads. A thread switch requires
entering and exiting the kernel. A fiber switch saves and restores a few registers
without changing modes at all.
Although fibers are cooperatively scheduled, if there are multiple threads
scheduling the fibers, a lot of careful synchronization is required to make sure
fibers do not interfere with each other. To simplify the interaction between threads
and fibers, it is often useful to create only as many threads as there are processors
to run them, and affinitize the threads to each run only on a distinct set of available
processors, or even just one processor.
Each thread can then run a particular subset of the fibers, establishing a one-to-
many relationship between threads and fibers which simplifies synchronization.
Even so there are still many difficulties with fibers. Most of the Win32 libraries
SEC. 11.4 PROCESSES AND THREADS IN WINDOWS 911

are completely unaware of fibers, and applications that attempt to use fibers as if
they were threads will encounter various failures. The kernel has no knowledge of
fibers, and when a fiber enters the kernel, the thread it is executing on may block
and the kernel will schedule an arbitrary thread on the processor, making it
unavailable to run other fibers. For these reasons fibers are rarely used except
when porting code from other systems that explicitly need the functionality pro-
vided by fibers.

Thread Pools and User-Mode Scheduling

The Win32 thread pool is a facility that builds on top of the Windows thread
model to provide a better abstraction for certain types of programs. Thread crea-
tion is too expensive to be invoked every time a program wants to execute a small
task concurrently with other tasks in order to take advantage of multiple proc-
essors. Tasks can be grouped together into larger tasks but this reduces the amount
of exploitable concurrency in the program. An alternative approach is for a pro-
gram to allocate a limited number of threads, and maintain a queue of tasks that
need to be run. As a thread finishes the execution of a task, it takes another one
from the queue. This model separates the resource-management issues (how many
processors are available and how many threads should be created) from the pro-
gramming model (what is a task and how are tasks synchronized). Windows for-
malizes this solution into the Win32 thread pool, a set of APIs for automatically
managing a dynamic pool of threads and dispatching tasks to them.
Thread pools are not a perfect solution, because when a thread blocks for some
resource in the middle of a task, the thread cannot switch to a different task. Thus,
the thread pool will inevitably create more threads than there are processors avail-
able, so if runnable threads are available to be scheduled even when other threads
have blocked. The thread pool is integrated with many of the common synchroni-
zation mechanisms, such as awaiting the completion of I/O or blocking until a ker-
nel event is signaled. Synchronization can be used as triggers for queuing a task so
threads are not assigned the task before it is ready to run.
The implementation of the thread pool uses the same queue facility provided
for synchronization with I/O completion, together with a kernel-mode thread fac-
tory which adds more threads to the process as needed to keep the available num-
ber of processors busy. Small tasks exist in many applications, but particularly in
those that provide services in the client/server model of computing, where a stream
of requests are sent from the clients to the server. Use of a thread pool for these
scenarios improves the efficiency of the system by reducing the overhead of creat-
ing threads and moving the decisions about how to manage the threads in the pool
out of the application and into the operating system.
What programmers see as a single Windows thread is actually two threads: one
that runs in kernel mode and one that runs in user mode. This is precisely the same
912 CASE STUDY 2: WINDOWS 8 CHAP. 11

model that UNIX has. Each of these threads is allocated its own stack and its own
memory to save its registers when not running. The two threads appear to be a sin-
gle thread because they do not run at the same time. The user thread operates as an
extension of the kernel thread, running only when the kernel thread switches to it
by returning from kernel mode to user mode. When a user thread wants to perform
a system call, encounters a page fault, or is preempted, the system enters kernel
mode and switches back to the corresponding kernel thread. It is normally not pos-
sible to switch between user threads without first switching to the corresponding
kernel thread, switching to the new kernel thread, and then switching to its user
thread.
Most of the time the difference between user and kernel threads is transparent
to the programmer. However, in Windows 7 Microsoft added a facility called
UMS (User-Mode Scheduling), which exposes the distinction. UMS is similar to
facilities used in other operating systems, such as scheduler activations. It can be
used to switch between user threads without first having to enter the kernel, provid-
ing the benefits of fibers, but with much better integration into Win32—since it
uses real Win32 threads.
The implementation of UMS has three key elements:

1. User-mode switching: a user-mode scheduler can be written to switch


between user threads without entering the kernel. When a user thread
does enter kernel mode, UMS will find the corresponding kernel
thread and immediately switch to it.
2. Reentering the user-mode scheduler: when the execution of a kernel
thread blocks to await the availability of a resource, UMS switches to
a special user thread and executes the user-mode scheduler so that a
different user thread can be scheduled to run on the current processor.
This allows the current process to continue using the current proc-
essor for its full turn rather than having to get in line behind other
processes when one of its threads blocks.
3. System-call completion: after a blocked kernel thread eventually is
finished, a notification containing the results of the system calls is
queued for the user-mode scheduler so that it can switch to the corres-
ponding user thread next time it makes a scheduling decision.

UMS does not include a user-mode scheduler as part of Windows. UMS is in-
tended as a low-level facility for use by run-time libraries used by programming-
language and server applications to implement lightweight threading models that
do not conflict with kernel-level thread scheduling. These run-time libraries will
normally implement a user-mode scheduler best suited to their environment. A
summary of these abstractions is given in Fig. 11-23.
SEC. 11.4 PROCESSES AND THREADS IN WINDOWS 913

Name Description Notes


Job Collection of processes that share quotas and limits Used in AppContainers
Process Container for holding resources
Thread Entity scheduled by the kernel
Fiber Lightweight thread managed entirely in user space Rarely used
Thread pool Task-oriented programming model Built on top of threads
User-mode thread Abstraction allowing user-mode thread switching An extension of threads

Figure 11-23. Basic concepts used for CPU and resource management.

Threads

Every process normally starts out with one thread, but new ones can be created
dynamically. Threads form the basis of CPU scheduling, as the operating system
always selects a thread to run, not a process. Consequently, every thread has a
state (ready, running, blocked, etc), whereas processes do not have scheduling
states. Threads can be created dynamically by a Win32 call that specifies the ad-
dress within the enclosing process’ address space at which it is to start running.
Every thread has a thread ID, which is taken from the same space as the proc-
ess IDs, so a single ID can never be in use for both a process and a thread at the
same time. Process and thread IDs are multiples of four because they are actually
allocated by the executive using a special handle table set aside for allocating IDs.
The system is reusing the scalable handle-management facility shown in
Figs. 11-16 and 11-17. The handle table does not have references on objects, but
does use the pointer field to point at the process or thread so that the lookup of a
process or thread by ID is very efficient. FIFO ordering of the list of free handles
is turned on for the ID table in recent versions of Windows so that IDs are not im-
mediately reused. The problems with immediate reuse are explored in the prob-
lems at the end of this chapter.
A thread normally runs in user mode, but when it makes a system call it
switches to kernel mode and continues to run as the same thread with the same
properties and limits it had in user mode. Each thread has two stacks, one for use
when it is in user mode and one for use when it is in kernel mode. Whenever a
thread enters the kernel, it switches to the kernel-mode stack. The values of the
user-mode registers are saved in a CONTEXT data structure at the base of the ker-
nel-mode stack. Since the only way for a user-mode thread to not be running is for
it to enter the kernel, the CONTEXT for a thread always contains its register state
when it is not running. The CONTEXT for each thread can be examined and mod-
ified from any process with a handle to the thread.
Threads normally run using the access token of their containing process, but in
certain cases related to client/server computing, a thread running in a service proc-
ess can impersonate its client, using a temporary access token based on the client’s
914 CASE STUDY 2: WINDOWS 8 CHAP. 11

token so it can perform operations on the client’s behalf. (In general a service can-
not use the client’s actual token, as the client and server may be running on dif-
ferent systems.)
Threads are also the normal focal point for I/O. Threads block when perform-
ing synchronous I/O, and the outstanding I/O request packets for asynchronous I/O
are linked to the thread. When a thread is finished executing, it can exit. Any I/O
requests pending for the thread will be canceled. When the last thread still active
in a process exits, the process terminates.
It is important to realize that threads are a scheduling concept, not a re-
source-ownership concept. Any thread is able to access all the objects that belong
to its process. All it has to do is use the handle value and make the appropriate
Win32 call. There is no restriction on a thread that it cannot access an object be-
cause a different thread created or opened it. The system does not even keep track
of which thread created which object. Once an object handle has been put in a
process’ handle table, any thread in the process can use it, even it if is impersonat-
ing a different user.
As described previously, in addition to the normal threads that run within user
processes Windows has a number of system threads that run only in kernel mode
and are not associated with any user process. All such system threads run in a spe-
cial process called the system process. This process does not have a user-mode
address space. It provides the environment that threads execute in when they are
not operating on behalf of a specific user-mode process. We will study some of
these threads later when we come to memory management. Some perform admin-
istrative tasks, such as writing dirty pages to the disk, while others form the pool of
worker threads that are assigned to run specific short-term tasks delegated by exec-
utive components or drivers that need to get some work done in the system process.

11.4.2 Job, Process, Thread, and Fiber Management API Calls

New processes are created using the Win32 API function CreateProcess. This
function has many parameters and lots of options. It takes the name of the file to
be executed, the command-line strings (unparsed), and a pointer to the environ-
ment strings. There are also flags and values that control many details such as how
security is configured for the process and first thread, debugger configuration, and
scheduling priorities. A flag also specifies whether open handles in the creator are
to be passed to the new process. The function also takes the current working direc-
tory for the new process and an optional data structure with information about the
GUI Window the process is to use. Rather than returning just a process ID for the
new process, Win32 returns both handles and IDs, both for the new process and for
its initial thread.
The large number of parameters reveals a number of differences from the de-
sign of process creation in UNIX.
SEC. 11.4 PROCESSES AND THREADS IN WINDOWS 915

1. The actual search path for finding the program to execute is buried in
the library code for Win32, but managed more explicitly in UNIX.
2. The current working directory is a kernel-mode concept in UNIX but
a user-mode string in Windows. Windows does open a handle on the
current directory for each process, with the same annoying effect as in
UNIX: you cannot delete the directory, unless it happens to be across
the network, in which case you can delete it.
3. UNIX parses the command line and passes an array of parameters,
while Win32 leaves argument parsing up to the individual program.
As a consequence, different programs may handle wildcards (e.g.,
*.txt) and other special symbols in an inconsistent way.
4. Whether file descriptors can be inherited in UNIX is a property of the
handle. In Windows it is a property of both the handle and a parame-
ter to process creation.
5. Win32 is GUI oriented, so new processes are directly passed infor-
mation about their primary window, while this information is passed
as parameters to GUI applications in UNIX.
6. Windows does not have a SETUID bit as a property of the executable,
but one process can create a process that runs as a different user, as
long as it can obtain a token with that user’s credentials.
7. The process and thread handle returned from Windows can be used at
any time to modify the new process/thread in many substantive ways,
including modifying the virtual memory, injecting threads into the
process, and altering the execution of threads. UNIX makes modifi-
cations to the new process only between the fork and exec calls, and
only in limited ways as exec throws out all the user-mode state of the
process.

Some of these differences are historical and philosophical. UNIX was de-
signed to be command-line oriented rather than GUI oriented like Windows.
UNIX users are more sophisticated, and they understand concepts like PATH vari-
ables. Windows inherited a lot of legacy from MS-DOS.
The comparison is also skewed because Win32 is a user-mode wrapper around
the native NT process execution, much as the system library function wraps
fork/exec in UNIX. The actual NT system calls for creating processes and threads,
NtCreateProcess and NtCreateThread, are simpler than the Win32 versions. The
main parameters to NT process creation are a handle on a section representing the
program file to run, a flag specifying whether the new process should, by default,
inherit handles from the creator, and parameters related to the security model. All
the details of setting up the environment strings and creating the initial thread are
916 CASE STUDY 2: WINDOWS 8 CHAP. 11

left to user-mode code that can use the handle on the new process to manipulate its
virtual address space directly.
To support the POSIX subsystem, native process creation has an option to cre-
ate a new process by copying the virtual address space of another process rather
than mapping a section object for a new program. This is used only to implement
fork for POSIX, and not by Win32. Since POSIX no longer ships with Windows,
process duplication has little use—though sometimes enterprising developers come
up with special uses, similar to uses of fork without exec in UNIX.
Thread creation passes the CPU context to use for the new thread (which in-
cludes the stack pointer and initial instruction pointer), a template for the TEB, and
a flag saying whether the thread should be immediately run or created in a sus-
pended state (waiting for somebody to call NtResumeThread on its handle). Crea-
tion of the user-mode stack and pushing of the argv/argc parameters is left to user-
mode code calling the native NT memory-management APIs on the process hand-
le.
In the Windows Vista release, a new native API for processes, NtCreateUser-
Process, was added which moves many of the user-mode steps into the kernel-
mode executive, and combines process creation with creation of the initial thread.
The reason for the change was to support the use of processes as security bound-
aries. Normally, all processes created by a user are considered to be equally trust-
ed. It is the user, as represented by a token, that determines where the trust bound-
ary is. NtCreateUserProcess allows processes to also provide trust boundaries, but
this means that the creating process does not have sufficient rights regarding a new
process handle to implement the details of process creation in user mode for proc-
esses that are in a different trust environment. The primary use of a process in a
different trust boundary (called protected processes) is to support forms of digital
rights management, which protect copyrighted material from being used improp-
erly. Of course, protected processes only target user-mode attacks against protect-
ed content and cannot prevent kernel-mode attacks.

Interprocess Communication

Threads can communicate in a wide variety of ways, including pipes, named


pipes, mailslots, sockets, remote procedure calls, and shared files. Pipes have two
modes: byte and message, selected at creation time. Byte-mode pipes work the
same way as in UNIX. Message-mode pipes are somewhat similar but preserve
message boundaries, so that four writes of 128 bytes will be read as four 128-byte
messages, and not as one 512-byte message, as might happen with byte-mode
pipes. Named pipes also exist and have the same two modes as regular pipes.
Named pipes can also be used over a network but regular pipes cannot.
Mailslots are a feature of the now-defunct OS/2 operating system imple-
mented in Windows for compatibility. They are similar to pipes in some ways, but
not all. For one thing, they are one way, whereas pipes are two way. They could
SEC. 11.4 PROCESSES AND THREADS IN WINDOWS 917

be used over a network but do not provide guaranteed delivery. Finally, they allow
the sending process to broadcast a message to many receivers, instead of to just
one receiver. Both mailslots and named pipes are implemented as file systems in
Windows, rather than executive functions. This allows them to be accessed over
the network using the existing remote file-system protocols.
Sockets are like pipes, except that they normally connect processes on dif-
ferent machines. For example, one process writes to a socket and another one on a
remote machine reads from it. Sockets can also be used to connect processes on
the same machine, but since they entail more overhead than pipes, they are gener-
ally only used in a networking context. Sockets were originally designed for
Berkeley UNIX, and the implementation was made widely available. Some of the
Berkeley code and data structures are still present in Windows today, as acknow-
ledged in the release notes for the system.
RPCs are a way for process A to have process B call a procedure in B’s address
space on A’s behalf and return the result to A. Various restrictions on the parame-
ters exist. For example, it makes no sense to pass a pointer to a different process,
so data structures have to be packaged up and transmitted in a nonprocess-specific
way. RPC is normally implemented as an abstraction layer on top of a transport
layer. In the case of Windows, the transport can be TCP/IP sockets, named pipes,
or ALPC. ALPC (Advanced Local Procedure Call) is a message-passing facility in
the kernel-mode executive. It is optimized for communicating between processes
on the local machine and does not operate across the network. The basic design is
for sending messages that generate replies, implementing a lightweight version of
remote procedure call which the RPC package can build on top of to provide a
richer set of features than available in ALPC. ALPC is implemented using a com-
bination of copying parameters and temporary allocation of shared memory, based
on the size of the messages.
Finally, processes can share objects. This includes section objects, which can
be mapped into the virtual address space of different processes at the same time.
All writes done by one process then appear in the address spaces of the other proc-
esses. Using this mechanism, the shared buffer used in producer-consumer prob-
lems can easily be implemented.

Synchronization

Processes can also use various types of synchronization objects. Just as Win-
dows provides numerous interprocess communication mechanisms, it also provides
numerous synchronization mechanisms, including semaphores, mutexes, critical
regions, and events. All of these mechanisms work with threads, not processes, so
that when a thread blocks on a semaphore, other threads in that process (if any) are
not affected and can continue to run.
A semaphore can be created using the CreateSemaphore Win32 API function,
which can also initialize it to a given value and define a maximum value as well.
918 CASE STUDY 2: WINDOWS 8 CHAP. 11

Semaphores are kernel-mode objects and thus have security descriptors and hand-
les. The handle for a semaphore can be duplicated using DuplicateHandle and pas-
sed to another process so that multiple processes can synchronize on the same sem-
aphore. A semaphore can also be given a name in the Win32 namespace and have
an ACL set to protect it. Sometimes sharing a semaphore by name is more ap-
propriate than duplicating the handle.
Calls for up and down exist, although they have the somewhat odd names of
ReleaseSemaphore (up) and WaitForSingleObject (down). It is also possible to
give WaitForSingleObject a timeout, so the calling thread can be released eventual-
ly, even if the semaphore remains at 0 (although timers reintroduce races). Wait-
ForSingleObject and WaitForMultipleObjects are the common interfaces used for
waiting on the dispatcher objects discussed in Sec. 11.3. While it would have been
possible to wrap the single-object version of these APIs in a wrapper with a some-
what more semaphore-friendly name, many threads use the multiple-object version
which may include waiting for multiple flavors of synchronization objects as well
as other events like process or thread termination, I/O completion, and messages
being available on sockets and ports.
Mutexes are also kernel-mode objects used for synchronization, but simpler
than semaphores because they do not have counters. They are essentially locks,
with API functions for locking WaitForSingleObject and unlocking ReleaseMutex.
Like semaphore handles, mutex handles can be duplicated and passed between
processes so that threads in different processes can access the same mutex.
A third synchronization mechanism is called critical sections, which imple-
ment the concept of critical regions. These are similar to mutexes in Windows, ex-
cept local to the address space of the creating thread. Because critical sections are
not kernel-mode objects, they do not have explicit handles or security descriptors
and cannot be passed between processes. Locking and unlocking are done with
EnterCriticalSection and LeaveCriticalSection, respectively. Because these API
functions are performed initially in user space and make kernel calls only when
blocking is needed, they are much faster than mutexes. Critical sections are opti-
mized to combine spin locks (on multiprocessors) with the use of kernel synchroni-
zation only when necessary. In many applications most critical sections are so
rarely contended or have such short hold times that it is never necessary to allocate
a kernel synchronization object. This results in a very significant savings in kernel
memory.
Another synchronization mechanism we discuss uses kernel-mode objects call-
ed events. As we have described previously, there are two kinds: notification
events and synchronization events. An event can be in one of two states: signaled
or not-signaled. A thread can wait for an event to be signaled with WaitForSin-
gleObject. If another thread signals an event with SetEvent, what happens depends
on the type of event. With a notification event, all waiting threads are released and
the event stays set until manually cleared with ResetEvent. With a synchroniza-
tion event, if one or more threads are waiting, exactly one thread is released and
SEC. 11.4 PROCESSES AND THREADS IN WINDOWS 919

the event is cleared. An alternative operation is PulseEvent, which is like SetEvent


except that if nobody is waiting, the pulse is lost and the event is cleared. In con-
trast, a SetEvent that occurs with no waiting threads is remembered by leaving the
event in the signaled state so a subsequent thread that calls a wait API for the event
will not actually wait.
The number of Win32 API calls dealing with processes, threads, and fibers is
nearly 100, a substantial number of which deal with IPC in one form or another.
Two new synchronization primitives were recently added to Windows, WaitOn-
Address and InitOnceExecuteOnce. WaitOnAddress is called to wait for the value
at the specified address to be modified. The application must call either Wake-
ByAddressSingle (or WakeByAddressAll) after modifying the location to wake ei-
ther the first (or all) of the threads that called WaitOnAddress on that location. The
advantage of this API over using events is that it is not necessary to allocate an ex-
plicit event for synchronization. Instead, the system hashes the address of the loca-
tion to find a list of all the waiters for changes to a given address. WaitOnAddress
functions similar to the sleep/wakeup mechanism found in the UNIX kernel. Ini-
tOnceExecuteOnce can be used to ensure that an initialization routine is run only
once in a program. Correct initialization of data structures is surprisingly hard in
multithreaded programs. A summary of the synchronization primitives discussed
above, as well as some other important ones, is given in Fig. 11-24.
Note that not all of these are just system calls. While some are wrappers, oth-
ers contain significant library code which maps the Win32 semantics onto the
native NT APIs. Still others, like the fiber APIs, are purely user-mode functions
since, as we mentioned earlier, kernel mode in Windows knows nothing about
fibers. They are entirely implemented by user-mode libraries.

11.4.3 Implementation of Processes and Threads

In this section we will get into more detail about how Windows creates a proc-
ess (and the initial thread). Because Win32 is the most documented interface, we
will start there. But we will quickly work our way down into the kernel and under-
stand the implementation of the native API call for creating a new process. We
will focus on the main code paths that get executed whenever processes are creat-
ed, as well as look at a few of the details that fill in gaps in what we have covered
so far.
A process is created when another process makes the Win32 CreateProcess
call. This call invokes a user-mode procedure in kernel32.dll that makes a call to
NtCreateUserProcess in the kernel to create the process in several steps.

1. Convert the executable file name given as a parameter from a Win32


path name to an NT path name. If the executable has just a name
without a directory path name, it is searched for in the directories list-
ed in the default directories (which include, but are not limited to,
those in the PATH variable in the environment).
920 CASE STUDY 2: WINDOWS 8 CHAP. 11

Win32 API Function Description


CreateProcess Create a new process
CreateThread Create a new thread in an existing process
CreateFiber Create a new fiber
ExitProcess Terminate current process and all its threads
ExitThread Terminate this thread
ExitFiber Terminate this fiber
SwitchToFiber Run a different fiber on the current thread
SetPriorityClass Set the priority class for a process
SetThreadPriority Set the priority for one thread
CreateSemaphore Create a new semaphore
CreateMutex Create a new mutex
OpenSemaphore Open an existing semaphore
OpenMutex Open an existing mutex
WaitForSingleObject Block on a single semaphore, mutex, etc.
WaitForMultipleObjects Block on a set of objects whose handles are given
PulseEvent Set an event to signaled, then to nonsignaled
ReleaseMutex Release a mutex to allow another thread to acquire it
ReleaseSemaphore Increase the semaphore count by 1
EnterCriticalSection Acquire the lock on a critical section
LeaveCriticalSection Release the lock on a critical section
WaitOnAddress Block until the memory is changed at the specified address
WakeByAddressSingle Wake the first thread that is waiting on this address
WakeByAddressAll Wake all threads that are waiting on this address
InitOnceExecuteOnce Ensure that an initialize routine executes only once

Figure 11-24. Some of the Win32 calls for managing processes, threads,
and fibers.

2. Bundle up the process-creation parameters and pass them, along with


the full path name of the executable program, to the native API
NtCreateUserProcess.
3. Running in kernel mode, NtCreateUserProcess processes the parame-
ters, then opens the program image and creates a section object that
can be used to map the program into the new process’ virtual address
space.
4. The process manager allocates and initializes the process object (the
kernel data structure representing a process to both the kernel and ex-
ecutive layers).
SEC. 11.4 PROCESSES AND THREADS IN WINDOWS 921

5. The memory manager creates the address space for the new process
by allocating and initializing the page directories and the virtual ad-
dress descriptors which describe the kernel-mode portion, including
the process-specific regions, such as the self-map page-directory en-
tries that gives each process kernel-mode access to the physical pages
in its entire page table using kernel virtual addresses. (We will de-
scribe the self map in more detail in Sec. 11.5.)
6. A handle table is created for the new process, and all the handles from
the caller that are allowed to be inherited are duplicated into it.
7. The shared user page is mapped, and the memory manager initializes
the working-set data structures used for deciding what pages to trim
from a process when physical memory is low. The pieces of the ex-
ecutable image represented by the section object are mapped into the
new process’ user-mode address space.
8. The executive creates and initializes the user-mode PEB, which is
used by both user mode processes and the kernel to maintain proc-
esswide state information, such as the user-mode heap pointers and
the list of loaded libraries (DLLs).
9. Virtual memory is allocated in the new process and used to pass pa-
rameters, including the environment strings and command line.
10. A process ID is allocated from the special handle table (ID table) the
kernel maintains for efficiently allocating locally unique IDs for proc-
esses and threads.
11. A thread object is allocated and initialized. A user-mode stack is al-
located along with the Thread Environment Block (TEB). The CON-
TEXT record which contains the thread’s initial values for the CPU
registers (including the instruction and stack pointers) is initialized.
12. The process object is added to the global list of processes. Handles
for the process and thread objects are allocated in the caller’s handle
table. An ID for the initial thread is allocated from the ID table.
13. NtCreateUserProcess returns to user mode with the new process
created, containing a single thread that is ready to run but suspended.
14. If the NT API fails, the Win32 code checks to see if this might be a
process belonging to another subsystem like WOW64. Or perhaps
the program is marked that it should be run under the debugger.
These special cases are handled with special code in the user-mode
CreateProcess code.
922 CASE STUDY 2: WINDOWS 8 CHAP. 11

15. If NtCreateUserProcess was successful, there is still some work to be


done. Win32 processes have to be registered with the Win32 subsys-
tem process, csrss.exe. Kernel32.dll sends a message to csrss telling it
about the new process along with the process and thread handles so it
can duplicate itself. The process and threads are entered into the
subsystems’ tables so that they have a complete list of all Win32
processes and threads. The subsystem then displays a cursor con-
taining a pointer with an hourglass to tell the user that something is
going on but that the cursor can be used in the meanwhile. When the
process makes its first GUI call, usually to create a window, the cur-
sor is removed (it times out after 2 seconds if no call is forthcoming).
16. If the process is restricted, such as low-rights Internet Explorer, the
token is modified to restrict what objects the new process can access.
17. If the application program was marked as needing to be shimmed to
run compatibly with the current version of Windows, the specified
shims are applied. Shims usually wrap library calls to slightly modi-
fy their behavior, such as returning a fake version number or delaying
the freeing of memory.
18. Finally, call NtResumeThread to unsuspend the thread, and return the
structure to the caller containing the IDs and handles for the process
and thread that were just created.
In earlier versions of Windows, much of the algorithm for process creation was im-
plemented in the user-mode procedure which would create a new process in using
multiple system calls and by performing other work using the NT native APIs that
support implementation of subsystems. These steps were moved into the kernel to
reduce the ability of the parent process to manipulate the child process in the cases
where the child is running a protected program, such as one that implements DRM
to protect movies from piracy.
The original native API, NtCreateProcess, is still supported by the system, so
much of process creation could still be done within user mode of the parent proc-
ess—as long as the process being created is not a protected process.

Scheduling

The Windows kernel does not have a central scheduling thread. Instead, when
a thread cannot run any more, the thread calls into the scheduler itself to see which
thread to switch to. The following conditions invoke scheduling.
1. A running thread blocks on a semaphore, mutex, event, I/O, etc.
2. The thread signals an object (e.g., does an up on a semaphore).
3. The quantum expires.
SEC. 11.4 PROCESSES AND THREADS IN WINDOWS 923

In case 1, the thread is already in the kernel to carry out the operation on the dis-
patcher or I/O object. It cannot possibly continue, so it calls the scheduler code to
pick its successor and load that thread’s CONTEXT record to resume running it.
In case 2, the running thread is in the kernel, too. However, after signaling
some object, it can definitely continue because signaling an object never blocks.
Still, the thread is required to call the scheduler to see if the result of its action has
released a thread with a higher scheduling priority that is now ready to run. If so, a
thread switch occurs since Windows is fully preemptive (i.e., thread switches can
occur at any moment, not just at the end of the current thread’s quantum). Howev-
er, in the case of a multicore chip or a multiprocessor, a thread that was made ready
may be scheduled on a different CPU and the original thread can continue to ex-
ecute on the current CPU even though its scheduling priority is lower.
In case 3, an interrupt to kernel mode occurs, at which point the thread ex-
ecutes the scheduler code to see who runs next. Depending on what other threads
are waiting, the same thread may be selected, in which case it gets a new quantum
and continues running. Otherwise a thread switch happens.
The scheduler is also called under two other conditions:

1. An I/O operation completes.


2. A timed wait expires.

In the first case, a thread may have been waiting on this I/O and is now released to
run. A check has to be made to see if it should preempt the running thread since
there is no guaranteed minimum run time. The scheduler is not run in the interrupt
handler itself (since that may keep interrupts turned off too long). Instead, a DPC
is queued for slightly later, after the interrupt handler is done. In the second case, a
thread has done a down on a semaphore or blocked on some other object, but with
a timeout that has now expired. Again it is necessary for the interrupt handler to
queue a DPC to avoid having it run during the clock interrupt handler. If a thread
has been made ready by this timeout, the scheduler will be run and if the newly
runnable thread has higher priority, the current thread is preempted as in case 1.
Now we come to the actual scheduling algorithm. The Win32 API provides
two APIs to influence thread scheduling. First, there is a call SetPriorityClass that
sets the priority class of all the threads in the caller’s process. The allowed values
are: real-time, high, above normal, normal, below normal, and idle. The priority
class determines the relative priorities of processes. The process priority class can
also be used by a process to temporarily mark itself as being background, meaning
that it should not interfere with any other activity in the system. Note that the pri-
ority class is established for the process, but it affects the actual priority of every
thread in the process by setting a base priority that each thread starts with when
created.
The second Win32 API is SetThreadPriority. It sets the relative priority of a
thread (possibly, but not necessarily, the calling thread) with respect to the priority
924 CASE STUDY 2: WINDOWS 8 CHAP. 11

class of its process. The allowed values are: time critical, highest, above normal,
normal, below normal, lowest, and idle. Time-critical threads get the highest non-
real-time scheduling priority, while idle threads get the lowest, irrespective of the
priority class. The other priority values adjust the base priority of a thread with re-
spect to the normal value determined by the priority class (+2, +1, 0, −1, −2, re-
spectively). The use of priority classes and relative thread priorities makes it easier
for applications to decide what priorities to specify.
The scheduler works as follows. The system has 32 priorities, numbered from
0 to 31. The combinations of priority class and relative priority are mapped onto
32 absolute thread priorities according to the table of Fig. 11-25. The number in
the table determines the thread’s base priority. In addition, every thread has a
current priority, which may be higher (but not lower) than the base priority and
which we will discuss shortly.

Win32 process class priorities


Above Below
Real-time High normal Normal normal Idle
Time critical 31 15 15 15 15 15
Highest 26 15 12 10 8 6
Win32 Above normal 25 14 11 9 7 5
thread Normal 24 13 10 8 6 4
priorities Below normal 23 12 9 7 5 3
Lowest 22 11 8 6 4 2
Idle 16 1 1 1 1 1

Figure 11-25. Mapping of Win32 priorities to Windows priorities.

To use these priorities for scheduling, the system maintains an array of 32 lists
of threads, corresponding to priorities 0 through 31 derived from the table of
Fig. 11-25. Each list contains ready threads at the corresponding priority. The
basic scheduling algorithm consists of searching the array from priority 31 down to
priority 0. As soon as a nonempty list is found, the thread at the head of the queue
is selected and run for one quantum. If the quantum expires, the thread goes to the
end of the queue at its priority level and the thread at the front is chosen next. In
other words, when there are multiple threads ready at the highest priority level,
they run round robin for one quantum each. If no thread is ready, the processor is
idled—that is, set to a low power state waiting for an interrupt to occur.
It should be noted that scheduling is done by picking a thread without regard to
which process that thread belongs. Thus, the scheduler does not first pick a proc-
ess and then pick a thread in that process. It only looks at the threads. It does not
consider which thread belongs to which process except to determine if it also needs
to switch address spaces when switching threads.
SEC. 11.4 PROCESSES AND THREADS IN WINDOWS 925

To improve the scalability of the scheduling algorithm for multiprocessors with


a high number of processors, the scheduler tries hard not to have to take the lock
that protects access to the global array of priority lists. Instead, it sees if it can di-
rectly dispatch a thread that is ready to run to the processor where it should run.
For each thread the scheduler maintains the notion of its ideal processor and
attempts to schedule it on that processor whenever possible. This improves the
performance of the system, as the data used by a thread are more likely to already
be available in the cache belonging to its ideal processor. The scheduler is aware
of multiprocessors in which each CPU has its own memory and which can execute
programs out of any memory—but at a cost if the memory is not local. These sys-
tems are called NUMA (NonUniform Memory Access) machines. The scheduler
tries to optimize thread placement on such machines. The memory manager tries
to allocate physical pages in the NUMA node belonging to the ideal processor for
threads when they page fault.
The array of queue headers is shown in Fig. 11-26. The figure shows that there
are actually four categories of priorities: real-time, user, zero, and idle, which is ef-
fectively −1. These deserve some comment. Priorities 16–31 are called system,
and are intended to build systems that satisfy real-time constraints, such as dead-
lines needed for multimedia presentations. Threads with real-time priorities run
before any of the threads with dynamic priorities, but not before DPCs and ISRs.
If a real-time application wants to run on the system, it may require device drivers
that are careful not to run DPCs or ISRs for any extended time as they might cause
the real-time threads to miss their deadlines.
Ordinary users may not run real-time threads. If a user thread ran at a higher
priority than, say, the keyboard or mouse thread and got into a loop, the keyboard
or mouse thread would never run, effectively hanging the system. The right to set
the priority class to real-time requires a special privilege to be enabled in the proc-
ess’ token. Normal users do not have this privilege.
Application threads normally run at priorities 1–15. By setting the process and
thread priorities, an application can determine which threads get preference. The
ZeroPage system threads run at priority 0 and convert free pages into pages of all
zeroes. There is a separate ZeroPage thread for each real processor.
Each thread has a base priority based on the priority class of the process and
the relative priority of the thread. But the priority used for determining which of
the 32 lists a ready thread is queued on is determined by its current priority, which
is normally the same as the base priority—but not always. Under certain condi-
tions, the current priority of a nonreal-time thread is boosted by the kernel above
the base priority (but never above priority 15). Since the array of Fig. 11-26 is
based on the current priority, changing this priority affects scheduling. No adjust-
ments are ever made to real-time threads.
Let us now see when a thread’s priority is raised. First, when an I/O operation
completes and releases a waiting thread, the priority is boosted to give it a chance
to run again quickly and start more I/O. The idea here is to keep the I/O devices
926 CASE STUDY 2: WINDOWS 8 CHAP. 11

Priority
31
Next thread to run

System
priorities 24

16

User
priorities 8

1
Zero page thread 0

Idle thread

Figure 11-26. Windows supports 32 priorities for threads.

busy. The amount of boost depends on the I/O device, typically 1 for a disk, 2 for
a serial line, 6 for the keyboard, and 8 for the sound card.
Second, if a thread was waiting on a semaphore, mutex, or other event, when it
is released, it gets boosted by 2 levels if it is in the foreground process (the process
controlling the window to which keyboard input is sent) and 1 level otherwise.
This fix tends to raise interactive processes above the big crowd at level 8. Finally,
if a GUI thread wakes up because window input is now available, it gets a boost for
the same reason.
These boosts are not forever. They take effect immediately, and can cause
rescheduling of the CPU. But if a thread uses all of its next quantum, it loses one
priority level and moves down one queue in the priority array. If it uses up another
full quantum, it moves down another level, and so on until it hits its base level,
where it remains until it is boosted again.
There is one other case in which the system fiddles with the priorities. Imag-
ine that two threads are working together on a producer-consumer type problem.
The producer’s work is harder, so it gets a high priority, say 12, compared to the
consumer’s 4. At a certain point, the producer has filled up a shared buffer and
blocks on a semaphore, as illustrated in Fig. 11-27(a).
Before the consumer gets a chance to run again, an unrelated thread at priority
8 becomes ready and starts running, as shown in Fig. 11-27(b). As long as this
thread wants to run, it will be able to, since it has a higher priority than the consu-
mer, and the producer, though even higher, is blocked. Under these circumstances,
the producer will never get to run again until the priority 8 thread gives up. This
SEC. 11.4 PROCESSES AND THREADS IN WINDOWS 927

12 Blocked 12
Does a down on the Waiting on the semaphore
semaphore and blocks

Semaphone Running 8 Semaphone

Would like to do an up
on the semaphore but
Ready 4 never gets scheduled

(a) (b)

Figure 11-27. An example of priority inversion.

problem is well known under the name priority inversion. Windows addresses
priority inversion between kernel threads through a facility in the thread scheduler
called Autoboost. Autoboost automatically tracks resource dependencies between
threads and boosts the scheduling priority of threads that hold resources needed by
higher-priority threads.
Windows runs on PCs, which usually have only a single interactive session ac-
tive at a time. However, Windows also supports a terminal server mode which
supports multiple interactive sessions over the network using RDP (Remote Desk-
top Protocol). When running multiple user sessions, it is easy for one user to in-
terfere with another by consuming too much processor resources. Windows imple-
ments a fair-share algorithm, DFSS (Dynamic Fair-Share Scheduling), which
keeps sessions from running excessively. DFSS uses scheduling groups to
organize the threads in each session. Within each group the threads are scheduled
according to normal Windows scheduling policies, but each group is given more or
less access to the processors based on how much the group has been running in
aggregate. The relative priorities of the groups are adjusted slowly to allow ignore
short bursts of activity and reduce the amount a group is allowed to run only if it
uses excessive processor time over long periods.

11.5 MEMORY MANAGEMENT


Windows has an extremely sophisticated and complex virtual memory system.
It has a number of Win32 functions for using it, implemented by the memory man-
ager—the largest component of the NTOS executive layer. In the following sec-
tions we will look at the fundamental concepts, the Win32 API calls, and finally
the implementation.
928 CASE STUDY 2: WINDOWS 8 CHAP. 11

11.5.1 Fundamental Concepts

In Windows, every user process has its own virtual address space. For x86 ma-
chines, virtual addresses are 32 bits long, so each process has 4 GB of virtual ad-
dress space, with the user and kernel each receiving 2 GB. For x64 machines, both
the user and kernel receive more virtual addresses than they can reasonably use in
the foreseeable future. For both x86 and x64, the virtual address space is demand
paged, with a fixed page size of 4 KB—though in some cases, as we will see short-
ly, 2-MB large pages are also used (by using a page directory only and bypassing
the corresponding page table).
The virtual address space layouts for three x86 processes are shown in
Fig. 11-28 in simplified form. The bottom and top 64 KB of each process’ virtual
address space is normally unmapped. This choice was made intentionally to help
catch programming errors and mitigate the exploitability of certain types of vulner-
abilities.
Process A Process B Process C
4 GB

Nonpaged pool Nonpaged pool Nonpaged pool


Paged pool Paged pool Paged pool

A's page tables B's page tables C's page tables


Stacks, data, etc Stacks, data, etc Stacks, data, etc
HAL + OS HAL + OS HAL + OS
2 GB
System data System data System data

Process A's Process B's Process C's


private code private code private code
and data and data and data

0
Bottom and top
64 KB are invalid

Figure 11-28. Virtual address space layout for three user processes on the x86.
The white areas are private per process. The shaded areas are shared among all
processes.

Starting at 64 KB comes the user’s private code and data. This extends up to
almost 2 GB. The upper 2 GB contains the operating system, including the code,
data, and the paged and nonpaged pools. The upper 2 GB is the kernel’s virtual
memory and is shared among all user processes, except for virtual memory data
like the page tables and working-set lists, which are per-process. Kernel virtual
SEC. 11.5 MEMORY MANAGEMENT 929

memory is accessible only while running in kernel mode. The reason for sharing
the process’ virtual memory with the kernel is that when a thread makes a system
call, it traps into kernel mode and can continue running without changing the mem-
ory map. All that has to be done is switch to the thread’s kernel stack. From a per-
formance point of view, this is a big win, and something UNIX does as well. Be-
cause the process’ user-mode pages are still accessible, the kernel-mode code can
read parameters and access buffers without having to switch back and forth be-
tween address spaces or temporarily double-map pages into both. The trade-off
here is less private address space per process in return for faster system calls.
Windows allows threads to attach themselves to other address spaces while
running in the kernel. Attachment to an address space allows the thread to access
all of the user-mode address space, as well as the portions of the kernel address
space that are specific to a process, such as the self-map for the page tables.
Threads must switch back to their original address space before returning to user
mode.

Virtual Address Allocation

Each page of virtual addresses can be in one of three states: invalid, reserved,
or committed. An invalid page is not currently mapped to a memory section ob-
ject and a reference to it causes a page fault that results in an access violation.
Once code or data is mapped onto a virtual page, the page is said to be committed.
A page fault on a committed page results in mapping the page containing the virtu-
al address that caused the fault onto one of the pages represented by the section ob-
ject or stored in the pagefile. Often this will require allocating a physical page and
performing I/O on the file represented by the section object to read in the data from
disk. But page faults can also occur simply because the page-table entry needs to
be updated, as the physical page referenced is still cached in memory, in which
case I/O is not required. These are called soft faults and we will discuss them in
more detail shortly.
A virtual page can also be in the reserved state. A reserved virtual page is
invalid but has the property that those virtual addresses will never be allocated by
the memory manager for another purpose. As an example, when a new thread is
created, many pages of user-mode stack space are reserved in the process’ virtual
address space, but only one page is committed. As the stack grows, the virtual
memory manager will automatically commit additional pages under the covers,
until the reservation is almost exhausted. The reserved pages function as guard
pages to keep the stack from growing too far and overwriting other process data.
Reserving all the virtual pages means that the stack can eventually grow to its max-
imum size without the risk that some of the contiguous pages of virtual address
space needed for the stack might be given away for another purpose. In addition to
the invalid, reserved, and committed attributes, pages also have other attributes,
such as being readable, writable, and executable.
930 CASE STUDY 2: WINDOWS 8 CHAP. 11

Pagefiles

An interesting trade-off occurs with assignment of backing store to committed


pages that are not being mapped to specific files. These pages use the pagefile.
The question is how and when to map the virtual page to a specific location in the
pagefile. A simple strategy would be to assign each virtual page to a page in one
of the paging files on disk at the time the virtual page was committed. This would
guarantee that there was always a known place to write out each committed page
should it be necessary to evict it from memory.
Windows uses a just-in-time strategy. Committed pages that are backed by the
pagefile are not assigned space in the pagefile until the time that they have to be
paged out. No disk space is allocated for pages that are never paged out. If the
total virtual memory is less than the available physical memory, a pagefile is not
needed at all. This is convenient for embedded systems based on Windows. It is
also the way the system is booted, since pagefiles are not initialized until the first
user-mode process, smss.exe, begins running.
With a preallocation strategy the total virtual memory in the system used for
private data (stacks, heap, and copy-on-write code pages) is limited to the size of
the pagefiles. With just-in-time allocation the total virtual memory can be almost
as large as the combined size of the pagefiles and physical memory. With disks so
large and cheap vs. physical memory, the savings in space is not as significant as
the increased performance that is possible.
With demand-paging, requests to read pages from disk need to be initiated
right away, as the thread that encountered the missing page cannot continue until
this page-in operation completes. The possible optimizations for faulting pages in-
to memory involve attempting to prepage additional pages in the same I/O opera-
tion. However, operations that write modified pages to disk are not normally syn-
chronous with the execution of threads. The just-in-time strategy for allocating
pagefile space takes advantage of this to boost the performance of writing modified
pages to the pagefile. Modified pages are grouped together and written in big
chunks. Since the allocation of space in the pagefile does not happen until the
pages are being written, the number of seeks required to write a batch of pages can
be optimized by allocating the pagefile pages to be near each other, or even making
them contiguous.
When pages stored in the pagefile are read into memory, they keep their alloca-
tion in the pagefile until the first time they are modified. If a page is never modi-
fied, it will go onto a special list of free physical pages, called the standby list,
where it can be reused without having to be written back to disk. If it is modified,
the memory manager will free the pagefile page and the only copy of the page will
be in memory. The memory manager implements this by marking the page as
read-only after it is loaded. The first time a thread attempts to write the page the
memory manager will detect this situation and free the pagefile page, grant write
access to the page, and then have the thread try again.
SEC. 11.5 MEMORY MANAGEMENT 931

Windows supports up to 16 pagefiles, normally spread out over separate disks


to achieve higher I/O bandwidth. Each one has an initial size and a maximum size
it can grow to later if needed, but it is better to create these files to be the maxi-
mum size at system installation time. If it becomes necessary to grow a pagefile
when the file system is much fuller, it is likely that the new space in the pagefile
will be highly fragmented, reducing performance.
The operating system keeps track of which virtual page maps onto which part
of which paging file by writing this information into the page-table entries for the
process for private pages, or into prototype page-table entries associated with the
section object for shared pages. In addition to the pages that are backed by the
pagefile, many pages in a process are mapped to regular files in the file system.
The executable code and read-only data in a program file (e.g., an EXE or
DLL) can be mapped into the address space of whatever process is using it. Since
these pages cannot be modified, they never need to be paged out but the physical
pages can just be immediately reused after the page-table mappings are all marked
as invalid. When the page is needed again in the future, the memory manager will
read the page in from the program file.
Sometimes pages that start out as read-only end up being modified, for ex-
ample, setting a breakpoint in the code when debugging a process, or fixing up
code to relocate it to different addresses within a process, or making modifications
to data pages that started out shared. In cases like these, Windows, like most mod-
ern operating systems, supports a type of page called copy-on-write. These pages
start out as ordinary mapped pages, but when an attempt is made to modify any
part of the page the memory manager makes a private, writable copy. It then
updates the page table for the virtual page so that it points at the private copy and
has the thread retry the write—which will now succeed. If that copy later needs to
be paged out, it will be written to the pagefile rather than the original file,
Besides mapping program code and data from EXE and DLL files, ordinary
files can be mapped into memory, allowing programs to reference data from files
without doing read and write operations. I/O operations are still needed, but they
are provided implicitly by the memory manager using the section object to repres-
ent the mapping between pages in memory and the blocks in the files on disk.
Section objects do not have to refer to a file. They can refer to anonymous re-
gions of memory. By mapping anonymous section objects into multiple processes,
memory can be shared without having to allocate a file on disk. Since sections can
be given names in the NT namespace, processes can rendezvous by opening sec-
tions by name, as well as by duplicating and passing handles between processes.

11.5.2 Memory-Management System Calls

The Win32 API contains a number of functions that allow a process to manage
its virtual memory explicitly. The most important of these functions are listed in
Fig. 11-29. All of them operate on a region consisting of either a single page or a
932 CASE STUDY 2: WINDOWS 8 CHAP. 11

sequence of two or more pages that are consecutive in the virtual address space.
Of course, processes do not have to manage their memory; paging happens auto-
matically, but these calls give processes additional power and flexibility.

Win32 API function Description


Vir tualAlloc Reser ve or commit a region
Vir tualFree Release or decommit a region
Vir tualProtect Change the read/write/execute protection on a region
Vir tualQuery Inquire about the status of a region
Vir tualLock Make a region memory resident (i.e., disable paging for it)
Vir tualUnlock Make a region pageable in the usual way
CreateFileMapping Create a file-mapping object and (optionally) assign it a name
MapViewOfFile Map (par t of) a file into the address space
UnmapViewOfFile Remove a mapped file from the address space
OpenFileMapping Open a previously created file-mapping object

Figure 11-29. The principal Win32 API functions for managing virtual memory
in Windows.

The first four API functions are used to allocate, free, protect, and query re-
gions of virtual address space. Allocated regions always begin on 64-KB bound-
aries to minimize porting problems to future architectures with pages larger than
current ones. The actual amount of address space allocated can be less than 64
KB, but must be a multiple of the page size. The next two APIs give a process the
ability to hardwire pages in memory so they will not be paged out and to undo this
property. A real-time program might need pages with this property to avoid page
faults to disk during critical operations, for example. A limit is enforced by the op-
erating system to prevent processes from getting too greedy. The pages actually
can be removed from memory, but only if the entire process is swapped out. When
it is brought back, all the locked pages are reloaded before any thread can start run-
ning again. Although not shown in Fig. 11-29, Windows also has native API func-
tions to allow a process to access the virtual memory of a different process over
which it has been given control, that is, for which it has a handle (see Fig. 11-7).
The last four API functions listed are for managing memory-mapped files. To
map a file, a file-mapping object must first be created with CreateFileMapping (see
Fig. 11-8). This function returns a handle to the file-mapping object (i.e., a section
object) and optionally enters a name for it into the Win32 namespace so that other
processes can use it, too. The next two functions map and unmap views on section
objects from a process’ virtual address space. The last API can be used by a proc-
ess to map share a mapping that another process created with CreateFileMapping,
usually one created to map anonymous memory. In this way, two or more proc-
esses can share regions of their address spaces. This technique allows them to
write in limited regions of each other’s virtual memory.
SEC. 11.5 MEMORY MANAGEMENT 933

11.5.3 Implementation of Memory Management

Windows, on the x86, supports a single linear 4-GB demand-paged address


space per process. Segmentation is not supported in any form. Theoretically, page
sizes can be any power of 2 up to 64 KB. On the x86 they are normally fixed at 4
KB. In addition, the operating system can use 2-MB large pages to improve the ef-
fectiveness of the TLB (Translation Lookaside Buffer) in the processor’s memo-
ry management unit. Use of 2-MB large pages by the kernel and large applications
significantly improves performance by improving the hit rate for the TLB and
reducing the number of times the page tables have to be walked to find entries that
are missing from the TLB.
Backing store on disk

Process A Process B
Stack Stack

Region Data

Data
Paging file

Shared
library
Lib.dll Shared
library
Program
Program
Prog1.exe Prog2.exe

Figure 11-30. Mapped regions with their shadow pages on disk. The lib.dll file
is mapped into two address spaces at the same time.

Unlike the scheduler, which selects individual threads to run and does not care
much about processes, the memory manager deals entirely with processes and does
not care much about threads. After all, processes, not threads, own the address
space and that is what the memory manager is concerned with. When a region of
virtual address space is allocated, as four of them have been for process A in
Fig. 11-30, the memory manager creates a VAD (Virtual Address Descriptor) for
it, listing the range of addresses mapped, the section representing the backing store
file and offset where it is mapped, and the permissions. When the first page is
touched, the directory of page tables is created and its physical address is inserted
into the process object. An address space is completely defined by the list of its
VADs. The VADs are organized into a balanced tree, so that the descriptor for a
934 CASE STUDY 2: WINDOWS 8 CHAP. 11

particular address can be found efficiently. This scheme supports sparse address
spaces. Unused areas between the mapped regions use no resources (memory or
disk) so they are essential free.

Page-Fault Handling

When a process starts on Windows, many of the pages mapping the program’s
EXE and DLL image files may already be in memory because they are shared with
other processes. The writable pages of the images are marked copy-on-write so
that they can be shared up to the point they need to be modified. If the operating
system recognizes the EXE from a previous execution, it may have recorded the
page-reference pattern, using a technology Microsoft calls SuperFetch. Super-
Fetch attempts to prepage many of the needed pages even though the process has
not faulted on them yet. This reduces the latency for starting up applications by
overlapping the reading of the pages from disk with the execution of the ini-
tialization code in the images. It improves throughput to disk because it is easier
for the disk drivers to organize the reads to reduce the seek time needed. Process
prepaging is also used during boot of the system, when a background application
moves to the foreground, and when restarting the system after hibernation.
Prepaging is supported by the memory manager, but implemented as a separate
component of the system. The pages brought in are not inserted into the process’
page table, but instead are inserted into the standby list from which they can quick-
ly be inserted into the process as needed without accessing the disk.
Nonmapped pages are slightly different in that they are not initialized by read-
ing from the file. Instead, the first time a nonmapped page is accessed the memory
manager provides a new physical page, making sure the contents are all zeroes (for
security reasons). On subsequent faults a nonmapped page may need to be found
in memory or else must be read back from the pagefile.
Demand paging in the memory manager is driven by page faults. On each
page fault, a trap to the kernel occurs. The kernel then builds a machine-indepen-
dent descriptor telling what happened and passes this to the memory-manager part
of the executive. The memory manager then checks the access for validity. If the
faulted page falls within a committed region, it looks up the address in the list of
VADs and finds (or creates) the process page-table entry. In the case of a shared
page, the memory manager uses the prototype page-table entry associated with the
section object to fill in the new page-table entry for the process page table.
The format of the page-table entries differs depending on the processor archi-
tecture. For the x86 and x64, the entries for a mapped page are shown in
Fig. 11-31. If an entry is marked valid, its contents are interpreted by the hardware
so that the virtual address can be translated into the correct physical page. Unmap-
ped pages also have entries, but they are marked invalid and the hardware ignores
the rest of the entry. The software format is somewhat different from the hardware
SEC. 11.5 MEMORY MANAGEMENT 935

format and is determined by the memory manager. For example, for an unmapped
page that must be allocated and zeroed before it may be used, that fact is noted in
the page-table entry.

63 62 52 51 12 11 9 8 7 6 5 4 3 2 1 0
P P P U R
N Physical
AVL AVL G A D A C W / / P
X page number
T D T S W

NX No eXecute PCD Page Cache Disable


AVL AVaiLable to the OS PWT Page Write-Through
G Global page U/S User/Supervisor
PAT Page Attribute Table R/W Read/Write access
D Dirty (modified) P Present (valid)
A Accessed (referenced)

Figure 11-31. A page-table entry (PTE) for a mapped page on the Intel x86 and
AMD x64 architectures.

Two important bits in the page-table entry are updated by the hardware direct-
ly. These are the access (A) and dirty (D) bits. These bits keep track of when a
particular page mapping has been used to access the page and whether that access
could have modified the page by writing it. This really helps the performance of
the system because the memory manager can use the access bit to implement the
LRU (Least-Recently Used) style of paging. The LRU principle says that pages
which have not been used the longest are the least likely to be used again soon.
The access bit allows the memory manager to determine that a page has been ac-
cessed. The dirty bit lets the memory manager know that a page may have been
modified, or more significantly, that a page has not been modified. If a page has
not been modified since being read from disk, the memory manager does not have
to write the contents of the page to disk before using it for something else.
Both the x86 and x64 use a 64-bit page-table entry, as shown in Fig. 11-31.
Each page fault can be considered as being in one of five categories:
1. The page referenced is not committed.
2. Access to a page has been attempted in violation of the permissions.
3. A shared copy-on-write page was about to be modified.
4. The stack needs to grow.
5. The page referenced is committed but not currently mapped in.
The first and second cases are due to programming errors. If a program at-
tempts to use an address which is not supposed to have a valid mapping, or at-
tempts an invalid operation (like attempting to write a read-only page) this is called
936 CASE STUDY 2: WINDOWS 8 CHAP. 11

an access violation and usually results in termination of the process. Access viola-
tions are often the result of bad pointers, including accessing memory that was
freed and unmapped from the process.
The third case has the same symptoms as the second one (an attempt to write
to a read-only page), but the treatment is different. Because the page has been
marked as copy-on-write, the memory manager does not report an access violation,
but instead makes a private copy of the page for the current process and then re-
turns control to the thread that attempted to write the page. The thread will retry
the write, which will now complete without causing a fault.
The fourth case occurs when a thread pushes a value onto its stack and crosses
onto a page which has not been allocated yet. The memory manager is program-
med to recognize this as a special case. As long as there is still room in the virtual
pages reserved for the stack, the memory manager will supply a new physical page,
zero it, and map it into the process. When the thread resumes running, it will retry
the access and succeed this time around.
Finally, the fifth case is a normal page fault. However, it has several subcases.
If the page is mapped by a file, the memory manager must search its data struc-
tures, such as the prototype page table associated with the section object to be sure
that there is not already a copy in memory. If there is, say in another process or on
the standby or modified page lists, it will just share it—perhaps marking it as copy-
on-write if changes are not supposed to be shared. If there is not already a copy,
the memory manager will allocate a free physical page and arrange for the file
page to be copied in from disk, unless another the page is already transitioning in
from disk, in which case it is only necessary to wait for the transition to complete.
When the memory manager can satisfy a page fault by finding the needed page
in memory rather than reading it in from disk, the fault is classified as a soft fault.
If the copy from disk is needed, it is a hard fault. Soft faults are much cheaper,
and have little impact on application performance compared to hard faults. Soft
faults can occur because a shared page has already been mapped into another proc-
ess, or only a new zero page is needed, or the needed page was trimmed from the
process’ working set but is being requested again before it has had a chance to be
reused. Soft faults can also occur because pages have been compressed to ef-
fectively increase the size of physical memory. For most configurations of CPU,
memory, and I/O in current systems it is more efficient to use compression rather
than incur the I/O expense (performance and energy) required to read a page from
disk.
When a physical page is no longer mapped by the page table in any process it
goes onto one of three lists: free, modified, or standby. Pages that will never be
needed again, such as stack pages of a terminating process, are freed immediately.
Pages that may be faulted again go to either the modified list or the standby list,
depending on whether or not the dirty bit was set for any of the page-table entries
that mapped the page since it was last read from disk. Pages in the modified list
will be eventually written to disk, then moved to the standby list.
SEC. 11.5 MEMORY MANAGEMENT 937

The memory manager can allocate pages as needed using either the free list or
the standby list. Before allocating a page and copying it in from disk, the memory
manager always checks the standby and modified lists to see if it already has the
page in memory. The prepaging scheme in Windows thus converts future hard
faults into soft faults by reading in the pages that are expected to be needed and
pushing them onto the standby list. The memory manager itself does a small
amount of ordinary prepaging by accessing groups of consecutive pages rather than
single pages. The additional pages are immediately put on the standby list. This is
not generally wasteful because the overhead in the memory manager is very much
dominated by the cost of doing a single I/O. Reading a cluster of pages rather than
a single page is negligibly more expensive.
The page-table entries in Fig. 11-31 refer to physical page numbers, not virtual
page numbers. To update page-table (and page-directory) entries, the kernel needs
to use virtual addresses. Windows maps the page tables and page directories for
the current process into kernel virtual address space using self-map entries in the
page directory, as shown in Fig. 11-32. By making page-directory entries point at
the page directory (the self-map), there are virtual addresses that can be used to
refer to page-directory entries (a) as well as page table entries (b). The self-map
occupies the same 8 MB of kernel virtual addresses for every process (on the x86).
For simplicity the figure shows the x86 self-map for 32-bit PTEs (Page-Table
Entries). Windows actually uses 64-bit PTEs so the system can makes use of
more than 4 GB of physical memory. With 32-bit PTEs, the self-map uses only
one PDE (Page-Directory Entry) in the page directory, and thus occupies only 4
MB of addresses rather than 8 MB.

The Page Replacement Algorithm

When the number of free physical memory pages starts to get low, the memory
manager starts working to make more physical pages available by removing them
from user-mode processes as well as the system process, which represents kernel-
mode use of pages. The goal is to have the most important virtual pages present in
memory and the others on disk. The trick is in determining what important means.
In Windows this is answered by making heavy use of the working-set concept.
Each process (not each thread) has a working set. This set consists of the map-
ped-in pages that are in memory and thus can be referenced without a page fault.
The size and composition of the working set fluctuates as the process’ threads run,
of course.
Each process’ working set is described by two parameters: the minimum size
and the maximum size. These are not hard bounds, so a process may have fewer
pages in memory than its minimum or (under certain circumstances) more than its
maximum. Every process starts with the same minimum and maximum, but these
bounds can change over time, or can be determined by the job object for processes
contained in a job. The default initial minimum is in the range 20–50 pages and
938 CASE STUDY 2: WINDOWS 8 CHAP. 11

CR3 CR3

PD PD
PT

0x300 0x300
0x390 0x321

Virtual Virtual
1100 0000 00 11 0000 0000 1100 0000 00 00 1100 0000 00 11 1001 0000 1100 1000 01 00
address address
c0300c00 c0390c84
(a) (b)

Self-map: PD[0xc0300000>>22] is PD (page-directory)


Virtual address (a): (PTE *)(0xc0300c00) points to PD[0x300] which is the self-map page directory entry
Virtual address (b): (PTE *)(0xc0390c84) points to PTE for virtual address 0xe4321000

Figure 11-32. The Windows self-map entries are used to map the physical pages
of the page tables and page directory into kernel virtual addresses (shown for
32-bit PTEs).

the default initial maximum is in the range 45–345 pages, depending on the total
amount of physical memory in the system. The system administrator can change
these defaults, however. While few home users will try, server admins might.
Working sets come into play only when the available physical memory is get-
ting low in the system. Otherwise processes are allowed to consume memory as
they choose, often far exceeding the working-set maximum. But when the system
comes under memory pressure, the memory manager starts to squeeze processes
back into their working sets, starting with processes that are over their maximum
by the most. There are three levels of activity by the working-set manager, all of
which is periodic based on a timer. New activity is added at each level:
1. Lots of memory available: Scan pages resetting access bits and
using their values to represent the age of each page. Keep an estimate
of the unused pages in each working set.
2. Memory getting tight: For any process with a significant proportion
of unused pages, stop adding pages to the working set and start
replacing the oldest pages whenever a new page is needed. The re-
placed pages go to the standby or modified list.
3. Memory is tight: Trim (i.e., reduce) working sets to be below their
maximum by removing the oldest pages.
SEC. 11.5 MEMORY MANAGEMENT 939

The working set manager runs every second, called from the balance set man-
ager thread. The working-set manager throttles the amount of work it does to keep
from overloading the system. It also monitors the writing of pages on the modified
list to disk to be sure that the list does not grow too large, waking the Modified-
PageWriter thread as needed.

Physical Memory Management

Above we mentioned three different lists of physical pages, the free list, the
standby list, and the modified list. There is a fourth list which contains free pages
that have been zeroed. The system frequently needs pages that contain all zeros.
When new pages are given to processes, or the final partial page at the end of a file
is read, a zero page is needed. It is time consuming to write a page with zeros, so
it is better to create zero pages in the background using a low-priority thread.
There is also a fifth list used to hold pages that have been detected as having hard-
ware errors (i.e., through hardware error detection).
All pages in the system either are referenced by a valid page-table entry or are
on one of these five lists, which are collectively called the PFN database (Page
Frame Number database). Fig. 11-33 shows the structure of the PFN Database.
The table is indexed by physical page-frame number. The entries are fixed length,
but different formats are used for different kinds of entries (e.g., shared vs. private).
Valid entries maintain the page’s state and a count of how many page tables point
to the page, so that the system can tell when the page is no longer in use. Pages
that are in a working set tell which entry references them. There is also a pointer
to the process page table that points to the page (for nonshared pages) or to the
prototype page table (for shared pages).
Additionally there is a link to the next page on the list (if any), and various
other fields and flags, such as read in progress, write in progress, and so on. To
save space, the lists are linked together with fields referring to the next element by
its index within the table rather than pointers. The table entries for the physical
pages are also used to summarize the dirty bits found in the various page table en-
tries that point to the physical page (i.e., because of shared pages). There is also
information used to represent differences in memory pages on larger server sys-
tems which have memory that is faster from some processors than from others,
namely NUMA machines.
Pages are moved between the working sets and the various lists by the work-
ing-set manager and other system threads. Let us examine the transitions. When
the working-set manager removes a page from a working set, the page goes on the
bottom of the standby or modified list, depending on its state of cleanliness. This
transition is shown as (1) in Fig. 11-34.
Pages on both lists are still valid pages, so if a page fault occurs and one of
these pages is needed, it is removed from the list and faulted back into the working
set without any disk I/O (2). When a process exits, its nonshared pages cannot be
940 CASE STUDY 2: WINDOWS 8 CHAP. 11

Page-frame number database


Page tables
State Cnt WS Other PT Next

14 Clean X
13 Dirty X
List headers 12 Clean
11 Active 20
Standby 10 Clean
9 Dirty
8 Active 4
Modified 7 Dirty
6 Free X
Free 5 Free
4 Zeroed X
3 Active 6
2 Zeroed
1 Active 14
Zeroed 0 Zeroed

Figure 11-33. Some of the major fields in the page-frame database for a valid
page.

faulted back to it, so the valid pages in its page table and any of its pages on the
modified or standby lists go on the free list (3). Any pagefile space in use by the
process is also freed.
Zero page needed (8)

Page referenced (6)


Working
Sets Soft page fault (2)

Modified Standby Free Zeroed


page page page page
list Modified list Dealloc list Zero
list
page (5) page
writer thread
(4) (7)

Page evicted from all working sets (1) Process exit (3) Bad memory
page
list

Figure 11-34. The various page lists and the transitions between them.

Other transitions are caused by other system threads. Every 4 seconds the bal-
ance set manager thread runs and looks for processes all of whose threads have
been idle for a certain number of seconds. If it finds any such processes, their
SEC. 11.5 MEMORY MANAGEMENT 941

kernel stacks are unpinned from physical memory and their pages are moved to the
standby or modified lists, also shown as (1).
Two other system threads, the mapped page writer and the modified page
writer, wake up periodically to see if there are enough clean pages. If not, they
take pages from the top of the modified list, write them back to disk, and then
move them to the standby list (4). The former handles writes to mapped files and
the latter handles writes to the pagefiles. The result of these writes is to transform
modified (dirty) pages into standby (clean) pages.
The reason for having two threads is that a mapped file might have to grow as
a result of the write, and growing it requires access to on-disk data structures to al-
locate a free disk block. If there is no room in memory to bring them in when a
page has to be written, a deadlock could result. The other thread can solve the
problem by writing out pages to a paging file.
The other transitions in Fig. 11-34 are as follows. If a process unmaps a page,
the page is no longer associated with a process and can go on the free list (5), ex-
cept for the case that it is shared. When a page fault requires a page frame to hold
the page about to be read in, the page frame is taken from the free list (6), if pos-
sible. It does not matter that the page may still contain confidential information
because it is about to be overwritten in its entirety.
The situation is different when a stack grows. In that case, an empty page
frame is needed and the security rules require the page to contain all zeros. For
this reason, another kernel system thread, the ZeroPage thread, runs at the lowest
priority (see Fig. 11-26), erasing pages that are on the free list and putting them on
the zeroed page list (7). Whenever the CPU is idle and there are free pages, they
might as well be zeroed since a zeroed page is potentially more useful than a free
page and it costs nothing to zero the page when the CPU is idle.
The existence of all these lists leads to some subtle policy choices. For ex-
ample, suppose that a page has to be brought in from disk and the free list is empty.
The system is now forced to choose between taking a clean page from the standby
list (which might otherwise have been faulted back in later) or an empty page from
the zeroed page list (throwing away the work done in zeroing it). Which is better?
The memory manager has to decide how aggressively the system threads
should move pages from the modified list to the standby list. Having clean pages
around is better than having dirty pages around (since clean ones can be reused in-
stantly), but an aggressive cleaning policy means more disk I/O and there is some
chance that a newly cleaned page may be faulted back into a working set and dirt-
ied again anyway. In general, Windows resolves these kinds of trade-offs through
algorithms, heuristics, guesswork, historical precedent, rules of thumb, and
administrator-controlled parameter settings.
Modern Windows introduced an additional abstraction layer at the bottom of
the memory manager, called the store manager. This layer makes decisions about
how to optimize the I/O operations to the available backing stores. Persistent stor-
age systems include auxiliary flash memory and SSDs in addition to rotating disks.
942 CASE STUDY 2: WINDOWS 8 CHAP. 11

The store manager optimizes where and how physical memory pages are backed
by the persistent stores in the system. It also implements optimization techniques
such as copy-on-write sharing of identical physical pages and compression of the
pages in the standby list to effectively increase the available RAM.
Another change in memory management in Modern Windows is the introduc-
tion of a swap file. Historically memory management in Windows has been based
on working sets, as described above. As memory pressure increases, the memory
manager squeezes on the working sets to reduce the footprint each process has in
memory. The modern application model introduces opportunities for new efficien-
cies. Since the process containing the foreground part of a modern application is
no longer given processor resources once the user has switched away, there is no
need for its pages to be resident. As memory pressure builds in the system, the
pages in the process may be removed as part of normal working-set management.
However, the process lifetime manager knows how long it has been since the user
switched to the application’s foreground process. When more memory is needed it
picks a process that has not run in a while and calls into the memory manager to
efficiently swap all the pages in a small number of I/O operations. The pages will
be written to the swap file by aggregating them into one or more large chunks.
This means that the entire process can also be restored in memory with fewer I/O
operations.
All in all, memory management is a highly complex executive component with
many data structures, algorithms, and heuristics. It attempts to be largely self tun-
ing, but there are also many knobs that administrators can tweak to affect system
performance. A number of these knobs and the associated counters can be viewed
using tools in the various tool kits mentioned earlier. Probably the most important
thing to remember here is that memory management in real systems is a lot more
than just one simple paging algorithm like clock or aging.

11.6 CACHING IN WINDOWS


The Windows cache improves the performance of file systems by keeping
recently and frequently used regions of files in memory. Rather than cache physi-
cal addressed blocks from the disk, the cache manager manages virtually addressed
blocks, that is, regions of files. This approach fits well with the structure of the
native NT File System (NTFS), as we will see in Sec. 11.8. NTFS stores all of its
data as files, including the file-system metadata.
The cached regions of files are called views because they represent regions of
kernel virtual addresses that are mapped onto file-system files. Thus, the actual
management of the physical memory in the cache is provided by the memory man-
ager. The role of the cache manager is to manage the use of kernel virtual ad-
dresses for views, arrange with the memory manager to pin pages in physical
memory, and provide interfaces for the file systems.
SEC. 11.6 CACHING IN WINDOWS 943

The Windows cache-manager facilities are shared among all the file systems.
Because the cache is virtually addressed according to individual files, the cache
manager is easily able to perform read-ahead on a per-file basis. Requests to ac-
cess cached data come from each file system. Virtual caching is convenient be-
cause the file systems do not have to first translate file offsets into physical block
numbers before requesting a cached file page. Instead, the translation happens
later when the memory manager calls the file system to access the page on disk.
Besides management of the kernel virtual address and physical memory re-
sources used for caching, the cache manager also has to coordinate with file sys-
tems regarding issues like coherency of views, flushing to disk, and correct mainte-
nance of the end-of-file marks—particularly as files expand. One of the most dif-
ficult aspects of a file to manage between the file system, the cache manager, and
the memory manager is the offset of the last byte in the file, called the ValidData-
Length. If a program writes past the end of the file, the blocks that were skipped
have to be filled with zeros, and for security reasons it is critical that the Valid-
DataLength recorded in the file metadata not allow access to uninitialized blocks,
so the zero blocks have to be written to disk before the metadata is updated with
the new length. While it is expected that if the system crashes, some of the blocks
in the file might not have been updated from memory, it is not acceptable that some
of the blocks might contain data previously belonging to other files.
Let us now examine how the cache manager works. When a file is referenced,
the cache manager maps a 256-KB chunk of kernel virtual address space onto the
file. If the file is larger than 256 KB, only a portion of the file is mapped at a time.
If the cache manager runs out of 256-KB chunks of virtual address space, it must
unmap an old file before mapping in a new one. Once a file is mapped, the cache
manager can satisfy requests for its blocks by just copying from kernel virtual ad-
dress space to the user buffer. If the block to be copied is not in physical memory,
a page fault will occur and the memory manager will satisfy the fault in the usual
way. The cache manager is not even aware of whether the block was in memory or
not. The copy always succeeds.
The cache manager also works for pages that are mapped into virtual memory
and accessed with pointers rather than being copied between kernel and user-mode
buffers. When a thread accesses a virtual address mapped to a file and a page fault
occurs, the memory manager may in many cases be able to satisfy the access as a
soft fault. It does not need to access the disk, since it finds that the page is already
in physical memory because it is mapped by the cache manager.

11.7 INPUT/OUTPUT IN WINDOWS


The goals of the Windows I/O manager are to provide a fundamentally exten-
sive and flexible framework for efficiently handling a very wide variety of I/O de-
vices and services, support automatic device discovery and driver installation (plug
944 CASE STUDY 2: WINDOWS 8 CHAP. 11

and play) and power management for devices and the CPU—all using a fundamen-
tally asynchronous structure that allows computation to overlap with I/O transfers.
There are many hundreds of thousands of devices that work with Windows. For a
large number of common devices it is not even necessary to install a driver, be-
cause there is already a driver that shipped with the Windows operating system.
But even so, counting all the revisions, there are almost a million distinct driver
binaries that run on Windows. In the following sections we will examine some of
the issues relating to I/O.

11.7.1 Fundamental Concepts

The I/O manager is on intimate terms with the plug-and-play manager. The
basic idea behind plug and play is that of an enumerable bus. Many buses, includ-
ing PC Card, PCI, PCIe, AGP, USB, IEEE 1394, EIDE, SCSI, and SATA, have
been designed so that the plug-and-play manager can send a request to each slot
and ask the device there to identify itself. Having discovered what is out there, the
plug-and-play manager allocates hardware resources, such as interrupt levels,
locates the appropriate drivers, and loads them into memory. As each driver is
loaded, a driver object is created for it. And then for each device, at least one de-
vice object is allocated. For some buses, such as SCSI, enumeration happens only
at boot time, but for other buses, such as USB, it can happen at any time, requiring
close cooperation between the plug-and-play manager, the bus drivers (which ac-
tually do the enumerating), and the I/O manager.
In Windows, all the file systems, antivirus filters, volume managers, network
protocol stacks, and even kernel services that have no associated hardware are im-
plemented using I/O drivers. The system configuration must be set to cause some
of these drivers to load, because there is no associated device to enumerate on the
bus. Others, like the file systems, are loaded by special code that detects they are
needed, such as the file-system recognizer that looks at a raw volume and deci-
phers what type of file system format it contains.
An interesting feature of Windows is its support for dynamic disks. These
disks may span multiple partitions and even multiple disks and may be reconfig-
ured on the fly, without even having to reboot. In this way, logical volumes are no
longer constrained to a single partition or even a single disk so that a single file
system may span multiple drives in a transparent way.
The I/O to volumes can be filtered by a special Windows driver to produce
Volume Shadow Copies. The filter driver creates a snapshot of the volume which
can be separately mounted and represents a volume at a previous point in time. It
does this by keeping track of changes after the snapshot point. This is very con-
venient for recovering files that were accidentally deleted, or traveling back in time
to see the state of a file at periodic snapshots made in the past.
But shadow copies are also valuable for making accurate backups of server
systems. The operating system works with server applications to have them reach
SEC. 11.7 INPUT/OUTPUT IN WINDOWS 945

a convenient point for making a clean backup of their persistent state on the vol-
ume. Once all the applications are ready, the system initializes the snapshot of the
volume and then tells the applications that they can continue. The backup is made
of the volume state at the point of the snapshot. And the applications were only
blocked for a very short time rather than having to go offline for the duration of the
backup.
Applications participate in the snapshot process, so the backup reflects a state
that is easy to recover in case there is a future failure. Otherwise the backup might
still be useful, but the state it captured would look more like the state if the system
had crashed. Recovering from a system at the point of a crash can be more dif-
ficult or even impossible, since crashes occur at arbitrary times in the execution of
the application. Murphy’s Law says that crashes are most likely to occur at the
worst possible time, that is, when the application data is in a state where recovery
is impossible.
Another aspect of Windows is its support for asynchronous I/O. It is possible
for a thread to start an I/O operation and then continue executing in parallel with
the I/O. This feature is especially important on servers. There are various ways
the thread can find out that the I/O has completed. One is to specify an event ob-
ject at the time the call is made and then wait on it eventually. Another is to speci-
fy a queue to which a completion event will be posted by the system when the I/O
is done. A third is to provide a callback procedure that the system calls when the
I/O has completed. A fourth is to poll a location in memory that the I/O manager
updates when the I/O completes.
The final aspect that we will mention is prioritized I/O. I/O priority is deter-
mined by the priority of the issuing thread, or it can be explicitly set. There are
five priorities specified: critical, high, normal, low, and very low. Critical is re-
served for the memory manager to avoid deadlocks that could otherwise occur
when the system experiences extreme memory pressure. Low and very low priori-
ties are used by background processes, like the disk defragmentation service and
spyware scanners and desktop search, which are attempting to avoid interfering
with normal operations of the system. Most I/O gets normal priority, but multi-
media applications can mark their I/O as high to avoid glitches. Multimedia appli-
cations can alternatively use bandwidth reservation to request guaranteed band-
width to access time-critical files, like music or video. The I/O system will pro-
vide the application with the optimal transfer size and the number of outstanding
I/O operations that should be maintained to allow the I/O system to achieve the re-
quested bandwidth guarantee.

11.7.2 Input/Output API Calls

The system call APIs provided by the I/O manager are not very different from
those offered by most other operating systems. The basic operations are open,
read, write, ioctl, and close, but there are also plug-and-play and power operations,
946 CASE STUDY 2: WINDOWS 8 CHAP. 11

operations for setting parameters, as well as calls for flushing system buffers, and
so on. At the Win32 layer these APIs are wrapped by interfaces that provide high-
er-level operations specific to particular devices. At the bottom, though, these
wrappers open devices and perform these basic types of operations. Even some
metadata operations, such as file rename, are implemented without specific system
calls. They just use a special version of the ioctl operations. This will make more
sense when we explain the implementation of I/O device stacks and the use of
IRPs by the I/O manager.

I/O system call Description


NtCreateFile Open new or existing files or devices
NtReadFile Read from a file or device
NtWriteFile Write to a file or device
NtQuer yDirector yFile Request information about a directory, including files
NtQuer yVolumeInformationFile Request information about a volume
NtSetVolumeInformationFile Modify volume information
NtNotifyChangeDirector yFile Complete when any file in the directory or subtree is modified
NtQuer yInformationFile Request information about a file
NtSetInformationFile Modify file information
NtLockFile Lock a range of bytes in a file
NtUnlockFile Remove a range lock
NtFsControlFile Miscellaneous operations on a file
NtFlushBuffersFile Flush in-memor y file buffers to disk
NtCancelIoFile Cancel outstanding I/O operations on a file
NtDeviceIoControlFile Special operations on a device

Figure 11-35. Native NT API calls for performing I/O.

The native NT I/O system calls, in keeping with the general philosophy of
Windows, take numerous parameters, and include many variations. Figure 11-35
lists the primary system-call interfaces to the I/O manager. NtCreateFile is used to
open existing or new files. It provides security descriptors for new files, a rich de-
scription of the access rights requested, and gives the creator of new files some
control over how blocks will be allocated. NtReadFile and NtWriteFile take a file
handle, buffer, and length. They also take an explicit file offset, and allow a key to
be specified for accessing locked ranges of bytes in the file. Most of the parame-
ters are related to specifying which of the different methods to use for reporting
completion of the (possibly asynchronous) I/O, as described above.
NtQuer yDirector yFile is an example of a standard paradigm in the executive
where various Query APIs exist to access or modify information about specific
types of objects. In this case, it is file objects that refer to directories. A parameter
specifies what type of information is being requested, such as a list of the names in
SEC. 11.7 INPUT/OUTPUT IN WINDOWS 947

the directory or detailed information about each file that is needed for an extended
directory listing. Since this is really an I/O operation, all the standard ways of
reporting that the I/O completed are supported. NtQueryVolumeInformationFile is
like the directory query operation, but expects a file handle which represents an
open volume which may or may not contain a file system. Unlike for directories,
there are parameters than can be modified on volumes, and thus there is a separate
API NtSetVolumeInformationFile.
NtNotifyChangeDirector yFile is an example of an interesting NT paradigm.
Threads can do I/O to determine whether any changes occur to objects (mainly
file-system directories, as in this case, or registry keys). Because the I/O is asyn-
chronous the thread returns and continues, and is only notified later when some-
thing is modified. The pending request is queued in the file system as an outstand-
ing I/O operation using an I/O Request Packet. Notifications are problematic if
you want to remove a file-system volume from the system, because the I/O opera-
tions are pending. So Windows supports facilities for canceling pending I/O oper-
ations, including support in the file system for forcibly dismounting a volume with
pending I/O.
NtQuer yInformationFile is the file-specific version of the system call for direc-
tories. It has a companion system call, NtSetInformationFile. These interfaces ac-
cess and modify all sorts of information about file names, file features like en-
cryption and compression and sparseness, and other file attributes and details, in-
cluding looking up the internal file id or assigning a unique binary name (object id)
to a file.
These system calls are essentially a form of ioctl specific to files. The set oper-
ation can be used to rename or delete a file. But note that they take handles, not
file names, so a file first must be opened before being renamed or deleted. They
can also be used to rename the alternative data streams on NTFS (see Sec. 11.8).
Separate APIs, NtLockFile and NtUnlockFile, exist to set and remove byte-
range locks on files. NtCreateFile allows access to an entire file to be restricted by
using a sharing mode. An alternative is these lock APIs, which apply mandatory
access restrictions to a range of bytes in the file. Reads and writes must supply a
key matching the key provided to NtLockFile in order to operate on the locked
ranges.
Similar facilities exist in UNIX, but there it is discretionary whether applica-
tions heed the range locks. NtFsControlFile is much like the preceding Query and
Set operations, but is a more generic operation aimed at handling file-specific oper-
ations that do not fit within the other APIs. For example, some operations are spe-
cific to a particular file system.
Finally, there are miscellaneous calls such as NtFlushBuffersFile. Like the
UNIX sync call, it forces file-system data to be written back to disk. NtCancel-
IoFile cancels outstanding I/O requests for a particular file, and NtDeviceIoCon-
trolFile implements ioctl operations for devices. The list of operations is actually
much longer. There are system calls for deleting files by name, and for querying
948 CASE STUDY 2: WINDOWS 8 CHAP. 11

the attributes of a specific file—but these are just wrappers around the other I/O
manager operations we have listed and did not really need to be implemented as
separate system calls. There are also system calls for dealing with I/O completion
ports, a queuing facility in Windows that helps multithreaded servers make ef-
ficient use of asynchronous I/O operations by readying threads by demand and
reducing the number of context switches required to service I/O on dedicated
threads.

11.7.3 Implementation of I/O

The Windows I/O system consists of the plug-and-play services, the device
power manager, the I/O manager, and the device-driver model. Plug-and-play
detects changes in hardware configuration and builds or tears down the device
stacks for each device, as well as causing the loading and unloading of device driv-
ers. The device power manager adjusts the power state of the I/O devices to reduce
system power consumption when devices are not in use. The I/O manager pro-
vides support for manipulating I/O kernel objects, and IRP-based operations like
IoCallDrivers and IoCompleteRequest. But most of the work required to support
Windows I/O is implemented by the device drivers themselves.

Device Drivers

To make sure that device drivers work well with the rest of Windows, Micro-
soft has defined the WDM (Windows Driver Model) that device drivers are ex-
pected to conform with. The WDK (Windows Driver Kit) contains docu-
mentation and examples to help developers produce drivers which conform to the
WDM. Most Windows drivers start out as copies of an appropriate sample driver
from the WDK, which is then modified by the driver writer.
Microsoft also provides a driver verifier which validates many of the actions
of drivers to be sure that they conform to the WDM requirements for the structure
and protocols for I/O requests, memory management, and so on. The verifier ships
with the system, and administrators can control it by running verifier.exe, which al-
lows them to configure which drivers are to be checked and how extensive (i.e., ex-
pensive) the checks should be.
Even with all the support for driver development and verification, it is still very
difficult to write even simple drivers in Windows, so Microsoft has built a system
of wrappers called the WDF (Windows Driver Foundation) that runs on top of
WDM and simplifies many of the more common requirements, mostly related to
correct interaction with device power management and plug-and-play operations.
To further simplify driver writing, as well as increase the robustness of the sys-
tem, WDF includes the UMDF (User-Mode Driver Framework) for writing driv-
ers as services that execute in processes. And there is the KMDF (Kernel-Mode
SEC. 11.7 INPUT/OUTPUT IN WINDOWS 949

Driver Framework) for writing drivers as services that execute in the kernel, but
with many of the details of WDM made automagical. Since underneath it is the
WDM that provides the driver model, that is what we will focus on in this section.
Devices in Windows are represented by device objects. Device objects are also
used to represent hardware, such as buses, as well as software abstractions like file
systems, network protocol engines, and kernel extensions, such as antivirus filter
drivers. All these are organized by producing what Windows calls a device stack,
as previously shown in Fig. 11-14.
I/O operations are initiated by the I/O manager calling an executive API
IoCallDriver with pointers to the top device object and to the IRP representing the
I/O request. This routine finds the driver object associated with the device object.
The operation types that are specified in the IRP generally correspond to the I/O
manager system calls described above, such as create, read, and close.
Figure 11-36 shows the relationships for a single level of the device stack. For
each of these operations a driver must specify an entry point. IoCallDriver takes the
operation type out of the IRP, uses the device object at the current level of the de-
vice stack to find the driver object, and indexes into the driver dispatch table with
the operation type to find the corresponding entry point into the driver. The driver
is then called and passed the device object and the IRP.

Device object Loaded device driver

Driver code
Driver object
Driver object

Dispatch table
Instance data
CREATE
READ
WRITE
Next device object FLUSH
IOCTL
CLEANUP
CLOSE

Figure 11-36. A single level in a device stack.

Once a driver has finished processing the request represented by the IRP, it has
three options. It can call IoCallDriver again, passing the IRP and the next device
object in the device stack. It can declare the I/O request to be completed and re-
turn to its caller. Or it can queue the IRP internally and return to its caller, having
declared that the I/O request is still pending. This latter case results in an asyn-
chronous I/O operation, at least if all the drivers above in the stack agree and also
return to their callers.
950 CASE STUDY 2: WINDOWS 8 CHAP. 11

I/O Request Packets

Figure 11-37 shows the major fields in the IRP. The bottom of the IRP is a dy-
namically sized array containing fields that can be used by each driver for the de-
vice stack handling the request. These stack fields also allow a driver to specify
the routine to call when completing an I/O request. During completion each level
of the device stack is visited in reverse order, and the completion routine assigned
by each driver is called in turn. At each level the driver can continue to complete
the request or decide there is still more work to do and leave the request pending,
suspending the I/O completion for the time being.
Kernel buffer address

Flags
User buffer address
Operation code

Buffer pointers

Memory descr list head


MDL Next IRP
MDL Thread’s IRP chain link

Completion/cancel info
Thread
Driver
Completion
queuing
APC block
& comm.

IRP Driver-Stack Data

Figure 11-37. The major fields of an I/O Request Packet.

When allocating an IRP, the I/O manager has to know how deep the particular
device stack is so that it can allocate a sufficiently large IRP. It keeps track of the
stack depth in a field in each device object as the device stack is formed. Note that
there is no formal definition of what the next device object is in any stack. That
information is held in private data structures belonging to the previous driver on
the stack. In fact, the stack does not really have to be a stack at all. At any layer a
driver is free to allocate new IRPs, continue to use the original IRP, send an I/O op-
eration to a different device stack, or even switch to a system worker thread to con-
tinue execution.
The IRP contains flags, an operation code for indexing into the driver dispatch
table, buffer pointers for possibly both kernel and user buffers, and a list of MDLs
(Memory Descriptor Lists) which are used to describe the physical pages repres-
ented by the buffers, that is, for DMA operations. There are fields used for cancel-
lation and completion operations. The fields in the IRP that are used to queue the
SEC. 11.7 INPUT/OUTPUT IN WINDOWS 951

IRP to devices while it is being processed are reused when the I/O operation has
finally completed to provide memory for the APC control object used to call the
I/O manager’s completion routine in the context of the original thread. There is
also a link field used to link all the outstanding IRPs to the thread that initiated
them.

Device Stacks

A driver in Windows may do all the work by itself, as the printer driver does in
Fig. 11-38. On the other hand, drivers may also be stacked, which means that a re-
quest may pass through a sequence of drivers, each doing part of the work. Two
stacked drivers are also illustrated in Fig. 11-38.
User process

User
program

Win32

Rest of windows

Filter

Driver
Function Function stack

Monolithic Bus Bus

Hardware abstraction layer

Controller Controller Controller

Figure 11-38. Windows allows drivers to be stacked to work with a specific in-
stance of a device. The stacking is represented by device objects.

One common use for stacked drivers is to separate the bus management from
the functional work of controlling the device. Bus management on the PCI bus is
quite complicated on account of many kinds of modes and bus transactions. By
952 CASE STUDY 2: WINDOWS 8 CHAP. 11

separating this work from the device-specific part, driver writers are freed from
learning how to control the bus. They can just use the standard bus driver in their
stack. Similarly, USB and SCSI drivers have a device-specific part and a generic
part, with common drivers being supplied by Windows for the generic part.
Another use of stacking drivers is to be able to insert filter drivers into the
stack. We have already looked at the use of file-system filter drivers, which are in-
serted above the file system. Filter drivers are also used for managing physical
hardware. A filter driver performs some transformation on the operations as the
IRP flows down the device stack, as well as during the completion operation with
the IRP flows back up through the completion routines each driver specified. For
example, a filter driver could compress data on the way to the disk or encrypt data
on the way to the network. Putting the filter here means that neither the applica-
tion program nor the true device driver has to be aware of it, and it works automat-
ically for all data going to (or coming from) the device.
Kernel-mode device drivers are a serious problem for the reliability and stabil-
ity of Windows. Most of the kernel crashes in Windows are due to bugs in device
drivers. Because kernel-mode device drivers all share the same address space with
the kernel and executive layers, errors in the drivers can corrupt system data struc-
tures, or worse. Some of these bugs are due to the astonishingly large numbers of
device drivers that exist for Windows, or to the development of drivers by less-
experienced system programmers. The bugs are also due to the enormous amount
of detail involved in writing a correct driver for Windows.
The I/O model is powerful and flexible, but all I/O is fundamentally asynchro-
nous, so race conditions can abound. Windows 2000 added the plug-and-play and
device power management facilities from the Win9x systems to the NT-based Win-
dows for the first time. This put a large number of requirements on drivers to deal
correctly with devices coming and going while I/O packets are in the middle of
being processed. Users of PCs frequently dock/undock devices, close the lid and
toss notebooks into briefcases, and generally do not worry about whether the little
green activity light happens to still be on. Writing device drivers that function cor-
rectly in this environment can be very challenging, which is why WDF was devel-
oped to simplify the Windows Driver Model.
Many books are available about the Windows Driver Model and the newer
Windows Driver Foundation (Kanetkar, 2008; Orwick & Smith, 2007; Reeves,
2010; Viscarola et al., 2007; and Vostokov, 2009).

11.8 THE WINDOWS NT FILE SYSTEM


Windows supports several file systems, the most important of which are
FAT-16, FAT-32, and NTFS (NT File System). FAT-16 is the old MS-DOS file
system. It uses 16-bit disk addresses, which limits it to disk partitions no larger
than 2 GB. Mostly it is used to access floppy disks, for those customers that still
SEC. 11.8 THE WINDOWS NT FILE SYSTEM 953

use them. FAT-32 uses 32-bit disk addresses and supports disk partitions up to 2
TB. There is no security in FAT-32 and today it is really used only for tran-
sportable media, like flash drives. NTFS is the file system developed specifically
for the NT version of Windows. Starting with Windows XP it became the default
file system installed by most computer manufacturers, greatly improving the secu-
rity and functionality of Windows. NTFS uses 64-bit disk addresses and can (theo-
retically) support disk partitions up to 264 bytes, although other considerations
limit it to smaller sizes.
In this chapter we will examine the NTFS file system because it is a modern
one with many interesting features and design innovations. It is large and complex
and space limitations prevent us from covering all of its features, but the material
presented below should give a reasonable impression of it.

11.8.1 Fundamental Concepts

Individual file names in NTFS are limited to 255 characters; full paths are lim-
ited to 32,767 characters. File names are in Unicode, allowing people in countries
not using the Latin alphabet (e.g., Greece, Japan, India, Russia, and Israel) to write
file names in their native language. For example, φιλε is a perfectly legal file
name. NTFS fully supports case-sensitive names (so foo is different from Foo and
FOO). The Win32 API does not support case-sensitivity fully for file names and
not at all for directory names. The support for case sensitivity exists when running
the POSIX subsystem in order to maintain compatibility with UNIX. Win32 is not
case sensitive, but it is case preserving, so file names can have different case letters
in them. Though case sensitivity is a feature that is very familiar to users of UNIX,
it is largely inconvenient to ordinary users who do not make such distinctions nor-
mally. For example, the Internet is largely case-insensitive today.
An NTFS file is not just a linear sequence of bytes, as FAT-32 and UNIX files
are. Instead, a file consists of multiple attributes, each represented by a stream of
bytes. Most files have a few short streams, such as the name of the file and its
64-bit object ID, plus one long (unnamed) stream with the data. However, a file
can also have two or more (long) data streams as well. Each stream has a name
consisting of the file name, a colon, and the stream name, as in foo:stream1. Each
stream has its own size and is lockable independently of all the other streams. The
idea of multiple streams in a file is not new in NTFS. The file system on the Apple
Macintosh uses two streams per file, the data fork and the resource fork. The first
use of multiple streams for NTFS was to allow an NT file server to serve Macin-
tosh clients. Multiple data streams are also used to represent metadata about files,
such as the thumbnail pictures of JPEG images that are available in the Windows
GUI. But alas, the multiple data streams are fragile and frequently fall off files
when they are transported to other file systems, transported over the network, or
even when backed up and later restored, because many utilities ignore them.
954 CASE STUDY 2: WINDOWS 8 CHAP. 11

NTFS is a hierarchical file system, similar to the UNIX file system. The sepa-
rator between component names is ‘‘ \’’, however, instead of ‘‘/’’, a fossil inherited
from the compatibility requirements with CP/M when MS-DOS was created
(CP/M used the slash for flags). Unlike UNIX the concept of the current working
directory, hard links to the current directory (.) and the parent directory (..) are im-
plemented as conventions rather than as a fundamental part of the file-system de-
sign. Hard links are supported, but used only for the POSIX subsystem, as is
NTFS support for traversal checking on directories (the ‘x’ permission in UNIX).
Symbolic links in are supported for NTFS. Creation of symbolic links is nor-
mally restricted to administrators to avoid security issues like spoofing, as UNIX
experienced when symbolic links were first introduced in 4.2BSD. The imple-
mentation of symbolic links uses an NTFS feature called reparse points (dis-
cussed later in this section). In addition, compression, encryption, fault tolerance,
journaling, and sparse files are also supported. These features and their imple-
mentations will be discussed shortly.

11.8.2 Implementation of the NT File System

NTFS is a highly complex and sophisticated file system that was developed
specifically for NT as an alternative to the HPFS file system that had been devel-
oped for OS/2. While most of NT was designed on dry land, NTFS is unique
among the components of the operating system in that much of its original design
took place aboard a sailboat out on the Puget Sound (following a strict protocol of
work in the morning, beer in the afternoon). Below we will examine a number of
features of NTFS, starting with its structure, then moving on to file-name lookup,
file compression, journaling, and file encryption.

File System Structure

Each NTFS volume (e.g., disk partition) contains files, directories, bitmaps,
and other data structures. Each volume is organized as a linear sequence of blocks
(clusters in Microsoft’s terminology), with the block size being fixed for each vol-
ume and ranging from 512 bytes to 64 KB, depending on the volume size. Most
NTFS disks use 4-KB blocks as a compromise between large blocks (for efficient
transfers) and small blocks (for low internal fragmentation). Blocks are referred to
by their offset from the start of the volume using 64-bit numbers.
The principal data structure in each volume is the MFT (Master File Table),
which is a linear sequence of fixed-size 1-KB records. Each MFT record describes
one file or one directory. It contains the file’s attributes, such as its name and time-
stamps, and the list of disk addresses where its blocks are located. If a file is ex-
tremely large, it is sometimes necessary to use two or more MFT records to con-
tain the list of all the blocks, in which case the first MFT record, called the base
record, points to the additional MFT records. This overflow scheme dates back to
SEC. 11.8 THE WINDOWS NT FILE SYSTEM 955

CP/M, where each directory entry was called an extent. A bitmap keeps track of
which MFT entries are free.
The MFT is itself a file and as such can be placed anywhere within the volume,
thus eliminating the problem with defective sectors in the first track. Furthermore,
the file can grow as needed, up to a maximum size of 248 records.
The MFT is shown in Fig. 11-39. Each MFT record consists of a sequence of
(attribute header, value) pairs. Each attribute begins with a header telling which
attribute this is and how long the value is. Some attribute values are variable
length, such as the file name and the data. If the attribute value is short enough to
fit in the MFT record, it is placed there. If it is too long, it is placed elsewhere on
the disk and a pointer to it is placed in the MFT record. This makes NTFS very ef-
ficient for small files, that is, those that can fit within the MFT record itself.
The first 16 MFT records are reserved for NTFS metadata files, as illustrated
in Fig. 11-39. Each record describes a normal file that has attributes and data
blocks, just like any other file. Each of these files has a name that begins with a
dollar sign to indicate that it is a metadata file. The first record describes the MFT
file itself. In particular, it tells where the blocks of the MFT file are located so that
the system can find the MFT file. Clearly, Windows needs a way to find the first
block of the MFT file in order to find the rest of the file-system information. The
way it finds the first block of the MFT file is to look in the boot block, where its
address is installed when the volume is formatted with the file system.

1 KB

16 First user file


15 (Reserved for future use)
14 (Reserved for future use)
13 (Reserved for future use)
12 (Reserved for future use)
11 $Extend Extentions: quotas,etc
10 $Upcase Case conversion table
9 $Secure Security descriptors for all files
8 $BadClus List of bad blocks Metadata files
7 $Boot Bootstrap loader
6 $Bitmap Bitmap of blocks used
5 $ Root directory
4 $AttrDef Attribute definitions
3 $Volume Volume file
2 $LogFile Log file to recovery
1 $MftMirr Mirror copy of MFT
0 $Mft Master File Table

Figure 11-39. The NTFS master file table.


956 CASE STUDY 2: WINDOWS 8 CHAP. 11

Record 1 is a duplicate of the early portion of the MFT file. This information
is so precious that having a second copy can be critical in the event one of the first
blocks of the MFT ever becomes unreadable. Record 2 is the log file. When struc-
tural changes are made to the file system, such as adding a new directory or remov-
ing an existing one, the action is logged here before it is performed, in order to in-
crease the chance of correct recovery in the event of a failure during the operation,
such as a system crash. Changes to file attributes are also logged here. In fact, the
only changes not logged here are changes to user data. Record 3 contains infor-
mation about the volume, such as its size, label, and version.
As mentioned above, each MFT record contains a sequence of (attribute head-
er, value) pairs. The $AttrDef file is where the attributes are defined. Information
about this file is in MFT record 4. Next comes the root directory, which itself is a
file and can grow to arbitrary length. It is described by MFT record 5.
Free space on the volume is kept track of with a bitmap. The bitmap is itself a
file, and its attributes and disk addresses are given in MFT record 6. The next
MFT record points to the bootstrap loader file. Record 8 is used to link all the bad
blocks together to make sure they never occur in a file. Record 9 contains the se-
curity information. Record 10 is used for case mapping. For the Latin letters A-Z
case mapping is obvious (at least for people who speak Latin). Case mapping for
other languages, such as Greek, Armenian, or Georgian (the country, not the state),
is less obvious to Latin speakers, so this file tells how to do it. Finally, record 11 is
a directory containing miscellaneous files for things like disk quotas, object identi-
fiers, reparse points, and so on. The last four MFT records are reserved for future
use.
Each MFT record consists of a record header followed by the (attribute header,
value) pairs. The record header contains a magic number used for validity check-
ing, a sequence number updated each time the record is reused for a new file, a
count of references to the file, the actual number of bytes in the record used, the
identifier (index, sequence number) of the base record (used only for extension
records), and some other miscellaneous fields.
NTFS defines 13 attributes that can appear in MFT records. These are listed in
Fig. 11-40. Each attribute header identifies the attribute and gives the length and
location of the value field along with a variety of flags and other information.
Usually, attribute values follow their attribute headers directly, but if a value is too
long to fit in the MFT record, it may be put in separate disk blocks. Such an
attribute is said to be a nonresident attribute. The data attribute is an obvious
candidate. Some attributes, such as the name, may be repeated, but all attributes
must appear in a fixed order in the MFT record. The headers for resident attributes
are 24 bytes long; those for nonresident attributes are longer because they contain
information about where to find the attribute on disk.
The standard information field contains the file owner, security information,
the timestamps needed by POSIX, the hard-link count, the read-only and archive
bits, and so on. It is a fixed-length field and is always present. The file name is a
SEC. 11.8 THE WINDOWS NT FILE SYSTEM 957

Attribute Description
Standard information Flag bits, timestamps, etc.
File name File name in Unicode; may be repeated for MS-DOS name
Security descriptor Obsolete. Security information is now in $Extend$Secure
Attribute list Location of additional MFT records, if needed
Object ID 64-bit file identifier unique to this volume
Reparse point Used for mounting and symbolic links
Volume name Name of this volume (used only in $Volume)
Volume information Volume version (used only in $Volume)
Index root Used for directories
Index allocation Used for very large directories
Bitmap Used for very large directories
Logged utility stream Controls logging to $LogFile
Data Stream data; may be repeated

Figure 11-40. The attributes used in MFT records.

variable-length Unicode string. In order to make files with non–MS-DOS names


accessible to old 16-bit programs, files can also have an 8 + 3 MS-DOS short
name. If the actual file name conforms to the MS-DOS 8 + 3 naming rule, a sec-
ondary MS-DOS name is not needed.
In NT 4.0, security information was put in an attribute, but in Windows 2000
and later, security information all goes into a single file so that multiple files can
share the same security descriptions. This results in significant savings in space
within most MFT records and in the file system overall because the security info
for so many of the files owned by each user is identical.
The attribute list is needed in case the attributes do not fit in the MFT record.
This attribute then tells where to find the extension records. Each entry in the list
contains a 48-bit index into the MFT telling where the extension record is and a
16-bit sequence number to allow verification that the extension record and base
records match up.
NTFS files have an ID associated with them that is like the i-node number in
UNIX. Files can be opened by ID, but the IDs assigned by NTFS are not always
useful when the ID must be persisted because it is based on the MFT record and
can change if the record for the file moves (e.g., if the file is restored from backup).
NTFS allows a separate object ID attribute which can be set on a file and never
needs to change. It can be kept with the file if it is copied to a new volume, for ex-
ample.
The reparse point tells the procedure parsing the file name that it has do some-
thing special. This mechanism is used for explicitly mounting file systems and for
symbolic links. The two volume attributes are used only for volume identification.
958 CASE STUDY 2: WINDOWS 8 CHAP. 11

The next three attributes deal with how directories are implemented. Small ones
are just lists of files but large ones are implemented using B+ trees. The logged
utility stream attribute is used by the encrypting file system.
Finally, we come to the attribute that is the most important of all: the data
stream (or in some cases, streams). An NTFS file has one or more data streams as-
sociated with it. This is where the payload is. The default data stream is
unnamed (i.e., dirpath \ file name::$DATA), but the alternate data streams each
have a name, for example, dirpath \ file name:streamname:$DATA.
For each stream, the stream name, if present, goes in this attribute header. Fol-
lowing the header is either a list of disk addresses telling which blocks the stream
contains, or for streams of only a few hundred bytes (and there are many of these),
the stream itself. Putting the actual stream data in the MFT record is called an
immediate file (Mullender and Tanenbaum, 1984).
Of course, most of the time the data does not fit in the MFT record, so this
attribute is usually nonresident. Let us now take a look at how NTFS keeps track
of the location of nonresident attributes, in particular data.

Storage Allocation

The model for keeping track of disk blocks is that they are assigned in runs of
consecutive blocks, where possible, for efficiency reasons. For example, if the first
logical block of a stream is placed in block 20 on the disk, then the system will try
hard to place the second logical block in block 21, the third logical block in 22,
and so on. One way to achieve these runs is to allocate disk storage several blocks
at a time, when possible.
The blocks in a stream are described by a sequence of records, each one
describing a sequence of logically contiguous blocks. For a stream with no holes
in it, there will be only one such record. Streams that are written in order from be-
ginning to end all belong in this category. For a stream with one hole in it (e.g.,
only blocks 0–49 and blocks 60–79 are defined), there will be two records. Such a
stream could be produced by writing the first 50 blocks, then seeking forward to
logical block 60 and writing another 20 blocks. When a hole is read back, all the
missing bytes are zeros. Files with holes are called sparse files.
Each record begins with a header giving the offset of the first block within the
stream. Next comes the offset of the first block not covered by the record. In the
example above, the first record would have a header of (0, 50) and would provide
the disk addresses for these 50 blocks. The second one would have a header of
(60, 80) and would provide the disk addresses for these 20 blocks.
Each record header is followed by one or more pairs, each giving a disk ad-
dress and run length. The disk address is the offset of the disk block from the start
of its partition; the run length is the number of blocks in the run. As many pairs as
needed can be in the run record. Use of this scheme for a three-run, nine-block
stream is illustrated in Fig. 11-41.
SEC. 11.8 THE WINDOWS NT FILE SYSTEM 959

Standard File name Data Info about data blocks


info header header header

Record Header Run #1 Run #2 Run #3


header
Standard
File name 0 9 20 4 64 2 80 3 Unused
MTF info
record

Disk blocks

Blocks numbers 20-23 64-65 80-82

Figure 11-41. An MFT record for a three-run, nine-block stream.

In this figure we have an MFT record for a short stream of nine blocks (header
0–8). It consists of the three runs of consecutive blocks on the disk. The first run
is blocks 20–23, the second is blocks 64–65, and the third is blocks 80–82. Each
of these runs is recorded in the MFT record as a (disk address, block count) pair.
How many runs there are depends on how well the disk block allocator did in find-
ing runs of consecutive blocks when the stream was created. For an n-block
stream, the number of runs can be anything from 1 through n.
Several comments are worth making here. First, there is no upper limit to the
size of streams that can be represented this way. In the absence of address com-
pression, each pair requires two 64-bit numbers in the pair for a total of 16 bytes.
However, a pair could represent 1 million or more consecutive disk blocks. In fact,
a 20-MB stream consisting of 20 separate runs of 1 million 1-KB blocks each fits
easily in one MFT record, whereas a 60-KB stream scattered into 60 isolated
blocks does not.
Second, while the straightforward way of representing each pair takes 2 × 8
bytes, a compression method is available to reduce the size of the pairs below 16.
Many disk addresses have multiple high-order zero-bytes. These can be omitted.
The data header tells how many are omitted, that is, how many bytes are actually
used per address. Other kinds of compression are also used. In practice, the pairs
are often only 4 bytes.
Our first example was easy: all the file information fit in one MFT record.
What happens if the file is so large or highly fragmented that the block information
does not fit in one MFT record? The answer is simple: use two or more MFT
records. In Fig. 11-42 we see a file whose base record is in MFT record 102. It
has too many runs for one MFT record, so it computes how many extension
records it needs, say, two, and puts their indices in the base record. The rest of the
record is used for the first k data runs.
960 CASE STUDY 2: WINDOWS 8 CHAP. 11

109
108 Run #m+1 Run n Second extension record
107
106
105 Run #k+1 Run m First extension record
104
103
102 MFT 105 MFT 108 Run #1 Run #k Base record
101
100

Figure 11-42. A file that requires three MFT records to store all its runs.

Note that Fig. 11-42 contains some redundancy. In theory, it should not be
necessary to specify the end of a sequence of runs because this information can be
calculated from the run pairs. The reason for ‘‘overspecifying’’ this information is
to make seeking more efficient: to find the block at a given file offset, it is neces-
sary to examine only the record headers, not the run pairs.
When all the space in record 102 has been used up, storage of the runs con-
tinues with MFT record 105. As many runs are packed in this record as fit. When
this record is also full, the rest of the runs go in MFT record 108. In this way,
many MFT records can be used to handle large fragmented files.
A problem arises if so many MFT records are needed that there is no room in
the base MFT to list all their indices. There is also a solution to this problem: the
list of extension MFT records is made nonresident (i.e., stored in other disk blocks
instead of in the base MFT record). Then it can grow as large as needed.
An MFT entry for a small directory is shown in Fig. 11-43. The record con-
tains a number of directory entries, each of which describes one file or directory.
Each entry has a fixed-length structure followed by a variable-length file name.
The fixed part contains the index of the MFT entry for the file, the length of the file
name, and a variety of other fields and flags. Looking for an entry in a directory
consists of examining all the file names in turn.
Large directories use a different format. Instead, of listing the files linearly, a
B+ tree is used to make alphabetical lookup possible and to make it easy to insert
new names in the directory in the proper place.
The NTFS parsing of the path \ foo \ bar begins at the root directory for C:,
whose blocks can be found from entry 5 in the MFT (see Fig. 11-39). The string
‘‘foo’’ is looked up in the root directory, which returns the index into the MFT for
the directory foo. This directory is then searched for the string ‘‘bar’’, which refers
to the MFT record for this file. NTFS performs access checks by calling back into
the security reference monitor, and if everything is cool it searches the MFT record
for the attribute ::$DATA, which is the default data stream.
SEC. 11.8 THE WINDOWS NT FILE SYSTEM 961

A directory entry contains the MFT index for the file,


Standard Index root the length of the file name, the file name itself,
info header header and various fields and flags

Record
header
Standard
Unused
info

Figure 11-43. The MFT record for a small directory.


We now have enough information to finish describing how file-name lookup occurs for
a file \ ?? \ C: \ foo \ bar. In Fig. 11-20 we saw how the Win32, the native NT system calls,
and the object and I/O managers cooperated to open a file by sending an I/O request to the
NTFS device stack for the C: volume. The I/O request asks NTFS to fill in a file object for
the remaining path name, \ foo \ bar.

Having found file bar, NTFS will set pointers to its own metadata in the file
object passed down from the I/O manager. The metadata includes a pointer to the
MFT record, information about compression and range locks, various details about
sharing, and so on. Most of this metadata is in data structures shared across all file
objects referring to the file. A few fields are specific only to the current open, such
as whether the file should be deleted when it is closed. Once the open has suc-
ceeded, NTFS calls IoCompleteRequest to pass the IRP back up the I/O stack to
the I/O and object managers. Ultimately a handle for the file object is put in the
handle table for the current process, and control is passed back to user mode. On
subsequent ReadFile calls, an application can provide the handle, specifying that
this file object for C: \ foo \ bar should be included in the read request that gets pas-
sed down the C: device stack to NTFS.
In addition to regular files and directories, NTFS supports hard links in the
UNIX sense, and also symbolic links using a mechanism called reparse points.
NTFS supports tagging a file or directory as a reparse point and associating a block
of data with it. When the file or directory is encountered during a file-name parse,
the operation fails and the block of data is returned to the object manager. The ob-
ject manager can interpret the data as representing an alternative path name and
then update the string to parse and retry the I/O operation. This mechanism is used
to support both symbolic links and mounted file systems, redirecting the search to
a different part of the directory hierarchy or even to a different partition.
Reparse points are also used to tag individual files for file-system filter drivers.
In Fig. 11-20 we showed how file-system filters can be installed between the I/O
manager and the file system. I/O requests are completed by calling IoComplete-
Request, which passes control to the completion routines each driver represented
962 CASE STUDY 2: WINDOWS 8 CHAP. 11

in the device stack inserted into the IRP as the request was being made. A driver
that wants to tag a file associates a reparse tag and then watches for completion re-
quests for file open operations that failed because they encountered a reparse point.
From the block of data that is passed back with the IRP, the driver can tell if this is
a block of data that the driver itself has associated with the file. If so, the driver
will stop processing the completion and continue processing the original I/O re-
quest. Generally, this will involve proceeding with the open request, but there is a
flag that tells NTFS to ignore the reparse point and open the file.

File Compression

NTFS supports transparent file compression. A file can be created in com-


pressed mode, which means that NTFS automatically tries to compress the blocks
as they are written to disk and automatically uncompresses them when they are
read back. Processes that read or write compressed files are completely unaware
that compression and decompression are going on.
Compression works as follows. When NTFS writes a file marked for compres-
sion to disk, it examines the first 16 (logical) blocks in the file, irrespective of how
many runs they occupy. It then runs a compression algorithm on them. If the re-
sulting compressed data can be stored in 15 or fewer blocks, they are written to the
disk, preferably in one run, if possible. If the compressed data still take 16 blocks,
the 16 blocks are written in uncompressed form. Then blocks 16–31 are examined
to see if they can be compressed to 15 blocks or fewer, and so on.
Figure 11-44(a) shows a file in which the first 16 blocks have successfully
compressed to eight blocks, the second 16 blocks failed to compress, and the third
16 blocks have also compressed by 50%. The three parts have been written as
three runs and stored in the MFT record. The ‘‘missing’’ blocks are stored in the
MFT entry with disk address 0 as shown in Fig. 11-44(b). Here the header (0, 48)
is followed by five pairs, two for the first (compressed) run, one for the uncom-
pressed run, and two for the final (compressed) run.
When the file is read back, NTFS has to know which runs are compressed and
which ones are not. It can tell based on the disk addresses. A disk address of 0 in-
dicates that it is the final part of 16 compressed blocks. Disk block 0 may not be
used for storing data, to avoid ambiguity. Since block 0 on the volume contains the
boot sector, using it for data is impossible anyway.
Random access to compressed files is actually possible, but tricky. Suppose
that a process does a seek to block 35 in Fig. 11-44. How does NTFS locate block
35 in a compressed file? The answer is that it has to read and decompress the en-
tire run first. Then it knows where block 35 is and can pass it to any process that
reads it. The choice of 16 blocks for the compression unit was a compromise.
Making it shorter would have made the compression less effective. Making it
longer would have made random access more expensive.
SEC. 11.8 THE WINDOWS NT FILE SYSTEM 963

Original uncompressed file

0 16 32 47

0 7 8 23 24 31
Compressed Uncompressed Compressed
Disk addr 30 37 40 55 85 92
(a)
Header Five runs (of which two empties)

Standard
File name 0 48 30 8 0 8 40 16 85 8 0 8 Unused
info

(b)

Figure 11-44. (a) An example of a 48-block file being compressed to 32 blocks.


(b) The MFT record for the file after compression.

Journaling

NTFS supports two mechanisms for programs to detect changes to files and di-
rectories. First is an operation, NtNotifyChangeDirectoryFile, that passes a buffer
and returns when a change is detected to a directory or directory subtree. The re-
sult is that the buffer has a list of change records. If it is too small, records are lost.
The second mechanism is the NTFS change journal. NTFS keeps a list of all
the change records for directories and files on the volume in a special file, which
programs can read using special file-system control operations, that is, the
FSCTL QUERY USN JOURNAL option to the NtFsControlFile API. The journal
file is normally very large, and there is little likelihood that entries will be reused
before they can be examined.

File Encryption

Computers are used nowadays to store all kinds of sensitive data, including
plans for corporate takeovers, tax information, and love letters, which the owners
do not especially want revealed to anyone. Information loss can happen when a
notebook computer is lost or stolen, a desktop system is rebooted using an MS-
DOS floppy disk to bypass Windows security, or a hard disk is physically removed
from one computer and installed on another one with an insecure operating system.
Windows addresses these problems by providing an option to encrypt files, so
that even in the event the computer is stolen or rebooted using MS-DOS, the files
will be unreadable. The normal way to use Windows encryption is to mark certain
directories as encrypted, which causes all the files in them to be encrypted, and
964 CASE STUDY 2: WINDOWS 8 CHAP. 11

new files moved to them or created in them to be encrypted as well. The actual en-
cryption and decryption are not managed by NTFS itself, but by a driver called
EFS (Encryption File System), which registers callbacks with NTFS.
EFS provides encryption for specific files and directories. There is also anoth-
er encryption facility in Windows called BitLocker which encrypts almost all the
data on a volume, which can help protect data no matter what—as long as the user
takes advantage of the mechanisms available for strong keys. Given the number of
systems that are lost or stolen all the time, and the great sensitivity to the issue of
identity theft, making sure secrets are protected is very important. An amazing
number of notebooks go missing every day. Major Wall Street companies sup-
posedly average losing one notebook per week in taxicabs in New York City alone.

11.9 WINDOWS POWER MANAGEMENT


The power manager rides herd on power usage throughout the system. His-
torically management of power consumption consisted of shutting off the monitor
display and stopping the disk drives from spinning. But the issue is rapidly becom-
ing more complicated due to requirements for extending how long notebooks can
run on batteries, and energy-conservation concerns related to desktop computers
being left on all the time and the high cost of supplying power to the huge server
farms that exist today.
Newer power-management facilities include reducing the power consumption
of components when the system is not in use by switching individual devices to
standby states, or even powering them off completely using soft power switches.
Multiprocessors shut down individual CPUs when they are not needed, and even
the clock rates of the running CPUs can be adjusted downward to reduce power
consumption. When a processor is idle, its power consumption is also reduced
since it needs to do nothing except wait for an interrupt to occur.
Windows supports a special shut down mode called hibernation, which copies
all of physical memory to disk and then reduces power consumption to a small
trickle (notebooks can run weeks in a hibernated state) with little battery drain.
Because all the memory state is written to disk, you can even replace the battery on
a notebook while it is hibernated. When the system resumes after hibernation it re-
stores the saved memory state (and reinitializes the I/O devices). This brings the
computer back into the same state it was before hibernation, without having to
login again and start up all the applications and services that were running. Win-
dows optimizes this process by ignoring unmodified pages backed by disk already
and compressing other memory pages to reduce the amount of I/O bandwidth re-
quired. The hibernation algorithm automatically tunes itself to balance between
I/O and processor throughput. If there is more processor available, it uses expen-
sive but more effective compression to reduce the I/O bandwidth needed. When
I/O bandwidth is sufficient, hibernation will skip the compression altogether. With
SEC. 11.9 WINDOWS POWER MANAGEMENT 965

the current generation of multiprocessors, both hibernation and resume can be per-
formed in a few seconds even on systems with many gigabytes of RAM.
An alternative to hibernation is standby mode where the power manager re-
duces the entire system to the lowest power state possible, using just enough power
to the refresh the dynamic RAM. Because memory does not need to be copied to
disk, this is somewhat faster than hibernation on some systems.
Despite the availability of hibernation and standby, many users are still in the
habit of shutting down their PC when they finish working. Windows uses hiberna-
tion to perform a pseudo shutdown and startup, called HiberBoot, that is much fast-
er than normal shutdown and startup. When the user tells the system to shutdown,
HiberBoot logs the user off and then hibernates the system at the point they would
normally login again. Later, when the user turns the system on again, HiberBoot
will resume the system at the login point. To the user it looks like shutdown was
very, very fast because most of the system initialization steps are skipped. Of
course, sometimes the system needs to perform a real shutdown in order to fix a
problem or install an update to the kernel. If the system is told to reboot rather
than shutdown, the system undergoes a real shutdown and performs a normal boot.
On phones and tablets, as well as the newest generation of laptops, computing
devices are expected to be always on yet consume little power. To provide this
experience Modern Windows implements a special version of power management
called CS (connected standby). CS is possible on systems with special network-
ing hardware which is able to listen for traffic on a small set of connections using
much less power than if the CPU were running. A CS system always appears to be
on, coming out of CS as soon as the screen is turned on by the user. Connected
standby is different than the regular standby mode because a CS system will also
come out of standby when it receives a packet on a monitored connection. Once
the battery begins to run low, a CS system will go into the hibernation state to
avoid completely exhausting the battery and perhaps losing user data.
Achieving good battery life requires more than just turning off the processor as
often as possible. It is also important to keep the processor off as long as possible.
The CS network hardware allows the processors to stay off until data have arrived,
but other events can also cause the processors to be turned back on. In NT-based
Windows device drivers, system services, and the applications themselves fre-
quently run for no particular reason other than to check on things. Such polling
activity is usually based on setting timers to periodically run code in the system or
application. Timer-based polling can produce a cacophony of events turning on the
processor. To avoid this, Modern Windows requires that timers specify an impreci-
sion parameter which allows the operating system to coalesce timer events and re-
duce the number of separate occasions one of the processors will have to be turned
back on. Windows also formalizes the conditions under which an application that
is not actively running can execute code in the background. Operations like check-
ing for updates or freshening content cannot be performed solely by requesting to
run when a timer expires. An application must defer to the operating system about
966 CASE STUDY 2: WINDOWS 8 CHAP. 11

when to run such background activities. For example, checking for updates might
occur only once a day or at the next time the device is charging its battery. A set of
system brokers provide a variety of conditions which can be used to limit when
background activity is performed. If a background task needs to access a low-cost
network or utilize a user’s credentials, the brokers will not execute the task until
the requisite conditions are present.
Many applications today are implemented with both local code and services in
the cloud. Windows provides WNS (Windows Notification Service) which allows
third-party services to push notifications to a Windows device in CS without re-
quiring the CS network hardware to specifically listen for packets from the third
party’s servers. WNS notifications can signal time-critical events, such as the arri-
val of a text message or a VoIP call. When a WNS packet arrives, the processor
will have to be turned on to process it, but the ability of the CS network hardware
to discriminate between traffic from different connections means the processor
does not have to awaken for every random packet that arrives at the network inter-
face.

11.10 SECURITY IN WINDOWS 8


NT was originally designed to meet the U.S. Department of Defense’s C2 se-
curity requirements (DoD 5200.28-STD), the Orange Book, which secure DoD
systems must meet. This standard requires operating systems to have certain prop-
erties in order to be classified as secure enough for certain kinds of military work.
Although Windows was not specifically designed for C2 compliance, it inherits
many security properties from the original security design of NT, including the fol-
lowing:

1. Secure login with antispoofing measures.


2. Discretionary access controls.
3. Privileged access controls.
4. Address-space protection per process.
5. New pages must be zeroed before being mapped in.
6. Security auditing.

Let us review these items briefly


Secure login means that the system administrator can require all users to have
a password in order to log in. Spoofing is when a malicious user writes a program
that displays the login prompt or screen and then walks away from the computer in
the hope that an innocent user will sit down and enter a name and password. The
name and password are then written to disk and the user is told that login has
SEC. 11.10 SECURITY IN WINDOWS 8 967

failed. Windows prevents this attack by instructing users to hit CTRL-ALT-DEL to


log in. This key sequence is always captured by the keyboard driver, which then
invokes a system program that puts up the genuine login screen. This procedure
works because there is no way for user processes to disable CTRL-ALT-DEL proc-
essing in the keyboard driver. But NT can and does disable use of the CTRL-ALT-
DEL secure attention sequence in some cases, particularly for consumers and in
systems that have accessibility for the disabled enabled, on phones, tablets, and the
Xbox, where there rarely is a physical keyboard.
Discretionary access controls allow the owner of a file or other object to say
who can use it and in what way. Privileged access controls allow the system
administrator (superuser) to override them when needed. Address-space protection
simply means that each process has its own protected virtual address space not ac-
cessible by any unauthorized process. The next item means that when the process
heap grows, the pages mapped in are initialized to zero so that processes cannot
find any old information put there by the previous owner (hence the zeroed page
list in Fig. 11-34, which provides a supply of zeroed pages for this purpose).
Finally, security auditing allows the administrator to produce a log of certain secu-
rity-related events.
While the Orange Book does not specify what is to happen when someone
steals your notebook computer, in large organizations one theft a week is not
unusual. Consequently, Windows provides tools that a conscientious user can use
to minimize the damage when a notebook is stolen or lost (e.g., secure login, en-
crypted files, etc.). Of course, conscientious users are precisely the ones who do
not lose their notebooks—it is the others who cause the trouble.
In the next section we will describe the basic concepts behind Windows securi-
ty. After that we will look at the security system calls. Finally, we will conclude
by seeing how security is implemented.

11.10.1 Fundamental Concepts

Every Windows user (and group) is identified by an SID (Security ID). SIDs
are binary numbers with a short header followed by a long random component.
Each SID is intended to be unique worldwide. When a user starts up a process, the
process and its threads run under the user’s SID. Most of the security system is de-
signed to make sure that each object can be accessed only by threads with autho-
rized SIDs.
Each process has an access token that specifies an SID and other properties.
The token is normally created by winlogon, as described below. The format of the
token is shown in Fig. 11-45. Processes can call GetTokenInformation to acquire
this information. The header contains some administrative information. The expi-
ration time field could tell when the token ceases to be valid, but it is currently not
used. The Groups field specifies the groups to which the process belongs, which is
needed for the POSIX subsystem. The default DACL (Discretionary ACL) is the
968 CASE STUDY 2: WINDOWS 8 CHAP. 11

access control list assigned to objects created by the process if no other ACL is
specified. The user SID tells who owns the process. The restricted SIDs are to
allow untrustworthy processes to take part in jobs with trustworthy processes but
with less power to do damage.
Finally, the privileges listed, if any, give the process special powers denied or-
dinary users, such as the right to shut the machine down or access files to which
access would otherwise be denied. In effect, the privileges split up the power of
the superuser into several rights that can be assigned to processes individually. In
this way, a user can be given some superuser power, but not all of it. In summary,
the access token tells who owns the process and which defaults and powers are as-
sociated with it.

Expiration Default User Group Restricted Impersonation Integrity


Header Groups Privileges
Time CACL SID SID SIDs Level Level

Figure 11-45. Structure of an access token.

When a user logs in, winlogon gives the initial process an access token. Subse-
quent processes normally inherit this token on down the line. A process’ access
token initially applies to all the threads in the process. However, a thread can ac-
quire a different access token during execution, in which case the thread’s access
token overrides the process’ access token. In particular, a client thread can pass its
access rights to a server thread to allow the server to access the client’s protected
files and other objects. This mechanism is called impersonation. It is imple-
mented by the transport layers (i.e., ALPC, named pipes, and TCP/IP) and used by
RPC to communicate from clients to servers. The transports use internal interfaces
in the kernel’s security reference monitor component to extract the security context
for the current thread’s access token and ship it to the server side, where it is used
to construct a token which can be used by the server to impersonate the client.
Another basic concept is the security descriptor. Every object has a security
descriptor associated with it that tells who can perform which operations on it.
The security descriptors are specified when the objects are created. The NTFS file
system and the registry maintain a persistent form of security descriptor, which is
used to create the security descriptor for File and Key objects (the object-manager
objects representing open instances of files and keys).
A security descriptor consists of a header followed by a DACL with one or
more ACEs (Access Control Entries). The two main kinds of elements are Allow
and Deny. An Allow element specifies an SID and a bitmap that specifies which
operations processes that SID may perform on the object. A Deny element works
the same way, except a match means the caller may not perform the operation. For
example, Ida has a file whose security descriptor specifies that everyone has read
access, Elvis has no access. Cathy has read/write access, and Ida herself has full
SEC. 11.10 SECURITY IN WINDOWS 8 969

access. This simple example is illustrated in Fig. 11-46. The SID Everyone refers
to the set of all users, but it is overridden by any explicit ACEs that follow.

Security
descriptor
File Header
Deny
Security
descriptor Elvis ACE
111111
Header Allow
Owner's SID Cathy
Group SID 110000
DACL Allow
SACL Ida
111111
Allow
Everyone
100000

Header
Audit
Marilyn ACE
111111

Figure 11-46. An example security descriptor for a file.

In addition to the DACL, a security descriptor also has a SACL (System


Access Control list), which is like a DACL except that it specifies not who may
use the object, but which operations on the object are recorded in the systemwide
security event log. In Fig. 11-46, every operation that Marilyn performs on the file
will be logged. The SACL also contains the integrity level, which we will de-
scribe shortly.

11.10.2 Security API Calls

Most of the Windows access-control mechanism is based on security descrip-


tors. The usual pattern is that when a process creates an object, it provides a secu-
rity descriptor as one of the parameters to the CreateProcess, CreateFile, or other
object-creation call. This security descriptor then becomes the security descriptor
attached to the object, as we saw in Fig. 11-46. If no security descriptor is pro-
vided in the object-creation call, the default security in the caller’s access token
(see Fig. 11-45) is used instead.
Many of the Win32 API security calls relate to the management of security de-
scriptors, so we will focus on those here. The most important calls are listed in
Fig. 11-47. To create a security descriptor, storage for it is first allocated and then
970 CASE STUDY 2: WINDOWS 8 CHAP. 11

initialized using InitializeSecurityDescriptor. This call fills in the header. If the


owner SID is not known, it can be looked up by name using LookupAccountSid. It
can then be inserted into the security descriptor. The same holds for the group
SID, if any. Normally, these will be the caller’s own SID and one of the called’s
groups, but the system administrator can fill in any SIDs.

Win32 API function Description


InitializeSecurityDescriptor Prepare a new security descriptor for use
LookupAccountSid Look up the SID for a given user name
SetSecurityDescriptorOwner Enter the owner SID in the security descriptor
SetSecurityDescriptorGroup Enter a group SID in the security descriptor
InitializeAcl Initialize a DACL or SACL
AddAccessAllowedAce Add a new ACE to a DACL or SACL allowing access
AddAccessDeniedAce Add a new ACE to a DACL or SACL denying access
DeleteAce Remove an ACE from a DACL or SACL
SetSecurityDescriptorDacl Attach a DACL to a security descriptor

Figure 11-47. The principal Win32 API functions for security.

At this point the security descriptor’s DACL (or SACL) can be initialized with
InitializeAcl. ACL entries can be added using AddAccessAllowedAce and AddAc-
cessDeniedAce. These calls can be repeated multiple times to add as many ACE
entries as are needed. DeleteAce can be used to remove an entry, that is, when
modifying an existing ACL rather than when constructing a new ACL. When the
ACL is ready, SetSecurityDescriptorDacl can be used to attach it to the security de-
scriptor. Finally, when the object is created, the newly minted security descriptor
can be passed as a parameter to have it attached to the object.

11.10.3 Implementation of Security

Security in a stand-alone Windows system is implemented by a number of


components, most of which we have already seen (networking is a whole other
story and beyond the scope of this book). Logging in is handled by winlogon and
authentication is handled by lsass. The result of a successful login is a new GUI
shell (explorer.exe) with its associated access token. This process uses the SECU-
RITY and SAM hives in the registry. The former sets the general security policy
and the latter contains the security information for the individual users, as dis-
cussed in Sec. 11.2.3.
Once a user is logged in, security operations happen when an object is opened
for access. Every OpenXXX call requires the name of the object being opened and
the set of rights needed. During processing of the open, the security reference
monitor (see Fig. 11-11) checks to see if the caller has all the rights required. It
SEC. 11.10 SECURITY IN WINDOWS 8 971

performs this check by looking at the caller’s access token and the DACL associ-
ated with the object. It goes down the list of ACEs in the ACL in order. As soon
as it finds an entry that matches the caller’s SID or one of the caller’s groups, the
access found there is taken as definitive. If all the rights the caller needs are avail-
able, the open succeeds; otherwise it fails.
DACLs can have Deny entries as well as Allow entries, as we have seen. For
this reason, it is usual to put entries denying access in front of entries granting ac-
cess in the ACL, so that a user who is specifically denied access cannot get in via a
back door by being a member of a group that has legitimate access.
After an object has been opened, a handle to it is returned to the caller. On
subsequent calls, the only check that is made is whether the operation now being
tried was in the set of operations requested at open time, to prevent a caller from
opening a file for reading and then trying to write on it. Additionally, calls on
handles may result in entries in the audit logs, as required by the SACL.
Windows added another security facility to deal with common problems secur-
ing the system by ACLs. There are new mandatory integrity-level SIDs in the
process token, and objects specify an integrity-level ACE in the SACL. The integ-
rity level prevents write-access to objects no matter what ACEs are in the DACL.
In particular, the integrity-level scheme is used to protect against an Internet Ex-
plorer process that has been compromised by an attacker (perhaps by the user ill-
advisedly downloading code from an unknown Website). Low-rights IE, as it is
called, runs with an integrity level set to low. By default all files and registry keys
in the system have an integrity level of medium, so IE running with low-integrity
level cannot modify them.
A number of other security features have been added to Windows in recent
years. Starting with service pack 2 of Windows XP, much of the system was com-
piled with a flag (/GS) that did validation against many kinds of stack buffer over-
flows. Additionally a facility in the AMD64 architecture, called NX, was used to
limit execution of code on stacks. The NX bit in the processor is available even
when running in x86 mode. NX stands for no execute and allows pages to be
marked so that code cannot be executed from them. Thus, if an attacker uses a
buffer-overflow vulnerability to insert code into a process, it is not so easy to jump
to the code and start executing it.
Windows Vista introduced even more security features to foil attackers. Code
loaded into kernel mode is checked (by default on x64 systems) and only loaded if
it is properly signed by a known and trusted authority. The addresses that DLLs
and EXEs are loaded at, as well as stack allocations, are shuffled quite a bit on
each system to make it less likely that an attacker can successfully use buffer over-
flows to branch into a well-known address and begin executing sequences of code
that can be weaved into an elevation of privilege. A much smaller fraction of sys-
tems will be able to be attacked by relying on binaries being at standard addresses.
Systems are far more likely to just crash, converting a potential elevation attack
into a less dangerous denial-of-service attack.
972 CASE STUDY 2: WINDOWS 8 CHAP. 11

Yet another change was the introduction of what Microsoft calls UAC (User
Account Control). This is to address the chronic problem in Windows where
most users run as administrators. The design of Windows does not require users to
run as administrators, but neglect over many releases had made it just about impos-
sible to use Windows successfully if you were not an administrator. Being an
administrator all the time is dangerous. Not only can user errors easily damage the
system, but if the user is somehow fooled or attacked and runs code that is trying to
compromise the system, the code will have administrative access, and can bury it-
self deep in the system.
With UAC, if an attempt is made to perform an operation requiring administra-
tor access, the system overlays a special desktop and takes control so that only
input from the user can authorize the access (similarly to how CTRL-ALT-DEL
works for C2 security). Of course, without becoming administrator it is possible
for an attacker to destroy what the user really cares about, namely his personal
files. But UAC does help foil existing types of attacks, and it is always easier to
recover a compromised system if the attacker was unable to modify any of the sys-
tem data or files.
The final security feature in Windows is one we have already mentioned.
There is support to create protected processes which provide a security boundary.
Normally, the user (as represented by a token object) defines the privilege bound-
ary in the system. When a process is created, the user has access to process
through any number of kernel facilities for process creation, debugging, path
names, thread injection, and so on. Protected processes are shut off from user ac-
cess. The original use of this facility in Windows was to allow digital rights man-
agement software to better protect content. In Windows 8.1, protected processes
were expanded to more user-friendly purposes, like securing the system against at-
tackers rather than securing content against attacks by the system owner.
Microsoft’s efforts to improve the security of Windows have accelerated in
recent years as more and more attacks have been launched against systems around
the world. Some of these attacks have been very successful, taking entire countries
and major corporations offline, and incurring costs of billions of dollars. Most of
the attacks exploit small coding errors that lead to buffer overruns or using memory
after it is freed, allowing the attacker to insert code by overwriting return ad-
dresses, exception pointers, virtual function pointers, and other data that control the
execution of programs. Many of these problems could be avoided if type-safe lan-
guages were used instead of C and C++. And even with these unsafe languages
many vulnerabilities could be avoided if students were better trained to understand
the pitfalls of parameter and data validation, and the many dangers inherent in
memory allocation APIs. After all, many of the software engineers who write code
at Microsoft today were students a few years earlier, just as many of you reading
this case study are now. Many books are available on the kinds of small coding er-
rors that are exploitable in pointer-based languages and how to avoid them (e.g.,
Howard and LeBlank, 2009).
SEC. 11.10 SECURITY IN WINDOWS 8 973

11.10.4 Security Mitigations

It would be great for users if computer software did not have any bugs, particu-
larly bugs that are exploitable by hackers to take control of their computer and
steal their information, or use their computer for illegal purposes such as distrib-
uted denial-of-service attacks, compromising other computers, and distribution of
spam or other illicit materials. Unfortunately, this is not yet feasible in practice,
and computers continue to have security vulnerabilities. Operating system devel-
opers have expended incredible efforts to minimize the number of bugs, with
enough success that attackers are increasing their focus on application software, or
browser plug-ins, like Adobe Flash, rather than the operating system itself.
Computer systems can still be made more secure through mitigation techni-
ques that make it more difficult to exploit vulnerabilities when they are found.
Windows has continually added improvements to its mitigation techniques in the
ten years leading up to Windows 8.1.

Mitigation Description
/GS compiler flag Add canary to stack frames to protect branch targets
Exception hardening Restrict what code can be invoked as exception handlers
NX MMU protection Mark code as non-executable to hinder attack payloads
ASLR Randomize address space to make ROP attacks difficult
Heap hardening Check for common heap usage errors
VTGuard Add checks to validate virtual function tables
Code Integrity Verify that libraries and drivers are properly cryptographically signed
Patchguard Detect attempts to modify kernel data, e.g. by rootkits
Windows Update Provide regular security patches to remove vulnerabilities
Windows Defender Built-in basic antivirus capability

Figure 11-48. Some of the principal security mitigations in Windows.

The mitigations listed undermine different steps required for successful wide-
spread exploitation of Windows systems. Some provide defense-in-depth against
attacks that are able to work around other mitigations. /GS protects against stack
overflow attacks that might allow attackers to modify return addresses, function
pointers, and exception handlers. Exception hardening adds additional checks to
verify that exception handler address chains are not overwritten. No-eXecute pro-
tection requires that successful attackers point the program counter not just at a
data payload, but at code that the system has marked as executable. Often at-
tackers attempt to circumvent NX protections using return-oriented-program-
ming or return to libC techniques that point the program counter at fragments of
code that allow them to build up an attack. ASLR (Address Space Layout Ran-
domization) foils such attacks by making it difficult for an attacker to know ahead
of time just exactly where the code, stacks, and other data structures are loaded in
974 CASE STUDY 2: WINDOWS 8 CHAP. 11

the address space. Recent work shows how running programs can be rerandom-
ized every few seconds, making attacks even more difficult (Giuffrida et al., 2012).
Heap hardening is a series of mitigations added to the Windows imple-
mentation of the heap that make it more difficult to exploit vulnerabilities such as
writing beyond the boundaries of a heap allocation, or some cases of continuing to
use a heap block after freeing it. VTGuard adds additional checks in particularly
sensitive code that prevent exploitation of use-after-free vulnerabilities related to
virtual-function tables in C++.
Code integrity is kernel-level protection against loading arbitrary executable
code into processes. It checks that programs and libraries were cryptographically
signed by a trustworthy publisher. These checks work with the memory manager
to verify the code on a page-by-page basis whenever individual pages are retrieved
from disk. Patchguard is a kernel-level mitigation that attempts to detect rootkits
designed to hide a successful exploitation from detection.
Windows Update is an automated service providing fixes to security vulnera-
bilities by patching the affected programs and libraries within Windows. Many of
the vulnerabilities fixed were reported by security researchers, and their contribu-
tions are acknowledged in the notes attached to each fix. Ironically the security
updates themselves pose a significant risk. Almost all vulnerabilities used by at-
tackers are exploited only after a fix has been published by Microsoft. This is be-
cause reverse engineering the fixes themselves is the primary way most hackers
discover vulnerabilities in systems. Systems that did not have all known updates
immediately applied are thus susceptible to attack. The security research commun-
ity is usually insistent that companies patch all vulnerabilities found within a rea-
sonable time. The current monthly patch frequency used by Microsoft is a com-
promise between keeping the community happy and how often users must deal
with patching to keep their systems safe.
The exception to this are the so-called zero day vulnerabilities. These are
exploitable bugs that are not known to exist until after their use is detected. Fortu-
nately, zero day vulnerabilities are considered to be rare, and reliably exploitable
zero days are even rarer due to the effectiveness of the mitigation measures de-
scribed above. There is a black market in such vulnerabilities. The mitigations in
the most recent versions of Windows are believed to be causing the market price
for a useful zero day to rise very steeply.
Finally, antivirus software has become such a critical tool for combating mal-
ware that Windows includes a basic version within Windows, called Windows
Defender. Antivirus software hooks into kernel operations to detect malware in-
side files, as well as recognize the behavioral patterns that are used by specific
instances (or general categories) of malware. These behaviors include the techni-
ques used to survive reboots, modify the registry to alter system behavior, and
launching particular processes and services needed to implement an attack.
Though Windows Defender provides reasonably good protection against common
malware, many users prefer to purchase third-party antivirus software.
SEC. 11.10 SECURITY IN WINDOWS 8 975

Many of these mitigations are under the control of compiler and linker flags.
If applications, kernel device drivers, or plug-in libraries read data into executable
memory or include code without /GS and ASLR enabled, the mitigations are not
present and any vulnerabilities in the programs are much easier to exploit. Fortu-
nately, in recent years the risks of not enabling mitigations are becoming widely
understood by software developers, and mitigations are generally enabled.
The final two mitigations on the list are under the control of the user or admin-
istrator of each computer system. Allowing Windows Update to patch software
and making sure that updated antivirus software is installed on systems are the best
techniques for protecting systems from exploitation. The versions of Windows
used by enterprise customers include features that make it easier for administrators
to ensure that the systems connected to their networks are fully patched and cor-
rectly configured with antivirus software.

11.11 SUMMARY
Kernel mode in Windows is structured in the HAL, the kernel and executive
layers of NTOS, and a large number of device drivers implementing everything
from device services to file systems and networking to graphics. The HAL hides
certain differences in hardware from the other components. The kernel layer man-
ages the CPUs to support multithreading and synchronization, and the executive
implements most kernel-mode services.
The executive is based on kernel-mode objects that represent the key executive
data structures, including processes, threads, memory sections, drivers, devices,
and synchronization objects—to mention a few. User processes create objects by
calling system services and get back handle references which can be used in subse-
quent system calls to the executive components. The operating system also creates
objects internally. The object manager maintains a namespace into which objects
can be inserted for subsequent lookup.
The most important objects in Windows are processes, threads, and sections.
Processes have virtual address spaces and are containers for resources. Threads are
the unit of execution and are scheduled by the kernel layer using a priority algo-
rithm in which the highest-priority ready thread always runs, preempting lower-pri-
ority threads as necessary. Sections represent memory objects, like files, that can
be mapped into the address spaces of processes. EXE and DLL program images
are represented as sections, as is shared memory.
Windows supports demand-paged virtual memory. The paging algorithm is
based on the working-set concept. The system maintains several types of page
lists, to optimize the use of memory. The various page lists are fed by trimming
the working sets using complex formulas that try to reuse physical pages that have
not been referenced in a long time. The cache manager manages virtual addresses
in the kernel that can be used to map files into memory, dramatically improving
976 CASE STUDY 2: WINDOWS 8 CHAP. 11

I/O performance for many applications because read operations can be satisfied
without accessing the disk.
I/O is performed by device drivers, which follow the Windows Driver Model.
Each driver starts out by initializing a driver object that contains the addresses of
the procedures that the system can call to manipulate devices. The actual devices
are represented by device objects, which are created from the configuration de-
scription of the system or by the plug-and-play manager as it discovers devices
when enumerating the system buses. Devices are stacked and I/O request packets
are passed down the stack and serviced by the drivers for each device in the device
stack. I/O is inherently asynchronous, and drivers commonly queue requests for
further work and return back to their caller. File-system volumes are implemented
as devices in the I/O system.
The NTFS file system is based on a master file table, which has one record per
file or directory. All the metadata in an NTFS file system is itself part of an NTFS
file. Each file has multiple attributes, which can be either in the MFT record or
nonresident (stored in blocks outside the MFT). NTFS supports Unicode, com-
pression, journaling, and encryption among many other features.
Finally, Windows has a sophisticated security system based on access control
lists and integrity levels. Each process has an authentication token that tells the
identity of the user and what special privileges the process has, if any. Each object
has a security descriptor associated with it. The security descriptor points to a dis-
cretionary access control list that contains access control entries that can allow or
deny access to individuals or groups. Windows has added numerous security fea-
tures in recent releases, including BitLocker for encrypting entire volumes, and ad-
dress-space randomization, nonexecutable stacks, and other measures to make suc-
cessful attacks more difficult.

PROBLEMS

1. Give one advantage and one disadvantage of the registry vs. having individual .ini files.
2. A mouse can have one, two, or three buttons. All three types are in use. Does the HAL
hide this difference from the rest of the operating system? Why or why not?
3. The HAL keeps track of time starting in the year 1601. Give an example of an applica-
tion where this feature is useful.
4. In Sec. 11.3.3 we described the problems caused by multithreaded applications closing
handles in one thread while still using them in another. One possibility for fixing this
would be to insert a sequence field. How could this help? What changes to the system
would be required?
5. Many components of the executive (Fig. 11-11) call other components of the executive.
Give three examples of one component calling another one, but use (six) different com-
ponents in all.
CHAP. 11 PROBLEMS 977
6. Win32 does not have signals. If they were to be introduced, they could be per process,
per thread, both, or neither. Make a proposal and explain why it is a good idea.
7. An alternative to using DLLs is to statically link each program with precisely those li-
brary procedures it actually calls, no more and no less. If this scheme were to be intro-
duced, would it make more sense on client machines or on server machines?
8. The discussion of Windows User-Mode Scheduling mentioned that user-mode and ker-
nel-mode threads had different stacks. What are some reasons why separate stacks are
needed?
9. Windows uses 2-MB large pages because it improves the effectiveness of the TLB,
which can have a profound impact on performance. Why is this? Why are 2-MB large
pages not used all the time?
10. Is there any limit on the number of different operations that can be defined on an exec-
utive object? If so, where does this limit come from? If not, why not?
11. The Win32 API call WaitForMultipleObjects allows a thread to block on a set of syn-
chronization objects whose handles are passed as parameters. As soon as any one of
them is signaled, the calling thread is released. Is it possible to have the set of syn-
chronization objects include two semaphores, one mutex, and one critical section?
Why or why not? (Hint: This is not a trick question but it does require some careful
thought.)
12. When initializing a global variable in a multithreaded program, a common pro-
gramming error is to allow a race condition where the variable can be initialized twice.
Why could this be a problem? Windows provides the InitOnceExecuteOnce API to
prevent such races. How might it be implemented?
13. Name three reasons why a desktop process might be terminated. What additional rea-
son might cause a process running a modern application to be terminated?
14. Modern applications must save their state to disk every time the user switches away
from the application. This seems inefficient, as users may switch back to an applica-
tion many times and the application simply resumes running. Why does the operating
system require applications to save their state so often rather than just giving them a
chance at the point the application is actually going to be terminated?
15. As described in Sec. 11.4, there is a special handle table used to allocate IDs for proc-
esses and threads. The algorithms for handle tables normally allocate the first avail-
able handle (maintaining the free list in LIFO order). In recent releases of Windows
this was changed so that the ID table always keeps the free list in FIFO order. What is
the problem that the LIFO ordering potentially causes for allocating process IDs, and
why does not UNIX have this problem?
16. Suppose that the quantum is set to 20 msec and the current thread, at priority 24, has
just started a quantum. Suddenly an I/O operation completes and a priority 28 thread
is made ready. About how long does it have to wait to get to run on the CPU?
17. In Windows, the current priority is always greater than or equal to the base priority.
Are there any circumstances in which it would make sense to have the current priority
be lower than the base priority? If so, give an example. If not, why not?
978 CASE STUDY 2: WINDOWS 8 CHAP. 11

18. Windows uses a facility called Autoboost to temporarily raise the priority of a thread
that holds the resource that is required by a higher-priority thread. How do you think
this works?
19. In Windows it is easy to implement a facility where threads running in the kernel can
temporarily attach to the address space of a different process. Why is this so much
harder to implement in user mode? Why might it be interesting to do so?
20. Name two ways to give better response time to the threads in important processes.
21. Even when there is plenty of free memory available, and the memory manager does not
need to trim working sets, the paging system can still frequently be writing to disk.
Why?
22. Windows swaps the processes for modern applications rather than reducing their work-
ing set and paging them. Why would this be more efficient? (Hint: It makes much less
of a difference when the disk is an SSD.)
23. Why does the self-map used to access the physical pages of the page directory and
page tables for a process always occupy the same 8 MB of kernel virtual addresses (on
the x86)?
24. The x86 can use either 64-bit or 32-bit page table entries. Windows uses 64-bit PTEs
so the system can access more than 4 GB of memory. With 32-bit PTEs, the self-map
uses only one PDE in the page directory, and thus occupies only 4 MB of addresses
rather than 8 MB. Why is this?
25. If a region of virtual address space is reserved but not committed, do you think a VAD
is created for it? Defend your answer.
26. Which of the transitions shown in Fig. 11-34 are policy decisions, as opposed to re-
quired moves forced by system events (e.g., a process exiting and freeing its pages)?
27. Suppose that a page is shared and in two working sets at once. If it is evicted from one
of the working sets, where does it go in Fig. 11-34? What happens when it is evicted
from the second working set?
28. When a process unmaps a clean stack page, it makes the transition (5) in Fig. 11-34.
Where does a dirty stack page go when unmapped? Why is there no transition to the
modified list when a dirty stack page is unmapped?
29. Suppose that a dispatcher object representing some type of exclusive lock (like a
mutex) is marked to use a notification event instead of a synchronization event to
announce that the lock has been released. Why would this be bad? How much would
the answer depend on lock hold times, the length of quantum, and whether the system
was a multiprocessor?
30. To support POSIX, the native NtCreateProcess API supports duplicating a process in
order to support fork. In UNIX fork is shortly followed by an exec most of the time.
One example where this was used historically was in the Berkeley dump(8S) program
which would backup disks to magnetic tape. Fork was used as a way of checkpointing
the dump program so it could be restarted if there was an error with the tape device.
CHAP. 11 PROBLEMS 979
Give an example of how Windows might do something similar using NtCreateProcess.
(Hint: Consider processes that host DLLs to implement functionality provided by a
third party).
31. A file has the following mapping. Give the MFT run entries.
Offset 0 1 2 3 4 5 6 7 8 9 10
Disk address 50 51 52 22 24 25 26 53 54 - 60
32. Consider the MFT record of Fig. 11-41. Suppose that the file grew and a 10th block
was assigned to the end of the file. The number of this block is 66. What would the
MFT record look like now?
33. In Fig. 11-44(b), the first two runs are each of length 8 blocks. Is it just an accident
that they are equal, or does this have to do with the way compression works? Explain
your answer.
34. Suppose that you wanted to build Windows Lite. Which of the fields of Fig. 11-45
could be removed without weakening the security of the system?
35. The mitigation strategy for improving security despite the continuing presence of vul-
nerabilities has been very successful. Modern attacks are very sophisticated, often re-
quiring the presence of multiple vulnerabilities to build a reliable exploit. One of the
vulnerabilities that is usually required is an information leak. Explain how an infor-
mation leak can be used to defeat address-space randomization in order to launch an
attack based on return-oriented programming.
36. An extension model used by many programs (Web browsers, Office, COM servers)
involves hosting DLLs to hook and extend their underlying functionality. Is this a rea-
sonable model for an RPC-based service to use as long as it is careful to impersonate
clients before loading the DLL? Why not?
37. When running on a NUMA machine, whenever the Windows memory manager needs
to allocate a physical page to handle a page fault it attempts to use a page from the
NUMA node for the current thread’s ideal processor. Why? What if the thread is cur-
rently running on a different processor?
38. Give a couple of examples where an application might be able to recover easily from a
backup based on a volume shadow copy rather the state of the disk after a system
crash.
39. In Sec. 11.10, providing new memory to the process heap was mentioned as one of the
scenarios that require a supply of zeroed pages in order to satisfy security re-
quirements. Give one or more other examples of virtual memory operations that re-
quire zeroed pages.
40. Windows contains a hypervisor which allows multiple operating systems to run simul-
taneously. This is available on clients, but is far more important in cloud computing.
When a security update is applied to a guest operating system, it is not much different
than patching a server. However, when a security update is applied to the root operat-
ing system, this can be a big problem for the users of cloud computing. What is the
nature of the problem? What can be done about it?
980 CASE STUDY 2: WINDOWS 8 CHAP. 11

41. The regedit command can be used to export part or all of the registry to a text file
under all current versions of Windows. Save the registry several times during a work
session and see what changes. If you have access to a Windows computer on which
you can install software or hardware, find out what changes when a program or device
is added or removed.
42. Write a UNIX program that simulates writing an NTFS file with multiple streams. It
should accept a list of one or more files as arguments and write an output file that con-
tains one stream with the attributes of all arguments and additional streams with the
contents of each of the arguments. Now write a second program for reporting on the
attributes and streams and extracting all the components.

You might also like