Understanding Linux Socket Programming
Understanding Linux Socket Programming
10
The second type is rather similar to the first one, except that it preserves packet
boundaries. If the sender makes five separate calls to write, each for 512 bytes, and
the receiver asks for 2560 bytes, with a type 1 socket all 2560 bytes will be re-
turned at once. With a type 2 socket, only 512 bytes will be returned. Four more
calls are needed to get the rest. The third type of socket is used to give the user ac-
cess to the raw network. This type is especially useful for real-time applications,
and for those situations in which the user wants to implement a specialized error-
handling scheme. Packets may be lost or reordered by the network. There are no
guarantees, as in the first two cases. The advantage of this mode is higher per-
formance, which sometimes outweighs reliability (e.g., for multimedia delivery, in
which being fast counts for more than being right).
When a socket is created, one of the parameters specifies the protocol to be
used for it. For reliable byte streams, the most popular protocol is TCP (Transmis-
sion Control Protocol). For unreliable packet-oriented transmission, UDP (User
Datagram Protocol) is the usual choice. Both of these are layered on top of IP
(Internet Protocol). All of these protocols originated with the U.S. Dept. of
Defense’s ARPANET, and now form the basis of the Internet. There is no common
protocol for reliable packet streams.
Before a socket can be used for networking, it must have an address bound to
it. This address can be in one of several naming domains. The most common one
is the Internet naming domain, which uses 32-bit integers for naming endpoints in
Version 4 and 128-bit integers in Version 6 (Version 5 was an experimental system
that never made it to the major leagues).
Once sockets have been created on both the source and destination computers,
a connection can be established between them (for connection-oriented communi-
cation). One party makes a listen system call on a local socket, which creates a
buffer and blocks until data arrive. The other makes a connect system call, giving
as parameters the file descriptor for a local socket and the address of a remote
socket. If the remote party accepts the call, the system then establishes a con-
nection between the sockets.
Once a connection has been established, it functions analogously to a pipe. A
process can read and write from it using the file descriptor for its local socket.
When the connection is no longer needed, it can be closed in the usual way, via the
close system call.
Each I/O device in a Linux system generally has a special file associated with
it. Most I/O can be done by just using the proper file, eliminating the need for spe-
cial system calls. Nevertheless, sometimes there is a need for something that is de-
vice specific. Prior to POSIX most UNIX systems had a system call ioctl that per-
formed a large number of device-specific actions on special files. Over the course
of the years, it had gotten to be quite a mess. POSIX cleaned it up by splitting its
SEC. 10.5 INPUT/OUTPUT IN LINUX 771
functions into separate function calls primarily for terminal devices. In Linux and
modern UNIX systems, whether each one is a separate system call or they share a
single system call or something else is implementation dependent.
The first four calls listed in Fig. 10-20 are used to set and get the terminal
speed. Different calls are provided for input and output because some modems op-
erate at split speed. For example, old videotex systems allowed people to access
public databases with short requests from the home to the server at 75 bits/sec with
replies coming back at 1200 bits/sec. This standard was adopted at a time when
1200 bits/sec both ways was too expensive for home use. Times change in the net-
working world. This asymmetry still persists, with some telephone companies
offering inbound service at 20 Mbps and outbound service at 2 Mbps, often under
the name of ADSL (Asymmetric Digital Subscriber Line).
Figure 10-20. The main POSIX calls for managing the terminal.
The last two calls in the list are for setting and reading back all the special
characters used for erasing characters and lines, interrupting processes, and so on.
In addition, they enable and disable echoing, handle flow control, and perform
other related functions. Additional I/O function calls also exist, but they are some-
what specialized, so we will not discuss them further. In addition, ioctl is still avail-
able.
a parameter. Adding a new device type to Linux means adding a new entry to one
of these tables and supplying the corresponding procedures to handle the various
operations on the device.
Some of the operations which may be associated with different character de-
vices are shown in Fig. 10-21. Each row refers to a single I/O device (i.e., a single
driver). The columns represent the functions that all character drivers must sup-
port. Several other functions also exist. When an operation is performed on a char-
acter special file, the system indexes into the hash table of character devices to
select the proper structure, then calls the corresponding function to have the work
performed. Thus each of the file operations contains a pointer to a function con-
tained in the corresponding driver.
Figure 10-21. Some of the file operations supported for typical character devices.
Each driver is split into two parts, both of which are part of the Linux kernel
and both of which run in kernel mode. The top half runs in the context of the caller
and interfaces to the rest of Linux. The bottom half runs in kernel context and
interacts with the device. Drivers are allowed to make calls to kernel procedures
for memory allocation, timer management, DMA control, and other things. The set
of kernel functions that may be called is defined in a document called the Driver-
Kernel Interface. Writing device drivers for Linux is covered in detail in Cooper-
stein (2009) and Corbet et al. (2009).
The I/O system is split into two major components: the handling of block spe-
cial files and the handling of character special files. We will now look at each of
these components in turn.
The goal of the part of the system that does I/O on block special files (e.g.,
disks) is to minimize the number of transfers that must be done. To accomplish
this goal, Linux has a cache between the disk drivers and the file system, as illus-
trated in Fig. 10-22. Prior to the 2.2 kernel, Linux maintained completely separate
page and buffer caches, so a file residing in a disk block could be cached in both
caches. Newer versions of Linux have a unified cache. A generic block layer holds
these components together, performs the necessary translations between disk sec-
tors, blocks, buffers and pages of data, and enables the operations on them.
The cache is a table in the kernel for holding thousands of the most recently
used blocks. When a block is needed from a disk for whatever reason (i-node,
directory, or data), a check is first made to see if it is in the cache. If it is present in
SEC. 10.5 INPUT/OUTPUT IN LINUX 773
the cache, the block is taken from there and a disk access is avoided, thereby re-
sulting in great improvements in system performance.
Cache
File system 1 FS 2
Figure 10-22. The Linux I/O system showing one file system in detail.
If the block is not in the page cache, it is read from the disk into the cache and
from there copied to where it is needed. Since the page cache has room for only a
fixed number of blocks, the page-replacement algorithm described in the previous
section is invoked.
The page cache works for writes as well as for reads. When a program writes a
block, it goes to the cache, not to the disk. The pdflush daemon will flush the
block to disk in the event the cache grows above a specified value. In addition, to
avoid having blocks stay too long in the cache before being written to the disk, all
dirty blocks are written to the disk every 30 seconds.
In order to reduce the latency of repetitive disk-head movements, Linux relies
on an I/O scheduler. Its purpose is to reorder or bundle read/write requests to
block devices. There are many scheduler variants, optimized for different types of
workloads. The basic Linux scheduler is based on the original Linux elevator
scheduler. The operations of the elevator scheduler can be summarized as fol-
lows: Disk operations are sorted in a doubly linked list, ordered by the address of
the sector of the disk request. New requests are inserted in this list in a sorted man-
ner. This prevents repeated costly disk-head movements. The request list is subse-
quently merged so that adjacent operations are issued via a single disk request. The
basic elevator scheduler can lead to starvation. Therefore, the revised version of
the Linux disk scheduler includes two additional lists, maintaining read or write
operations ordered by their deadlines. The default deadlines are 0.5 sec for reads
774 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
and 5 sec for writes. If a system-defined deadline for the oldest write operation is
about to expire, that write request will be serviced before any of the requests on the
main doubly linked list.
In addition to regular disk files, there are also block special files, also called
raw block files. These files allow programs to access the disk using absolute
block numbers, without regard to the file system. They are most often used for
things like paging and system maintenance.
The interaction with character devices is simple. Since character devices pro-
duce or consume streams of characters, or bytes of data, support for random access
makes little sense. One exception is the use of line disciplines. A line discipline
can be associated with a terminal device, represented via the structure tty struct,
and it represents an interpreter for the data exchanged with the terminal device. For
instance, local line editing can be done (i.e., erased characters and lines can be re-
moved), carriage returns can be mapped onto line feeds, and other special proc-
essing can be completed. However, if a process wants to interact on every charac-
ter, it can put the line in raw mode, in which case the line discipline will be bypas-
sed. Not all devices have line disciplines.
Output works in a similar way, expanding tabs to spaces, converting line feeds
to carriage returns + line feeds, adding filler characters following carriage returns
on slow mechanical terminals, and so on. Like input, output can go through the line
discipline (cooked mode) or bypass it (raw mode). Raw mode is especially useful
when sending binary data to other computers over a serial line and for GUIs. Here,
no conversions are desired.
The interaction with network devices is different. While network devices also
produce/consume streams of characters, their asynchronous nature makes them less
suitable for easy integration under the same interface as other character devices.
The networking device driver produces packets consisting of multiple bytes of
data, along with network headers. These packets are then routed through a series of
network protocol drivers, and ultimately are passed to the user-space application. A
key data structure is the socket buffer structure, skbuff, which is used to represent
portions of memory filled with packet data. The data in an skbuff buffer do not al-
ways start at the start of the buffer. As they are being processed by various proto-
cols in the networking stack, protocol headers may be removed, or added. The user
processes interact with networking devices via sockets, which in Linux support the
original BSD socket API. The protocol drivers can be bypassed and direct access
to the underlying network device is enabled via raw sockets. Only the superuser is
allowed to create raw sockets.
For decades, UNIX device drivers were statically linked into the kernel so they
were all present in memory whenever the system was booted. Given the environ-
ment in which UNIX grew up, commonly departmental minicomputers and then
SEC. 10.5 INPUT/OUTPUT IN LINUX 775
high-end workstations, with their small and unchanging sets of I/O devices, this
scheme worked well. Basically, a computer center built a kernel containing drivers
for the I/O devices and that was it. If next year the center bought a new disk, it
relinked the kernel. No big deal.
With the arrival of Linux on the PC platform, suddenly all that changed. The
number of I/O devices available on the PC is orders of magnitude larger than on
any minicomputer. In addition, although all Linux users have (or can easily get)
the full source code, probably the vast majority would have considerable difficulty
adding a driver, updating all the device-driver related data structures, relinking the
kernel, and then installing it as the bootable system (not to mention dealing with
the aftermath of building a kernel that does not boot).
Linux solved this problem with the concept of loadable modules. These are
chunks of code that can be loaded into the kernel while the system is running. Most
commonly these are character or block device drivers, but they can also be entire
file systems, network protocols, performance monitoring tools, or anything else de-
sired.
When a module is loaded, several things have to happen. First, the module has
to be relocated on the fly, during loading. Second, the system has to check to see if
the resources the driver needs are available (e.g., interrupt request levels) and if so,
mark them as in use. Third, any interrupt vectors that are needed must be set up.
Fourth, the appropriate driver switch table has to be updated to handle the new
major device type. Finally, the driver is allowed to run to perform any device-spe-
cific initialization it may need. Once all these steps are completed, the driver is
fully installed, the same as any statically installed driver. Other modern UNIX sys-
tems now also support loadable modules.
The initial Linux file system was the MINIX 1 file system. However, because
it limited file names to 14 characters (in order to be compatible with UNIX Version
7) and its maximum file size was 64 MB (which was overkill on the 10-MB hard
776 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
disks of its era), there was interest in better file systems almost from the beginning
of the Linux development, which began about 5 years after MINIX 1 was released.
The first improvement was the ext file system, which allowed file names of 255
characters and files of 2 GB, but it was slower than the MINIX 1 file system, so the
search continued for a while. Eventually, the ext2 file system was invented, with
long file names, long files, and better performance, and it has become the main file
system. However, Linux supports several dozen file systems using the Virtual File
System (VFS) layer (described in the next section). When Linux is linked, a
choice is offered of which file systems should be built into the kernel. Others can
be dynamically loaded as modules during execution, if need be.
A Linux file is a sequence of 0 or more bytes containing arbitrary information.
No distinction is made between ASCII files, binary files, or any other kinds of
files. The meaning of the bits in a file is entirely up to the file’s owner. The system
does not care. File names are limited to 255 characters, and all the ASCII charac-
ters except NUL are allowed in file names, so a file name consisting of three car-
riage returns is a legal file name (but not an especially convenient one).
By convention, many programs expect file names to consist of a base name and
an extension, separated by a dot (which counts as a character). Thus prog.c is typi-
cally a C program, [Link] is typically a Python program, and prog.o is usually an
object file (compiler output). These conventions are not enforced by the operating
system but some compilers and other programs expect them. Extensions may be of
any length, and files may have multiple extensions, as in [Link], which is
probably a gzip compressed Java program.
Files can be grouped together in directories for convenience. Directories are
stored as files and to a large extent can be treated like files. Directories can contain
subdirectories, leading to a hierarchical file system. The root directory is called /
and always contains several subdirectories. The / character is also used to separate
directory names, so that the name /usr/ast/x denotes the file x located in the direc-
tory ast, which itself is in the /usr directory. Some of the major directories near the
top of the tree are shown in Fig. 10-23.
Directory Contents
bin Binary (executable) programs
dev Special files for I/O devices
etc Miscellaneous system files
lib Libraries
usr User directories
There are two ways to specify file names in Linux, both to the shell and when
opening a file from inside a program. The first way is by means of an absolute
path, which means telling how to get to the file starting at the root directory. An
SEC. 10.6 THE LINUX FILE SYSTEM 777
(a) (b)
In the example just discussed, we suggested that before linking, the only way
for Fred to refer to Lisa’s file x was by using its absolute path. Actually, this is not
really true. When a directory is created, two entries, . and .., are automatically
made in it. The former refers to the working directory itself. The latter refers to the
directory’s parent, that is, the directory in which it itself is listed. Thus from
/usr/fred, another path to Lisa’s file x is ../lisa/x.
In addition to regular files, Linux also supports character special files and
block special files. Character special files are used to model serial I/O devices,
such as keyboards and printers. Opening and reading from /dev/tty reads from the
keyboard; opening and writing to /dev/lp writes to the printer. Block special files,
often with names like /dev/hd1, can be used to read and write raw disk partitions
without regard to the file system. Thus a seek to byte k followed by a read will be-
gin reading from the kth byte on the corresponding partition, completely ignoring
the i-node and file structure. Raw block devices are used for paging and swapping
by programs that lay down file systems (e.g., mkfs), and by programs that fix sick
file systems (e.g., fsck), for example.
Many computers have two or more disks. On mainframes at banks, for ex-
ample, it is frequently necessary to have 100 or more disks on a single machine, in
order to hold the huge databases required. Even personal computers often have at
least two disks—a hard disk and an optical (e.g., DVD) drive. When there are mul-
tiple disk drives, the question arises of how to handle them.
One solution is to put a self-contained file system on each one and just keep
them separate. Consider, for example, the situation shown in Fig. 10-25(a). Here
we have a hard disk, which we call C:, and a DVD, which we call D:. Each has its
own root directory and files. With this solution, the user has to specify both the de-
vice and the file when anything other than the default is needed. For instance, to
copy a file x to a directory d (assuming C: is the default), one would type
cp D:/x /a/d/x
This is the approach taken by a number of systems, including Windows 8, which it
inherited from MS-DOS in a century long ago.
The Linux solution is to allow one disk to be mounted in another disk’s file
tree. In our example, we could mount the DVD on the directory /b, yielding the
file system of Fig. 10-25(b). The user now sees a single file tree, and no longer has
to be aware of which file resides on which device. The above copy command now
becomes
cp /b/x /a/d/x
exactly the same as it would have been if everything had been on the hard disk in
the first place.
Another interesting property of the Linux file system is locking. In some ap-
plications, two or more processes may be using the same file at the same time,
which may lead to race conditions. One solution is to program the application with
SEC. 10.6 THE LINUX FILE SYSTEM 779
x y z
a b a b
x y z
c d c d
p q r q q r
critical regions. However, if the processes belong to independent users who do not
even know each other, this kind of coordination is generally inconvenient.
Consider, for example, a database consisting of many files in one or more di-
rectories that are accessed by unrelated users. It is certainly possible to associate a
semaphore with each directory or file and achieve mutual exclusion by having
processes do a down operation on the appropriate semaphore before accessing the
data. The disadvantage, however, is that a whole directory or file is then made inac-
cessible, even though only one record may be needed.
For this reason, POSIX provides a flexible and fine-grained mechanism for
processes to lock as little as a single byte and as much as an entire file in one
indivisible operation. The locking mechanism requires the caller to specify the file
to be locked, the starting byte, and the number of bytes. If the operation succeeds,
the system makes a table entry noting that the bytes in question (e.g., a database
record) are locked.
Two kinds of locks are provided, shared locks and exclusive locks. If a por-
tion of a file already contains a shared lock, a second attempt to place a shared lock
on it is permitted, but an attempt to put an exclusive lock on it will fail. If a por-
tion of a file contains an exclusive lock, all attempts to lock any part of that portion
will fail until the lock has been released. In order to successfully place a lock,
every byte in the region to be locked must be available.
When placing a lock, a process must specify whether it wants to block or not
in the event that the lock cannot be placed. If it chooses to block, when the exist-
ing lock has been removed, the process is unblocked and the lock is placed. If the
process chooses not to block when it cannot place a lock, the system call returns
immediately, with the status code telling whether the lock succeeded or not. If it
did not, the caller has to decide what to do next (e.g., wait and try again).
Locked regions may overlap. In Fig. 10-26(a) we see that process A has placed
a shared lock on bytes 4 through 7 of some file. Later, process B places a shared
780 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
(a) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(b) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
A B
(c) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Figure 10-26. (a) A file with one lock. (b) Adding a second lock. (c) A third one.
Many system calls relate to files and the file system. First we will look at the
system calls that operate on individual files. Later we will examine those that
involve directories or the file system as a whole. To create a new file, the creat call
can be used. (When Ken Thompson was once asked what he would do differently
if he had the chance to reinvent UNIX, he replied that he would spell creat as cre-
ate this time.) The parameters provide the name of the file and the protection
mode. Thus
fd = creat("abc", mode);
creates a file called abc with the protection bits taken from mode. These bits deter-
mine which users may access the file and how. They will be described later.
The creat call not only creates a new file, but also opens it for writing. To
allow subsequent system calls to access the file, a successful creat returns a small
SEC. 10.6 THE LINUX FILE SYSTEM 781
Figure 10-27. Some system calls relating to files. The return code s is −1 if an
error has occurred; fd is a file descriptor, and position is a file offset. The parame-
ters should be self explanatory.
The most heavily used calls are undoubtedly read and write. Each one has
three parameters: a file descriptor (telling which open file to read or write), a buffer
address (telling where to put the data or get the data from), and a count (telling
how many bytes to transfer). That is all there is. It is a very simple design. A typ-
ical call is
n = read(fd, buffer, nbytes);
Although nearly all programs read and write files sequentially, some programs
need to be able to access any part of a file at random. Associated with each file is a
782 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
pointer that indicates the current position in the file. When reading (or writing) se-
quentially, it normally points to the next byte to be read (written). If the pointer is
at, say, 4096, before 1024 bytes are read, it will automatically be moved to 5120
after a successful read system call. The lseek call changes the value of the position
pointer, so that subsequent calls to read or write can begin anywhere in the file, or
even beyond the end of it. It is called lseek to avoid conflicting with seek, a now-
obsolete call that was formerly used on 16-bit computers for seeking.
Lseek has three parameters: the first one is the file descriptor for the file; the
second is a file position; the third tells whether the file position is relative to the be-
ginning of the file, the current position, or the end of the file. The value returned by
lseek is the absolute position in the file after the file pointer is changed. Slightly
ironically, lseek is the only file system call that never causes a real disk seek be-
cause all it does is update the current file position, which is a number in memory.
For each file, Linux keeps track of the file mode (regular, directory, special
file), size, time of last modification, and other information. Programs can ask to see
this information via the stat system call. The first parameter is the file name. The
second is a pointer to a structure where the information requested is to be put. The
fields in the structure are shown in Fig. 10-28. The fstat call is the same as stat ex-
cept that it operates on an open file (whose name may not be known) rather than on
a path name.
The pipe system call is used to create shell pipelines. It creates a kind of
pseudofile, which buffers the data between the pipeline components, and returns
file descriptors for both reading and writing the buffer. In a pipeline such as
sor t <in | head –30
file descriptor 1 (standard output) in the process running sort would be set (by the
shell) to write to the pipe, and file descriptor 0 (standard input) in the process run-
ning head would be set to read from the pipe. In this way, sort just reads from file
descriptor 0 (set to the file in) and writes to file descriptor 1 (the pipe) without even
SEC. 10.6 THE LINUX FILE SYSTEM 783
being aware that these have been redirected. If they have not been redirected, sort
will automatically read from the keyboard and write to the screen (the default de-
vices). Similarly, when head reads from file descriptor 0, it is reading the data sort
put into the pipe buffer without even knowing that a pipe is in use. This is a clear
example of how a simple concept (redirection) with a simple implementation (file
descriptors 0 and 1) can lead to a powerful tool (connecting programs in arbitrary
ways without having to modify them at all).
The last system call in Fig. 10-27 is fcntl. It is used to lock and unlock files,
apply shared or exclusive locks, and perform a few other file-specific operations.
Now let us look at some system calls that relate more to directories or the file
system as a whole, rather than just to one specific file. Some common ones are list-
ed in Fig. 10-29. Directories are created and destroyed using mkdir and rmdir, re-
spectively. A directory can be removed only if it is empty.
Figure 10-29. Some system calls relating to directories. The return code s is −1
if an error has occurred; dir identifies a directory stream, and dirent is a directory
entry. The parameters should be self explanatory.
As we saw in Fig. 10-24, linking to a file creates a new directory entry that
points to an existing file. The link system call creates the link. The parameters spec-
ify the original and new names, respectively. Directory entries are removed with
unlink. When the last link to a file is removed, the file is automatically deleted. For
a file that has never been linked, the first unlink causes it to disappear.
The working directory is changed by the chdir system call. Doing so has the ef-
fect of changing the interpretation of relative path names.
The last four calls of Fig. 10-29 are for reading directories. They can be open-
ed, closed, and read, analogous to ordinary files. Each call to readdir returns exact-
ly one directory entry in a fixed format. There is no way for users to write in a di-
rectory (in order to maintain the integrity of the file system). Files can be added to
a directory using creat or link and removed using unlink. There is also no way to
seek to a specific file in a directory, but rewinddir allows an open directory to be
read again from the beginning.
784 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
In this section we will first look at the abstractions supported by the Virtual
File System layer. The VFS hides from higher-level processes and applications the
differences among many types of file systems supported by Linux, whether they
are residing on local devices or are stored remotely and need to be accessed over
the network. Devices and other special files are also accessed through the VFS
layer. Next, we will describe the implementation of the first widespread Linux file
system, ext2, or the second extended file system. Afterward, we will discuss the
improvements in the ext4 file system. A wide variety of other file systems are also
in use. All Linux systems can handle multiple disk partitions, each with a different
file system on it.
are cached in what is called the dentry cache. For instance, the dentry cache
would contain entries for /, /usr, /usr/ast, and the like. If multiple processes access
the same file through the same hard link (i.e., same path), their file object will
point to the same entry in this cache.
Finally, the file data structure is an in-memory representation of an open file,
and is created in response to the open system call. It supports operations such as
read, write, sendfile, lock, and other system calls described in the previous section.
The actual file systems implemented underneath the VFS need not use the
exact same abstractions and operations internally. They must, however, implement
file-system operations semantically equivalent to those specified with the VFS ob-
jects. The elements of the operations data structures for each of the four VFS ob-
jects are pointers to functions in the underlying file system.
We next describe one of the most popular on-disk file systems used in Linux:
ext2. The first Linux release used the MINIX 1 file system and was limited by
short file names and 64-MB file sizes. The MINIX 1 file system was eventually re-
placed by the first extended file system, ext, which permitted both longer file
names and larger file sizes. Due to its performance inefficiencies, ext was replaced
by its successor, ext2, which is still in widespread use.
An ext2 Linux disk partition contains a file system with the layout shown in
Fig. 10-31. Block 0 is not used by Linux and contains code to boot the computer.
Following block 0, the disk partition is divided into groups of blocks, irrespective
of where the disk cylinder boundaries fall. Each group is organized as follows.
The first block is the superblock. It contains information about the layout of
the file system, including the number of i-nodes, the number of disk blocks, and
the start of the list of free disk blocks (typically a few hundred entries). Next
comes the group descriptor, which contains information about the location of the
bitmaps, the number of free blocks and i-nodes in the group, and the number of di-
rectories in the group. This information is important since ext2 attempts to spread
directories evenly over the disk.
Boot Block group 0 Block group 1 Block group 2 Block group 3 Block group 4 ...
Two bitmaps are used to keep track of the free blocks and free i-nodes, respect-
ively, a choice inherited from the MINIX 1 file system (and in contrast to most
UNIX file systems, which use a free list). Each map is one block long. With a
1-KB block, this design limits a block group to 8192 blocks and 8192 i-nodes. The
former is a real restriction but, in practice, the latter is not. With 4-KB blocks, the
numbers are four times larger.
Following the superblock are the i-nodes themselves. They are numbered from
1 up to some maximum. Each i-node is 128 bytes long and describes exactly one
file. An i-node contains accounting information (including all the information re-
turned by stat, which simply takes it from the i-node), as well as enough informa-
tion to locate all the disk blocks that hold the file’s data.
Following the i-nodes are the data blocks. All the files and directories are stor-
ed here. If a file or directory consists of more than one block, the blocks need not
be contiguous on the disk. In fact, the blocks of a large file are likely to be spread
all over the disk.
I-nodes corresponding to directories are dispersed throughout the disk block
groups. Ext2 makes an effort to collocate ordinary files in the same block group as
the parent directory, and data files in the same block as the original file i-node, pro-
vided that there is sufficient space. This idea was borrowed from the Berkeley Fast
File System (McKusick et al., 1984). The bitmaps are used to make quick decis-
ions regarding where to allocate new file-system data. When new file blocks are al-
located, ext2 also preallocates a number (eight) of additional blocks for that file, so
as to minimize the file fragmentation due to future write operations. This scheme
balances the file-system load across the entire disk. It also performs well due to its
tendencies for collocation and reduced fragmentation.
To access a file, it must first use one of the Linux system calls, such as open,
which requires the file’s path name. The path name is parsed to extract individual
directories. If a relative path is specified, the lookup starts from the process’ cur-
rent directory, otherwise it starts from the root directory. In either case, the i-node
for the first directory can easily be located: there is a pointer to it in the process de-
scriptor, or, in the case of a root directory, it is typically stored in a predetermined
block on disk.
The directory file allows file names up to 255 characters and is illustrated in
Fig. 10-32. Each directory consists of some integral number of disk blocks so that
directories can be written atomically to the disk. Within a directory, entries for files
and directories are in unsorted order, with each entry directly following the one be-
fore it. Entries may not span disk blocks, so often there are some number of unused
bytes at the end of each disk block.
Each directory entry in Fig. 10-32 consists of four fixed-length fields and one
variable-length field. The first field is the i-node number, 19 for the file colossal,
42 for the file voluminous, and 88 for the directory bigdir. Next comes a field
rec len, telling how big the entry is (in bytes), possibly including some padding
after the name. This field is needed to find the next entry for the case that the file
SEC. 10.6 THE LINUX FILE SYSTEM 787
I-node number
Entry size
Type
File name length
Figure 10-32. (a) A Linux directory with three files. (b) The same directory af-
ter the file voluminous has been removed.
number for the /usr/ast directory can be taken from it. Armed with the i-node num-
ber of the /usr/ast directory, this i-node can be read and the directory blocks locat-
ed. Finally, ‘‘file’’ is looked up and its i-node number found. Thus, the use of a rel-
ative path name is not only more convenient for the user, but it also saves a sub-
stantial amount of work for the system.
If the file is present, the system extracts the i-node number and uses it as an
index into the i-node table (on disk) to locate the corresponding i-node and bring it
into memory. The i-node is put in the i-node table, a kernel data structure that
holds all the i-nodes for currently open files and directories. The format of the i-
node entries, as a bare minimum, must contain all the fields returned by the stat
system call so as to make stat work (see Fig. 10-28). In Fig. 10-33 we show some
of the fields included in the i-node structure supported by the Linux file-system
layer. The actual i-node structure contains many more fields, since the same struc-
ture is also used to represent directories, devices, and other special files. The i-
node structure also contains fields reserved for future use. History has shown that
unused bits do not remain that way for long.
Let us now see how the system reads a file. Remember that a typical call to the
library procedure for invoking the read system call looks like this:
n = read(fd, buffer, nbytes);
When the kernel gets control, all it has to start with are these three parameters and
the information in its internal tables relating to the user. One of the items in the in-
ternal tables is the file-descriptor array. It is indexed by a file descriptor and con-
tains one entry for each open file (up to the maximum number, usually defaults to
32).
The idea is to start with this file descriptor and end up with the corresponding
i-node. Let us consider one possible design: just put a pointer to the i-node in the
file-descriptor table. Although simple, unfortunately this method does not work.
SEC. 10.6 THE LINUX FILE SYSTEM 789
The problem is as follows. Associated with every file descriptor is a file position
that tells at which byte the next read (or write) will start. Where should it go? One
possibility is to put it in the i-node table. However, this approach fails if two or
more unrelated processes happen to open the same file at the same time because
each one has its own file position.
A second possibility is to put the file position in the file-descriptor table. In
that way, every process that opens a file gets its own private file position. Unfortun-
ately this scheme fails too, but the reasoning is more subtle and has to do with the
nature of file sharing in Linux. Consider a shell script, s, consisting of two com-
mands, p1 and p2, to be run in order. If the shell script is called by the command
s >x
it is expected that p1 will write its output to x, and then p2 will write its output to x
also, starting at the place where p1 stopped.
When the shell forks off p1, x is initially empty, so p1 just starts writing at file
position 0. However, when p1 finishes, some mechanism is needed to make sure
that the initial file position that p2 sees is not 0 (which it would be if the file posi-
tion were kept in the file-descriptor table), but the value p1 ended with.
The way this is achieved is shown in Fig. 10-34. The trick is to introduce a
new table, the open-file-description table, between the file descriptor table and
the i-node table, and put the file position (and read/write bit) there. In this figure,
the parent is the shell and the child is first p1 and later p2. When the shell forks off
p1, its user structure (including the file-descriptor table) is an exact copy of the
shell’s, so both of them point to the same open-file-description table entry. When
p1 finishes, the shell’s file descriptor is still pointing to the open-file description
containing p1’s file position. When the shell now forks off p2, the new child auto-
matically inherits the file position, without either it or the shell even having to
know what that position is.
However, if an unrelated process opens the file, it gets its own open-file-de-
scription entry, with its own file position, which is precisely what is needed. Thus
the whole point of the open-file-description table is to allow a parent and child to
share a file position, but to provide unrelated processes with their own values.
Getting back to the problem of doing the read, we have now shown how the
file position and i-node are located. The i-node contains the disk addresses of the
first 12 blocks of the file. If the file position falls in the first 12 blocks, the block is
read and the data are copied to the user. For files longer than 12 blocks, a field in
the i-node contains the disk address of a single indirect block, as shown in
Fig. 10-34. This block contains the disk addresses of more disk blocks. For ex-
ample, if a block is 1 KB and a disk address is 4 bytes, the single indirect block
can hold 256 disk addresses. Thus this scheme works for files of up to 268 KB.
Beyond that, a double indirect block is used. It contains the addresses of 256
single indirect blocks, each of which holds the addresses of 256 data blocks. This
mechanism is sufficient to handle files up to 10 + 216 blocks (67,119,104 bytes). If
790 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
Open file
description i-node
Parent's File position Mode
file- R/W
descriptor Pointer to i-node Link count
table
File position Uid
R/W Gid
Child's Pointer to i-node
file- File size
descriptor
table Times
Unrelated Addresses of
Pointers to
process` first 12
disk blocks
file- disk blocks
descriptor Single indirect
table
Double indirect
Triple indirect
Triple
indirect
block Double
indirect
Single
block
indirect
block
Figure 10-34. The relation between the file-descriptor table, the open-file-de-
scription-table, and the i-node table.
even this is not enough, the i-node has space for a triple indirect block. Its point-
ers point to many double indirect blocks. This addressing scheme can handle file
sizes of 224 1-KB blocks (16 GB). For 8-KB block sizes, the addressing scheme
can support file sizes up to 64 TB.
In order to prevent all data loss after system crashes and power failures, the
ext2 file system would have to write out each data block to disk as soon as it was
created. The latency incurred during the required disk-head seek operation would
be so high that the performance would be intolerable. Therefore, writes are delay-
ed, and changes may not be committed to disk for up to 30 sec, which is a very
long time interval in the context of modern computer hardware.
To improve the robustness of the file system, Linux relies on journaling file
systems. Ext3, a successor of the ext2 file system, is an example of a journaling
file system. Ext4, a follow-on of ext3, is also a journaling file system, but unlike
SEC. 10.6 THE LINUX FILE SYSTEM 791
ext3, it changes the block addressing scheme used by its predecessors, thereby sup-
porting both larger files and larger overall file-system sizes. We will describe some
of its features next.
The basic idea behind a journaling file system is to maintain a journal, which
describes all file-system operations in sequential order. By sequentially writing out
changes to the file-system data or metadata (i-nodes, superblock, etc.), the opera-
tions do not suffer from the overheads of disk-head movement during random disk
accesses. Eventually, the changes will be written out, committed, to the appropriate
disk location, and the corresponding journal entries can be discarded. If a system
crash or power failure occurs before the changes are committed, during restart the
system will detect that the file system was not unmounted properly, traverse the
journal, and apply the file-system changes described in the journal log.
Ext4 is designed to be highly compatible with ext2 and ext3, although its core
data structures and disk layout are modified. Regardless, a file system which has
been unmounted as an ext2 system can be subsequently mounted as an ext4 system
and offer the journaling capability.
The journal is a file managed as a circular buffer. The journal may be stored on
the same or a separate device from the main file system. Since the journal opera-
tions are not "journaled" themselves, these are not handled by the same ext4 file
system. Instead, a separate JBD (Journaling Block Device) is used to perform the
journal read/write operations.
JBD supports three main data structures: log record, atomic operation handle,
and transaction. A log record describes a low-level file-system operation, typically
resulting in changes within a block. Since a system call such as write includes
changes at multiple places—i-nodes, existing file blocks, new file blocks, list of
free blocks, etc.—related log records are grouped in atomic operations. Ext4 noti-
fies JBD of the start and end of system-call processing, so that JBD can ensure that
either all log records in an atomic operation are applied, or none of them. Finally,
primarily for efficiency reasons, JBD treats collections of atomic operations as
transactions. Log records are stored consecutively within a transaction. JBD will
allow portions of the journal file to be discarded only after all log records be-
longing to a transaction are safely committed to disk.
Since writing out a log entry for each disk change may be costly, ext4 may be
configured to keep a journal of all disk changes, or only of changes related to the
file-system metadata (the i-nodes, superblocks, etc.). Journaling only metadata
gives less system overhead and results in better performance but does not make any
guarantees against corruption of file data. Several other journaling file systems
maintain logs of only metadata operations (e.g., SGI’s XFS). In addition, the
reliability of the journal can be further improved via checksumming.
Key modification in ext4 compared to its predecessors is the use of extents.
Extents represent contiguous blocks of storage, for instance 128 MB of contiguous
4-KB blocks vs. individual storage blocks, as referenced in ext2. Unlike its prede-
cessors, ext4 does not require metadata operations for each block of storage. This
792 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
scheme also reduces fragmentation for large files. As a result, ext4 can provide
faster file system operations and support larger files and file system sizes. For
instance, for a block size of 1 KB, ext4 increases the maximum file size from 16
GB to 16 TB, and the maximum file system size to 1 EB (Exabyte).
Another Linux file system is the /proc (process) file system, an idea originally
devised in the 8th edition of UNIX from Bell Labs and later copied in 4.4BSD and
System V. However, Linux extends the idea in several ways. The basic concept is
that for every process in the system, a directory is created in /proc. The name of
the directory is the process PID expressed as a decimal number. For example,
/proc/619 is the directory corresponding to the process with PID 619. In this direc-
tory are files that appear to contain information about the process, such as its com-
mand line, environment strings, and signal masks. In fact, these files do not exist
on the disk. When they are read, the system retrieves the information from the ac-
tual process as needed and returns it in a standard format.
Many of the Linux extensions relate to other files and directories located in
/proc. They contain a wide variety of information about the CPU, disk partitions,
devices, interrupt vectors, kernel counters, file systems, loaded modules, and much
more. Unprivileged user programs may read much of this information to learn
about system behavior in a safe way. Some of these files may be written to in order
to change system parameters.
Networking has played a major role in Linux, and UNIX in general, right from
the beginning (the first UNIX network was built to move new kernels from the
PDP-11/70 to the Interdata 8/32 during the port to the latter). In this section we
will examine Sun Microsystem’s NFS (Network File System), which is used on
all modern Linux systems to join the file systems on separate computers into one
logical whole. Currently, the dominant NSF implementation is version 3, intro-
duced in 1994. NSFv4 was introduced in 2000 and provides several enhancements
over the previous NFS architecture. Three aspects of NFS are of interest: the archi-
tecture, the protocol, and the implementation. We will now examine these in turn,
first in the context of the simpler NFS version 3, then we will turn to the enhance-
ments included in v4.
NFS Architecture
The basic idea behind NFS is to allow an arbitrary collection of clients and ser-
vers to share a common file system. In many cases, all the clients and servers are
on the same LAN, but this is not required. It is also possible to run NFS over a
SEC. 10.6 THE LINUX FILE SYSTEM 793
wide area network if the server is far from the client. For simplicity we will speak
of clients and servers as though they were on distinct machines, but in fact, NFS al-
lows every machine to be both a client and a server at the same time.
Each NFS server exports one or more of its directories for access by remote
clients. When a directory is made available, so are all of its subdirectories, so ac-
tually entire directory trees are normally exported as a unit. The list of directories a
server exports is maintained in a file, often /etc/exports, so these directories can be
exported automatically whenever the server is booted. Clients access exported di-
rectories by mounting them. When a client mounts a (remote) directory, it be-
comes part of its directory hierarchy, as shown in Fig. 10-35.
Client 1 Client 2
/ /
/usr/ast
Mount
/usr/ast/work
/bin /projects
/proj1 /proj2
cat cp Is mv sh
a b c d e
Server 1 Server 2
Figure 10-35. Examples of remote mounted file systems. Directories are shown
as squares and files as circles.
In this example, client 1 has mounted the bin directory of server 1 on its own
bin directory, so it can now refer to the shell as /bin/sh and get the shell on server
1. Diskless workstations often have only a skeleton file system (in RAM) and get
all their files from remote servers like this. Similarly, client 1 has mounted server
2’s directory /projects on its directory /usr/ast/work so it can now access file a as
/usr/ast/work/proj1/a. Finally, client 2 has also mounted the projects directory and
can also access file a, only as /mnt/proj1/a. As seen here, the same file can have
different names on different clients due to its being mounted in a different place in
the respective trees. The mount point is entirely local to the clients; the server does
not know where it is mounted on any of its clients.
794 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
NFS Protocols
Since one of the goals of NFS is to support a heterogeneous system, with cli-
ents and servers possibly running different operating systems on different hard-
ware, it is essential that the interface between the clients and servers be well de-
fined. Only then is anyone able to write a new client implementation and expect it
to work correctly with existing servers, and vice versa.
NFS accomplishes this goal by defining two client-server protocols. A proto-
col is a set of requests sent by clients to servers, along with the corresponding
replies sent by the servers back to the clients.
The first NFS protocol handles mounting. A client can send a path name to a
server and request permission to mount that directory somewhere in its directory
hierarchy. The place where it is to be mounted is not contained in the message, as
the server does not care where it is to be mounted. If the path name is legal and the
directory specified has been exported, the server returns a file handle to the client.
The file handle contains fields uniquely identifying the file-system type, the disk,
the i-node number of the directory, and security information. Subsequent calls to
read and write files in the mounted directory or any of its subdirectories use the file
handle.
When Linux boots, it runs the /etc/rc shell script before going multiuser. Com-
mands to mount remote file systems can be placed in this script, thus automatically
mounting the necessary remote file systems before allowing any logins. Alterna-
tively, most versions of Linux also support automounting. This feature allows a
set of remote directories to be associated with a local directory. None of these re-
mote directories are mounted (or their servers even contacted) when the client is
booted. Instead, the first time a remote file is opened, the operating system sends a
message to each of the servers. The first one to reply wins, and its directory is
mounted.
Automounting has two principal advantages over static mounting via the
/etc/rc file. First, if one of the NFS servers named in /etc/rc happens to be down, it
is impossible to bring the client up, at least not without some difficulty, delay, and
quite a few error messages. If the user does not even need that server at the
moment, all that work is wasted. Second, by allowing the client to try a set of ser-
vers in parallel, a degree of fault tolerance can be achieved (because only one of
them needs to be up), and the performance can be improved (by choosing the first
one to reply—presumably the least heavily loaded).
On the other hand, it is tacitly assumed that all the file systems specified as al-
ternatives for the automount are identical. Since NFS provides no support for file
or directory replication, it is up to the user to arrange for all the file systems to be
the same. Consequently, automounting is most often used for read-only file sys-
tems containing system binaries and other files that rarely change.
The second NFS protocol is for directory and file access. Clients can send
messages to servers to manipulate directories and read and write files. They can
SEC. 10.6 THE LINUX FILE SYSTEM 795
also access file attributes, such as file mode, size, and time of last modification.
Most Linux system calls are supported by NFS, with the perhaps surprising ex-
ceptions of open and close.
The omission of open and close is not an accident. It is fully intentional. It is
not necessary to open a file before reading it, nor to close it when done. Instead, to
read a file, a client sends the server a lookup message containing the file name,
with a request to look it up and return a file handle, which is a structure that identi-
fies the file (i.e., contains a file system identifier and i-node number, among other
data). Unlike an open call, this lookup operation does not copy any information
into internal system tables. The read call contains the file handle of the file to read,
the offset in the file to begin reading, and the number of bytes desired. Each such
message is self-contained. The advantage of this scheme is that the server does not
have to remember anything about open connections in between calls to it. Thus if a
server crashes and then recovers, no information about open files is lost, because
there is none. A server like this that does not maintain state information about
open files is said to be stateless.
Unfortunately, the NFS method makes it difficult to achieve the exact Linux
file semantics. For example, in Linux a file can be opened and locked so that other
processes cannot access it. When the file is closed, the locks are released. In a
stateless server such as NFS, locks cannot be associated with open files, because
the server does not know which files are open. NFS therefore needs a separate, ad-
ditional mechanism to handle locking.
NFS uses the standard UNIX protection mechanism, with the rwx bits for the
owner, group, and others (mentioned in Chap. 1 and discussed in detail below).
Originally, each request message simply contained the user and group IDs of the
caller, which the NFS server used to validate the access. In effect, it trusted the cli-
ents not to cheat. Several years’ experience abundantly demonstrated that such an
assumption was—how shall we put it?—rather naive. Currently, public key crypto-
graphy can be used to establish a secure key for validating the client and server on
each request and reply. When this option is used, a malicious client cannot imper-
sonate another client because it does not know that client’s secret key.
NFS Implementation
V- node
Virtual file system layer Virtual file system layer
Message
Message from client
to server
Local disks Local disks
file system and i-node are recorded because modern Linux systems can support
multiple file systems (e.g., ext2fs, /proc, FAT, etc.). Although VFS was invented to
support NFS, most modern Linux systems now support it as an integral part of the
operating system, even if NFS is not used.
To see how v-nodes are used, let us trace a sequence of mount, open, and read
system calls. To mount a remote file system, the system administrator (or /etc/rc)
calls the mount program specifying the remote directory, the local directory on
which it is to be mounted, and other information. The mount program parses the
name of the remote directory to be mounted and discovers the name of the NFS
server on which the remote directory is located. It then contacts that machine, ask-
ing for a file handle for the remote directory. If the directory exists and is available
for remote mounting, the server returns a file handle for the directory. Finally, it
makes a mount system call, passing the handle to the kernel.
The kernel then constructs a v-node for the remote directory and asks the NFS
client code in Fig. 10-36 to create an r-node (remote i-node) in its internal tables
to hold the file handle. The v-node points to the r-node. Each v-node in the VFS
layer will ultimately contain either a pointer to an r-node in the NFS client code, or
a pointer to an i-node in one of the local file systems (shown as dashed lines in
Fig. 10-36). Thus, from the v-node it is possible to see if a file or directory is local
or remote. If it is local, the correct file system and i-node can be located. If it is
remote, the remote host and file handle can be located.
SEC. 10.6 THE LINUX FILE SYSTEM 797
When a remote file is opened on the client, at some point during the parsing of
the path name, the kernel hits the directory on which the remote file system is
mounted. It sees that this directory is remote and in the directory’s v-node finds
the pointer to the r-node. It then asks the NFS client code to open the file. The
NFS client code looks up the remaining portion of the path name on the remote
server associated with the mounted directory and gets back a file handle for it. It
makes an r-node for the remote file in its tables and reports back to the VFS layer,
which puts in its tables a v-node for the file that points to the r-node. Again here
we see that every open file or directory has a v-node that points to either an r-node
or an i-node.
The caller is given a file descriptor for the remote file. This file descriptor is
mapped onto the v-node by tables in the VFS layer. Note that no table entries are
made on the server side. Although the server is prepared to provide file handles
upon request, it does not keep track of which files happen to have file handles out-
standing and which do not. When a file handle is sent to it for file access, it checks
the handle, and if it is valid, uses it. Validation can include verifying an authentica-
tion key contained in the RPC headers, if security is enabled.
When the file descriptor is used in a subsequent system call, for example, read,
the VFS layer locates the corresponding v-node, and from that determines whether
it is local or remote and also which i-node or r-node describes it. It then sends a
message to the server containing the handle, the file offset (which is maintained on
the client side, not the server side), and the byte count. For efficiency reasons,
transfers between client and server are done in large chunks, normally 8192 bytes,
even if fewer bytes are requested.
When the request message arrives at the server, it is passed to the VFS layer
there, which determines which local file system holds the requested file. The VFS
layer then makes a call to that local file system to read and return the bytes. These
data are then passed back to the client. After the client’s VFS layer has gotten the
8-KB chunk it asked for, it automatically issues a request for the next chunk, so it
will have it should it be needed shortly. This feature, known as read ahead, im-
proves performance considerably.
For writes an analogous path is followed from client to server. Also, transfers
are done in 8-KB chunks here, too. If a write system call supplies fewer than 8 KB
of data, the data are just accumulated locally. Only when the entire 8-KB chunk is
full is it sent to the server. However, when a file is closed, all of its data are sent to
the server immediately.
Another technique used to improve performance is caching, as in ordinary
UNIX. Servers cache data to avoid disk accesses, but this is invisible to the clients.
Clients maintain two caches, one for file attributes (i-nodes) and one for file data.
When either an i-node or a file block is needed, a check is made to see if it can be
satisfied out of the cache. If so, network traffic can be avoided.
While client caching helps performance enormously, it also introduces some
nasty problems. Suppose that two clients are both caching the same file block and
798 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
one of them modifies it. When the other one reads the block, it gets the old (stale)
value. The cache is not coherent.
Given the potential severity of this problem, the NFS implementation does sev-
eral things to mitigate it. For one, associated with each cache block is a timer.
When the timer expires, the entry is discarded. Normally, the timer is 3 sec for data
blocks and 30 sec for directory blocks. Doing this reduces the risk somewhat. In
addition, whenever a cached file is opened, a message is sent to the server to find
out when the file was last modified. If the last modification occurred after the local
copy was cached, the cache copy is discarded and the new copy fetched from the
server. Finally, once every 30 sec a cache timer expires, and all the dirty (i.e., mod-
ified) blocks in the cache are sent to the server. While not perfect, these patches
make the system highly usable in most practical circumstances.
NFS Version 4
Version 4 of the Network File System was designed to simplify certain opera-
tions from its predecessor. In contrast to NSFv3, which is described above, NFSv4
is a stateful file system. This permits open operations to be invoked on remote
files, since the remote NFS server will maintain all file-system-related structures,
including the file pointer. Read operations then need not include absolute read
ranges, but can be incrementally applied from the previous file-pointer position.
This results in shorter messages, and also in the ability to bundle multiple NFSv3
operations in one network transaction.
The stateful nature of NFSv4 makes it easy to integrate the variety of NFSv3
protocols described earlier in this section into one coherent protocol. There is no
need to support separate protocols for mounting, caching, locking, or secure opera-
tions. NFSv4 also works better with both Linux (and UNIX in general) and Win-
dows file-system semantics.
Linux, as a clone of MINIX and UNIX, has been a multiuser system almost
from the beginning. This history means that security and control of information
was built in very early on. In the following sections, we will look at some of the
security aspects of Linux.
The user community for a Linux system consists of some number of registered
users, each of whom has a unique UID (User ID). A UID is an integer between 0
and 65,535. Files (but also processes and other resources) are marked with the
SEC. 10.7 SECURITY IN LINUX 799
UID of their owner. By default, the owner of a file is the person who created the
file, although there is a way to change ownership.
Users can be organized into groups, which are also numbered with 16-bit inte-
gers called GIDs (Group IDs). Assigning users to groups is done manually (by
the system administrator) and consists of making entries in a system database tel-
ling which user is in which group. A user could be in one or more groups at the
same time. For simplicity, we will not discuss this feature further.
The basic security mechanism in Linux is simple. Each process carries the UID
and GID of its owner. When a file is created, it gets the UID and GID of the creat-
ing process. The file also gets a set of permissions determined by the creating proc-
ess. These permissions specify what access the owner, the other members of the
owner’s group, and the rest of the users have to the file. For each of these three cat-
egories, potential accesses are read, write, and execute, designated by the letters r,
w, and x, respectively. The ability to execute a file makes sense only if that file is
an executable binary program, of course. An attempt to execute a file that has ex-
ecute permission but which is not executable (i.e., does not start with a valid head-
er) will fail with an error. Since there are three categories of users and 3 bits per
category, 9 bits are sufficient to represent the access rights. Some examples of
these 9-bit numbers and their meanings are given in Fig. 10-37.
The first two entries in Fig. 10-37 allow the owner and the owner’s group full
access, respectively. The next one allows the owner’s group to read the file but not
to change it, and prevents outsiders from any access. The fourth entry is common
for a data file the owner wants to make public. Similarly, the fifth entry is the
usual one for a publicly available program. The sixth entry denies all access to all
users. This mode is sometimes used for dummy files used for mutual exclusion be-
cause an attempt to create such a file will fail if one already exists. Thus if multiple
processes simultaneously attempt to create such a file as a lock, only one of them
will succeed. The last example is strange indeed, since it gives the rest of the world
more access than the owner. However, its existence follows from the protection
rules. Fortunately, there is a way for the owner to subsequently change the protec-
tion mode, even without having any access to the file itself.
800 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
The user with UID 0 is special and is called the superuser (or root). The
superuser has the power to read and write all files in the system, no matter who
owns them and no matter how they are protected. Processes with UID 0 also have
the ability to make a small number of protected system calls denied to ordinary
users. Normally, only the system administrator knows the superuser’s password, al-
though many undergraduates consider it a great sport to try to look for security
flaws in the system so they can log in as the superuser without knowing the pass-
word. Management tends to frown on such activity.
Directories are files and have the same protection modes that ordinary files do
except that the x bits refer to search permission instead of execute permission.
Thus a directory with mode rwxr–xr–x allows its owner to read, modify, and search
the directory, but allows others only to read and search it, but not add or remove
files from it.
Special files corresponding to the I/O devices have the same protection bits as
regular files. This mechanism can be used to limit access to I/O devices. For ex-
ample, the printer special file, /dev/lp, could be owned by the root or by a special
user, daemon, and have mode rw– – – – – – – to keep everyone else from directly
accessing the printer. After all, if everyone could just print at will, chaos would re-
sult.
Of course, having /dev/lp owned by, say, daemon with protection mode
rw– – – – – – – means that nobody else can use the printer. While this would save
many innocent trees from an early death, sometimes users do have a legitimate
need to print something. In fact, there is a more general problem of allowing con-
trolled access to all I/O devices and other system resources.
This problem was solved by adding a new protection bit, the SETUID bit, to
the 9 protection bits discussed above. When a program with the SETUID bit on is
executed, the effective UID for that process becomes the UID of the executable
file’s owner instead of the UID of the user who invoked it. When a process at-
tempts to open a file, it is the effective UID that is checked, not the underlying real
UID. By making the program that accesses the printer be owned by daemon but
with the SETUID bit on, any user could execute it, and have the power of daemon
(e.g., access to /dev/lp) but only to run that program (which might queue print jobs
for printing in an orderly fashion).
Many sensitive Linux programs are owned by the root but with the SETUID
bit on. For example, the program that allows users to change their passwords,
passwd, needs to write in the password file. Making the password file publicly
writable would not be a good idea. Instead, there is a program that is owned by the
root and which has the SETUID bit on. Although the program has complete access
to the password file, it will change only the caller’s password and not permit any
other access to the password file.
In addition to the SETUID bit there is also a SETGID bit that works analo-
gously, temporarily giving the user the effective GID of the program. In practice,
this bit is rarely used, however.
SEC. 10.7 SECURITY IN LINUX 801
There are only a small number of system calls relating to security. The most
important ones are listed in Fig. 10-38. The most heavily used security system call
is chmod. It is used to change the protection mode. For example,
s = chmod("/usr/ast/newgame", 0755);
sets newgame to rwxr–xr–x so that everyone can run it (note that 0755 is an octal
constant, which is convenient, since the protection bits come in groups of 3 bits).
Only the owner of a file and the superuser can change its protection bits.
Figure 10-38. Some system calls relating to security. The return code s is −1 if
an error has occurred; uid and gid are the UID and GID, respectively. The param-
eters should be self explanatory.
The access call tests to see if a particular access would be allowed using the
real UID and GID. This system call is needed to avoid security breaches in pro-
grams that are SETUID and owned by the root. Such a program can do anything,
and it is sometimes needed for the program to figure out if the user is allowed to
perform a certain access. The program cannot just try it, because the access will al-
ways succeed. With the access call the program can find out if the access is allow-
ed by the real UID and real GID.
The next four system calls return the real and effective UIDs and GIDs. The
last three are allowed only for the superuser. They change a file’s owner, and a
process’ UID and GID.
When a user logs in, the login program, login (which is SETUID root) asks for
a login name and a password. It hashes the password and then looks in the pass-
word file, /etc/passwd, to see if the hash matches the one there (networked systems
work slightly differently). The reason for using hashes is to prevent the password
802 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
from being stored in unencrypted form anywhere in the system. If the password is
correct, the login program looks in /etc/passwd to see the name of the user’s pre-
ferred shell, possibly bash, but possibly some other shell such as csh or ksh. The
login program then uses setuid and setgid to give itself the user’s UID and GID
(remember, it started out as SETUID root). Then it opens the keyboard for stan-
dard input (file descriptor 0), the screen for standard output (file descriptor 1), and
the screen for standard error (file descriptor 2). Finally, it executes the preferred
shell, thus terminating itself.
At this point the preferred shell is running with the correct UID and GID and
standard input, output, and error all set to their default devices. All processes that it
forks off (i.e., commands typed by the user) automatically inherit the shell’s UID
and GID, so they also will have the correct owner and group. All files they create
also get these values.
When any process attempts to open a file, the system first checks the protec-
tion bits in the file’s i-node against the caller’s effective UID and effective GID to
see if the access is permitted. If so, the file is opened and a file descriptor returned.
If not, the file is not opened and −1 is returned. No checks are made on subsequent
read or write calls. As a consequence, if the protection mode changes after a file is
already open, the new mode will not affect processes that already have the file
open.
The Linux security model and its implementation are essentially the same as in
most other traditional UNIX systems.
10.8 ANDROID
Android is a relatively new operating system designed to run on mobile de-
vices. It is based on the Linux kernel—Android introduces only a few new con-
cepts to the Linux kernel itself, using most of the Linux facilities you are already
familiar with (processes, user IDs, virtual memory, file systems, scheduling, etc.)
in sometimes very different ways than they were originally intended.
In the five years since its introduction, Android has grown to be one of the
most widely used smartphone operating systems. Its popularity has ridden the ex-
plosion of smartphones, and it is freely available for manufacturers of mobile de-
vices to use in their products. It is also an open-source platform, making it cus-
tomizable to a diverse variety of devices. It is popular not only for consumer-
centric devices where its third-party application ecosystem is advantageous (such
as tablets, televisions, game systems, and media players), but is increasingly used
as the embedded OS for dedicated devices that need a graphical user interface
(GUI) such as VOIP phones, smart watches, automotive dashboards, medical de-
vices, and home appliances.
A large amount of the Android operating system is written in a high-level lan-
guage, the Java programming language. The kernel and a large number of low-
SEC. 10.8 ANDROID 803
level libraries are written in C and C++. However a large amount of the system is
written in Java and, but for some small exceptions, the entire application API is
written and published in Java as well. The parts of Android written in Java tend to
follow a very object-oriented design as encouraged by that language.
Early Development
applications as a single process on a host computer. In fact there are still some
remnants of this old implementation around today, with things like the Applica-
[Link] method still in the SDK (Software Development Kit), which
Android programmers use to write applications.
In June 2006, two hardware devices were selected as software-development
targets for planned products. The first, code-named ‘‘Sooner,’’ was based on an
existing smartphone with a QWERTY keyboard and screen without touch input.
The goal of this device was to get an initial product out as soon as possible, by
leveraging existing hardware. The second target device, code-named ‘‘Dream,’’
was designed specifically for Android, to run it as fully envisioned. It included a
large (for that time) touch screen, slide-out QWERTY keyboard, 3G radio (for fast-
er web browsing), accelerometer, GPS and compass (to support Google Maps), etc.
As the software schedule came better into focus, it became clear that the two
hardware schedules did not make sense. By the time it was possible to release
Sooner, that hardware would be well out of date, and the effort put on Sooner was
pushing out the more important Dream device. To address this, it was decided to
drop Sooner as a target device (though development on that hardware continued for
some time until the newer hardware was ready) and focus entirely on Dream.
Android 1.0
The first public availability of the Android platform was a preview SDK re-
leased in November 2007. This consisted of a hardware device emulator running a
full Android device system image and core applications, API documentation, and a
development environment. At this point the core design and implementation were
in place, and in most ways closely resembled the modern Android system architec-
ture we will be discussing. The announcement included video demos of the plat-
form running on top of both the Sooner and Dream hardware.
Early development of Android had been done under a series of quarterly demo
milestones to drive and show continued process. The SDK release was the first
more formal release for the platform. It required taking all the pieces that had been
put together so far for application development, cleaning them up, documenting
them, and creating a cohesive development environment for third-party developers.
Development now proceeded along two tracks: taking in feedback about the
SDK to further refine and finalize APIs, and finishing and stabilizing the imple-
mentation needed to ship the Dream device. A number of public updates to the
SDK occurred during this time, culminating in a 0.9 release in August 2008 that
contained the nearly final APIs.
The platform itself had been going through rapid development, and in the
spring of 2008 the focus was shifting to stabilization so that Dream could ship.
Android at this point contained a large amount of code that had never been shipped
as a commercial product, all the way from parts of the C library, through the
Dalvik interpreter (which runs the apps), system, and applications.
806 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
Android also contained quite a few novel design ideas that had never been
done before, and it was not clear how they would pan out. This all needed to come
together as a stable product, and the team spent a few nail-biting months wonder-
ing if all of this stuff would actually come together and work as intended.
Finally, in August 2008, the software was stable and ready to ship. Builds
went to the factory and started being flashed onto devices. In September Android
1.0 was launched on the Dream device, now called the T-Mobile G1.
Continued Development
A number of key design goals for the Android platform evolved during its de-
velopment:
Android is built on top of the standard Linux kernel, with only a few signifi-
cant extensions to the kernel itself that will be discussed later. Once in user space,
however, its implementation is quite different from a traditional Linux distribution
and uses many of the Linux features you already understand in very different ways.
As in a traditional Linux system, Android’s first user-space process is init,
which is the root of all other processes. The daemons Android’s init process starts
are different, however, focused more on low-level details (managing file systems
and hardware access) rather than higher-level user facilities like scheduling cron
jobs. Android also has an additional layer of processes, those running Dalvik’s
Java language environment, which are responsible for executing all parts of the
system implemented in Java.
Figure 10-39 illustrates the basic process structure of Android. First is the init
process, which spawns a number of low-level daemon processes. One of these is
zygote, which is the root of the higher-level Java language processes.
system_server phone
System
Dalvik Dalvik processes
zygote
installd servicemanager adbd Daemons
Dalvik
init
Kernel
Android’s init does not run a shell in the traditional way, since a typical
Android device does not have a local console for shell access. Instead, the daemon
process adbd listens for remote connections (such as over USB) that request shell
access, forking shell processes for them as needed.
Since most of Android is written in the Java language, the zygote daemon and
processes it starts are central to the system. The first process zygote always starts
810 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
is called system server, which contains all of the core operating system services.
Key parts of this are the power manager, package manager, window manager, and
activity manager.
Other processes will be created from zygote as needed. Some of these are
‘‘persistent’’ processes that are part of the basic operating system, such as the tele-
phony stack in the phone process, which must remain always running. Additional
application processes will be created and stopped as needed while the system is
running.
Applications interact with the operating system through calls to libraries pro-
vided by it, which together compose the Android framework. Some of these li-
braries can perform their work within that process, but many will need to perform
interprocess communication with other processes, often services in the sys-
tem server process.
Figure 10-40 shows the typical design for Android framework APIs that inter-
act with system services, in this case the package manager. The package manager
provides a framework API for applications to call in their local process, here the
PackageManager class. Internally, this class must get a connection to the corres-
ponding service in the system server. To accomplish this, at boot time the sys-
tem server publishes each service under a well-defined name in the service man-
ager, a daemon started by init. The PackageManager in the application process
retrieves a connection from the service manager to its system service using that
same name.
Once the PackageManager has connected with its system service, it can make
calls on it. Most application calls to PackageManager are implemented as
interprocess communication using Android’s Binder IPC mechanism, in this case
making calls to the PackageManagerService implementation in the system server.
The implementation of PackageManagerService arbitrates interactions across all
client applications and maintains state that will be needed by multiple applications.
For the most part, Android includes a stock Linux kernel providing standard
Linux features. Most of the interesting aspects of Android as an operating system
are in how those existing Linux features are used. There are also, however,
serveral significant extensions to Linux that the Android system relies on.
Wake Locks
Application Code
PackageManager PackageManagerService
Binder IPC
Bind C
er IP er IP
C Bind
"package"
Service manager
executing without an external interrupt such as pressing a power key. While run-
ning, secondary pieces of hardware may be turned on or off as needed, but the
CPU itself and core parts of the hardware must remain in a powered state to handle
incoming network traffic and other such events. Going into the lower-power sleep
state is something that happens relatively rarely: either through the user explicitly
putting the system to sleep, or its going to sleep itself due to a relatively long inter-
val of user inactivity. Coming out of this sleep state requires a hardware interrupt
from an external source, such as pressing a button on a keyboard, at which point
the device will wake up and turn on its screen.
Mobile device users have different expectations. Although the user can turn off
the screen in a way that looks like putting the device to sleep, the traditional sleep
state is not actually desired. While a device’s screen is off, the device still needs to
be able to do work: it needs to be able to receive phone calls, receive and process
data for incoming chat messages, and many other things.
The expectations around turning a mobile device’s screen on and off are also
much more demanding than on a traditional computer. Mobile interaction tends to
be in many short bursts throughout the day: you receive a message and turn on the
device to see it and perhaps send a one-sentence reply, you run into friends walking
812 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
their new dog and turn on the device to take a picture of her. In this kind of typical
mobile usage, any delay from pulling the device out until it is ready for use has a
significant negative impact on the user experience.
Given these requirements, one solution would be to just not have the CPU go
to sleep when a device’s screen is turned off, so that it is always ready to turn back
on again. The kernel does, after all, know when there is no work scheduled for any
threads, and Linux (as well as most operating systems) will automatically make the
CPU idle and use less power in this situation.
An idle CPU, however, is not the same thing as true sleep. For example:
1. On many chipsets the idle state uses significantly more power than a
true sleep state.
2. An idle CPU can wake up at any moment if some work happens to
become available, even if that work is not important.
3. Just having the CPU idle does not tell you that you can turn off other
hardware that would not be needed in a true sleep.
Wake locks on Android allow the system to go in to a deeper sleep mode, with-
out being tied to an explicit user action like turning the screen off. The default
state of the system with wake locks is that the device is asleep. When the device is
running, to keep it from going back to sleep something needs to be holding a wake
lock.
While the screen is on, the system always holds a wake lock that prevents the
device from going to sleep, so it will stay running, as we expect.
When the screen is off, however, the system itself does not generally hold a
wake lock, so it will stay out of sleep only as long as something else is holding
one. When no more wake locks are held, the system goes to sleep, and it can come
out of sleep only due to a hardware interrupt.
Once the system has gone to sleep, a hardware interrupt will wake it up again,
as in a traditional operating system. Some sources of such an interrupt are time-
based alarms, events from the cellular radio (such as for an incoming call), incom-
ing network traffic, and presses on certain hardware buttons (such as the power
button). Interrupt handlers for these events require one change from standard
Linux: they need to aquire an initial wake lock to keep the system running after it
handles the interrupt.
The wake lock acquired by an interrupt handler must be held long enough to
transfer control up the stack to the driver in the kernel that will continue processing
the event. That kernel driver is then responsible for acquiring its own wake lock,
after which the interrupt wake lock can be safely released without risk of the sys-
tem going back to sleep.
If the driver is then going to deliver this event up to user space, a similar hand-
shake is needed. The driver must ensure that it continues to hold the wake lock un-
til it has delivered the event to a waiting user process and ensured there has been an
SEC. 10.8 ANDROID 813
opportunity there to acquire its own wake lock. This flow may continue across
subsystems in user space as well; as long as something is holding a wake lock, we
continue performing the desired processing to respond to the event. Once no more
wake locks are held, however, the entire system falls back to sleep and all proc-
essing stops.
Out-Of-Memory Killer
10.8.6 Dalvik
The use of Linux processes and security greatly simplifies the Dalvik environ-
ment, since it is no longer responsible for these critical aspects of system stability
and robustness. Not incidentally, it also allows applications to freely use native
code in their implementation, which is especially important for games which are
usually built with C++-based engines.
Mixing processes and the Java language like this does introduce some chal-
lenges. Bringing up a fresh Java-language environment can take a second, even on
modern mobile hardware. Recall one of the design goals of Android, to be able to
quickly launch applications, with a target of 200 msec. Requiring that a fresh
Dalvik process be brought up for this new application would be well beyond that
budget. A 200-msec launch is hard to achieve on mobile hardware, even without
needing to initialize a new Java-language environment.
The solution to this problem is the zygote native daemon that we briefly men-
tioned previously. Zygote is responsible for bringing up and initializing Dalvik, to
the point where it is ready to start running system or application code written in
Java. All new Dalvik-based processes (system or application) are forked from
zygote, allowing them to start execution with the environment already ready to go.
It is not just Dalvik that zygote brings up. Zygote also preloads many parts of
the Android framework that are commonly used in the system and application, as
well as loading resources and other things that are often needed.
Note that creating a new process from zygote involves a Linux fork, but there is
no exec call. The new process is a replica of the original zygote process, with all
of its preinitialized state already set up and ready to go. Figure 10-41 illustrates
how a new Java application process is related to the original zygote process. After
the fork, the new process has its own separate Dalvik environment, though it is
sharing all of the preloaded and initialed data with zygote through copy-on-write
pages. All that now remains to have the new running process ready to go is to give
it the correct identity (UID etc.), finish any initialization of Dalvik that requires
starting threads, and loading the application or system code to be run.
In addition to launch speed, there is another benefit that zygote brings. Because
only a fork is used to create processes from it, the large number of dirty RAM
pages needed to initialize Dalvik and preload classes and resources can be shared
between zygote and all of its child processes. This sharing is especially important
for Android’s environment, where swap is not available; demand paging of clean
pages (such as executable code) from ‘‘disk’’ (flash memory) is available. However
any dirty pages must stay locked in RAM; they cannot be paged out to ‘‘disk.’’
Application classes
and resources
Dalvik Dalvik
Rather than use existing Linux IPC facilities such as pipes, Binder includes a
special kernel module that implements its own IPC mechanism. The Binder IPC
model is different enough from traditional Linux mechanisms that it cannot be ef-
ficiently implemented on top of them purely in user space. In addition, Android
does not support most of the System V primitives for cross-process interaction
(semaphores, shared memory segments, message queues) because they do not pro-
vide robust semantics for cleaning up their resources from buggy or malicious ap-
plications.
The basic IPC model Binder uses is the RPC (remote procedure call). That
is, the sending process is submitting a complete IPC operation to the kernel, which
SEC. 10.8 ANDROID 817
Platform / Application
Method calls
Ilnterface / aidl
Interface definitions
transact() onTransact()
IBinder / Binder
ioctl()
is executed in the receiving process; the sender may block while the receiver ex-
ecutes, allowing a result to be returned back from the call. (Senders optionally
may specify they should not block, continuing their execution in parallel with the
receiver.) Binder IPC is thus message based, like System V message queues, rath-
er than stream based as in Linux pipes. A message in Binder is referred to as a
transaction, and at a higher level can be viewed as a function call across proc-
esses.
Each transaction that user space submits to the kernel is a complete operation:
it identifies the target of the operation and identity of the sender as well as the
complete data being delivered. The kernel determines the appropriate process to
receive that transaction, delivering it to a waiting thread in the process.
Figure 10-43 illustrates the basic flow of a transaction. Any thread in the orig-
inating process may create a transaction identifying its target, and submit this to
the kernel. The kernel makes a copy of the transaction, adding to it the identity of
818 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
the sender. It determines which process is responsible for the target of the transac-
tion and wakes up a thread in the process to receive it. Once the receiving process
is executing, it determines the appropriate target of the transaction and delivers it.
Process 1 Process 2
To: Object1
From: Process 1
T1 T2 T1 T2
(Data)
(For the discussion here, we are simplifying the the way transaction data
moves through the system as two copies, one to the kernel and one to the receiving
process’s address space. The actual implementation does this in one copy. For
each process that can receive transactions, the kernel creates a shared memory area
with it. When it is handling a transaction, it first determines the process that will
be receiving that transaction and copies the data directly into that shared address
space.)
Note that each process in Fig. 10-43 has a ‘‘thread pool.’’ This is one or more
threads created by user space to handle incoming transactions. The kernel will dis-
patch each incoming transaction to a thread currently waiting for work in that proc-
ess’s thread pool. Calls into the kernel from a sending process however do not
need to come from the thread pool—any thread in the process is free to initiate a
transaction, such as Ta in Fig. 10-43.
We have already seen that transactions given to the kernel identify a target ob-
ject; however, the kernel must determine the receiving process. To accomplish
this, the kernel keeps track of the available objects in each process and maps them
to other processes, as shown in Fig. 10-44. The objects we are looking at here are
simply locations in the address space of that process. The kernel only keeps track
of these object addresses, with no meaning attached to them; they may be the loca-
tion of a C data structure, C++ object, or anything else located in that process’s ad-
dress space.
References to objects in remote processes are identified by an integer handle,
which is much like a Linux file descriptor. For example, consider Object2a in
SEC. 10.8 ANDROID 819
Process 2—this is known by the kernel to be associated with Process 2, and further
the kernel has assigned Handle 2 for it in Process 1. Process 1 can thus submit a
transaction to the kernel targeted to its Handle 2, and from that the kernel can de-
termine this is being sent to Process 2 and specifically Object2a in that process.
Process 1 Kernel Process 2
Process 1 Process 2
Object1a Object1a Object2a Object2a
Object1b Object2b
Object1b Object2b
Handle 1 Handle 1
Handle 2 Handle 2
Handle 2 Handle 2
Handle 3 Handle 3
Also like file descriptors, the value of a handle in one process does not mean
the same thing as that value in another process. For example, in Fig. 10-44, we can
see that in Process 1, a handle value of 2 identifies Object2a; however, in Process
2, that same handle value of 2 identifies Object1a. Further, it is impossible for one
process to access an object in another process if the kernel has not assigned a hand-
le to it for that process. Again in Fig. 10-44, we can see that Process 2’s Object2b
is known by the kernel, but no handle has been assigned to it for Process 1. There
is thus no path for Process 1 to access that object, even if the kernel has assigned
handles to it for other processes.
How do these handle-to-object associations get set up in the first place?
Unlike Linux file descriptors, user processes do not directly ask for handles. In-
stead, the kernel assigns handles to processes as needed. This process is illustrated
in Fig. 10-45. Here we are looking at how the reference to Object1b from Process
2 to Process 1 in the previous figure may have come about. The key to this is how
a transaction flows through the system, from left to right at the bottom of the fig-
ure.
The key steps shown in Fig. 10-45 are:
1. Process 1 creates the initial transaction structure, which contains the
local address Object1b.
2. Process 1 submits the transaction to the kernel.
3. The kernel looks at the data in the transaction, finds the address Ob-
ject1b, and creates a new entry for it since it did not previously know
about this address.
820 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
Process 1 Process 2
Object2a
Object1b 3
Object2a
Object1b
Handle 1 8
Handle 1
Handle 3
Handle 2
Handle 2
Handle 2
Handle 3 6
Most user-space code does not directly interact with the Binder kernel module.
Instead, there is a user-space object-oriented library that provides a simpler API.
The first level of these user-space APIs maps fairly directly to the kernel concepts
we have covered so far, in the form of three classes:
1. IBinder is an abstract interface for a Binder object. Its key method is
transact, which submits a transaction to the object. The imple-
mentation receiving the transaction may be an object either in the
local process or in another process; if it is in another process, this will
be delivered to it through the Binder kernel module as previously dis-
cussed.
2. Binder is a concrete Binder object. Implementing a Binder subclass
gives you a class that can be called by other processes. Its key meth-
od is onTransact, which receives a transaction that was sent to it. The
main responsibility of a Binder subclass is to look at the transaction
data it receives here and perform the appropriate operation.
3. Parcel is a container for reading and writing data that is in a Binder
transaction. It has methods for reading and writing typed data—inte-
gers, strings, arrays—but most importantly it can read and write refer-
ences to any IBinder object, using the appropriate data structure for
the kernel to understand and transport that reference across processes.
Figure 10-46 depicts how these classes work together, modifying Fig. 10-44
that we previously looked at with the user-space classes that are used. Here we see
that Binder1b and Binder2a are instances of concrete Binder subclasses. To per-
form an IPC, a process now creates a Parcel containing the desired data, and sends
it through another class we have not yet seen, BinderProxy. This class is created
whenever a new handle appears in a process, thus providing an implementation of
IBinder whose transact method creates the appropriate transaction for the call and
submits it to the kernel.
The kernel transaction structure we had previously looked at is thus split apart
in the user-space APIs: the target is represented by a BinderProxy and its data is
held in a Parcel. The transaction flows through the kernel as we previously saw
and, upon appearing in user space in the receiving process, its target is used to de-
termine the appropriate receiving Binder object while a Parcel is constructed from
its data and delivered to that object’s onTransact method.
These three classes now make it fairly easy to write IPC code:
1. Subclass from Binder.
2. Implement onTransact to decode and execute incoming calls.
3. Implement corresponding code to create a Parcel that can be passed
to that object’s transact method.
822 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
Process 1 Process 2
Binder1b
Binder1b Binder2b
Binder2a
Handle 1
Parcel Handle 1
Handle 2
Data
Handle 2 onTransact()
Binder1b
Handle 3
Data
BinderProxy
Transaction Transaction (Handle 3)
transact()
To: Handle 2 To: Binder2a Parcel
From: Process 1 From: Process 1 Data
BinderProxy
Handle 3
(Handle 2) Data Data Data
Binder1b Handle 3
Data Data
The bulk of this work is in the last two steps. This is the unmarshalling and
marshalling code that is needed to turn how we’d prefer to program—using sim-
ple method calls—into the operations that are needed to execute an IPC. This is
boring and error-prone code to write, so we’d like to let the computer take care of
that for us.
The final piece of Binder IPC is the one that is most often used, a high-level in-
terface-based programming model. Instead of dealing with Binder objects and
Parcel data, here we get to think in terms of interfaces and methods.
The main piece of this layer is a command-line tool called AIDL (for Android
Interface Definition Language). This tool is an interface compiler, taking an ab-
stract description of an interface and generating from it the source code necessary
to define that interface and implement the appropriate marshalling and unmar-
shalling code needed to make remote calls with it.
Figure 10-47 shows a simple example of an interface defined in AIDL. This
interface is called IExample and contains a single method, print, which takes a sin-
gle String argument.
package [Link]
interface IExample {
void print(String msg);
}
Binder IExample
With these classes in place, there is no longer any need to worry about the
mechanics of an IPC. Implementors of the IExample interface simply derive from
[Link] and implement the interface methods as they normally would. Cal-
lers will receive an IExample interface that is implemented by [Link], al-
lowing them to make regular calls on the interface.
The way these pieces work together to perform a complete IPC operation is
shown in Fig. 10-49. A simple print call on an IExample interface turns into:
Process 1 Process 2
Examplelmpl
print("hello")
IExample
[Link] print("hello")
Kernel [Link]
transact({print hello})
onTransact({print hello})
ioctl()
BinderProxy ioctl()
binder_module Binder
The bulk of Android’s IPC is written using this mechanism. Most services in
Android are defined through AIDL and implemented as shown here. Recall the
previous Fig. 10-40 showing how the implementation of the package manager in
the system server process uses IPC to publish itself with the service manager for
other processes to make calls to it. Two AIDL interfaces are involved here: one for
the service manager and one for the package manager. For example, Fig. 10-50
shows the basic AIDL description for the service manager; it contains the getSer-
vice method, which other processes use to retrieve the IBinder of system service
interfaces like the package manager.
Android provides an application model that is very different from the normal
command-line environment in the Linux shell or even applications launched from a
graphical user interface. An application is not an executable file with a main entry
point; it is a container of everything that makes up that app: its code, graphical re-
sources, declarations about what it is to the system, and other data.
SEC. 10.8 ANDROID 825
package [Link]
interface IServiceManager {
IBinder getService(String name);
void addService(String name, IBinder binder);
}
1. A manifest describing what the application is, what it does, and how
to run it. The manifest must provide a package name for the applica-
tion, a Java-style scoped string (such as [Link]),
which uniquely identifies it.
2. Resources needed by the application, including strings it displays to
the user, XML data for layouts and other descriptions, graphical bit-
maps, etc.
3. The code itself, which may be Dalvik bytecode as well as native li-
brary code.
4. Signing information, securely identifying the author.
The key part of the application for our purposes here is its manifest, which ap-
pears as a precompiled XML file named [Link] in the root of the
apk’s zip namespace. A complete example manifest declaration for a hypothetical
email application is shown in Fig. 10-51: it allows you to view and compose emails
and also includes components needed for synchronizing its local email storage
with a server even when the user is not currently in the application.
Android applications do not have a simple main entry point which is executed
when the user launches them. Instead, they publish under the manifest’s <applica-
tion> tag a variety of entry points describing the various things the application can
do. These entry points are expressed as four distinct types, defining the core types
of behavior that applications can provide: activity, receiver, service, and content
provider. The example we have presented shows a few activities and one declara-
tion of the other component types, but an application may declare zero or more of
any of these.
Each of the different four component types an application can contain has dif-
ferent semantics and uses within the system. In all cases, the android:name attrib-
ute supplies the Java class name of the application code implementing that compo-
nent, which will be instantiated by the system when needed.
826 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
<activity android:name="[Link]">
<intent-filter>
<action android:name="[Link]" />
<categor y android:name="[Link] [Link]" />
</intent-filter>
</activity>
<activity android:name="[Link]">
<intent-filter>
<action android:name="[Link]" />
<categor y android:name="[Link] [Link]" />
<data android:mimeType="*/*" />
</intent-filter>
</activity>
<receiver android:name="[Link]">
<intent-filter>
<action android:name="[Link] STORAGE LOW" />
</intent-filter>
<intent-filter>
<action android:name="[Link] STORAGE OKAY" />
</intent-filter>
</receiver>
<provider android:name="[Link]"
android:authorities="[Link]">
</provider>
</application>
</manifest>
The package manager is the part of Android that keeps track of all application
packages. It parses every application’s manifest, collecting and indexing the infor-
mation it finds in them. With that information, it then provides facilities for clients
to query it about the currently installed applications and retrieve relevant infor-
mation about them. It is also responsible for installing applications (creating stor-
age space for the application and ensuring the integrity of the apk) as well as
everything needed to uninstall (cleaning up everything associated with a previously
installed app).
SEC. 10.8 ANDROID 827
Activities
An activity is a part of the application that interacts directly with the user
through a user interface. When the user launches an application on their device,
this is actually an activity inside the application that has been designated as such a
main entry point. The application implements code in its activity that is responsi-
ble for interacting with the user.
The example email manifest shown in Fig. 10-51 contains two activities. The
first is the main mail user interface, allowing users to view their messages; the sec-
ond is a separate interface for composing a new message. The first mail activity is
declared as the main entry point for the application, that is, the activity that will be
started when the user launches it from the home screen.
Since the first activity is the main activity, it will be shown to users as an appli-
cation they can launch from the main application launcher. If they do so, the sys-
tem will be in the state shown in Fig. 10-52. Here the activity manager, on the left
side, has made an internal ActivityRecord instance in its process to keep track of
the activity. One or more of these activities are organized into containers called
tasks, which roughly correspond to what the user experiences as an application. At
this point the activity manager has started the email application’s process and an
instance of its MainMailActivity for displaying its main UI, which is associated
828 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
with the appropriate ActivityRecord. This activity is in a state called resumed since
it is now in the foreground of the user interface.
Activity manager in system_server process Email app process
Task: Email
MailMainActivity
RESUMED
ActivityRecord
(MailMainActivity)
If the user were now to switch away from the email application (not exiting it)
and launch a camera application to take a picture, we would be in the state shown
in Fig. 10-53. Note that we now have a new camera process running the camera’s
main activity, an associated ActivityRecord for it in the activity manager, and it is
now the resumed activity. Something interesting also happens to the previous
email activity: instead of being resumed, it is now stopped and the ActivityRecord
holds this activity’s saved state.
Activity manager in system_server process Camera app process
Task: Camera
CameraMainActivity
RESUMED
ActivityRecord
(CameraMainActivity)
Task: Email
MailMainActivity
STOPPED
ActivityRecord
(MailMainActivity)
Saved state
When an activity is no longer in the foreground, the system asks it to ‘‘save its
state.’’ This involves the application creating a minimal amount of state infor-
mation representing what the user currently sees, which it returns to the activity
SEC. 10.8 ANDROID 829
manager and stores in the system server process, in the ActivityRecord associated
with that activity. The saved state for an activity is generally small, containing for
example where you are scrolled in an email message, but not the message itself,
which will be stored elsewhere by the application in its persistent storage.
Recall that although Android does demand paging (it can page in and out clean
RAM that has been mapped from files on disk, such as code), it does not rely on
swap space. This means all dirty RAM pages in an application’s process must stay
in RAM. Having the email’s main activity state safely stored away in the activity
manager gives the system back some of the flexibility in dealing with memory that
swap provides.
For example, if the camera application starts to require a lot of RAM, the sys-
tem can simply get rid of the email process, as shown in Fig. 10-54. The Activi-
tyRecord, with its precious saved state, remains safely tucked away by the activity
manager in the system server process. Since the system server process hosts all of
Android’s core system services, it must always remain running, so the state saved
here will remain around for as long as we might need it.
Activity manager in system_server process Camera app process
Task: Camera
CameraMainActivity
RESUMED
ActivityRecord
(CameraMainActivity)
Task: Email
STOPPED
ActivityRecord
(MailMainActivity)
Saved state
Figure 10-54. Removing the email process to reclaim RAM for the camera.
Our example email application not only has an activity for its main UI, but in-
cludes another ComposeActivity. Applications can declare any number of activities
they want. This can help organize the implementation of an application, but more
importantly it can be used to implement cross-application interactions. For ex-
ample, this is the basis of Android’s cross-application sharing system, which the
ComposeActivity here is participating in. If the user, while in the camera applica-
tion, decides she wants to share a picture she took, our email application’s Com-
poseActivity is one of the sharing options she has. If it is selected, that activity will
830 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
be started and given the picture to be shared. (Later we will see how the camera
application is able to find the email application’s ComposeActivity.)
Performing that share option while in the activity state seen in Fig. 10-54 will
lead to the new state in Fig. 10-55. There are a number of important things to note:
1. The email app’s process must be started again, to run its ComposeAc-
tivity.
2. However, the old MailMainActivity is not started at this point, since it
is not needed. This reduces RAM use.
3. The camera’s task now has two records: the original CameraMainAc-
tivity we had just been in, and the new ComposeActivity that is now
displayed. To the user, these are still one cohesive task: it is the cam-
era currently interacting with them to email a picture.
4. The new ComposeActivity is at the top, so it is resumed; the previous
CameraMainActivity is no longer at the top, so its state has been
saved. We can at this point safely quit its process if its RAM is need-
ed elsewhere.
Task: Camera
ComposeActivity
RESUMED
ActivityRecord
(ComposeActivity)
ActivityRecord
(CameraMainActivity)
Saved state
CameraMainActivity
Task: Email
STOPPED
ActivityRecord
(MailMainActivity)
Saved state
Finally let us look at would happen if the user left the camera task while in this
last state (that is, composing an email to share a picture) and returned to the email
SEC. 10.8 ANDROID 831
application. Figure 10-56 shows the new state the system will be in. Note that we
have brought the email task with its main activity back to the foreground. This
makes MailMainActivity the foreground activity, but there is currently no instance
of it running in the application’s process.
Activity manager in system_server process Email app process
Task: Email
MailMainActivity
RESUMED
ActivityRecord
(MailMainActivity)
ComposeActivity
Task: Camera
STOPPED
ActivityRecord
(ComposeActivity) Camera app process
Saved state
STOPPED
ActivityRecord CameraMainActivity
(CameraMainActivity)
Saved state
To return to the previous activity, the system makes a new instance, handing it
back the previously saved state the old instance had provided. This action of
restoring an activity from its saved state must be able to bring the activity back to
the same visual state as the user last left it. To accomplish this, the application will
look in its saved state for the message the user was in, load that message’s data
from its persistent storage, and then apply any scroll position or other user-inter-
face state that had been saved.
Services
The example email manifest shown in Fig. 10-51 contains a service that is used
to perform synchronization of the user’s mailbox. A common implementation
would schedule the service to run at a regular interval, such as every 15 minutes,
starting the service when it is time to run, and stopping itself when done.
This is a typical use of the first style of service, a long-running background op-
eration. Figure 10-57 shows the state of the system in this case, which is quite
simple. The activity manager has created a ServiceRecord to keep track of the ser-
vice, noting that it has been started, and thus created its SyncService instance in the
application’s process. While in this state the service is fully active (barring the en-
tire system going to sleep if not holding a wake lock) and free to do what it wants.
It is possible for the application’s process to go away while in this state, such as if
the process crashes, but the activity manager will continue to maintain its Ser-
viceRecord and can at that point decide to restart the service if desired.
ServiceRecord
(SyncService) SyncService
To see how one can use a service as a connection point for interaction with
other applications, let us say that we want to extend our existing SyncService to
have an API that allows other applications to control its sync interval. We will
need to define an AIDL interface for this API, like the one shown in Fig. 10-58.
package [Link]
interface ISyncControl {
int getSyncInterval();
void setSyncInterval(int seconds);
}
To use this, another process can bind to our application service, getting access
to its interface. This creates a connection between the two applications, shown in
Fig. 10-59. The steps of this process are:
SEC. 10.8 ANDROID 833
1. The client application tells the activity manager that it would like to
bind to the service.
2. If the service is not already created, the activity manager creates it in
the service application’s process.
3. The service returns the IBinder for its interface back to the activity
manager, which now holds that IBinder in its ServiceRecord.
4. Now that the activity manager has the service IBinder, it can be sent
back to the original client application.
5. The client application now having the service’s IBinder may proceed
to make any direct calls it would like on its interface.
2. Create
STOPPED
ServiceRecord
(SyncService) SyncService
3. Return
IBinder IBinder
IBinder
5. Call service
4. Send
IBinder
IBinder
1. Bind
Client app process
Receivers
for a list of all receivers interested in the event, which is placed in a Broadcast-
Record representing that broadcast. The activity manager will then proceed to step
through each entry in the list, having each associated application’s process create
and execute the appropriate receiver class.
Activity manager in system_server process Calendar app process
BroadcastRecord SyncControlReceiver
DEVICE_STORAGE_LOW
SyncControlReceiver
(Calendar app) Email app process
SyncControlReceiver SyncControlReceiver
(Email app)
CleanupReceiver
Receivers only run as one-shot operations. When an event happens, the system
finds any receivers interested in it, delivers that event to them, and once they have
consumed the event they are done. There is no ReceiverRecord like those we have
seen for other application components, because a particular receiver is only a tran-
sient entity for the duration of a single broadcast. Each time a new broadcast is
sent to a receiver component, a new instance of that receiver’s class is created.
Content Providers
Our last application component, the content provider, is the primary mechan-
ism that applications use to exchange data with each other. All interactions with a
content provider are through URIs using a content: scheme; the authority of the
URI is used to find the correct content-provider implementation to interact with.
For example, in our email application from Fig. 10-51, the content provider
specifies that its authority is [Link]. Thus URIs operat-
ing on this content provider would start with
content://[Link]/
The suffix to that URI is interpreted by the provider itself to determine which data
within it is being accessed. In the example here, a common convention would be
that the URI
SEC. 10.8 ANDROID 835
content://[Link]/messages
means the list of all email messages, while
content://[Link]/messages/1
provides access to a single message at key number 1.
To interact with a content provider, applications always go through a system
API called ContentResolver, where most methods have an initial URI argument
indicating the data to operate on. One of the most often used ContentResolver
methods is query, which performs a database query on a given URI and returns a
Cursor for retrieving the structured results. For example, retrieving a summary of
all of the available email messages would look something like:
quer y("content://[Link]/messages")
Though this does not look like it to applications, what is actually going on
when they use content providers has many similarities to binding to services. Fig-
ure 10-61 illustrates how the system handles our query example:
Content providers are one of the key mechanisms for performing interactions
across applications. For example, if we return to the cross-application sharing sys-
tem previously described in Fig. 10-55, content providers are the way data is ac-
tually transferred. The full flow for this operation is:
1. A share request that includes the URI of the data to be shared is creat-
ed and is submitted to the system.
2. The system asks the ContentResolver for the MIME type of the data
behind that URI; this works much like the query method we just dis-
cussed, but asks the content provider to return a MIME-type string for
the URI.
836 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
3. Create
ProviderRecord
(EmailProvider) EmailProvider
4. Return
IBinder [Link]
IBinder
6. query()
5. Return [Link]
IBinder ContentResolver
2. Look up
Authority 1. query()
3. The system finds all activities that can receive data of the identified
MIME type.
4. A user interface is shown for the user to select one of the possible re-
cipients.
5. When one of these activities is selected, the system launches it.
6. The share-handling activity receives the URI of the data to be shared,
retrieves its data through ContentResolver, and performs its ap-
propriate operation: creates an email, stores it, etc..
10.8.9 Intents
A detail that we have not yet discussed in the application manifest shown in
Fig. 10-51 is the <intent-filter> tags included with the activities and receiver decla-
rations. This is part of the intent feature in Android, which is the cornerstone for
how different applications identify each other in order to be able to interact and
work together.
An intent is the mechanism Android uses to discover and identify activities,
receivers, and services. It is similar in some ways to the Linux shell’s search path,
which the shell uses to look through multiple possible directories in order to find
an executable matching command names given to it.
There are two major types of intents: explicit and implicit. An explicit intent
is one that directly identifies a single specific application component; in Linux
shell terms it is the equivalent to supplying an absolute path to a command. The
SEC. 10.8 ANDROID 837
most important part of such an intent is a pair of strings naming the component: the
package name of the target application and class name of the component within
that application. Now referring back to the activity of Fig. 10-52 in application
Fig. 10-51, an explicit intent for this component would be one with package name
[Link] and class name [Link].
The package and class name of an explicit intent are enough information to
uniquely identify a target component, such as the main email activity in Fig. 10-52.
From the package name, the package manager can return everything needed about
the application, such as where to find its code. From the class name, we know
which part of that code to execute.
An implicit intent is one that describes characteristics of the desired compo-
nent, but not the component itself; in Linux shell terms this is the equivalent to
supplying a single command name to the shell, which it uses with its search path to
find a concrete command to be run. This process of finding the component match-
ing an implicit intent is called intent resolution.
Android’s general sharing facility, as we previously saw in Fig. 10-55’s illus-
tration of sharing a photo the user took from the camera through the email applica-
tion, is a good example of implicit intents. Here the camera application builds an
intent describing the action to be done, and the system finds all activities that can
potentially perform that action. A share is requested through the intent action
[Link], and we can see in Fig. 10-51 that the email applica-
tion’s compose activity declares that it can perform this action.
There can be three outcomes to an intent resolution: (1) no match is found, (2)
a single unique match is found, or (3) there are multiple activities that can handle
the intent. An empty match will result in either an empty result or an exception,
depending on the expectations of the caller at that point. If the match is unique,
then the system can immediately proceed to launching the now explicit intent. If
the match is not unique, we need to somehow resolve it in another way to a single
result.
If the intent resolves to multiple possible activities, we cannot just launch all of
them; we need to pick a single one to be launched. This is accomplished through a
trick in the package manager. If the package manager is asked to resolve an intent
down to a single activity, but it finds there are multiple matches, it instead resolves
the intent to a special activity built into the system called the ResolverActivity.
This activity, when launched, simply takes the original intent, asks the package
manager for a list of all matching activities, and displays these for the user to select
a single desired action. When one is selected, it creates a new explicit intent from
the original intent and the selected activity, calling the system to have that new
activity started.
Android has another similarity with the Linux shell: Android’s graphical shell,
the launcher, runs in user space like any other application. An Android launcher
performs calls on the package manager to find the available activities and launch
them when selected by the user.
838 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
10.8.11 Security
caller. Binder IPC explicitly includes this information in every transaction deliv-
ered across processes so a recipient of the IPC can easily ask for the UID of the
caller.
Android predefines a number of standard UIDs for the lower-level parts of the
system, but most applications are dynamically assigned a UID, at first boot or in-
stall time, from a range of ‘‘application UIDs.’’ Figure 10-62 illustrates some com-
mon mappings of UID values to their meanings. UIDs below 10000 are fixed
assignments within the system for dedicated hardware or other specific parts of the
implementation; some typical values in this range are shown here. In the range
10000–19999 are UIDs dynamically assigned to applications by the package man-
ager when it installs them; this means at most 10,000 applications can be installed
on the system. Also note the range starting at 100000, which is used to implement
a traditional multiuser model for Android: an application that is granted UID
10002 as its identity would be identified as 110002 when running as a second user.
UID Purpose
0 Root
1000 Core system (system ser ver process)
1001 Telephony ser vices
1013 Low-level media processes
2000 Command line shell access
10000–19999 Dynamically assigned application UIDs
100000 Start of secondary users
access that data, which is what we want since the pictures there may be sensitive
data to the user.
After the user has taken a picture, she may want to email it to a friend. Email
is a separate application, in its own sandbox, with no access to the pictures in the
camera application. How can the email application get access to the pictures in the
camera application’s sandbox?
The best-known form of access control in Android is application permissions.
Permissions are specific well-defined abilities that can be granted to an application
at install time. The application lists the permissions it needs in its manifest, and
prior to installing the application the user is informed of what it will be allowed to
do based on them.
Figure 10-63 shows how our email application could make use of permissions
to access pictures in the camera application. In this case, the camera application
has associated the READ PICTURES permission with its pictures, saying that any
application holding that permission can access its picture data. The email applica-
tion declares in its manifest that it requires this permission. The email application
can now access a URI owned by the camera, such as content://pics/1; upon receiv-
ing the request for this URI, the camera app’s content provider asks the package
manager whether the caller holds the necessary permission. If it does, the call suc-
ceeds and appropriate data is returned to the application.
Package manager in system_server process
ComposeActivity
Granted permissions
INTERNET
Email app process
Permissions are not tied to content providers; any IPC into the system may be
protected by a permission through the system’s asking the package manager if the
caller holds the required permission. Recall that application sandboxing is based
SEC. 10.8 ANDROID 841
BrowserMainActivity
Granted permissions
INTERNET
Browser app process
receives a URI of the data to share, but does not know where it came from—in the
figure here it comes from the camera, but any other application could use this to let
the user email its data, from audio files to word-processing documents. The email
application only needs to read that URI as a byte stream to add it as an attachment.
However, with permissions it would also have to specify up-front the permissions
for all of the data of all of the applications it may be asked to send an email from.
We have two problems to solve. First, we do not want to give applications ac-
cess to wide swaths of data that they do not really need. Second, they need to be
given access to any data sources, even ones they do not have a priori knowledge
about.
There is an important observation to make: the act of emailing a picture is ac-
tually a user interaction where the user has expressed a clear intent to use a specific
picture with a specific application. As long as the operating system is involved in
the interaction, it can use this to identify a specific hole to open in the sandboxes
between the two applications, allowing that data through.
Android supports this kind of implicit secure data access through intents and
content providers. Figure 10-65 illustrates how this situation works for our picture
emailing example. The camera application at the bottom-left has created an intent
asking to share one of its images, content://pics/1. In addition to starting the email
compose application as we had seen before, this also adds an entry to a list of
‘‘granted URIs,’’ noting that the new ComposeActivity now has access to this URI.
Now when ComposeActivity looks to open and read the data from the URI it has
been given, the camera application’s PicturesProvider that owns the data behind the
URI can ask the activity manager if the calling email application has access to the
data, which it does, so the picture is returned.
This fine-grained URI access control can also operate the other way. There is
another intent action, [Link] CONTENT, which an application
can use to ask the user to pick some data and return to it. This would be used in
our email application, for example, to operate the other way around: the user while
in the email application can ask to add an attachment, which will launch an activity
in the camera application for them to select one.
Figure 10-66 illustrates this new flow. It is almost identical to Fig. 10-65, the
only difference being in the way the activities of the two applications are com-
posed, with the email application starting the appropriate picture-selection activity
in the camera application. Once an image is selected, its URI is returned back to
the email application, and at this point our URI grant is recorded by the activity
manager.
This approach is extremely powerful, since it allows the system to maintain
tight control over per-application data, granting specific access to data where need-
ed, without the user needing to be aware that this is happening. Many other user
interactions can also benefit from it. An obvious one is drag and drop to create a
similar URI grant, but Android also takes advantage of other information such as
current window focus to determine the kinds of interactions applications can have.
SEC. 10.8 ANDROID 843
Open Receive
content://pics/1 data
Task: Pictures
RESUMED
ActivityRecord
(ComposeActivity) ComposeActivity
SEND
content://pics/1 Email app process
STOPPED
ActivityRecord
(CameraActivity)
Saved state
A final common security method Android uses is explicit user interfaces for al-
lowing/removing specific types of access. In this approach, there is some way an
application indicates it can optionally provide some functionally, and a sys-
tem-supplied trusted user interface that provides control over this access.
A typical example of this approach is Android’s input-method architecture.
An input method is a specific service supplied by a third-party application that al-
lows the user to provide input to applications, typically in the form of an on-screen
keyboard. This is a highly sensitive interaction in the system, since a lot of person-
al data will go through the input-method application, including passwords the user
types.
An application indicates it can be an input method by declaring a service in its
manifest with an intent filter matching the action for the system’s input-method
protocol. This does not, however, automatically allow it to become an input meth-
od, and unless something else happens the application’s sandbox has no ability to
operate like one.
Android’s system settings include a user interface for selecting input methods.
This interface shows all available input methods of the currently installed applica-
tions and whether or not they are enabled. If the user wants to use a new input
method after they have installed its application, they must go to this system settings
interface and enable it. When doing that, the system can also inform the user of
the kinds of things this will allow the application to do.
844 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
Allow
Granted URls
PicturesProvider
To: ComposeActivity Authority: "pics"
URI: content://pics/1
Check
Open Receive
content://pics/1 data
Task: Pictures
RESUMED
ActivityRecord
(PicturePickerActivity) ComposeActivity
RECEIVE
GET
content://pics/1 Email app process
ActivityRecord
STOPPED
(ComposeActivity)
Saved state
The traditional process model in Linux is a fork to create a new process, fol-
lowed by an exec to initialize that process with the code to be run and then start its
execution. The shell is responsible for driving this execution, forking and execut-
ing processes as needed to run shell commands. When those commands exit, the
process is removed by Linux.
Android uses processes somewhat differently. As discussed in the previous
section on applications, the activity manager is the part of Android responsible for
managing running applications. It coordinates the launching of new application
processes, determines what will run in them, and when they are no longer needed.
SEC. 10.8 ANDROID 845
Starting Processes
In order to launch new processes, the activity manager must communicate with
the zygote. When the activity manager first starts, it creates a dedicated socket
with zygote, through which it sends a command when it needs to start a process.
The command primarily describes the sandbox to be created: the UID that the new
process should run as and any other security restrictions that will apply to it.
Zygote thus must run as root: when it forks, it does the appropriate setup for the
UID it will run as, finally dropping root privileges and changing the process to the
desired UID.
Recall in our previous discussion about Android applications that the activity
manager maintains dynamic information about the execution of activities (in
Fig. 10-52), services (Fig. 10-57), broadcasts (to receivers as in Fig. 10-60), and
content providers (Fig. 10-61). It uses this information to drive the creation and
management of application processes. For example, when the application launcher
calls in to the system with a new intent to start an activity as we saw in Fig. 10-52,
it is the activity manager that is responsible for making that new application run.
The flow for starting an activity in a new process is shown in Fig. 10-67. The
details of each step in the illustration are:
1. Some existing process (such as the app launcher) calls in to the activ-
ity manager with an intent describing the new activity it would like to
have started.
2. Activity manager asks the package manager to resolve the intent to an
explicit component.
3. Activity manager determines that the application’s process is not al-
ready running, and then asks zygote for a new process of the ap-
propriate UID.
4. Zygote performs a fork, creating a new process that is a clone of itself,
drops privileges and sets its UID appropriately for the application’s
sandbox, and finishes initialization of Dalvik in that process so that
the Java runtime is fully executing. For example, it must start threads
like the garbage collector after it forks.
5. The new process, now a clone of zygote with the Java environment
fully up and running, calls back to the activity manager, asking
‘‘What am I supposed to do?’’
6. Activity manager returns back the full information about the applica-
tion it is starting, such as where to find its code.
7. New process loads the code for the application being run.
846 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
2 Resolve Intent 9
ss
sta
s cla
rtA
1 a
nti
vit
ta
y()
8 Ins 7
ess
proc
a new
C reate Zygote process
Note that when we started this activity, the application’s process may already
have been running. In that case, the activity manager will simply skip to the end,
sending a new command to the process telling it to instantiate and run the ap-
propriate component. This can result in an additional activity instance running in
the application, if appropriate, as we saw previously in Fig. 10-56.
Process Lifecycle
The activity manager is also responsible for determining when processes are
no longer needed. It keeps track of all activities, receivers, services, and content
providers running in a process; from this it can determine how important (or not)
the process is.
Recall that Android’s out-of-memory killer in the kernel uses a process’s
oom adj as a strict ordering to determine which processes it should kill first. The
activity manager is responsible for setting each process’s oom adj appropriately
SEC. 10.8 ANDROID 847
based on the state of that process, by classifying them into major categories of use.
Figure 10-68 shows the main categories, with the most important category first.
The last column shows a typical oom adj value that is assigned to processes of this
type.
Now, when RAM is getting low, the system has configured the processes so
that the out-of-memory killer will first kill cached processes to try to reclaim
enough needed RAM, followed by home, service, and on up. Within a specific
oom adj level, it will kill processes with a larger RAM footprint before smaller
ones.
We’ve now seen how Android decides when to start processes and how it cate-
gorizes those processes in importance. Now we need to decide when to have proc-
esses exit, right? Or do we really need to do anything more here? The answer is,
we do not. On Android, application processes never cleanly exit. The system just
leaves unneeded processes around, relying on the kernel to reap them as needed.
Cached processes in many ways take the place of the swap space that Android
lacks. As RAM is needed elsewhere, cached processes can be thrown out of active
RAM. If an application later needs to run again, a new process can be created,
restoring any previous state needed to return it to how the user last left it. Behind
the scenes, the operating system is launching, killing, and relaunching processes as
needed so the important foreground operations remain running and cached proc-
esses are kept around as long as their RAM would not be better used elsewhere.
Process Dependencies
We at this point have a good overview of how individual Android processes are
managed. There is a further complication to this, however: dependencies between
processes.
As an example, consider our previous camera application holding the pictures
that have been taken. These pictures are not part of the operating system; they are
848 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
10.9 SUMMARY
Linux began its life as an open-source, full-production UNIX clone, and is now
used on machines ranging from smartphones and notebook computers to
supercomputers. Three main interfaces to it exist: the shell, the C library, and the
system calls themselves. In addition, a graphical user interface is often used to sim-
plify user interaction with the system. The shell allows users to type commands for
execution. These may be simple commands, pipelines, or more complex struc-
tures. Input and output may be redirected. The C library contains the system calls
and also many enhanced calls, such as printf for writing formatted output to files.
The actual system call interface is architecture dependent, and on x86 platforms
consists of roughly 250 calls, each of which does what is needed and no more.
The key concepts in Linux include the process, the memory model, I/O, and
the file system. Processes may fork off subprocesses, leading to a tree of processes.
850 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
PROBLEMS
6. Write a Linux pipeline that prints the eighth line of file z on standard output.
7. Why does Linux distinguish between standard output and standard error, when both
default to the terminal?
8. A user at a terminal types the following commands:
a|b|c&
d|e|f&
After the shell has processed them, how many new processes are running?
9. When the Linux shell starts up a process, it puts copies of its environment variables,
such as HOME, on the process’ stack, so the process can find out what its home direc-
tory is. If this process should later fork, will the child automatically get these vari-
ables, too?
10. About how long does it take a traditional UNIX system to fork off a child process
under the following conditions: text size = 100 KB, data size = 20 KB, stack size = 10
KB, task structure = 1 KB, user structure = 5 KB. The kernel trap and return takes 1
msec, and the machine can copy one 32-bit word every 50 nsec. Text segments are
shared, but data and stack segments are not.
11. As multimegabyte programs became more common, the time spent executing the fork
system call and copying the data and stack segments of the calling process grew
proportionally. When fork is executed in Linux, the parent’s address space is not cop-
ied, as traditional fork semantics would dictate. How does Linux prevent the child from
doing something that would completely change the fork semantics?
852 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
12. Why are negative arguments to nice reserved exclusively for the superuser?
13. A non-real-time Linux process has priority levels from 100 to 139. What is the default
static priority and how is the nice value used to change this?
14. Does it make sense to take away a process’ memory when it enters zombie state? Why
or why not?
15. To what hardware concept is a signal closely related? Give two examples of how sig-
nals are used.
16. Why do you think the designers of Linux made it impossible for a process to send a
signal to another process that is not in its process group?
17. A system call is usually implemented using a software interrupt (trap) instruction.
Could an ordinary procedure call be used as well on the Pentium hardware? If so,
under what conditions and how? If not, why not?
18. In general, do you think daemons have higher or lower priority than interactive proc-
esses? Why?
19. When a new process is forked off, it must be assigned a unique integer as its PID. Is it
sufficient to have a counter in the kernel that is incremented on each process creation,
with the counter used as the new PID? Discuss your answer.
20. In every process’ entry in the task structure, the PID of the parent is stored. Why?
21. The copy-on-write mechanism is used as an optimization in the fork system call, so that
a copy of a page is created only when one of the processes (parent or child) tries to
write on the page. Suppose a process p1 forks processes p2 and p3 in quick succession.
Explain how a page sharing may be handled in this case.
22. What combination of the sharing flags bits used by the Linux clone command corre-
sponds to a conventional UNIX fork call? To creating a conventional UNIX thread?
23. Two tasks A and B need to perform the same amount of work. However, task A has
higher priority, and needs to be given more CPU time. Expain how will this be
achieved in each of the Linux schedulers described in this chapter, the O(1) and the
CFS scheduler.
24. Some UNIX systems are tickless, meaning they do not have periodic clock interrupts.
Why is this done? Also, does ticklessness make sense on a computer (such as an em-
bedded system) running only one process?
25. When booting Linux (or most other operating systems for that matter), the bootstrap
loader in sector 0 of the disk first loads a boot program which then loads the operating
system. Why is this extra step necessary? Surely it would be simpler to have the boot-
strap loader in sector 0 just load the operating system directly.
26. A certain editor has 100 KB of program text, 30 KB of initialized data, and 50 KB of
BSS. The initial stack is 10 KB. Suppose that three copies of this editor are started si-
multaneously. How much physical memory is needed (a) if shared text is used, and (b)
if it is not?
27. Why are open-file-descriptor tables necessary in Linux?
CHAP. 10 PROBLEMS 853
28. In Linux, the data and stack segments are paged and swapped to a scratch copy kept on
a special paging disk or partition, but the text segment uses the executable binary file
instead. Why?
29. Describe a way to use mmap and signals to construct an interprocess-communication
mechanism.
30. A file is mapped in using the following mmap system call:
mmap(65536, 32768, READ, FLAGS, fd, 0)
Pages are 8 KB. Which byte in the file is accessed by reading a byte at memory ad-
dress 72,000?
31. After the system call of the previous problem has been executed, the call
munmap(65536, 8192)
is carried out. Does it succeed? If so, which bytes of the file remain mapped? If not,
why does it fail?
32. Can a page fault ever lead to the faulting process being terminated? If so, give an ex-
ample. If not, why not?
33. Is it possible that with the buddy system of memory management it ever occurs that
two adjacent blocks of free memory of the same size coexist without being merged into
one block? If so, explain how. If not, show that it is impossible.
34. It is stated in the text that a paging partition will perform better than a paging file. Why
is this so?
35. Give two examples of the advantages of relative path names over absolute ones.
36. The following locking calls are made by a collection of processes. For each call, tell
what happens. If a process fails to get a lock, it blocks.
(a) A wants a shared lock on bytes 0 through 10.
(b) B wants an exclusive lock on bytes 20 through 30.
(c) C wants a shared lock on bytes 8 through 40.
(d) A wants a shared lock on bytes 25 through 35.
(e) B wants an exclusive lock on byte 8.
37. Consider the locked file of Fig. 10-26(c). Suppose that a process tries to lock bytes 10
and 11 and blocks. Then, before C releases its lock, yet another process tries to lock
bytes 10 and 11, and also blocks. What kinds of problems are introduced into the
semantics by this situation? Propose and defend two solutions.
38. Explain under what situations a process may request a shared lock or an exclusive lock.
What problem may a process requesting an exclusive lock suffer from?
39. If a Linux file has protection mode 755 (octal), what can the owner, the owner’s group,
and everyone else do to the file?
40. Some tape drives have numbered blocks and the ability to overwrite a particular block
in place without disturbing the blocks in front of or behind it. Could such a device hold
a mounted Linux file system?
854 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
41. In Fig. 10-24, both Fred and Lisa have access to the file x in their respective directories
after linking. Is this access completely symmetrical in the sense that anything one of
them can do with it the other one can, too?
42. As we have seen, absolute path names are looked up starting at the root directory and
relative path names are looked up starting at the working directory. Suggest an efficient
way to implement both kinds of searches.
43. When the file /usr/ast/work/f is opened, several disk accesses are needed to read i-node
and directory blocks. Calculate the number of disk accesses required under the as-
sumption that the i-node for the root directory is always in memory, and all directories
are one block long.
44. A Linux i-node has 12 disk addresses for data blocks, as well as the addresses of sin-
gle, double, and triple indirect blocks. If each of these holds 256 disk addresses, what
is the size of the largest file that can be handled, assuming that a disk block is 1 KB?
45. When an i-node is read in from the disk during the process of opening a file, it is put
into an i-node table in memory. This table has some fields that are not present on the
disk. One of them is a counter that keeps track of the number of times the i-node has
been opened. Why is this field needed?
46. On multi-CPU platforms, Linux maintains a runqueue for each CPU. Is this a good
idea? Explain your answer?
47. The concept of loadable modules is useful in that new device drivers may be loaded in
the kernel while the system is running. Provide two disadvantages of this concept.
48. Pdflush threads can be awakened periodically to write back to disk very old pages—
older than 30 sec. Why is this necessary?
49. After a system crash and reboot, a recovery program is usually run. Suppose this pro-
gram discovers that the link count in a disk i-node is 2, but only one directory entry
references the i-node. Can it fix the problem, and if so, how?
50. Make an educated guess as to which Linux system call is the fastest.
51. Is it possible to unlink a file that has never been linked? What happens?
52. Based on the information presented in this chapter, if a Linux ext2 file system were to
be put on a 1.44-MB floppy disk, what is the maximum amount of user file data that
could be stored on the disk? Assume that disk blocks are 1 KB.
53. In view of all the trouble that students can cause if they get to be superuser, why does
this concept exist in the first place?
54. A professor shares files with his students by placing them in a publicly accessible di-
rectory on the Computer Science department’s Linux system. One day he realizes that
a file placed there the previous day was left world-writable. He changes the permis-
sions and verifies that the file is identical to his master copy. The next day he finds that
the file has been changed. How could this have happened and how could it have been
prevented?
55. Linux supports a system call fsuid. Unlike setuid, which grants the user all the rights
of the effective id associated with a program he is running, fsuid grants the user who is
CHAP. 10 PROBLEMS 855
running the program special rights only with respect to access to files. Why is this fea-
ture useful?
56. On a Linux system, go to /proc/#### directory, where #### is a decimal number cor-
responding to a process currently running in the system. Answer the following along
with an explanation:
(a) What is the size of most of the files in this directory?
(b) What are the time and date settings of most of the files?
(c) What type of access right is provided to the users for accessing the files?
57. If you are writing an Android activity to display a Web page in a browser, how would
you implement its activity-state saving to minimize the amount of saved state without
losing anything important?
58. If you are writing networking code on Android that uses a socket to download a file,
what should you consider doing that is different than on a standard Linux system?
59. If you are designing something like Android’s zygote process for a system that will
have multiple threads running in each process forked from it, would you prefer to start
those threads in zygote or after the fork?
60. Imagine you use Android’s Binder IPC to send an object to another process. You later
receive an object from a call into your process, and find that what you have received is
the same object as previously sent. What can you assume or not assume about the cal-
ler in your process?
61. Consider an Android system that, immediately after starting, follows these steps:
1. The home (or launcher) application is started.
2. The email application starts syncing its mailbox in the background.
3. The user launches a camera application.
4. The user launches a Web browser application.
The web page the user is now viewing in the browser application requires inceasingly
more RAM, until it needs everything it can get. What happens?
62. Write a minimal shell that allows simple commands to be started. It should also allow
them to be started in the background.
63. Using assembly language and BIOS calls, write a program that boots itself from a flop-
py disk on a Pentium-class computer. The program should use BIOS calls to read the
keyboard and echo the characters typed, just to demonstrate that it is running.
64. Write a dumb terminal program to connect two Linux computers via the serial ports.
Use the POSIX terminal management calls to configure the ports.
65. Write a client-server application which, on request, transfers a large file via sockets.
Reimplement the same application using shared memory. Which version do you expect
to perform better? Why? Conduct performance measurements with the code you have
written and using different file sizes. What are your observations? What do you think
happens inside the Linux kernel which results in this behavior?
66. Implement a basic user-level threads library to run on top of Linux. The library API
should contain function calls like mythreads init, mythreads create, mythreads join,
856 CASE STUDY 1: UNIX, LINUX, AND ANDROID CHAP. 10
mythreads exit, mythreads yield, mythreads self, and perhaps a few others. Next, im-
plement these synchronization variables to enable safe concurrent operations:
mythreads mutex init, mythreads mutex lock, mythreads mutex unlock. Before start-
ing, clearly define the API and specify the semantics of each of the calls. Next imple-
ment the user-level library with a simple, round-robin preemptive scheduler. You will
also need to write one or more multithreaded applications, which use your library, in
order to test it. Finally, replace the simple scheduling mechanism with another one
which behaves like the Linux 2.6 O(1) scheduler described in this chapter. Compare
the performance your application(s) receive when using each of the schedulers.
67. Write a shell script that displays some important system information such as what
processes you are running, your home directory and current directory, processor type,
current CPU utilization, etc.
11
CASE STUDY 2: WINDOWS 8
857
858 CASE STUDY 2: WINDOWS 8 CHAP. 11
Figure 11-1. Major releases in the history of Microsoft operating systems for
desktop PCs.
In the early 1980s IBM, at the time the biggest and most powerful computer
company in the world, was developing a personal computer based the Intel 8088
microprocessor. Since the mid-1970s, Microsoft had become the leading provider
of the BASIC programming language for 8-bit microcomputers based on the 8080
and Z-80. When IBM approached Microsoft about licensing BASIC for the new
IBM PC, Microsoft readily agreed and suggested that IBM contact Digital Re-
search to license its CP/M operating system, since Microsoft was not then in the
operating system business. IBM did that, but the president of Digital Research,
Gary Kildall, was too busy to meet with IBM. This was probably the worst blun-
der in all of business history, since had he licensed CP/M to IBM, Kildall would
probably have become the richest man on the planet. Rebuffed by Kildall, IBM
came back to Bill Gates, the cofounder of Microsoft, and asked for help again.
Within a short time, Microsoft bought a CP/M clone from a local company, Seattle
Computer Products, ported it to the IBM PC, and licensed it to IBM. It was then
renamed MS-DOS 1.0 (MicroSoft Disk Operating System) and shipped with the
first IBM PC in 1981.
SEC. 11.1 HISTORY OF WINDOWS THROUGH WINDOWS 8.1 859
Cutler’s system was called NT for New Technology (and also because the orig-
inal target processor was the new Intel 860, code-named the N10). NT was de-
signed to be portable across different processors and emphasized security and
reliability, as well as compatibility with the MS-DOS-based versions of Windows.
Cutler’s background at DEC shows in various places, with there being more than a
passing similarity between the design of NT and that of VMS and other operating
systems designed by Cutler, shown in Fig. 11-2.
NT did meet its portability goals, with additional releases in 1994 and 1995
adding support for (little-endian) MIPS and PowerPC architectures. The first
major upgrade to NT came with Windows NT 4.0 in 1996. This system had the
power, security, and reliability of NT, but also sported the same user interface as
the by-then very popular Windows 95.
Figure 11-3 shows the relationship of the Win32 API to Windows. Having a
common API across both the MS-DOS-based and NT-based Windows was impor-
tant to the success of NT.
This compatibility made it much easier for users to migrate from Windows 95
to NT, and the operating system became a strong player in the high-end desktop
market as well as servers. However, customers were not as willing to adopt other
processor architectures, and of the four architectures Windows NT 4.0 supported in
1996 (the DEC Alpha was added in that release), only the x86 (i.e., Pentium fam-
ily) was still actively supported by the time of the next major release, Windows
2000.
Win32s
Windows Windows Windows Windows
3.0/3.1 95/98/98SE/Me NT/2000/Vista/7 8/8.1
Figure 11-3. The Win32 API allows programs to run on almost all versions of
Windows.
Windows 2000 represented a significant evolution for NT. The key technolo-
gies added were plug-and-play (for consumers who installed a new PCI card, elim-
inating the need to fiddle with jumpers), network directory services (for enterprise
customers), improved power management (for notebook computers), and an im-
proved GUI (for everyone).
The technical success of Windows 2000 led Microsoft to push toward the dep-
recation of Windows 98 by enhancing the application and device compatibility of
the next NT release, Windows XP. Windows XP included a friendlier new look-
and-feel to the graphical interface, bolstering Microsoft’s strategy of hooking con-
sumers and reaping the benefit as they pressured their employers to adopt systems
with which they were already familiar. The strategy was overwhelmingly suc-
cessful, with Windows XP being installed on hundreds of millions of PCs over its
first few years, allowing Microsoft to achieve its goal of effectively ending the era
of MS-DOS-based Windows.
862 CASE STUDY 2: WINDOWS 8 CHAP. 11
the same time, processor performance ceased to improve at the same rate it had
previously, due to the difficulties in dissipating the heat created by ever-increasing
clock speeds. Moore’s Law continued to hold, but the additional transistors were
going into new features and multiple processors rather than improvements in sin-
gle-processor performance. All the bloat in Windows Vista meant that it per-
formed poorly on these computers relative to Windows XP, and the release was
never widely accepted.
The issues with Windows Vista were addressed in the subsequent release,
Windows 7. Microsoft invested heavily in testing and performance automation,
new telemetry technology, and extensively strengthened the teams charged with
improving performance, reliability, and security. Though Windows 7 had rela-
tively few functional changes compared to Windows Vista, it was better engineered
and more efficient. Windows 7 quickly supplanted Vista and ultimately Windows
XP to be the most popular version of Windows to date.
By the time Windows 7 shipped, the computing industry once again began to
change dramatically. The success of the Apple iPhone as a portable computing de-
vice, and the advent of the Apple iPad, had heralded a sea-change which led to the
dominance of lower-cost Android tablets and phones, much as Microsoft had dom-
inated the desktop in the first three decades of personal computing. Small,
portable, yet powerful devices and ubiquitous fast networks were creating a world
where mobile computing and network-based services were becoming the dominant
paradigm. The old world of portable computers was replaced by machines with
small screens that ran applications readily downloadable from the Web. These ap-
plications were not the traditional variety, like word processing, spreadsheets, and
connecting to corporate servers. Instead, they provided access to services like Web
search, social networking, Wikipedia, streaming music and video, shopping, and
personal navigation. The business models for computing were also changing, with
advertising opportunities becoming the largest economic force behind computing.
Microsoft began a process to redesign itself as a devices and services company
in order to better compete with Google and Apple. It needed an operating system
it could deploy across a wide spectrum of devices: phones, tablets, game consoles,
laptops, desktops, servers, and the cloud. Windows thus underwent an even bigger
evolution than with Windows Vista, resulting in Windows 8. However, this time
Microsoft applied the lessons from Windows 7 to create a well-engineered, per-
formant product with less bloat.
Windows 8 built on the modular MinWin approach Microsoft used in Win-
dows 7 to produce a small operating system core that could be extended onto dif-
ferent devices. The goal was for each of the operating systems for specific devices
to be built by extending this core with new user interfaces and features, yet provide
as common an experience for users as possible. This approach was successfully
864 CASE STUDY 2: WINDOWS 8 CHAP. 11
applied to Windows Phone 8, which shares most of the core binaries with desktop
and server Windows. Support of phones and tablets by Windows required support
for the popular ARM architecture, as well as new Intel processors targeting those
devices. What makes Windows 8 part of the Modern Windows era are the funda-
mental changes in the programming models, as we will examine in the next sec-
tion.
Windows 8 was not received to universal acclaim. In particular, the lack of the
Start Button on the taskbar (and its associated menu) was viewed by many users as
a huge mistake. Others objected to using a tablet-like interface on a desktop ma-
chine with a large monitor. Microsoft responded to this and other criticisms on
May 14, 2013 by releasing an update called Windows 8.1. This version fixed
these problems while at the same time introducing a host of new features, such as
better cloud integration, as well as a number of new programs. Although we will
stick to the more generic name of ‘‘Windows 8’’ in this chapter, in fact, everything
in it is a description of how Windows 8.1 works.
It is now time to start our technical study of Windows. Before getting into the
details of the internal structure, however, we will take a look at the native NT API
for system calls, the Win32 programming subsystem introduced as part of NT-
based Windows, and the Modern WinRT programming environment introduced
with Windows 8.
Figure 11-4 shows the layers of the Windows operating system. Beneath the
applet and GUI layers of Windows are the programming interfaces that applica-
tions build on. As in most operating systems, these consist largely of code libraries
(DLLs) to which programs dynamically link for access to operating system fea-
tures. Windows also includes a number of programming interfaces which are im-
plemented as services that run as separate processes. Applications communicate
with user-mode services through RPCs (Remote-Procedure-Calls).
The core of the NT operating system is the NTOS kernel-mode program
([Link]), which provides the traditional system-call interfaces upon which the
rest of the operating system is built. In Windows, only programmers at Microsoft
write to the system-call layer. The published user-mode interfaces all belong to
operating system personalities that are implemented using subsystems that run on
top of the NTOS layers.
Originally NT supported three personalities: OS/2, POSIX and Win32. OS/2
was discarded in Windows XP. Support for POSIX was finally removed in Win-
dows 8.1. Today all Windows applications are written using APIs that are built on
top of the Win32 subsystem, such as the WinFX API in the .NET programming
model. The WinFX API includes many of the features of Win32, and in fact many
SEC. 11.2 PROGRAMMING WINDOWS 865
of the functions in the WinFX Base Class Library are simply wrappers around
Win32 APIs. The advantages of WinFX have to do with the richness of the object
types supported, the simplified consistent interfaces, and use of the .NET Common
Language Run-time (CLR), including garbage collection (GC).
The Modern versions of Windows begin with Windows 8, which introduced
the new WinRT set of APIs. Windows 8 deprecated the traditional Win32 desktop
experience in favor of running a single application at a time on the full screen with
an emphasis on touch over use of the mouse. Microsoft saw this as a necessary
step as part of the transition to a single operating system that would work with
phones, tablets, and game consoles, as well as traditional PCs and servers. The
GUI changes necessary to support this new model require that applications be
rewritten to a new API model, the Modern Software Development Kit, which in-
cludes the WinRT APIs. The WinRT APIs are carefully curated to produce a more
consistent set of behaviors and interfaces. These APIs have versions available for
C++ and .NET programs but also JavaScript for applications hosted in a brow-
ser-like environment [Link] (Windows Web Application).
In addition to WinRT APIs, many of the existing Win32 APIs were included in
the MSDK (Microsoft Development Kit). The initially available WinRT APIs
were not sufficient to write many programs. Some of the included Win32 APIs
were chosen to limit the behavior of applications. For example, applications can-
not create threads directly with the MSDK, but must rely on the Win32 thread pool
to run concurrent activities within a process. This is because Modern Windows is
866 CASE STUDY 2: WINDOWS 8 CHAP. 11
shifting programmers away from a threading model to a task model in order to dis-
entangle resource management (priorities, processor affinities) from the pro-
gramming model (specifying concurrent activities). Other omitted Win32 APIs in-
clude most of the Win32 virtual memory APIs. Programmers are expected to rely
on the Win32 heap-management APIs rather than attempt to manage memory re-
sources directly. APIs that were already deprecated in Win32 were also omitted
from the MSDK, as were all ANSI APIs. The MSDK APIs are Unicode only.
The choice of the word Modern to describe a product such as Windows is sur-
prising. Perhaps if a new generation Windows is here ten years from now, it will
be referred to as post-Modern Windows.
Unlike traditional Win32 processes, the processes running modern applications
have their lifetimes managed by the operating system. When a user switches away
from an application, the system gives it a couple of seconds to save its state and
then ceases to give it further processor resources until the user switches back to the
application. If the system runs low on resources, the operating system may termi-
nate the application’s processes without the application ever running again. When
the user switches back to the application at some time in the future, it will be re-
started by the operating system. Applications that need to run tasks in the back-
ground must specifically arrange to do so using a new set of WinRT APIs. Back-
ground activity is carefully managed by the system to improve battery life and pre-
vent interference with the foreground application the user is currently using. These
changes were made to make Windows function better on mobile devices.
In the Win32 desktop world applications are deployed by running an installer
that is part of the application. Modern applications have to be installed using Win-
dows’ AppStore program, which will deploy only applications that were uploaded
into the Microsoft on-line store by the developer. Microsoft is following the same
successful model introduced by Apple and adopted by Android. Microsoft will not
accept applications into the store unless they pass verification which, among other
checks, ensures that the application is using only APIs available in the MSDK.
When a modern application is running, it always executes in a sandbox called
an AppContainer. Sandboxing process execution is a security technique for iso-
lating less trusted code so that it cannot freely tamper with the system or user data.
The Windows AppContainer treats each application as a distinct user, and uses
Windows security facilities to keep the application from accessing arbitrary system
resources. When an application does need access to a system resource, there are
WinRT APIs that communicate to broker processes which do have access to more
of the system, such as a user’s files.
As shown in Fig. 11-5, NT subsystems are constructed out of four compo-
nents: a subsystem process, a set of libraries, hooks in CreateProcess, and support
in the kernel. A subsystem process is really just a service. The only special prop-
erty is that it is started by the [Link] (session manager) program—the initial
user-mode program started by NT—in response to a request from CreateProcess
in Win32 or the corresponding API in a different subsystem. Although Win32 is
SEC. 11.2 PROGRAMMING WINDOWS 867
the only remaining subsystem supported, Windows still maintains the subsystem
model, including the [Link] Win32 subsystem process.
Program process
Subsystem
libraries
Subsystem
Local procedure kernel support
Native NT call (LPC)
system services NTOS Executive
Like all other operating systems, Windows has a set of system calls it can per-
form. In Windows, these are implemented in the NTOS executive layer that runs
in kernel mode. Microsoft has published very few of the details of these native
system calls. They are used internally by lower-level programs that ship as part of
the operating system (mainly services and the subsystems), as well as kernel-mode
device drivers. The native NT system calls do not really change very much from
release to release, but Microsoft chose not to make them public so that applications
written for Windows would be based on Win32 and thus more likely to work with
both the MS-DOS-based and NT-based Windows systems, since the Win32 API is
common to both.
Most of the native NT system calls operate on kernel-mode objects of one kind
or another, including files, processes, threads, pipes, semaphores, and so on. Fig-
ure 11-6 gives a list of some of the common categories of kernel-mode objects sup-
ported by the kernel in Windows. Later, when we discuss the object manager, we
will provide further details on the specific object types.
Sometimes use of the term object regarding the data structures manipulated by
the operating system can be confusing because it is mistaken for object-oriented.
Operating system objects do provide data hiding and abstraction, but they lack
some of the most basic properties of object-oriented systems such as inheritance
and polymorphism.
In the native NT API, calls are available to create new kernel-mode objects or
access existing ones. Every call creating or opening an object returns a result called
a handle to the caller. The handle can subsequently be used to perform operations
on the object. Handles are specific to the process that created them. In general
handles cannot be passed directly to another process and used to refer to the same
object. However, under certain circumstances, it is possible to duplicate a handle
into the handle table of other processes in a protected way, allowing processes to
share access to objects—even if the objects are not accessible in the namespace.
The process duplicating each handle must itself have handles for both the source
and target process.
Every object has a security descriptor associated with it, telling in detail who
may and may not perform what kinds of operations on the object based on the
SEC. 11.2 PROGRAMMING WINDOWS 869
access requested. When handles are duplicated between processes, new access
restrictions can be added that are specific to the duplicated handle. Thus, a process
can duplicate a read-write handle and turn it into a read-only version in the target
process.
Not all system-created data structures are objects and not all objects are kernel-
mode objects. The only ones that are true kernel-mode objects are those that need
to be named, protected, or shared in some way. Usually, they represent some kind
of programming abstraction implemented in the kernel. Every kernel-mode object
has a system-defined type, has well-defined operations on it, and occupies storage
in kernel memory. Although user-mode programs can perform the operations (by
making system calls), they cannot get at the data directly.
Figure 11-7 shows a sampling of the native APIs, all of which use explicit
handles to manipulate kernel-mode objects such as processes, threads, IPC ports,
and sections (which are used to describe memory objects that can be mapped into
address spaces). NtCreateProcess returns a handle to a newly created process ob-
ject, representing an executing instance of the program represented by the Section-
Handle. DebugPor tHandle is used to communicate with a debugger when giving it
control of the process after an exception (e.g., dividing by zero or accessing invalid
memory). ExceptPor tHandle is used to communicate with a subsystem process
when errors occur and are not handled by an attached debugger.
Figure 11-7. Examples of native NT API calls that use handles to manipulate ob-
jects across process boundaries.
handle for the object. Such objects can even extend the NT namespace by provid-
ing parse routines that allow the objects to function somewhat like mount points in
UNIX. File systems and the registry use this facility to mount volumes and hives
onto the NT namespace. Accessing the device object for a volume gives access to
the raw volume, but the device object also represents an implicit mount of the vol-
ume into the NT namespace. The individual files on a volume can be accessed by
concatenating the volume-relative file name onto the end of the name of the device
object for that volume.
Permanent names are also used to represent synchronization objects and shared
memory, so that they can be shared by processes without being continually recreat-
ed as processes stop and start. Device objects and often driver objects are given
permanent names, giving them some of the persistence properties of the special i-
nodes kept in the /dev directory of UNIX.
We will describe many more of the features in the native NT API in the next
section, where we discuss the Win32 APIs that provide wrappers around the NT
system calls.
The Win32 function calls are collectively called the Win32 API. These inter-
faces are publicly disclosed and fully documented. They are implemented as li-
brary procedures that either wrap the native NT system calls used to get the work
done or, in some cases, do the work right in user mode. Though the native NT
APIs are not published, most of the functionality they provide is accessible through
the Win32 API. The existing Win32 API calls rarely change with new releases of
Windows, though many new functions are added to the API.
Figure 11-8 shows various low-level Win32 API calls and the native NT API
calls that they wrap. What is interesting about the figure is how uninteresting the
mapping is. Most low-level Win32 functions have native NT equivalents, which is
not surprising as Win32 was designed with NT in mind. In many cases the Win32
layer must manipulate the Win32 parameters to map them onto NT, for example,
canonicalizing path names and mapping onto the appropriate NT path names, in-
cluding special MS-DOS device names (like LPT:). The Win32 APIs for creating
processes and threads also must notify the Win32 subsystem process, [Link],
that there are new processes and threads for it to supervise, as we will describe in
Sec. 11.4.
Some Win32 calls take path names, whereas the equivalent NT calls use hand-
les. So the wrapper routines have to open the files, call NT, and then close the
handle at the end. The wrappers also translate the Win32 APIs from ANSI to Uni-
code. The Win32 functions shown in Fig. 11-8 that use strings as parameters are
actually two APIs, for example, CreateProcessW and CreateProcessA. The
strings passed to the latter API must be translated to Unicode before calling the un-
derlying NT API, since NT works only with Unicode.
872 CASE STUDY 2: WINDOWS 8 CHAP. 11
Figure 11-8. Examples of Win32 API calls and the native NT API calls that they
wrap.
Since few changes are made to the existing Win32 interfaces in each release of
Windows, in theory the binary programs that ran correctly on any previous release
will continue to run correctly on a new release. In practice, there are often many
compatibility problems with new releases. Windows is so complex that a few
seemingly inconsequential changes can cause application failures. And applica-
tions themselves are often to blame, since they frequently make explicit checks for
specific operating system versions or fall victim to their own latent bugs that are
exposed when they run on a new release. Nevertheless, Microsoft makes an effort
in every release to test a wide variety of applications to find incompatibilities and
either correct them or provide application-specific workarounds.
Windows supports two special execution environments both called WOW
(Windows-on-Windows). WOW32 is used on 32-bit x86 systems to run 16-bit
Windows 3.x applications by mapping the system calls and parameters between the
16-bit and 32-bit worlds. Similarly, WOW64 allows 32-bit Windows applications
to run on x64 systems.
The Windows API philosophy is very different from the UNIX philosophy. In
the latter, the operating system functions are simple, with few parameters and few
places where there are multiple ways to perform the same operation. Win32 pro-
vides very comprehensive interfaces with many parameters, often with three or
four ways of doing the same thing, and mixing together low-level and high-level
functions, like CreateFile and CopyFile.
This means Win32 provides a very rich set of interfaces, but it also introduces
much complexity due to the poor layering of a system that intermixes both high-
level and low-level functions in the same API. For our study of operating systems,
only the low-level functions of the Win32 API that wrap the native NT API are rel-
evant, so those are what we will focus on.
SEC. 11.2 PROGRAMMING WINDOWS 873
Win32 has calls for creating and managing both processes and threads. There
are also many calls that relate to interprocess communication, such as creating, de-
stroying, and using mutexes, semaphores, events, communication ports, and other
IPC objects.
Although much of the memory-management system is invisible to pro-
grammers, one important feature is visible: namely the ability of a process to map
a file onto a region of its virtual memory. This allows threads running in a process
the ability to read and write parts of the file using pointers without having to expli-
citly perform read and write operations to transfer data between the disk and mem-
ory. With memory-mapped files the memory-management system itself performs
the I/Os as needed (demand paging).
Windows implements memory-mapped files using three completely different
facilities. First it provides interfaces which allow processes to manage their own
virtual address space, including reserving ranges of addresses for later use. Sec-
ond, Win32 supports an abstraction called a file mapping, which is used to repres-
ent addressable objects like files (a file mapping is called a section in the NT
layer). Most often, file mappings are created to refer to files using a file handle,
but they can also be created to refer to private pages allocated from the system
pagefile.
The third facility maps views of file mappings into a process’ address space.
Win32 allows only a view to be created for the current process, but the underlying
NT facility is more general, allowing views to be created for any process for which
you have a handle with the appropriate permissions. Separating the creation of a
file mapping from the operation of mapping the file into the address space is a dif-
ferent approach than used in the mmap function in UNIX.
In Windows, the file mappings are kernel-mode objects represented by a hand-
le. Like most handles, file mappings can be duplicated into other processes. Each
of these processes can map the file mapping into its own address space as it sees
fit. This is useful for sharing private memory between processes without having to
create files for sharing. At the NT layer, file mappings (sections) can also be made
persistent in the NT namespace and accessed by name.
An important area for many programs is file I/O. In the basic Win32 view, a
file is just a linear sequence of bytes. Win32 provides over 60 calls for creating
and destroying files and directories, opening and closing files, reading and writing
them, requesting and setting file attributes, locking ranges of bytes, and many more
fundamental operations on both the organization of the file system and access to
individual files.
There are also various advanced facilities for managing data in files. In addi-
tion to the primary data stream, files stored on the NTFS file system can have addi-
tional data streams. Files (and even entire volumes) can be encrypted. Files can be
compressed, and/or represented as a sparse stream of bytes where missing regions
of data in the middle occupy no storage on disk. File-system volumes can be
organized out of multiple separate disk partitions using different levels of RAID
874 CASE STUDY 2: WINDOWS 8 CHAP. 11
drawing geometric figures, filling them in, managing the color palettes they use,
dealing with fonts, and placing icons on the screen. Finally, there are calls for
dealing with the keyboard, mouse and other human-input devices as well as audio,
printing, and other output devices.
The GUI operations work directly with the [Link] driver using special in-
terfaces to access these functions in kernel mode from user-mode libraries. Since
these calls do not involve the core system calls in the NTOS executive, we will not
say more about them.
Figure 11-10. Some of the Win32 API calls for using the registry
When the system is turned off, most of the registry information is stored on the
disk in the hives. Because their integrity is so critical to correct system func-
tioning, backups are made automatically and metadata writes are flushed to disk to
prevent corruption in the event of a system crash. Loss of the registry requires
reinstalling all software on the system.
Drivers Procs and threads Virtual memory Object manager Config manager
File systems,
volume manager,
LPC Cache manager I/O manager Security monitor
TCP/IP stack,
net interfaces
graphics devices, Executive run-time library
all other devices NTOS executive layer
Hardware
CPU, MMU, interrupt controllers, memory, physical devices, BIOS
The uppermost layer in Fig. 11-11 is the system library ([Link]), which ac-
tually runs in user mode. The system library includes a number of support func-
tions for the compiler run-time and low-level libraries, similar to what is in libc in
UNIX. [Link] also contains special code entry points used by the kernel to ini-
tialize threads and dispatch exceptions and user-mode APCs (Asynchronous Pro-
cedure Calls). Because the system library is so integral to the operation of the ker-
nel, every user-mode process created by NTOS has ntdll mapped at the same fixed
address. When NTOS is initializing the system it creates a section object to use
when mapping ntdll, and it also records addresses of the ntdll entry points used by
the kernel.
Below the NTOS kernel and executive layers is a layer of software called the
HAL (Hardware Abstraction Layer) which abstracts low-level hardware details
like access to device registers and DMA operations, and the way the parentboard
SEC. 11.3 SYSTEM STRUCTURE 879
One goal of Windows is to make the system portable across hardware plat-
forms. Ideally, to bring up an operating system on a new type of computer system
it should be possible to just recompile the operating system on the new platform.
Unfortunately, it is not this simple. While many of the components in some layers
of the operating system can be largely portable (because they mostly deal with in-
ternal data structures and abstractions that support the programming model), other
layers must deal with device registers, interrupts, DMA, and other hardware fea-
tures that differ significantly from machine to machine.
Most of the source code for the NTOS kernel is written in C rather than assem-
bly language (only 2% is assembly on x86, and less than 1% on x64). However, all
this C code cannot just be scooped up from an x86 system, plopped down on, say,
an ARM system, recompiled, and rebooted owing to the many hardware differ-
ences between processor architectures that have nothing to do with the different in-
struction sets and which cannot be hidden by the compiler. Languages like C make
it difficult to abstract away some hardware data structures and parameters, such as
the format of page-table entries and the physical memory page sizes and word
length, without severe performance penalties. All of these, as well as a slew of
hardware-specific optimizations, would have to be manually ported even though
they are not written in assembly code.
Hardware details about how memory is organized on large servers, or what
hardware synchronization primitives are available, can also have a big impact on
higher levels of the system. For example, NT’s virtual memory manager and the
kernel layer are aware of hardware details related to cache and memory locality.
Throughout the system NT uses compare&swap synchronization primitives, and it
would be difficult to port to a system that does not have them. Finally, there are
many dependencies in the system on the ordering of bytes within words. On all the
systems NT has ever been ported to, the hardware was set to little-endian mode.
Besides these larger issues of portability, there are also minor ones even be-
tween different parentboards from different manufacturers. Differences in CPU
versions affect how synchronization primitives like spin-locks are implemented.
There are several families of support chips that create differences in how hardware
interrupts are prioritized, how I/O device registers are accessed, management of
DMA transfers, control of the timers and real-time clock, multiprocessor synchron-
ization, working with firmware facilities such as ACPI (Advanced Configuration
and Power Interface), and so on. Microsoft made a serious attempt to hide these
types of machine dependencies in a thin layer at the bottom called the HAL, as
mentioned earlier. The job of the HAL is to present the rest of the operating sys-
tem with abstract hardware that hides the specific details of processor version, sup-
port chipset, and other configuration variations. These HAL abstractions are pres-
ented in the form of machine-independent services (procedure calls and macros)
that NTOS and the drivers can use.
SEC. 11.3 SYSTEM STRUCTURE 881
By using the HAL services and not addressing the hardware directly, drivers
and the kernel require fewer changes when being ported to new processors—and in
most cases can run unmodified on systems with the same processor architecture,
despite differences in versions and support chips.
The HAL does not provide abstractions or services for specific I/O devices
such as keyboards, mice, and disks or for the memory management unit. These
facilities are spread throughout the kernel-mode components, and without the HAL
the amount of code that would have to be modified when porting would be sub-
stantial, even when the actual hardware differences were small. Porting the HAL
itself is straightforward because all the machine-dependent code is concentrated in
one place and the goals of the port are well defined: implement all of the HAL ser-
vices. For many releases Microsoft supported a HAL Development Kit allowing
system manufacturers to build their own HAL, which would allow other kernel
components to work on new systems without modification, provided that the hard-
ware changes were not too great.
As an example of what the hardware abstraction layer does, consider the issue
of memory-mapped I/O vs. I/O ports. Some machines have one and some have the
other. How should a driver be programmed: to use memory-mapped I/O or not?
Rather than forcing a choice, which would make the driver not portable to a ma-
chine that did it the other way, the hardware abstraction layer offers three proce-
dures for driver writers to use for reading the device registers and another three for
writing them:
uc = READ PORT UCHAR(por t); WRITE PORT UCHAR(por t, uc);
us = READ PORT USHORT(por t); WRITE PORT USHORT(por t, us);
ul = READ PORT ULONG(por t); WRITE PORT LONG(por t, ul);
These procedures read and write unsigned 8-, 16-, and 32-bit integers, respectively,
to the specified port. It is up to the hardware abstraction layer to decide whether
memory-mapped I/O is needed here. In this way, a driver can be moved without
modification between machines that differ in the way the device registers are im-
plemented.
Drivers frequently need to access specific I/O devices for various purposes. At
the hardware level, a device has one or more addresses on a certain bus. Since
modern computers often have multiple buses (PCI, PCIe, USB, IEEE 1394, etc.), it
can happen that more than one device may have the same address on different
buses, so some way is needed to distinguish them. The HAL provides a service for
identifying devices by mapping bus-relative device addresses onto systemwide log-
ical addresses. In this way, drivers do not have to keep track of which device is
connected to which bus. This mechanism also shields higher layers from proper-
ties of alternative bus structures and addressing conventions.
Interrupts have a similar problem—they are also bus dependent. Here, too, the
HAL provides services to name interrupts in a systemwide way and also provides
ways to allow drivers to attach interrupt service routines to interrupts in a portable
882 CASE STUDY 2: WINDOWS 8 CHAP. 11
way, without having to know anything about which interrupt vector is for which
bus. Interrupt request level management is also handled in the HAL.
Another HAL service is setting up and managing DMA transfers in a de-
vice-independent way. Both the systemwide DMA engine and DMA engines on
specific I/O cards can be handled. Devices are referred to by their logical ad-
dresses. The HAL implements software scatter/gather (writing or reading from
noncontiguous blocks of physical memory).
The HAL also manages clocks and timers in a portable way. Time is kept
track of in units of 100 nanoseconds starting at midnight on 1 January 1601, which
is the first date in the previous quadricentury, which simplifies leap-year computa-
tions. (Quick Quiz: Was 1800 a leap year? Quick Answer: No.) The time services
decouple the drivers from the actual frequencies at which the clocks run.
Kernel components sometimes need to synchronize at a very low level, espe-
cially to prevent race conditions in multiprocessor systems. The HAL provides
primitives to manage this synchronization, such as spin locks, in which one CPU
simply waits for a resource held by another CPU to be released, particularly in
situations where the resource is typically held only for a few machine instructions.
Finally, after the system has been booted, the HAL talks to the computer’s
firmware (BIOS) and inspects the system configuration to find out which buses and
I/O devices the system contains and how they have been configured. This infor-
mation is then put into the registry. A summary of some of the things the HAL
does is given in Fig. 11-12.
Device Device Spin
registers addresses Interrupts DMA Timers locks Firmware
1.
RAM MOV EAX,ABC
ADD EAX,BAX
BNE LABEL
MOV EAX,ABC
MOV EAX,ABC
2. ADD EAX,BAX
BNE LABEL
MOVE AX,ABC
ADD EAX,BAX
BNE LABEL
3. Disk
Printer
Above the hardware abstraction layer is NTOS, consisting of two layers: the
kernel and the executive. ‘‘Kernel’’ is a confusing term in Windows. It can refer to
all the code that runs in the processor’s kernel mode. It can also refer to the
SEC. 11.3 SYSTEM STRUCTURE 883
[Link] file which contains NTOS, the core of the Windows operating system.
Or it can refer to the kernel layer within NTOS, which is how we use it in this sec-
tion. It is even used to name the user-mode Win32 library that provides the wrap-
pers for the native system calls: [Link].
In the Windows operating system the kernel layer, illustrated above the execu-
tive layer in Fig. 11-11, provides a set of abstractions for managing the CPU. The
most central abstraction is threads, but the kernel also implements exception han-
dling, traps, and several kinds of interrupts. Creating and destroying the data struc-
tures which support threading is implemented in the executive layer. The kernel
layer is responsible for scheduling and synchronization of threads. Having support
for threads in a separate layer allows the executive layer to be implemented using
the same preemptive multithreading model used to write concurrent code in user
mode, though the synchronization primitives in the executive are much more spe-
cialized.
The kernel’s thread scheduler is responsible for determining which thread is
executing on each CPU in the system. Each thread executes until a timer interrupt
signals that it is time to switch to another thread (quantum expired), or until the
thread needs to wait for something to happen, such as an I/O to complete or for a
lock to be released, or a higher-priority thread becomes runnable and needs the
CPU. When switching from one thread to another, the scheduler runs on the CPU
and ensures that the registers and other hardware state have been saved. The
scheduler then selects another thread to run on the CPU and restores the state that
was previously saved from the last time that thread ran.
If the next thread to be run is in a different address space (i.e., process) than
the thread being switched from, the scheduler must also change address spaces.
The details of the scheduling algorithm itself will be discussed later in this chapter
when we come to processes and threads.
In addition to providing a higher-level abstraction of the hardware and han-
dling thread switches, the kernel layer also has another key function: providing
low-level support for two classes of synchronization mechanisms: control objects
and dispatcher objects. Control objects are the data structures that the kernel
layer provides as abstractions to the executive layer for managing the CPU. They
are allocated by the executive but they are manipulated with routines provided by
the kernel layer. Dispatcher objects are the class of ordinary executive objects
that use a common data structure for synchronization.
Control objects include primitive objects for threads, interrupts, timers, syn-
chronization, profiling, and two special objects for implementing DPCs and APCs.
DPC (Deferred Procedure Call) objects are used to reduce the time taken to ex-
ecute ISRs (Interrupt Service Routines) in response to an interrupt from a partic-
ular device. Limiting time spent in ISRs reduces the chance of losing an interrupt.
884 CASE STUDY 2: WINDOWS 8 CHAP. 11
The system hardware assigns a hardware priority level to interrupts. The CPU
also associates a priority level with the work it is performing. The CPU responds
only to interrupts at a higher-priority level than it is currently using. Normal prior-
ity levels, including the priority level of all user-mode work, is 0. Device inter-
rupts occur at priority 3 or higher, and the ISR for a device interrupt normally ex-
ecutes at the same priority level as the interrupt in order to keep other less impor-
tant interrupts from occurring while it is processing a more important one.
If an ISR executes too long, the servicing of lower-priority interrupts will be
delayed, perhaps causing data to be lost or slowing the I/O throughput of the sys-
tem. Multiple ISRs can be in progress at any one time, with each successive ISR
being due to interrupts at higher and higher-priority levels.
To reduce the time spent processing ISRs, only the critical operations are per-
formed, such as capturing the result of an I/O operation and reinitializing the de-
vice. Further processing of the interrupt is deferred until the CPU priority level is
lowered and no longer blocking the servicing of other interrupts. The DPC object
is used to represent the further work to be done and the ISR calls the kernel layer
to queue the DPC to the list of DPCs for a particular processor. If the DPC is the
first on the list, the kernel registers a special request with the hardware to interrupt
the CPU at priority 2 (which NT calls DISPATCH level). When the last of any ex-
ecuting ISRs completes, the interrupt level of the processor will drop back below 2,
and that will unblock the interrupt for DPC processing. The ISR for the DPC inter-
rupt will process each of the DPC objects that the kernel had queued.
The technique of using software interrupts to defer interrupt processing is a
well-established method of reducing ISR latency. UNIX and other systems started
using deferred processing in the 1970s to deal with the slow hardware and limited
buffering of serial connections to terminals. The ISR would deal with fetching
characters from the hardware and queuing them. After all higher-level interrupt
processing was completed, a software interrupt would run a low-priority ISR to do
character processing, such as implementing backspace by sending control charac-
ters to the terminal to erase the last character displayed and move the cursor back-
ward.
A similar example in Windows today is the keyboard device. After a key is
struck, the keyboard ISR reads the key code from a register and then reenables the
keyboard interrupt but does not do further processing of the key immediately. In-
stead, it uses a DPC to queue the processing of the key code until all outstanding
device interrupts have been processed.
Because DPCs run at level 2 they do not keep device ISRs from executing, but
they do prevent any threads from running until all the queued DPCs complete and
the CPU priority level is lowered below 2. Device drivers and the system itself
must take care not to run either ISRs or DPCs for too long. Because threads are
not allowed to execute, ISRs and DPCs can make the system appear sluggish and
produce glitches when playing music by stalling the threads writing the music
buffer to the sound device. Another common use of DPCs is running routines in
SEC. 11.3 SYSTEM STRUCTURE 885
response to a timer interrupt. To avoid blocking threads, timer events which need
to run for an extended time should queue requests to the pool of worker threads the
kernel maintains for background activities.
The other special kernel control object is the APC (Asynchronous Procedure
Call) object. APCs are like DPCs in that they defer processing of a system rou-
tine, but unlike DPCs, which operate in the context of particular CPUs, APCs ex-
ecute in the context of a specific thread. When processing a key press, it does not
matter which context the DPC runs in because a DPC is simply another part of in-
terrupt processing, and interrupts only need to manage the physical device and per-
form thread-independent operations such as recording the data in a buffer in kernel
space.
The DPC routine runs in the context of whatever thread happened to be run-
ning when the original interrupt occurred. It calls into the I/O system to report that
the I/O operation has been completed, and the I/O system queues an APC to run in
the context of the thread making the original I/O request, where it can access the
user-mode address space of the thread that will process the input.
At the next convenient time the kernel layer delivers the APC to the thread and
schedules the thread to run. An APC is designed to look like an unexpected proce-
dure call, somewhat similar to signal handlers in UNIX. The kernel-mode APC for
completing I/O executes in the context of the thread that initiated the I/O, but in
kernel mode. This gives the APC access to both the kernel-mode buffer as well as
all of the user-mode address space belonging to the process containing the thread.
When an APC is delivered depends on what the thread is already doing, and even
what type of system. In a multiprocessor system the thread receiving the APC may
begin executing even before the DPC finishes running.
User-mode APCs can also be used to deliver notification of I/O completion in
user mode to the thread that initiated the I/O. User-mode APCs invoke a user-
mode procedure designated by the application, but only when the target thread has
blocked in the kernel and is marked as willing to accept APCs. The kernel inter-
rupts the thread from waiting and returns to user mode, but with the user-mode
stack and registers modified to run the APC dispatch routine in the [Link] system
library. The APC dispatch routine invokes the user-mode routine that the applica-
tion has associated with the I/O operation. Besides specifying user-mode APCs as
a means of executing code when I/Os complete, the Win32 API QueueUserAPC
allows APCs to be used for arbitrary purposes.
The executive layer also uses APCs for operations other than I/O completion.
Because the APC mechanism is carefully designed to deliver APCs only when it is
safe to do so, it can be used to safely terminate threads. If it is not a good time to
terminate the thread, the thread will have declared that it was entering a critical re-
gion and defer deliveries of APCs until it leaves. Kernel threads mark themselves
886 CASE STUDY 2: WINDOWS 8 CHAP. 11
as entering critical regions to defer APCs when acquiring locks or other resources,
so that they cannot be terminated while still holding the resource.
Dispatcher Objects
Object header
Object-specific data
Figure 11-13. dispatcher header data structure embedded in many executive ob-
jects (dispatcher objects).
locking primitives, like mutexes. When a thread that is waiting for a lock begins
running again, the first thing it does is to retry acquiring the lock. If only one
thread can hold the lock at a time, all the other threads made runnable might im-
mediately block, incurring lots of unnecessary context switching. The difference
between dispatcher objects using synchronization vs. notification is a flag in the
dispatcher header structure.
As a little aside, mutexes in Windows are called ‘‘mutants’’ in the code be-
cause they were required to implement the OS/2 semantics of not automatically
unlocking themselves when a thread holding one exited, something Cutler consid-
ered bizarre.
As shown in Fig. 11-11, below the kernel layer of NTOS there is the executive.
The executive layer is written in C, is mostly architecture independent (the memo-
ry manager being a notable exception), and has been ported with only modest
effort to new processors (MIPS, x86, PowerPC, Alpha, IA64, x64, and ARM). The
executive contains a number of different components, all of which run using the
control abstractions provided by the kernel layer.
Each component is divided into internal and external data structures and inter-
faces. The internal aspects of each component are hidden and used only within the
component itself, while the external aspects are available to all the other compo-
nents within the executive. A subset of the external interfaces are exported from
the [Link] executable and device drivers can link to them as if the executive
were a library. Microsoft calls many of the executive components ‘‘managers,’’ be-
cause each is charge of managing some aspect of the operating services, such as
I/O, memory, processes, objects, etc.
As with most operating systems, much of the functionality in the Windows ex-
ecutive is like library code, except that it runs in kernel mode so its data structures
can be shared and protected from access by user-mode code, and so it can access
kernel-mode state, such as the MMU control registers. But otherwise the executive
is simply executing operating system functions on behalf of its caller, and thus runs
in the thread of its called.
When any of the executive functions block waiting to synchronize with other
threads, the user-mode thread is blocked, too. This makes sense when working on
behalf of a particular user-mode thread, but it can be unfair when doing work relat-
ed to common housekeeping tasks. To avoid hijacking the current thread when the
executive determines that some housekeeping is needed, a number of kernel-mode
threads are created when the system boots and dedicated to specific tasks, such as
making sure that modified pages get written to disk.
For predictable, low-frequency tasks, there is a thread that runs once a second
and has a laundry list of items to handle. For less predictable work there is the
888 CASE STUDY 2: WINDOWS 8 CHAP. 11
pool of high-priority worker threads mentioned earlier which can be used to run
bounded tasks by queuing a request and signaling the synchronization event that
the worker threads are waiting on.
The object manager manages most of the interesting kernel-mode objects
used in the executive layer. These include processes, threads, files, semaphores,
I/O devices and drivers, timers, and many others. As described previously, kernel-
mode objects are really just data structures allocated and used by the kernel. In
Windows, kernel data structures have enough in common that it is very useful to
manage many of them in a unified facility.
The facilities provided by the object manager include managing the allocation
and freeing of memory for objects, quota accounting, supporting access to objects
using handles, maintaining reference counts for kernel-mode pointer references as
well as handle references, giving objects names in the NT namespace, and provid-
ing an extensible mechanism for managing the lifecycle for each object. Kernel
data structures which need some of these facilities are managed by the object man-
ager.
Object-manager objects each have a type which is used to specify exactly how
the lifecycle of objects of that type is to be managed. These are not types in the
object-oriented sense, but are simply a collection of parameters specified when the
object type is created. To create a new type, an executive component calls an ob-
ject-manager API to create a new type. Objects are so central to the functioning of
Windows that the object manager will be discussed in more detail in the next sec-
tion.
The I/O manager provides the framework for implementing I/O device drivers
and provides a number of executive services specific to configuring, accessing, and
performing operations on devices. In Windows, device drivers not only manage
physical devices but they also provide extensibility to the operating system. Many
functions that are compiled into the kernel on other systems are dynamically load-
ed and linked by the kernel on Windows, including network protocol stacks and
file systems.
Recent versions of Windows have a lot more support for running device drivers
in user mode, and this is the preferred model for new device drivers. There are
hundreds of thousands of different device drivers for Windows working with more
than a million distinct devices. This represents a lot of code to get correct. It is
much better if bugs cause a device to become inaccessible by crashing in a user-
mode process rather than causing the system to crash. Bugs in kernel-mode device
drivers are the major source of the dreaded BSOD (Blue Screen Of Death) where
Windows detects a fatal error within kernel mode and shuts down or reboots the
system. BSOD’s are comparable to kernel panics on UNIX systems.
In essence, Microsoft has now officially recognized what researchers in the
area of microkernels such as MINIX 3 and L4 have known for years: the more
code there is in the kernel, the more bugs there are in the kernel. Since device driv-
ers make up something in the vicinity of 70% of the code in the kernel, the more
SEC. 11.3 SYSTEM STRUCTURE 889
drivers that can be moved into user-mode processes, where a bug will only trigger
the failure of a single driver (rather than bringing down the entire system), the bet-
ter. The trend of moving code from the kernel to user-mode processes is expected
to accelerate in the coming years.
The I/O manager also includes the plug-and-play and device power-man-
agement facilities. Plug-and-play comes into action when new devices are detect-
ed on the system. The plug-and-play subcomponent is first notified. It works with
a service, the user-mode plug-and-play manager, to find the appropriate device
driver and load it into the system. Getting the right one is not always easy and
sometimes depends on sophisticated matching of the specific hardware device ver-
sion to a particular version of the drivers. Sometimes a single device supports a
standard interface which is supported by multiple different drivers, written by dif-
ferent companies.
We will study I/O further in Sec. 11.7 and the most important NT file system,
NTFS, in Sec. 11.8.
Device power management reduces power consumption when possible, ex-
tending battery life on notebooks, and saving energy on desktops and servers. Get-
ting power management correct can be challenging, as there are many subtle
dependencies between devices and the buses that connect them to the CPU and
memory. Power consumption is not affected just by what devices are powered-on,
but also by the clock rate of the CPU, which is also controlled by the device power
manager. We will take a more in depth look at power management in Sec. 11.9.
The process manager manages the creation and termination of processes and
threads, including establishing the policies and parameters which govern them.
But the operational aspects of threads are determined by the kernel layer, which
controls scheduling and synchronization of threads, as well as their interaction
with the control objects, like APCs. Processes contain threads, an address space,
and a handle table containing the handles the process can use to refer to kernel-
mode objects. Processes also include information needed by the scheduler for
switching between address spaces and managing process-specific hardware infor-
mation (such as segment descriptors). We will study process and thread man-
agement in Sec. 11.4.
The executive memory manager implements the demand-paged virtual mem-
ory architecture. It manages the mapping of virtual pages onto physical page
frames, the management of the available physical frames, and management of the
pagefile on disk used to back private instances of virtual pages that are no longer
loaded in memory. The memory manager also provides special facilities for large
server applications such as databases and programming language run-time compo-
nents such as garbage collectors. We will study memory management later in this
chapter, in Sec. 11.5.
The cache manager optimizes the performance of I/O to the file system by
maintaining a cache of file-system pages in the kernel virtual address space. The
cache manager uses virtually addressed caching, that is, organizing cached pages
890 CASE STUDY 2: WINDOWS 8 CHAP. 11
in terms of their location in their files. This differs from physical block caching, as
in UNIX, where the system maintains a cache of the physically addressed blocks of
the raw disk volume.
Cache management is implemented using mapped files. The actual caching is
performed by the memory manager. The cache manager need be concerned only
with deciding what parts of what files to cache, ensuring that cached data is
flushed to disk in a timely fashion, and managing the kernel virtual addresses used
to map the cached file pages. If a page needed for I/O to a file is not available in
the cache, the page will be faulted in using the memory manager. We will study
the cache manager in Sec. 11.6.
The security reference monitor enforces Windows’ elaborate security mech-
anisms, which support the international standards for computer security called
Common Criteria, an evolution of United States Department of Defense Orange
Book security requirements. These standards specify a large number of rules that a
conforming system must meet, such as authenticated login, auditing, zeroing of al-
located memory, and many more. One rules requires that all access checks be im-
plemented by a single module within the system. In Windows, this module is the
security reference monitor in the kernel. We will study the security system in more
detail in Sec. 11.10.
The executive contains a number of other components that we will briefly de-
scribe. The configuration manager is the executive component which imple-
ments the registry, as described earlier. The registry contains configuration data for
the system in file-system files called hives. The most critical hive is the SYSTEM
hive which is loaded into memory at boot time. Only after the executive layer has
successfully initialized its key components, including the I/O drivers that talk to
the system disk, is the in-memory copy of the hive reassociated with the copy in
the file system. Thus, if something bad happens while trying to boot the system,
the on-disk copy is much less likely to be corrupted.
The LPC component provides for a highly efficient interprocess communica-
tion used between processes running on the same system. It is one of the data tran-
sports used by the standards-based remote procedure call facility to implement the
client/server style of computing. RPC also uses named pipes and TCP/IP as tran-
sports.
LPC was substantially enhanced in Windows 8 (it is now called ALPC, for
Advanced LPC) to provide support for new features in RPC, including RPC from
kernel mode components, like drivers. LPC was a critical component in the origi-
nal design of NT because it is used by the subsystem layer to implement communi-
cation between library stub routines that run in each process and the subsystem
process which implements the facilities common to a particular operating system
personality, such as Win32 or POSIX.
Windows 8 implemented a publish/subscibe service called WNF (Windows
Notification Facility). WNF notifications are based on changes to an instance of
WNF state data. A publisher declares an instance of state data (up to 4 KB) and
SEC. 11.3 SYSTEM STRUCTURE 891
tells the operating system how long to maintain it (e.g., until the next reboot or
permanently). A publisher atomically updates the state as appropriate. Subscri-
bers can arrange to run code whenever an instance of state data is modified by a
publisher. Because the WNF state instances contain a fixed amount of preallocated
data, there is no queuing of data as in message-based IPC—with all the attendant
resource-management problems. Subscribers are guaranteed only that they can see
the latest version of a state instance.
This state-based approach gives WNF its principal advantage over other IPC
mechanisms: publishers and subscribers are decoupled and can start and stop inde-
pendently of each other. Publishers need not execute at boot time just to initialize
their state instances, as those can be persisted by the operating system across
reboots. Subscribers generally need not be concerned about past values of state
instances when they start running, as all they should need to know about the state’s
history is encapsulated in the current state. In scenarios where past state values
cannot be reasonably encapsulated, the current state can provide metadata for man-
aging historical state, say, in a file or in a persisted section object used as a circular
buffer. WNF is part of the native NT APIs and is not (yet) exposed via Win32 in-
terfaces. But it is extensively used internally by the system to implement Win32
and WinRT APIs.
In Windows NT 4.0, much of the code related to the Win32 graphical interface
was moved into the kernel because the then-current hardware could not provide the
required performance. This code previously resided in the [Link] subsystem
process which implemented the Win32 interfaces. The kernel-based GUI code
resides in a special kernel-driver, [Link]. This change was expected to im-
prove Win32 performance because the extra user-mode/kernel-mode transitions
and the cost of switching address spaces to implement communication via LPC
was eliminated. But it has not been as successful as expected because the re-
quirements on code running in the kernel are very strict, and the additional over-
head of running in kernel-mode offsets some of the gains from reducing switching
costs.
The final part of Fig. 11-11 consists of the device drivers. Device drivers in
Windows are dynamic link libraries which are loaded by the NTOS executive.
Though they are primarily used to implement the drivers for specific hardware,
such as physical devices and I/O buses, the device-driver mechanism is also used
as the general extensibility mechanism for kernel mode. As described above,
much of the Win32 subsystem is loaded as a driver.
The I/O manager organizes a data flow path for each instance of a device, as
shown in Fig. 11-14. This path is called a device stack and consists of private
instances of kernel device objects allocated for the path. Each device object in the
device stack is linked to a particular driver object, which contains the table of
892 CASE STUDY 2: WINDOWS 8 CHAP. 11
routines to use for the I/O request packets that flow through the device stack. In
some cases the devices in the stack represent drivers whose sole purpose is to filter
I/O operations aimed at a particular device, bus, or network driver. Filtering is
used for a number of reasons. Sometimes preprocessing or postprocessing I/O op-
erations results in a cleaner architecture, while other times it is just pragmatic be-
cause the sources or rights to modify a driver are not available and so filtering is
used to work around the inability to modify those drivers. Filters can also imple-
ment completely new functionality, such as turning disks into partitions or multiple
disks into RAID volumes.
I/O manager
Figure 11-14. Simplified depiction of device stacks for two NTFS file volumes.
The I/O request packet is passed from down the stack. The appropriate routines
from the associated drivers are called at each level in the stack. The device stacks
themselves consist of device objects allocated specifically to each stack.
The file systems are loaded as device drivers. Each instance of a volume for a
file system has a device object created as part of the device stack for that volume.
This device object will be linked to the driver object for the file system appropriate
to the volume’s formatting. Special filter drivers, called file-system filter drivers,
can insert device objects before the file-system device object to apply functionality
to the I/O requests being sent to each volume, such as inspecting data read or writ-
ten for viruses.
SEC. 11.3 SYSTEM STRUCTURE 893
kernel, and executive layers, link in the driver images, and access/update configu-
ration data in the SYSTEM hive. After all the kernel-mode components are ini-
tialized, the first user-mode process is created using for running the [Link] pro-
gram (which is like /etc/init in UNIX systems).
Recent versions of Windows provide support for improving the security of the
system at boot time. Many newer PCs contain a TPM (Trusted Platform Mod-
ule), which is chip on the parentboard. chip is a secure cryptographic processor
which protects secrets, such as encryption/decryption keys. The system’s TPM can
be used to protect system keys, such as those used by BitLocker to encrypt the
disk. Protected keys are not revealed to the operating system until after TPM has
verified that an attacker has not tampered with them. It can also provide other
cryptographic functions, such as attesting to remote systems that the operating sys-
tem on the local system had not been compromised.
The Windows boot programs have logic to deal with common problems users
encounter when booting the system fails. Sometimes installation of a bad device
driver, or running a program like regedit (which can corrupt the SYSTEM hive),
will prevent the system from booting normally. There is support for ignoring re-
cent changes and booting to the last known good configuration of the system.
Other boot options include safe-boot, which turns off many optional drivers, and
the recovery console, which fires up a [Link] command-line window, providing
an experience similar to single-user mode in UNIX.
Another common problem for users has been that occasionally some Windows
systems appear to be very flaky, with frequent (seemingly random) crashes of both
the system and applications. Data taken from Microsoft’s Online Crash Analysis
program provided evidence that many of these crashes were due to bad physical
memory, so the boot process in Windows provides the option of running an exten-
sive memory diagnostic. Perhaps future PC hardware will commonly support ECC
(or maybe parity) for memory, but most of the desktop, notebook, and handheld
systems today are vulnerable to even single-bit errors in the tens of billions of
memory bits they contain.
The object manager is probably the single most important component in the
Windows executive, which is why we have already introduced many of its con-
cepts. As described earlier, it provides a uniform and consistent interface for man-
aging system resources and data structures, such as open files, processes, threads,
memory sections, timers, devices, drivers, and semaphores. Even more specialized
objects representing things like kernel transactions, profiles, security tokens, and
Win32 desktops are managed by the object manager. Device objects link together
the descriptions of the I/O system, including providing the link between the NT
namespace and file-system volumes. The configuration manager uses an object of
type key to link in the registry hives. The object manager itself has objects it uses
SEC. 11.3 SYSTEM STRUCTURE 895
Object name
Directory in which the object lives
Object Security information (which can use object)
header Quota charges (cost to use the object)
List of processes with handles
Reference counts
Pointer to the type object
Type name
Access types
Access rights
Quota charges
Object Synchronizable?
Object-specific data Pageable
data
Open method
Close method
Delete method
Query name method
Parse method
Security method
Handles
Object
Object
Object
Figure 11-16. Handle table data structures for a minimal table using a single
page for up to 512 handles.
Figure 11-17 shows a handle table with two extra levels of indirection, the
maximum supported. It is sometimes convenient for code executing in kernel
mode to be able to use handles rather than referenced pointers. These are called
kernel handles and are specially encoded so that they can be distinguished from
user-mode handles. Kernel handles are kept in the system processes’ handle table
and cannot be accessed from user mode. Just as most of the kernel virtual address
space is shared across all processes, the system handle table is shared by all kernel
components, no matter what the current user-mode process is.
Users can create new objects or open existing objects by making Win32 calls
such as CreateSemaphore or OpenSemaphore. These are calls to library proce-
dures that ultimately result in the appropriate system calls being made. The result
of any successful call that creates or opens an object is a 64-bit handle-table entry
that is stored in the process’ private handle table in kernel memory. The 32-bit
index of the handle’s logical position in the table is returned to the user to use on
subsequent calls. The 64-bit handle-table entry in the kernel contains two 32-bit
words. One word contains a 29-bit pointer to the object’s header. The low-order 3
bits are used as flags (e.g., whether the handle is inherited by processes it creates).
These 3 bits are masked off before the pointer is followed. The other word con-
tains a 32-bit rights mask. It is needed because permissions checking is done only
898 CASE STUDY 2: WINDOWS 8 CHAP. 11
Handle-table
Descriptor D: Handle-table pointers [32]
Table pointer
at the time the object is created or opened. If a process has only read permission to
an object, all the other rights bits in the mask will be 0s, giving the operating sys-
tem the ability to reject any operation on the object other than reads.
Processes can share objects by having one process duplicate a handle to the ob-
ject into the others. But this requires that the duplicating process have handles to
the other processes, and is thus impractical in many situations, such as when the
processes sharing an object are unrelated, or are protected from each other. In
other cases it is important that objects persist even when they are not being used by
any process, such as device objects representing physical devices, or mounted vol-
umes, or the objects used to implement the object manager and the NT namespace
itself. To address general sharing and persistence requirements, the object man-
ager allows arbitrary objects to be given names in the NT namespace when they are
created. However, it is up to the executive component that manipulates objects of a
particular type to provide interfaces that support use of the object manager’s na-
ming facilities.
The NT namespace is hierarchical, with the object manager implementing di-
rectories and symbolic links. The namespace is also extensible, allowing any ob-
ject type to specify extensions of the namespace by specifying a Parse routine.
The Parse routine is one of the procedures that can be supplied for each object type
when it is created, as shown in Fig. 11-18.
The Open procedure is rarely used because the default object-manager behav-
ior is usually what is needed and so the procedure is specified as NULL for almost
all object types.
SEC. 11.3 SYSTEM STRUCTURE 899
Figure 11-18. Object procedures supplied when specifying a new object type.
The Close and Delete procedures represent different phases of being done with
an object. When the last handle for an object is closed, there may be actions neces-
sary to clean up the state and these are performed by the Close procedure. When
the final pointer reference is removed from the object, the Delete procedure is call-
ed so that the object can be prepared to be deleted and have its memory reused.
With file objects, both of these procedures are implemented as callbacks into the
I/O manager, which is the component that declared the file object type. The ob-
ject-manager operations result in I/O operations that are sent down the device stack
associated with the file object; the file system does most of the work.
The Parse procedure is used to open or create objects, like files and registry
keys, that extend the NT namespace. When the object manager is attempting to
open an object by name and encounters a leaf node in the part of the namespace it
manages, it checks to see if the type for the leaf-node object has specified a Parse
procedure. If so, it invokes the procedure, passing it any unused part of the path
name. Again using file objects as an example, the leaf node is a device object
representing a particular file-system volume. The Parse procedure is implemented
by the I/O manager, and results in an I/O operation to the file system to fill in a file
object to refer to an open instance of the file that the path name refers to on the
volume. We will explore this particular example step-by-step below.
The QueryName procedure is used to look up the name associated with an ob-
ject. The Security procedure is used to get, set, or delete the security descriptors
on an object. For most object types this procedure is supplied as a standard entry
point in the executive’s security reference monitor component.
Note that the procedures in Fig. 11-18 do not perform the most useful opera-
tions for each type of object, such as read or write on files (or down and up on
semaphores). Rather, the object manager procedures supply the functions needed
to correctly set up access to objects and then clean up when the system is finished
with them. The objects are made useful by the APIs that operate on the data struc-
tures the objects contain. System calls, like NtReadFile and NtWriteFile, use the
process’ handle table created by the object manager to translate a handle into a ref-
erenced pointer on the underlying object, such as a file object, which contains the
data that is needed to implement the system calls.
900 CASE STUDY 2: WINDOWS 8 CHAP. 11
Apart from the object-type callbacks, the object manager also provides a set of
generic object routines for operations like creating objects and object types, dupli-
cating handles, getting a referenced pointer from a handle or name, adding and
subtracting reference counts to the object header, and NtClose (the generic function
that closes all types of handles).
Although the object namespace is crucial to the entire operation of the system,
few people know that it even exists because it is not visible to users without special
viewing tools. One such viewing tool is winobj, available for free at the URL
[Link]/technet/sysinternals. When run, this tool depicts an object
namespace that typically contains the object directories listed in Fig. 11-19 as well
as a few others.
Directory Contents
\?? Starting place for looking up MS-DOS devices like C:
\ DosDevices Official name of \ ??, but really just a symbolic link to \ ??
\Device All discovered I/O devices
\Driver Objects corresponding to each loaded device driver
\ObjectTypes The type objects such as those listed in Fig. 11-21
\Windows Objects for sending messages to all the Win32 GUI windows
\BaseNamedObjects User-created Win32 objects such as semaphores, mutexes, etc.
\Arcname Par tition names discovered by the boot loader
\NLS National Language Support objects
\FileSystem File-system driver objects and file system recognizer objects
\Security Objects belonging to the security system
\KnownDLLs Key shared libraries that are opened early and held open
The strangely named directory \ ?? contains the names of all the MS-DOS-
style device names, such as A: for the floppy disk and C: for the first hard disk.
These names are actually symbolic links to the directory \ Device where the device
objects live. The name \ ?? was chosen to make it alphabetically first so as to
speed up lookup of all path names beginning with a drive letter. The contents of
the other object directories should be self explanatory.
As described above, the object manager keeps a separate handle count in every
object. This count is never larger than the referenced pointer count because each
valid handle has a referenced pointer to the object in its handle-table entry. The
reason for the separate handle count is that many types of objects may need to have
their state cleaned up when the last user-mode reference disappears, even though
they are not yet ready to have their memory deleted.
One example is file objects, which represent an instance of an opened file. In
Windows, files can be opened for exclusive access. When the last handle for a file
SEC. 11.3 SYSTEM STRUCTURE 901
object is closed it is important to delete the exclusive access at that point rather
than wait for any incidental kernel references to eventually go away (e.g., after the
last flush of data from memory). Otherwise closing and reopening a file from user
mode may not work as expected because the file still appears to be in use.
Though the object manager has comprehensive mechanisms for managing ob-
ject lifetimes within the kernel, neither the NT APIs nor the Win32 APIs provide a
reference mechanism for dealing with the use of handles across multiple concur-
rent threads in user mode. Thus, many multithreaded applications have race condi-
tions and bugs where they will close a handle in one thread before they are finished
with it in another. Or they may close a handle multiple times, or close a handle
that another thread is still using and reopen it to refer to a different object.
Perhaps the Windows APIs should have been designed to require a close API
per object type rather than the single generic NtClose operation. That would have
at least reduced the frequency of bugs due to user-mode threads closing the wrong
handles. Another solution might be to embed a sequence field in each handle in
addition to the index into the handle table.
To help application writers find problems like these in their programs, Win-
dows has an application verifier that software developers can download from
Microsoft. Similar to the verifier for drivers we will describe in Sec. 11.7, the ap-
plication verifier does extensive rules checking to help programmers find bugs that
might not be found by ordinary testing. It can also turn on a FIFO ordering for the
handle free list, so that handles are not reused immediately (i.e., turns off the bet-
ter-performing LIFO ordering normally used for handle tables). Keeping handles
from being reused quickly transforms situations where an operation uses the wrong
handle into use of a closed handle, which is easy to detect.
The device object is one of the most important and versatile kernel-mode ob-
jects in the executive. The type is specified by the I/O manager, which along with
the device drivers, are the primary users of device objects. Device objects are
closely related to drivers, and each device object usually has a link to a specific
driver object, which describes how to access the I/O processing routines for the
driver corresponding to the device.
Device objects represent hardware devices, interfaces, and buses, as well as
logical disk partitions, disk volumes, and even file systems and kernel extensions
like antivirus filters. Many device drivers are given names, so they can be accessed
without having to open handles to instances of the devices, as in UNIX. We will
use device objects to illustrate how the Parse procedure is used, as illustrated in
Fig. 11-20:
Figure 11-20. I/O and object manager steps for creating/opening a file and get-
ting back a file handle.
3. The object manager then calls the Parse procedure for this object
type, which happens to be IopParseDevice implemented by the I/O
manager. It passes not only a pointer to the device object it found (for
C:), but also the remaining string \ foo \ bar.
4. The I/O manager will create an IRP (I/O Request Packet), allocate a
file object, and send the request to the stack of I/O devices determined
by the device object found by the object manager.
5. The IRP is passed down the I/O stack until it reaches a device object
representing the file-system instance for C:. At each stage, control is
passed to an entry point into the driver object associated with the de-
vice object at that level. The entry point used here is for CREATE
operations, since the request is to create or open a file named
\ foo \ bar on the volume.
SEC. 11.3 SYSTEM STRUCTURE 903
6. The device objects encountered as the IRP heads toward the file sys-
tem represent file-system filter drivers, which may modify the I/O op-
eration before it reaches the file-system device object. Typically
these intermediate devices represent system extensions like antivirus
filters.
7. The file-system device object has a link to the file-system driver ob-
ject, say NTFS. So, the driver object contains the address of the
CREATE operation within NTFS.
8. NTFS will fill in the file object and return it to the I/O manager,
which returns back up through all the devices on the stack until Iop-
ParseDevice returns to the object manager (see Sec. 11.8).
10. The final step is to return back to the user-mode caller, which in this
example is the Win32 API CreateFile, which will return the handle to
the application.
Type Description
Process User process
Thread Thread within a process
Semaphore Counting semaphore used for interprocess synchronization
Mutex Binar y semaphore used to enter a critical region
Event Synchronization object with persistent state (signaled/not)
ALPC port Mechanism for interprocess message passing
Timer Object allowing a thread to sleep for a fixed time interval
Queue Object used for completion notification on asynchronous I/O
Open file Object associated with an open file
Access token Security descriptor for some object
Profile Data structure used for profiling CPU usage
Section Object used for representing mappable files
Key Registr y key, used to attach registry to object-manager namespace
Object directory Director y for grouping objects within the object manager
Symbolic link Refers to another object manager object by path name
Device I/O device object for a physical device, bus, driver, or volume instance
Device driver Each loaded device driver has its own object
Figure 11-21. Some common executive object types managed by the object
manager.
provide a way to block for a specific time interval. Queues (known internally as
KQUEUES) are used to notify threads that a previously started asynchronous I/O
operation has completed or that a port has a message waiting. Queues are designed
to manage the level of concurrency in an application, and are also used in high-per-
formance multiprocessor applications, like SQL.
Open file objects are created when a file is opened. Files that are not opened
do not have objects managed by the object manager. Access tokens are security
objects. They identify a user and tell what special privileges the user has, if any.
Profiles are structures used for storing periodic samples of the program counter of
a running thread to see where the program is spending its time.
Sections are used to represent memory objects that applications can ask the
memory manager to map into their address space. They record the section of the
file (or page file) that represents the pages of the memory object when they are on
disk. Keys represent the mount point for the registry namespace on the object
manager namespace. There is usually only one key object, named \ REGISTRY,
which connects the names of the registry keys and values to the NT namespace.
Object directories and symbolic links are entirely local to the part of the NT
namespace managed by the object manager. They are similar to their file system
counterparts: directories allow related objects to be collected together. Symbolic
SEC. 11.3 SYSTEM STRUCTURE 905
links allow a name in one part of the object namespace to refer to an object in a
different part of the object namespace.
Each device known to the operating system has one or more device objects that
contain information about it and are used to refer to the device by the system.
Finally, each device driver that has been loaded has a driver object in the object
space. The driver objects are shared by all the device objects that represent
instances of the devices controlled by those drivers.
Other objects (not shown) have more specialized purposes, such as interacting
with kernel transactions, or the Win32 thread pool’s worker thread factory.
Going back to Fig. 11-4, we see that the Windows operating system consists of
components in kernel mode and components in user mode. We have now com-
pleted our overview of the kernel-mode components; so it is time to look at the
user-mode components, of which three kinds are particularly important to Win-
dows: environment subsystems, DLLs, and service processes.
We have already described the Windows subsystem model; we will not go into
more detail now other than to mention that in the original design of NT, subsys-
tems were seen as a way of supporting multiple operating system personalities with
the same underlying software running in kernel mode. Perhaps this was an attempt
to avoid having operating systems compete for the same platform, as VMS and
Berkeley UNIX did on DEC’s VAX. Or maybe it was just that nobody at Micro-
soft knew whether OS/2 would be a success as a programming interface, so they
were hedging their bets. In any case, OS/2 became irrelevant, and a latecomer, the
Win32 API designed to be shared with Windows 95, became dominant.
A second key aspect of the user-mode design of Windows is the dynamic link
library (DLL) which is code that is linked to executable programs at run time rath-
er than compile time. Shared libraries are not a new concept, and most modern op-
erating systems use them. In Windows, almost all libraries are DLLs, from the
system library [Link] that is loaded into every process to the high-level libraries
of common functions that are intended to allow rampant code-reuse by application
developers.
DLLs improve the efficiency of the system by allowing common code to be
shared among processes, reduce program load times from disk by keeping com-
monly used code around in memory, and increase the serviceability of the system
by allowing operating system library code to be updated without having to recom-
pile or relink all the application programs that use it.
On the other hand, shared libraries introduce the problem of versioning and in-
crease the complexity of the system because changes introduced into a shared li-
brary to help one particular program have the potential of exposing latent bugs in
other applications, or just breaking them due to changes in the implementation—a
problem that in the Windows world is referred to as DLL hell.
906 CASE STUDY 2: WINDOWS 8 CHAP. 11
kernel and services implemented in user-mode processes. Both the kernel and
process provide private address spaces where data structures can be protected and
service requests can be scrutinized.
However, there can be significant performance differences between services in
the kernel vs. services in user-mode processes. Entering the kernel from user mode
is slow on modern hardware, but not as slow as having to do it twice because you
are switching back and forth to another process. Also cross-process communica-
tion has lower bandwidth.
Kernel-mode code can (carefully) access data at the user-mode addresses pas-
sed as parameters to its system calls. With user-mode services, either those data
must be copied to the service process, or some games be played by mapping mem-
ory back and forth (the ALPC facilities in Windows handle this under the covers).
In the future it is possible that the hardware costs of crossing between address
spaces and protection modes will be reduced, or perhaps even become irrelevant.
The Singularity project in Microsoft Research (Fandrich et al., 2006) uses run-time
techniques, like those used with C# and Java, to make protection a completely soft-
ware issue. No hardware switching between address spaces or protection modes is
required.
Windows makes significant use of user-mode service processes to extend the
functionality of the system. Some of these services are strongly tied to the opera-
tion of kernel-mode components, such as [Link] which is the local security
authentication service which manages the token objects that represent user-identity,
as well as managing encryption keys used by the file system. The user-mode plug-
and-play manager is responsible for determining the correct driver to use when a
new hardware device is encountered, installing it, and telling the kernel to load it.
Many facilities provided by third parties, such as antivirus and digital rights man-
agement, are implemented as a combination of kernel-mode drivers and user-mode
services.
The Windows [Link] has a tab which identifies the services running on
the system. Multiple services can be seen to be running in the same process
([Link]). Windows does this for many of its own boot-time services to reduce
the time needed to start up the system. Services can be combined into the same
process as long as they can safely operate with the same security credentials.
Within each of the shared service processes, individual services are loaded as
DLLs. They normally share a pool of threads using the Win32 thread-pool facility,
so that only the minimal number of threads needs to be running across all the resi-
dent services.
Services are common sources of security vulnerabilities in the system because
they are often accessible remotely (depending on the TCP/IP firewall and IP Secu-
rity settings), and not all programmers who write services are as careful as they
should be to validate the parameters and buffers that are passed in via RPC.
The number of services running constantly in Windows is staggering. Yet few
of those services ever receive a single request, though if they do it is likely to be
908 CASE STUDY 2: WINDOWS 8 CHAP. 11
In Windows processes are containers for programs. They hold the virtual ad-
dress space, the handles that refer to kernel-mode objects, and threads. In their
role as a container for threads they hold common resources used for thread execu-
tion, such as the pointer to the quota structure, the shared token object, and default
parameters used to initialize threads—including the priority and scheduling class.
Each process has user-mode system data, called the PEB (Process Environment
Block). The PEB includes the list of loaded modules (i.e., the EXE and DLLs),
the memory containing environment strings, the current working directory, and
data for managing the process’ heaps—as well as lots of special-case Win32 cruft
that has been added over time.
Threads are the kernel’s abstraction for scheduling the CPU in Windows. Pri-
orities are assigned to each thread based on the priority value in the containing
process. Threads can also be affinitized to run only on certain processors. This
helps concurrent programs running on multicore chips or multiprocessors to expli-
citly spread out work. Each thread has two separate call stacks, one for execution
in user mode and one for kernel mode. There is also a TEB (Thread Environ-
ment Block) that keeps user-mode data specific to the thread, including per-thread
storage (Thread Local Storage) and fields for Win32, language and cultural local-
ization, and other specialized fields that have been added by various facilities.
Besides the PEBs and TEBs, there is another data structure that kernel mode
shares with each process, namely, user shared data. This is a page that is writable
by the kernel, but read-only in every user-mode process. It contains a number of
values maintained by the kernel, such as various forms of time, version infor-
mation, amount of physical memory, and a large number of shared flags used by
various user-mode components, such as COM, terminal services, and the debug-
gers. The use of this read-only shared page is purely a performance optimization,
as the values could also be obtained by a system call into kernel mode. But system
calls are much more expensive than a single memory access, so for some sys-
tem-maintained fields, such as the time, this makes a lot of sense. The other fields,
such as the current time zone, change infrequently (except on airborne computers),
SEC. 11.4 PROCESSES AND THREADS IN WINDOWS 909
but code that relies on these fields must query them often just to see if they have
changed. As with many performance hacks, it is a bit ugly, but it works.
Processes
Processes are created from section objects, each of which describes a memory
object backed by a file on disk. When a process is created, the creating process re-
ceives a handle that allows it to modify the new process by mapping sections, allo-
cating virtual memory, writing parameters and environmental data, duplicating file
descriptors into its handle table, and creating threads. This is very different than
how processes are created in UNIX and reflects the difference in the target systems
for the original designs of UNIX vs. Windows.
As described in Sec. 11.1, UNIX was designed for 16-bit single-processor sys-
tems that used swapping to share memory among processes. In such systems, hav-
ing the process as the unit of concurrency and using an operation like fork to create
processes was a brilliant idea. To run a new process with small memory and no
virtual memory hardware, processes in memory have to be swapped out to disk to
create space. UNIX originally implemented fork simply by swapping out the par-
ent process and handing its physical memory to the child. The operation was al-
most free.
In contrast, the hardware environment at the time Cutler’s team wrote NT was
32-bit multiprocessor systems with virtual memory hardware to share 1–16 MB of
physical memory. Multiprocessors provide the opportunity to run parts of pro-
grams concurrently, so NT used processes as containers for sharing memory and
object resources, and used threads as the unit of concurrency for scheduling.
Of course, the systems of the next few years will look nothing like either of
these target environments, having 64-bit address spaces with dozens (or hundreds)
of CPU cores per chip socket and dozens or hundreds gigabytes of physical memo-
ry. This memory may be radically different from current RAM as well. Current
RAM loses its contents when powered off, but phase-change memories now in
the pipeline keep their values (like disks) even when powered off. Also expect
flash devices to replace hard disks, broader support for virtualization, ubiquitous
networking, and support for synchronization innovations like transactional mem-
ory. Windows and UNIX will continue to be adapted to new hardware realities,
but what will be really interesting is to see what new operating systems are de-
signed specifically for systems based on these advances.
Windows can group processes together into jobs. Jobs group processes in
order to apply constraints to them and the threads they contain, such as limiting re-
source use via a shared quota or enforcing a restricted token that prevents threads
from accessing many system objects. The most significant property of jobs for
910 CASE STUDY 2: WINDOWS 8 CHAP. 11
job
process process
Figure 11-22. The relationship between jobs, processes, threads, and fibers.
Jobs and fibers are optional; not all processes are in jobs or contain fibers.
Fibers are created by allocating a stack and a user-mode fiber data structure for
storing registers and data associated with the fiber. Threads are converted to fibers,
but fibers can also be created independently of threads. Such a fiber will not run
until a fiber already running on a thread explicitly calls SwitchToFiber to run the
fiber. Threads could attempt to switch to a fiber that is already running, so the pro-
grammer must provide synchronization to prevent this.
The primary advantage of fibers is that the overhead of switching between
fibers is much lower than switching between threads. A thread switch requires
entering and exiting the kernel. A fiber switch saves and restores a few registers
without changing modes at all.
Although fibers are cooperatively scheduled, if there are multiple threads
scheduling the fibers, a lot of careful synchronization is required to make sure
fibers do not interfere with each other. To simplify the interaction between threads
and fibers, it is often useful to create only as many threads as there are processors
to run them, and affinitize the threads to each run only on a distinct set of available
processors, or even just one processor.
Each thread can then run a particular subset of the fibers, establishing a one-to-
many relationship between threads and fibers which simplifies synchronization.
Even so there are still many difficulties with fibers. Most of the Win32 libraries
SEC. 11.4 PROCESSES AND THREADS IN WINDOWS 911
are completely unaware of fibers, and applications that attempt to use fibers as if
they were threads will encounter various failures. The kernel has no knowledge of
fibers, and when a fiber enters the kernel, the thread it is executing on may block
and the kernel will schedule an arbitrary thread on the processor, making it
unavailable to run other fibers. For these reasons fibers are rarely used except
when porting code from other systems that explicitly need the functionality pro-
vided by fibers.
The Win32 thread pool is a facility that builds on top of the Windows thread
model to provide a better abstraction for certain types of programs. Thread crea-
tion is too expensive to be invoked every time a program wants to execute a small
task concurrently with other tasks in order to take advantage of multiple proc-
essors. Tasks can be grouped together into larger tasks but this reduces the amount
of exploitable concurrency in the program. An alternative approach is for a pro-
gram to allocate a limited number of threads, and maintain a queue of tasks that
need to be run. As a thread finishes the execution of a task, it takes another one
from the queue. This model separates the resource-management issues (how many
processors are available and how many threads should be created) from the pro-
gramming model (what is a task and how are tasks synchronized). Windows for-
malizes this solution into the Win32 thread pool, a set of APIs for automatically
managing a dynamic pool of threads and dispatching tasks to them.
Thread pools are not a perfect solution, because when a thread blocks for some
resource in the middle of a task, the thread cannot switch to a different task. Thus,
the thread pool will inevitably create more threads than there are processors avail-
able, so if runnable threads are available to be scheduled even when other threads
have blocked. The thread pool is integrated with many of the common synchroni-
zation mechanisms, such as awaiting the completion of I/O or blocking until a ker-
nel event is signaled. Synchronization can be used as triggers for queuing a task so
threads are not assigned the task before it is ready to run.
The implementation of the thread pool uses the same queue facility provided
for synchronization with I/O completion, together with a kernel-mode thread fac-
tory which adds more threads to the process as needed to keep the available num-
ber of processors busy. Small tasks exist in many applications, but particularly in
those that provide services in the client/server model of computing, where a stream
of requests are sent from the clients to the server. Use of a thread pool for these
scenarios improves the efficiency of the system by reducing the overhead of creat-
ing threads and moving the decisions about how to manage the threads in the pool
out of the application and into the operating system.
What programmers see as a single Windows thread is actually two threads: one
that runs in kernel mode and one that runs in user mode. This is precisely the same
912 CASE STUDY 2: WINDOWS 8 CHAP. 11
model that UNIX has. Each of these threads is allocated its own stack and its own
memory to save its registers when not running. The two threads appear to be a sin-
gle thread because they do not run at the same time. The user thread operates as an
extension of the kernel thread, running only when the kernel thread switches to it
by returning from kernel mode to user mode. When a user thread wants to perform
a system call, encounters a page fault, or is preempted, the system enters kernel
mode and switches back to the corresponding kernel thread. It is normally not pos-
sible to switch between user threads without first switching to the corresponding
kernel thread, switching to the new kernel thread, and then switching to its user
thread.
Most of the time the difference between user and kernel threads is transparent
to the programmer. However, in Windows 7 Microsoft added a facility called
UMS (User-Mode Scheduling), which exposes the distinction. UMS is similar to
facilities used in other operating systems, such as scheduler activations. It can be
used to switch between user threads without first having to enter the kernel, provid-
ing the benefits of fibers, but with much better integration into Win32—since it
uses real Win32 threads.
The implementation of UMS has three key elements:
UMS does not include a user-mode scheduler as part of Windows. UMS is in-
tended as a low-level facility for use by run-time libraries used by programming-
language and server applications to implement lightweight threading models that
do not conflict with kernel-level thread scheduling. These run-time libraries will
normally implement a user-mode scheduler best suited to their environment. A
summary of these abstractions is given in Fig. 11-23.
SEC. 11.4 PROCESSES AND THREADS IN WINDOWS 913
Figure 11-23. Basic concepts used for CPU and resource management.
Threads
Every process normally starts out with one thread, but new ones can be created
dynamically. Threads form the basis of CPU scheduling, as the operating system
always selects a thread to run, not a process. Consequently, every thread has a
state (ready, running, blocked, etc), whereas processes do not have scheduling
states. Threads can be created dynamically by a Win32 call that specifies the ad-
dress within the enclosing process’ address space at which it is to start running.
Every thread has a thread ID, which is taken from the same space as the proc-
ess IDs, so a single ID can never be in use for both a process and a thread at the
same time. Process and thread IDs are multiples of four because they are actually
allocated by the executive using a special handle table set aside for allocating IDs.
The system is reusing the scalable handle-management facility shown in
Figs. 11-16 and 11-17. The handle table does not have references on objects, but
does use the pointer field to point at the process or thread so that the lookup of a
process or thread by ID is very efficient. FIFO ordering of the list of free handles
is turned on for the ID table in recent versions of Windows so that IDs are not im-
mediately reused. The problems with immediate reuse are explored in the prob-
lems at the end of this chapter.
A thread normally runs in user mode, but when it makes a system call it
switches to kernel mode and continues to run as the same thread with the same
properties and limits it had in user mode. Each thread has two stacks, one for use
when it is in user mode and one for use when it is in kernel mode. Whenever a
thread enters the kernel, it switches to the kernel-mode stack. The values of the
user-mode registers are saved in a CONTEXT data structure at the base of the ker-
nel-mode stack. Since the only way for a user-mode thread to not be running is for
it to enter the kernel, the CONTEXT for a thread always contains its register state
when it is not running. The CONTEXT for each thread can be examined and mod-
ified from any process with a handle to the thread.
Threads normally run using the access token of their containing process, but in
certain cases related to client/server computing, a thread running in a service proc-
ess can impersonate its client, using a temporary access token based on the client’s
914 CASE STUDY 2: WINDOWS 8 CHAP. 11
token so it can perform operations on the client’s behalf. (In general a service can-
not use the client’s actual token, as the client and server may be running on dif-
ferent systems.)
Threads are also the normal focal point for I/O. Threads block when perform-
ing synchronous I/O, and the outstanding I/O request packets for asynchronous I/O
are linked to the thread. When a thread is finished executing, it can exit. Any I/O
requests pending for the thread will be canceled. When the last thread still active
in a process exits, the process terminates.
It is important to realize that threads are a scheduling concept, not a re-
source-ownership concept. Any thread is able to access all the objects that belong
to its process. All it has to do is use the handle value and make the appropriate
Win32 call. There is no restriction on a thread that it cannot access an object be-
cause a different thread created or opened it. The system does not even keep track
of which thread created which object. Once an object handle has been put in a
process’ handle table, any thread in the process can use it, even it if is impersonat-
ing a different user.
As described previously, in addition to the normal threads that run within user
processes Windows has a number of system threads that run only in kernel mode
and are not associated with any user process. All such system threads run in a spe-
cial process called the system process. This process does not have a user-mode
address space. It provides the environment that threads execute in when they are
not operating on behalf of a specific user-mode process. We will study some of
these threads later when we come to memory management. Some perform admin-
istrative tasks, such as writing dirty pages to the disk, while others form the pool of
worker threads that are assigned to run specific short-term tasks delegated by exec-
utive components or drivers that need to get some work done in the system process.
New processes are created using the Win32 API function CreateProcess. This
function has many parameters and lots of options. It takes the name of the file to
be executed, the command-line strings (unparsed), and a pointer to the environ-
ment strings. There are also flags and values that control many details such as how
security is configured for the process and first thread, debugger configuration, and
scheduling priorities. A flag also specifies whether open handles in the creator are
to be passed to the new process. The function also takes the current working direc-
tory for the new process and an optional data structure with information about the
GUI Window the process is to use. Rather than returning just a process ID for the
new process, Win32 returns both handles and IDs, both for the new process and for
its initial thread.
The large number of parameters reveals a number of differences from the de-
sign of process creation in UNIX.
SEC. 11.4 PROCESSES AND THREADS IN WINDOWS 915
1. The actual search path for finding the program to execute is buried in
the library code for Win32, but managed more explicitly in UNIX.
2. The current working directory is a kernel-mode concept in UNIX but
a user-mode string in Windows. Windows does open a handle on the
current directory for each process, with the same annoying effect as in
UNIX: you cannot delete the directory, unless it happens to be across
the network, in which case you can delete it.
3. UNIX parses the command line and passes an array of parameters,
while Win32 leaves argument parsing up to the individual program.
As a consequence, different programs may handle wildcards (e.g.,
*.txt) and other special symbols in an inconsistent way.
4. Whether file descriptors can be inherited in UNIX is a property of the
handle. In Windows it is a property of both the handle and a parame-
ter to process creation.
5. Win32 is GUI oriented, so new processes are directly passed infor-
mation about their primary window, while this information is passed
as parameters to GUI applications in UNIX.
6. Windows does not have a SETUID bit as a property of the executable,
but one process can create a process that runs as a different user, as
long as it can obtain a token with that user’s credentials.
7. The process and thread handle returned from Windows can be used at
any time to modify the new process/thread in many substantive ways,
including modifying the virtual memory, injecting threads into the
process, and altering the execution of threads. UNIX makes modifi-
cations to the new process only between the fork and exec calls, and
only in limited ways as exec throws out all the user-mode state of the
process.
Some of these differences are historical and philosophical. UNIX was de-
signed to be command-line oriented rather than GUI oriented like Windows.
UNIX users are more sophisticated, and they understand concepts like PATH vari-
ables. Windows inherited a lot of legacy from MS-DOS.
The comparison is also skewed because Win32 is a user-mode wrapper around
the native NT process execution, much as the system library function wraps
fork/exec in UNIX. The actual NT system calls for creating processes and threads,
NtCreateProcess and NtCreateThread, are simpler than the Win32 versions. The
main parameters to NT process creation are a handle on a section representing the
program file to run, a flag specifying whether the new process should, by default,
inherit handles from the creator, and parameters related to the security model. All
the details of setting up the environment strings and creating the initial thread are
916 CASE STUDY 2: WINDOWS 8 CHAP. 11
left to user-mode code that can use the handle on the new process to manipulate its
virtual address space directly.
To support the POSIX subsystem, native process creation has an option to cre-
ate a new process by copying the virtual address space of another process rather
than mapping a section object for a new program. This is used only to implement
fork for POSIX, and not by Win32. Since POSIX no longer ships with Windows,
process duplication has little use—though sometimes enterprising developers come
up with special uses, similar to uses of fork without exec in UNIX.
Thread creation passes the CPU context to use for the new thread (which in-
cludes the stack pointer and initial instruction pointer), a template for the TEB, and
a flag saying whether the thread should be immediately run or created in a sus-
pended state (waiting for somebody to call NtResumeThread on its handle). Crea-
tion of the user-mode stack and pushing of the argv/argc parameters is left to user-
mode code calling the native NT memory-management APIs on the process hand-
le.
In the Windows Vista release, a new native API for processes, NtCreateUser-
Process, was added which moves many of the user-mode steps into the kernel-
mode executive, and combines process creation with creation of the initial thread.
The reason for the change was to support the use of processes as security bound-
aries. Normally, all processes created by a user are considered to be equally trust-
ed. It is the user, as represented by a token, that determines where the trust bound-
ary is. NtCreateUserProcess allows processes to also provide trust boundaries, but
this means that the creating process does not have sufficient rights regarding a new
process handle to implement the details of process creation in user mode for proc-
esses that are in a different trust environment. The primary use of a process in a
different trust boundary (called protected processes) is to support forms of digital
rights management, which protect copyrighted material from being used improp-
erly. Of course, protected processes only target user-mode attacks against protect-
ed content and cannot prevent kernel-mode attacks.
Interprocess Communication
be used over a network but do not provide guaranteed delivery. Finally, they allow
the sending process to broadcast a message to many receivers, instead of to just
one receiver. Both mailslots and named pipes are implemented as file systems in
Windows, rather than executive functions. This allows them to be accessed over
the network using the existing remote file-system protocols.
Sockets are like pipes, except that they normally connect processes on dif-
ferent machines. For example, one process writes to a socket and another one on a
remote machine reads from it. Sockets can also be used to connect processes on
the same machine, but since they entail more overhead than pipes, they are gener-
ally only used in a networking context. Sockets were originally designed for
Berkeley UNIX, and the implementation was made widely available. Some of the
Berkeley code and data structures are still present in Windows today, as acknow-
ledged in the release notes for the system.
RPCs are a way for process A to have process B call a procedure in B’s address
space on A’s behalf and return the result to A. Various restrictions on the parame-
ters exist. For example, it makes no sense to pass a pointer to a different process,
so data structures have to be packaged up and transmitted in a nonprocess-specific
way. RPC is normally implemented as an abstraction layer on top of a transport
layer. In the case of Windows, the transport can be TCP/IP sockets, named pipes,
or ALPC. ALPC (Advanced Local Procedure Call) is a message-passing facility in
the kernel-mode executive. It is optimized for communicating between processes
on the local machine and does not operate across the network. The basic design is
for sending messages that generate replies, implementing a lightweight version of
remote procedure call which the RPC package can build on top of to provide a
richer set of features than available in ALPC. ALPC is implemented using a com-
bination of copying parameters and temporary allocation of shared memory, based
on the size of the messages.
Finally, processes can share objects. This includes section objects, which can
be mapped into the virtual address space of different processes at the same time.
All writes done by one process then appear in the address spaces of the other proc-
esses. Using this mechanism, the shared buffer used in producer-consumer prob-
lems can easily be implemented.
Synchronization
Processes can also use various types of synchronization objects. Just as Win-
dows provides numerous interprocess communication mechanisms, it also provides
numerous synchronization mechanisms, including semaphores, mutexes, critical
regions, and events. All of these mechanisms work with threads, not processes, so
that when a thread blocks on a semaphore, other threads in that process (if any) are
not affected and can continue to run.
A semaphore can be created using the CreateSemaphore Win32 API function,
which can also initialize it to a given value and define a maximum value as well.
918 CASE STUDY 2: WINDOWS 8 CHAP. 11
Semaphores are kernel-mode objects and thus have security descriptors and hand-
les. The handle for a semaphore can be duplicated using DuplicateHandle and pas-
sed to another process so that multiple processes can synchronize on the same sem-
aphore. A semaphore can also be given a name in the Win32 namespace and have
an ACL set to protect it. Sometimes sharing a semaphore by name is more ap-
propriate than duplicating the handle.
Calls for up and down exist, although they have the somewhat odd names of
ReleaseSemaphore (up) and WaitForSingleObject (down). It is also possible to
give WaitForSingleObject a timeout, so the calling thread can be released eventual-
ly, even if the semaphore remains at 0 (although timers reintroduce races). Wait-
ForSingleObject and WaitForMultipleObjects are the common interfaces used for
waiting on the dispatcher objects discussed in Sec. 11.3. While it would have been
possible to wrap the single-object version of these APIs in a wrapper with a some-
what more semaphore-friendly name, many threads use the multiple-object version
which may include waiting for multiple flavors of synchronization objects as well
as other events like process or thread termination, I/O completion, and messages
being available on sockets and ports.
Mutexes are also kernel-mode objects used for synchronization, but simpler
than semaphores because they do not have counters. They are essentially locks,
with API functions for locking WaitForSingleObject and unlocking ReleaseMutex.
Like semaphore handles, mutex handles can be duplicated and passed between
processes so that threads in different processes can access the same mutex.
A third synchronization mechanism is called critical sections, which imple-
ment the concept of critical regions. These are similar to mutexes in Windows, ex-
cept local to the address space of the creating thread. Because critical sections are
not kernel-mode objects, they do not have explicit handles or security descriptors
and cannot be passed between processes. Locking and unlocking are done with
EnterCriticalSection and LeaveCriticalSection, respectively. Because these API
functions are performed initially in user space and make kernel calls only when
blocking is needed, they are much faster than mutexes. Critical sections are opti-
mized to combine spin locks (on multiprocessors) with the use of kernel synchroni-
zation only when necessary. In many applications most critical sections are so
rarely contended or have such short hold times that it is never necessary to allocate
a kernel synchronization object. This results in a very significant savings in kernel
memory.
Another synchronization mechanism we discuss uses kernel-mode objects call-
ed events. As we have described previously, there are two kinds: notification
events and synchronization events. An event can be in one of two states: signaled
or not-signaled. A thread can wait for an event to be signaled with WaitForSin-
gleObject. If another thread signals an event with SetEvent, what happens depends
on the type of event. With a notification event, all waiting threads are released and
the event stays set until manually cleared with ResetEvent. With a synchroniza-
tion event, if one or more threads are waiting, exactly one thread is released and
SEC. 11.4 PROCESSES AND THREADS IN WINDOWS 919
In this section we will get into more detail about how Windows creates a proc-
ess (and the initial thread). Because Win32 is the most documented interface, we
will start there. But we will quickly work our way down into the kernel and under-
stand the implementation of the native API call for creating a new process. We
will focus on the main code paths that get executed whenever processes are creat-
ed, as well as look at a few of the details that fill in gaps in what we have covered
so far.
A process is created when another process makes the Win32 CreateProcess
call. This call invokes a user-mode procedure in [Link] that makes a call to
NtCreateUserProcess in the kernel to create the process in several steps.
Figure 11-24. Some of the Win32 calls for managing processes, threads,
and fibers.
5. The memory manager creates the address space for the new process
by allocating and initializing the page directories and the virtual ad-
dress descriptors which describe the kernel-mode portion, including
the process-specific regions, such as the self-map page-directory en-
tries that gives each process kernel-mode access to the physical pages
in its entire page table using kernel virtual addresses. (We will de-
scribe the self map in more detail in Sec. 11.5.)
6. A handle table is created for the new process, and all the handles from
the caller that are allowed to be inherited are duplicated into it.
7. The shared user page is mapped, and the memory manager initializes
the working-set data structures used for deciding what pages to trim
from a process when physical memory is low. The pieces of the ex-
ecutable image represented by the section object are mapped into the
new process’ user-mode address space.
8. The executive creates and initializes the user-mode PEB, which is
used by both user mode processes and the kernel to maintain proc-
esswide state information, such as the user-mode heap pointers and
the list of loaded libraries (DLLs).
9. Virtual memory is allocated in the new process and used to pass pa-
rameters, including the environment strings and command line.
10. A process ID is allocated from the special handle table (ID table) the
kernel maintains for efficiently allocating locally unique IDs for proc-
esses and threads.
11. A thread object is allocated and initialized. A user-mode stack is al-
located along with the Thread Environment Block (TEB). The CON-
TEXT record which contains the thread’s initial values for the CPU
registers (including the instruction and stack pointers) is initialized.
12. The process object is added to the global list of processes. Handles
for the process and thread objects are allocated in the caller’s handle
table. An ID for the initial thread is allocated from the ID table.
13. NtCreateUserProcess returns to user mode with the new process
created, containing a single thread that is ready to run but suspended.
14. If the NT API fails, the Win32 code checks to see if this might be a
process belonging to another subsystem like WOW64. Or perhaps
the program is marked that it should be run under the debugger.
These special cases are handled with special code in the user-mode
CreateProcess code.
922 CASE STUDY 2: WINDOWS 8 CHAP. 11
Scheduling
The Windows kernel does not have a central scheduling thread. Instead, when
a thread cannot run any more, the thread calls into the scheduler itself to see which
thread to switch to. The following conditions invoke scheduling.
1. A running thread blocks on a semaphore, mutex, event, I/O, etc.
2. The thread signals an object (e.g., does an up on a semaphore).
3. The quantum expires.
SEC. 11.4 PROCESSES AND THREADS IN WINDOWS 923
In case 1, the thread is already in the kernel to carry out the operation on the dis-
patcher or I/O object. It cannot possibly continue, so it calls the scheduler code to
pick its successor and load that thread’s CONTEXT record to resume running it.
In case 2, the running thread is in the kernel, too. However, after signaling
some object, it can definitely continue because signaling an object never blocks.
Still, the thread is required to call the scheduler to see if the result of its action has
released a thread with a higher scheduling priority that is now ready to run. If so, a
thread switch occurs since Windows is fully preemptive (i.e., thread switches can
occur at any moment, not just at the end of the current thread’s quantum). Howev-
er, in the case of a multicore chip or a multiprocessor, a thread that was made ready
may be scheduled on a different CPU and the original thread can continue to ex-
ecute on the current CPU even though its scheduling priority is lower.
In case 3, an interrupt to kernel mode occurs, at which point the thread ex-
ecutes the scheduler code to see who runs next. Depending on what other threads
are waiting, the same thread may be selected, in which case it gets a new quantum
and continues running. Otherwise a thread switch happens.
The scheduler is also called under two other conditions:
In the first case, a thread may have been waiting on this I/O and is now released to
run. A check has to be made to see if it should preempt the running thread since
there is no guaranteed minimum run time. The scheduler is not run in the interrupt
handler itself (since that may keep interrupts turned off too long). Instead, a DPC
is queued for slightly later, after the interrupt handler is done. In the second case, a
thread has done a down on a semaphore or blocked on some other object, but with
a timeout that has now expired. Again it is necessary for the interrupt handler to
queue a DPC to avoid having it run during the clock interrupt handler. If a thread
has been made ready by this timeout, the scheduler will be run and if the newly
runnable thread has higher priority, the current thread is preempted as in case 1.
Now we come to the actual scheduling algorithm. The Win32 API provides
two APIs to influence thread scheduling. First, there is a call SetPriorityClass that
sets the priority class of all the threads in the caller’s process. The allowed values
are: real-time, high, above normal, normal, below normal, and idle. The priority
class determines the relative priorities of processes. The process priority class can
also be used by a process to temporarily mark itself as being background, meaning
that it should not interfere with any other activity in the system. Note that the pri-
ority class is established for the process, but it affects the actual priority of every
thread in the process by setting a base priority that each thread starts with when
created.
The second Win32 API is SetThreadPriority. It sets the relative priority of a
thread (possibly, but not necessarily, the calling thread) with respect to the priority
924 CASE STUDY 2: WINDOWS 8 CHAP. 11
class of its process. The allowed values are: time critical, highest, above normal,
normal, below normal, lowest, and idle. Time-critical threads get the highest non-
real-time scheduling priority, while idle threads get the lowest, irrespective of the
priority class. The other priority values adjust the base priority of a thread with re-
spect to the normal value determined by the priority class (+2, +1, 0, −1, −2, re-
spectively). The use of priority classes and relative thread priorities makes it easier
for applications to decide what priorities to specify.
The scheduler works as follows. The system has 32 priorities, numbered from
0 to 31. The combinations of priority class and relative priority are mapped onto
32 absolute thread priorities according to the table of Fig. 11-25. The number in
the table determines the thread’s base priority. In addition, every thread has a
current priority, which may be higher (but not lower) than the base priority and
which we will discuss shortly.
To use these priorities for scheduling, the system maintains an array of 32 lists
of threads, corresponding to priorities 0 through 31 derived from the table of
Fig. 11-25. Each list contains ready threads at the corresponding priority. The
basic scheduling algorithm consists of searching the array from priority 31 down to
priority 0. As soon as a nonempty list is found, the thread at the head of the queue
is selected and run for one quantum. If the quantum expires, the thread goes to the
end of the queue at its priority level and the thread at the front is chosen next. In
other words, when there are multiple threads ready at the highest priority level,
they run round robin for one quantum each. If no thread is ready, the processor is
idled—that is, set to a low power state waiting for an interrupt to occur.
It should be noted that scheduling is done by picking a thread without regard to
which process that thread belongs. Thus, the scheduler does not first pick a proc-
ess and then pick a thread in that process. It only looks at the threads. It does not
consider which thread belongs to which process except to determine if it also needs
to switch address spaces when switching threads.
SEC. 11.4 PROCESSES AND THREADS IN WINDOWS 925
Priority
31
Next thread to run
System
priorities 24
16
User
priorities 8
1
Zero page thread 0
Idle thread
busy. The amount of boost depends on the I/O device, typically 1 for a disk, 2 for
a serial line, 6 for the keyboard, and 8 for the sound card.
Second, if a thread was waiting on a semaphore, mutex, or other event, when it
is released, it gets boosted by 2 levels if it is in the foreground process (the process
controlling the window to which keyboard input is sent) and 1 level otherwise.
This fix tends to raise interactive processes above the big crowd at level 8. Finally,
if a GUI thread wakes up because window input is now available, it gets a boost for
the same reason.
These boosts are not forever. They take effect immediately, and can cause
rescheduling of the CPU. But if a thread uses all of its next quantum, it loses one
priority level and moves down one queue in the priority array. If it uses up another
full quantum, it moves down another level, and so on until it hits its base level,
where it remains until it is boosted again.
There is one other case in which the system fiddles with the priorities. Imag-
ine that two threads are working together on a producer-consumer type problem.
The producer’s work is harder, so it gets a high priority, say 12, compared to the
consumer’s 4. At a certain point, the producer has filled up a shared buffer and
blocks on a semaphore, as illustrated in Fig. 11-27(a).
Before the consumer gets a chance to run again, an unrelated thread at priority
8 becomes ready and starts running, as shown in Fig. 11-27(b). As long as this
thread wants to run, it will be able to, since it has a higher priority than the consu-
mer, and the producer, though even higher, is blocked. Under these circumstances,
the producer will never get to run again until the priority 8 thread gives up. This
SEC. 11.4 PROCESSES AND THREADS IN WINDOWS 927
12 Blocked 12
Does a down on the Waiting on the semaphore
semaphore and blocks
Would like to do an up
on the semaphore but
Ready 4 never gets scheduled
(a) (b)
problem is well known under the name priority inversion. Windows addresses
priority inversion between kernel threads through a facility in the thread scheduler
called Autoboost. Autoboost automatically tracks resource dependencies between
threads and boosts the scheduling priority of threads that hold resources needed by
higher-priority threads.
Windows runs on PCs, which usually have only a single interactive session ac-
tive at a time. However, Windows also supports a terminal server mode which
supports multiple interactive sessions over the network using RDP (Remote Desk-
top Protocol). When running multiple user sessions, it is easy for one user to in-
terfere with another by consuming too much processor resources. Windows imple-
ments a fair-share algorithm, DFSS (Dynamic Fair-Share Scheduling), which
keeps sessions from running excessively. DFSS uses scheduling groups to
organize the threads in each session. Within each group the threads are scheduled
according to normal Windows scheduling policies, but each group is given more or
less access to the processors based on how much the group has been running in
aggregate. The relative priorities of the groups are adjusted slowly to allow ignore
short bursts of activity and reduce the amount a group is allowed to run only if it
uses excessive processor time over long periods.
In Windows, every user process has its own virtual address space. For x86 ma-
chines, virtual addresses are 32 bits long, so each process has 4 GB of virtual ad-
dress space, with the user and kernel each receiving 2 GB. For x64 machines, both
the user and kernel receive more virtual addresses than they can reasonably use in
the foreseeable future. For both x86 and x64, the virtual address space is demand
paged, with a fixed page size of 4 KB—though in some cases, as we will see short-
ly, 2-MB large pages are also used (by using a page directory only and bypassing
the corresponding page table).
The virtual address space layouts for three x86 processes are shown in
Fig. 11-28 in simplified form. The bottom and top 64 KB of each process’ virtual
address space is normally unmapped. This choice was made intentionally to help
catch programming errors and mitigate the exploitability of certain types of vulner-
abilities.
Process A Process B Process C
4 GB
0
Bottom and top
64 KB are invalid
Figure 11-28. Virtual address space layout for three user processes on the x86.
The white areas are private per process. The shaded areas are shared among all
processes.
Starting at 64 KB comes the user’s private code and data. This extends up to
almost 2 GB. The upper 2 GB contains the operating system, including the code,
data, and the paged and nonpaged pools. The upper 2 GB is the kernel’s virtual
memory and is shared among all user processes, except for virtual memory data
like the page tables and working-set lists, which are per-process. Kernel virtual
SEC. 11.5 MEMORY MANAGEMENT 929
memory is accessible only while running in kernel mode. The reason for sharing
the process’ virtual memory with the kernel is that when a thread makes a system
call, it traps into kernel mode and can continue running without changing the mem-
ory map. All that has to be done is switch to the thread’s kernel stack. From a per-
formance point of view, this is a big win, and something UNIX does as well. Be-
cause the process’ user-mode pages are still accessible, the kernel-mode code can
read parameters and access buffers without having to switch back and forth be-
tween address spaces or temporarily double-map pages into both. The trade-off
here is less private address space per process in return for faster system calls.
Windows allows threads to attach themselves to other address spaces while
running in the kernel. Attachment to an address space allows the thread to access
all of the user-mode address space, as well as the portions of the kernel address
space that are specific to a process, such as the self-map for the page tables.
Threads must switch back to their original address space before returning to user
mode.
Each page of virtual addresses can be in one of three states: invalid, reserved,
or committed. An invalid page is not currently mapped to a memory section ob-
ject and a reference to it causes a page fault that results in an access violation.
Once code or data is mapped onto a virtual page, the page is said to be committed.
A page fault on a committed page results in mapping the page containing the virtu-
al address that caused the fault onto one of the pages represented by the section ob-
ject or stored in the pagefile. Often this will require allocating a physical page and
performing I/O on the file represented by the section object to read in the data from
disk. But page faults can also occur simply because the page-table entry needs to
be updated, as the physical page referenced is still cached in memory, in which
case I/O is not required. These are called soft faults and we will discuss them in
more detail shortly.
A virtual page can also be in the reserved state. A reserved virtual page is
invalid but has the property that those virtual addresses will never be allocated by
the memory manager for another purpose. As an example, when a new thread is
created, many pages of user-mode stack space are reserved in the process’ virtual
address space, but only one page is committed. As the stack grows, the virtual
memory manager will automatically commit additional pages under the covers,
until the reservation is almost exhausted. The reserved pages function as guard
pages to keep the stack from growing too far and overwriting other process data.
Reserving all the virtual pages means that the stack can eventually grow to its max-
imum size without the risk that some of the contiguous pages of virtual address
space needed for the stack might be given away for another purpose. In addition to
the invalid, reserved, and committed attributes, pages also have other attributes,
such as being readable, writable, and executable.
930 CASE STUDY 2: WINDOWS 8 CHAP. 11
Pagefiles
The Win32 API contains a number of functions that allow a process to manage
its virtual memory explicitly. The most important of these functions are listed in
Fig. 11-29. All of them operate on a region consisting of either a single page or a
932 CASE STUDY 2: WINDOWS 8 CHAP. 11
sequence of two or more pages that are consecutive in the virtual address space.
Of course, processes do not have to manage their memory; paging happens auto-
matically, but these calls give processes additional power and flexibility.
Figure 11-29. The principal Win32 API functions for managing virtual memory
in Windows.
The first four API functions are used to allocate, free, protect, and query re-
gions of virtual address space. Allocated regions always begin on 64-KB bound-
aries to minimize porting problems to future architectures with pages larger than
current ones. The actual amount of address space allocated can be less than 64
KB, but must be a multiple of the page size. The next two APIs give a process the
ability to hardwire pages in memory so they will not be paged out and to undo this
property. A real-time program might need pages with this property to avoid page
faults to disk during critical operations, for example. A limit is enforced by the op-
erating system to prevent processes from getting too greedy. The pages actually
can be removed from memory, but only if the entire process is swapped out. When
it is brought back, all the locked pages are reloaded before any thread can start run-
ning again. Although not shown in Fig. 11-29, Windows also has native API func-
tions to allow a process to access the virtual memory of a different process over
which it has been given control, that is, for which it has a handle (see Fig. 11-7).
The last four API functions listed are for managing memory-mapped files. To
map a file, a file-mapping object must first be created with CreateFileMapping (see
Fig. 11-8). This function returns a handle to the file-mapping object (i.e., a section
object) and optionally enters a name for it into the Win32 namespace so that other
processes can use it, too. The next two functions map and unmap views on section
objects from a process’ virtual address space. The last API can be used by a proc-
ess to map share a mapping that another process created with CreateFileMapping,
usually one created to map anonymous memory. In this way, two or more proc-
esses can share regions of their address spaces. This technique allows them to
write in limited regions of each other’s virtual memory.
SEC. 11.5 MEMORY MANAGEMENT 933
Process A Process B
Stack Stack
Region Data
Data
Paging file
Shared
library
[Link] Shared
library
Program
Program
[Link] [Link]
Figure 11-30. Mapped regions with their shadow pages on disk. The [Link] file
is mapped into two address spaces at the same time.
Unlike the scheduler, which selects individual threads to run and does not care
much about processes, the memory manager deals entirely with processes and does
not care much about threads. After all, processes, not threads, own the address
space and that is what the memory manager is concerned with. When a region of
virtual address space is allocated, as four of them have been for process A in
Fig. 11-30, the memory manager creates a VAD (Virtual Address Descriptor) for
it, listing the range of addresses mapped, the section representing the backing store
file and offset where it is mapped, and the permissions. When the first page is
touched, the directory of page tables is created and its physical address is inserted
into the process object. An address space is completely defined by the list of its
VADs. The VADs are organized into a balanced tree, so that the descriptor for a
934 CASE STUDY 2: WINDOWS 8 CHAP. 11
particular address can be found efficiently. This scheme supports sparse address
spaces. Unused areas between the mapped regions use no resources (memory or
disk) so they are essential free.
Page-Fault Handling
When a process starts on Windows, many of the pages mapping the program’s
EXE and DLL image files may already be in memory because they are shared with
other processes. The writable pages of the images are marked copy-on-write so
that they can be shared up to the point they need to be modified. If the operating
system recognizes the EXE from a previous execution, it may have recorded the
page-reference pattern, using a technology Microsoft calls SuperFetch. Super-
Fetch attempts to prepage many of the needed pages even though the process has
not faulted on them yet. This reduces the latency for starting up applications by
overlapping the reading of the pages from disk with the execution of the ini-
tialization code in the images. It improves throughput to disk because it is easier
for the disk drivers to organize the reads to reduce the seek time needed. Process
prepaging is also used during boot of the system, when a background application
moves to the foreground, and when restarting the system after hibernation.
Prepaging is supported by the memory manager, but implemented as a separate
component of the system. The pages brought in are not inserted into the process’
page table, but instead are inserted into the standby list from which they can quick-
ly be inserted into the process as needed without accessing the disk.
Nonmapped pages are slightly different in that they are not initialized by read-
ing from the file. Instead, the first time a nonmapped page is accessed the memory
manager provides a new physical page, making sure the contents are all zeroes (for
security reasons). On subsequent faults a nonmapped page may need to be found
in memory or else must be read back from the pagefile.
Demand paging in the memory manager is driven by page faults. On each
page fault, a trap to the kernel occurs. The kernel then builds a machine-indepen-
dent descriptor telling what happened and passes this to the memory-manager part
of the executive. The memory manager then checks the access for validity. If the
faulted page falls within a committed region, it looks up the address in the list of
VADs and finds (or creates) the process page-table entry. In the case of a shared
page, the memory manager uses the prototype page-table entry associated with the
section object to fill in the new page-table entry for the process page table.
The format of the page-table entries differs depending on the processor archi-
tecture. For the x86 and x64, the entries for a mapped page are shown in
Fig. 11-31. If an entry is marked valid, its contents are interpreted by the hardware
so that the virtual address can be translated into the correct physical page. Unmap-
ped pages also have entries, but they are marked invalid and the hardware ignores
the rest of the entry. The software format is somewhat different from the hardware
SEC. 11.5 MEMORY MANAGEMENT 935
format and is determined by the memory manager. For example, for an unmapped
page that must be allocated and zeroed before it may be used, that fact is noted in
the page-table entry.
63 62 52 51 12 11 9 8 7 6 5 4 3 2 1 0
P P P U R
N Physical
AVL AVL G A D A C W / / P
X page number
T D T S W
Figure 11-31. A page-table entry (PTE) for a mapped page on the Intel x86 and
AMD x64 architectures.
Two important bits in the page-table entry are updated by the hardware direct-
ly. These are the access (A) and dirty (D) bits. These bits keep track of when a
particular page mapping has been used to access the page and whether that access
could have modified the page by writing it. This really helps the performance of
the system because the memory manager can use the access bit to implement the
LRU (Least-Recently Used) style of paging. The LRU principle says that pages
which have not been used the longest are the least likely to be used again soon.
The access bit allows the memory manager to determine that a page has been ac-
cessed. The dirty bit lets the memory manager know that a page may have been
modified, or more significantly, that a page has not been modified. If a page has
not been modified since being read from disk, the memory manager does not have
to write the contents of the page to disk before using it for something else.
Both the x86 and x64 use a 64-bit page-table entry, as shown in Fig. 11-31.
Each page fault can be considered as being in one of five categories:
1. The page referenced is not committed.
2. Access to a page has been attempted in violation of the permissions.
3. A shared copy-on-write page was about to be modified.
4. The stack needs to grow.
5. The page referenced is committed but not currently mapped in.
The first and second cases are due to programming errors. If a program at-
tempts to use an address which is not supposed to have a valid mapping, or at-
tempts an invalid operation (like attempting to write a read-only page) this is called
936 CASE STUDY 2: WINDOWS 8 CHAP. 11
an access violation and usually results in termination of the process. Access viola-
tions are often the result of bad pointers, including accessing memory that was
freed and unmapped from the process.
The third case has the same symptoms as the second one (an attempt to write
to a read-only page), but the treatment is different. Because the page has been
marked as copy-on-write, the memory manager does not report an access violation,
but instead makes a private copy of the page for the current process and then re-
turns control to the thread that attempted to write the page. The thread will retry
the write, which will now complete without causing a fault.
The fourth case occurs when a thread pushes a value onto its stack and crosses
onto a page which has not been allocated yet. The memory manager is program-
med to recognize this as a special case. As long as there is still room in the virtual
pages reserved for the stack, the memory manager will supply a new physical page,
zero it, and map it into the process. When the thread resumes running, it will retry
the access and succeed this time around.
Finally, the fifth case is a normal page fault. However, it has several subcases.
If the page is mapped by a file, the memory manager must search its data struc-
tures, such as the prototype page table associated with the section object to be sure
that there is not already a copy in memory. If there is, say in another process or on
the standby or modified page lists, it will just share it—perhaps marking it as copy-
on-write if changes are not supposed to be shared. If there is not already a copy,
the memory manager will allocate a free physical page and arrange for the file
page to be copied in from disk, unless another the page is already transitioning in
from disk, in which case it is only necessary to wait for the transition to complete.
When the memory manager can satisfy a page fault by finding the needed page
in memory rather than reading it in from disk, the fault is classified as a soft fault.
If the copy from disk is needed, it is a hard fault. Soft faults are much cheaper,
and have little impact on application performance compared to hard faults. Soft
faults can occur because a shared page has already been mapped into another proc-
ess, or only a new zero page is needed, or the needed page was trimmed from the
process’ working set but is being requested again before it has had a chance to be
reused. Soft faults can also occur because pages have been compressed to ef-
fectively increase the size of physical memory. For most configurations of CPU,
memory, and I/O in current systems it is more efficient to use compression rather
than incur the I/O expense (performance and energy) required to read a page from
disk.
When a physical page is no longer mapped by the page table in any process it
goes onto one of three lists: free, modified, or standby. Pages that will never be
needed again, such as stack pages of a terminating process, are freed immediately.
Pages that may be faulted again go to either the modified list or the standby list,
depending on whether or not the dirty bit was set for any of the page-table entries
that mapped the page since it was last read from disk. Pages in the modified list
will be eventually written to disk, then moved to the standby list.
SEC. 11.5 MEMORY MANAGEMENT 937
The memory manager can allocate pages as needed using either the free list or
the standby list. Before allocating a page and copying it in from disk, the memory
manager always checks the standby and modified lists to see if it already has the
page in memory. The prepaging scheme in Windows thus converts future hard
faults into soft faults by reading in the pages that are expected to be needed and
pushing them onto the standby list. The memory manager itself does a small
amount of ordinary prepaging by accessing groups of consecutive pages rather than
single pages. The additional pages are immediately put on the standby list. This is
not generally wasteful because the overhead in the memory manager is very much
dominated by the cost of doing a single I/O. Reading a cluster of pages rather than
a single page is negligibly more expensive.
The page-table entries in Fig. 11-31 refer to physical page numbers, not virtual
page numbers. To update page-table (and page-directory) entries, the kernel needs
to use virtual addresses. Windows maps the page tables and page directories for
the current process into kernel virtual address space using self-map entries in the
page directory, as shown in Fig. 11-32. By making page-directory entries point at
the page directory (the self-map), there are virtual addresses that can be used to
refer to page-directory entries (a) as well as page table entries (b). The self-map
occupies the same 8 MB of kernel virtual addresses for every process (on the x86).
For simplicity the figure shows the x86 self-map for 32-bit PTEs (Page-Table
Entries). Windows actually uses 64-bit PTEs so the system can makes use of
more than 4 GB of physical memory. With 32-bit PTEs, the self-map uses only
one PDE (Page-Directory Entry) in the page directory, and thus occupies only 4
MB of addresses rather than 8 MB.
When the number of free physical memory pages starts to get low, the memory
manager starts working to make more physical pages available by removing them
from user-mode processes as well as the system process, which represents kernel-
mode use of pages. The goal is to have the most important virtual pages present in
memory and the others on disk. The trick is in determining what important means.
In Windows this is answered by making heavy use of the working-set concept.
Each process (not each thread) has a working set. This set consists of the map-
ped-in pages that are in memory and thus can be referenced without a page fault.
The size and composition of the working set fluctuates as the process’ threads run,
of course.
Each process’ working set is described by two parameters: the minimum size
and the maximum size. These are not hard bounds, so a process may have fewer
pages in memory than its minimum or (under certain circumstances) more than its
maximum. Every process starts with the same minimum and maximum, but these
bounds can change over time, or can be determined by the job object for processes
contained in a job. The default initial minimum is in the range 20–50 pages and
938 CASE STUDY 2: WINDOWS 8 CHAP. 11
CR3 CR3
PD PD
PT
0x300 0x300
0x390 0x321
Virtual Virtual
1100 0000 00 11 0000 0000 1100 0000 00 00 1100 0000 00 11 1001 0000 1100 1000 01 00
address address
c0300c00 c0390c84
(a) (b)
Figure 11-32. The Windows self-map entries are used to map the physical pages
of the page tables and page directory into kernel virtual addresses (shown for
32-bit PTEs).
the default initial maximum is in the range 45–345 pages, depending on the total
amount of physical memory in the system. The system administrator can change
these defaults, however. While few home users will try, server admins might.
Working sets come into play only when the available physical memory is get-
ting low in the system. Otherwise processes are allowed to consume memory as
they choose, often far exceeding the working-set maximum. But when the system
comes under memory pressure, the memory manager starts to squeeze processes
back into their working sets, starting with processes that are over their maximum
by the most. There are three levels of activity by the working-set manager, all of
which is periodic based on a timer. New activity is added at each level:
1. Lots of memory available: Scan pages resetting access bits and
using their values to represent the age of each page. Keep an estimate
of the unused pages in each working set.
2. Memory getting tight: For any process with a significant proportion
of unused pages, stop adding pages to the working set and start
replacing the oldest pages whenever a new page is needed. The re-
placed pages go to the standby or modified list.
3. Memory is tight: Trim (i.e., reduce) working sets to be below their
maximum by removing the oldest pages.
SEC. 11.5 MEMORY MANAGEMENT 939
The working set manager runs every second, called from the balance set man-
ager thread. The working-set manager throttles the amount of work it does to keep
from overloading the system. It also monitors the writing of pages on the modified
list to disk to be sure that the list does not grow too large, waking the Modified-
PageWriter thread as needed.
Above we mentioned three different lists of physical pages, the free list, the
standby list, and the modified list. There is a fourth list which contains free pages
that have been zeroed. The system frequently needs pages that contain all zeros.
When new pages are given to processes, or the final partial page at the end of a file
is read, a zero page is needed. It is time consuming to write a page with zeros, so
it is better to create zero pages in the background using a low-priority thread.
There is also a fifth list used to hold pages that have been detected as having hard-
ware errors (i.e., through hardware error detection).
All pages in the system either are referenced by a valid page-table entry or are
on one of these five lists, which are collectively called the PFN database (Page
Frame Number database). Fig. 11-33 shows the structure of the PFN Database.
The table is indexed by physical page-frame number. The entries are fixed length,
but different formats are used for different kinds of entries (e.g., shared vs. private).
Valid entries maintain the page’s state and a count of how many page tables point
to the page, so that the system can tell when the page is no longer in use. Pages
that are in a working set tell which entry references them. There is also a pointer
to the process page table that points to the page (for nonshared pages) or to the
prototype page table (for shared pages).
Additionally there is a link to the next page on the list (if any), and various
other fields and flags, such as read in progress, write in progress, and so on. To
save space, the lists are linked together with fields referring to the next element by
its index within the table rather than pointers. The table entries for the physical
pages are also used to summarize the dirty bits found in the various page table en-
tries that point to the physical page (i.e., because of shared pages). There is also
information used to represent differences in memory pages on larger server sys-
tems which have memory that is faster from some processors than from others,
namely NUMA machines.
Pages are moved between the working sets and the various lists by the work-
ing-set manager and other system threads. Let us examine the transitions. When
the working-set manager removes a page from a working set, the page goes on the
bottom of the standby or modified list, depending on its state of cleanliness. This
transition is shown as (1) in Fig. 11-34.
Pages on both lists are still valid pages, so if a page fault occurs and one of
these pages is needed, it is removed from the list and faulted back into the working
set without any disk I/O (2). When a process exits, its nonshared pages cannot be
940 CASE STUDY 2: WINDOWS 8 CHAP. 11
14 Clean X
13 Dirty X
List headers 12 Clean
11 Active 20
Standby 10 Clean
9 Dirty
8 Active 4
Modified 7 Dirty
6 Free X
Free 5 Free
4 Zeroed X
3 Active 6
2 Zeroed
1 Active 14
Zeroed 0 Zeroed
Figure 11-33. Some of the major fields in the page-frame database for a valid
page.
faulted back to it, so the valid pages in its page table and any of its pages on the
modified or standby lists go on the free list (3). Any pagefile space in use by the
process is also freed.
Zero page needed (8)
Page evicted from all working sets (1) Process exit (3) Bad memory
page
list
Figure 11-34. The various page lists and the transitions between them.
Other transitions are caused by other system threads. Every 4 seconds the bal-
ance set manager thread runs and looks for processes all of whose threads have
been idle for a certain number of seconds. If it finds any such processes, their
SEC. 11.5 MEMORY MANAGEMENT 941
kernel stacks are unpinned from physical memory and their pages are moved to the
standby or modified lists, also shown as (1).
Two other system threads, the mapped page writer and the modified page
writer, wake up periodically to see if there are enough clean pages. If not, they
take pages from the top of the modified list, write them back to disk, and then
move them to the standby list (4). The former handles writes to mapped files and
the latter handles writes to the pagefiles. The result of these writes is to transform
modified (dirty) pages into standby (clean) pages.
The reason for having two threads is that a mapped file might have to grow as
a result of the write, and growing it requires access to on-disk data structures to al-
locate a free disk block. If there is no room in memory to bring them in when a
page has to be written, a deadlock could result. The other thread can solve the
problem by writing out pages to a paging file.
The other transitions in Fig. 11-34 are as follows. If a process unmaps a page,
the page is no longer associated with a process and can go on the free list (5), ex-
cept for the case that it is shared. When a page fault requires a page frame to hold
the page about to be read in, the page frame is taken from the free list (6), if pos-
sible. It does not matter that the page may still contain confidential information
because it is about to be overwritten in its entirety.
The situation is different when a stack grows. In that case, an empty page
frame is needed and the security rules require the page to contain all zeros. For
this reason, another kernel system thread, the ZeroPage thread, runs at the lowest
priority (see Fig. 11-26), erasing pages that are on the free list and putting them on
the zeroed page list (7). Whenever the CPU is idle and there are free pages, they
might as well be zeroed since a zeroed page is potentially more useful than a free
page and it costs nothing to zero the page when the CPU is idle.
The existence of all these lists leads to some subtle policy choices. For ex-
ample, suppose that a page has to be brought in from disk and the free list is empty.
The system is now forced to choose between taking a clean page from the standby
list (which might otherwise have been faulted back in later) or an empty page from
the zeroed page list (throwing away the work done in zeroing it). Which is better?
The memory manager has to decide how aggressively the system threads
should move pages from the modified list to the standby list. Having clean pages
around is better than having dirty pages around (since clean ones can be reused in-
stantly), but an aggressive cleaning policy means more disk I/O and there is some
chance that a newly cleaned page may be faulted back into a working set and dirt-
ied again anyway. In general, Windows resolves these kinds of trade-offs through
algorithms, heuristics, guesswork, historical precedent, rules of thumb, and
administrator-controlled parameter settings.
Modern Windows introduced an additional abstraction layer at the bottom of
the memory manager, called the store manager. This layer makes decisions about
how to optimize the I/O operations to the available backing stores. Persistent stor-
age systems include auxiliary flash memory and SSDs in addition to rotating disks.
942 CASE STUDY 2: WINDOWS 8 CHAP. 11
The store manager optimizes where and how physical memory pages are backed
by the persistent stores in the system. It also implements optimization techniques
such as copy-on-write sharing of identical physical pages and compression of the
pages in the standby list to effectively increase the available RAM.
Another change in memory management in Modern Windows is the introduc-
tion of a swap file. Historically memory management in Windows has been based
on working sets, as described above. As memory pressure increases, the memory
manager squeezes on the working sets to reduce the footprint each process has in
memory. The modern application model introduces opportunities for new efficien-
cies. Since the process containing the foreground part of a modern application is
no longer given processor resources once the user has switched away, there is no
need for its pages to be resident. As memory pressure builds in the system, the
pages in the process may be removed as part of normal working-set management.
However, the process lifetime manager knows how long it has been since the user
switched to the application’s foreground process. When more memory is needed it
picks a process that has not run in a while and calls into the memory manager to
efficiently swap all the pages in a small number of I/O operations. The pages will
be written to the swap file by aggregating them into one or more large chunks.
This means that the entire process can also be restored in memory with fewer I/O
operations.
All in all, memory management is a highly complex executive component with
many data structures, algorithms, and heuristics. It attempts to be largely self tun-
ing, but there are also many knobs that administrators can tweak to affect system
performance. A number of these knobs and the associated counters can be viewed
using tools in the various tool kits mentioned earlier. Probably the most important
thing to remember here is that memory management in real systems is a lot more
than just one simple paging algorithm like clock or aging.
The Windows cache-manager facilities are shared among all the file systems.
Because the cache is virtually addressed according to individual files, the cache
manager is easily able to perform read-ahead on a per-file basis. Requests to ac-
cess cached data come from each file system. Virtual caching is convenient be-
cause the file systems do not have to first translate file offsets into physical block
numbers before requesting a cached file page. Instead, the translation happens
later when the memory manager calls the file system to access the page on disk.
Besides management of the kernel virtual address and physical memory re-
sources used for caching, the cache manager also has to coordinate with file sys-
tems regarding issues like coherency of views, flushing to disk, and correct mainte-
nance of the end-of-file marks—particularly as files expand. One of the most dif-
ficult aspects of a file to manage between the file system, the cache manager, and
the memory manager is the offset of the last byte in the file, called the ValidData-
Length. If a program writes past the end of the file, the blocks that were skipped
have to be filled with zeros, and for security reasons it is critical that the Valid-
DataLength recorded in the file metadata not allow access to uninitialized blocks,
so the zero blocks have to be written to disk before the metadata is updated with
the new length. While it is expected that if the system crashes, some of the blocks
in the file might not have been updated from memory, it is not acceptable that some
of the blocks might contain data previously belonging to other files.
Let us now examine how the cache manager works. When a file is referenced,
the cache manager maps a 256-KB chunk of kernel virtual address space onto the
file. If the file is larger than 256 KB, only a portion of the file is mapped at a time.
If the cache manager runs out of 256-KB chunks of virtual address space, it must
unmap an old file before mapping in a new one. Once a file is mapped, the cache
manager can satisfy requests for its blocks by just copying from kernel virtual ad-
dress space to the user buffer. If the block to be copied is not in physical memory,
a page fault will occur and the memory manager will satisfy the fault in the usual
way. The cache manager is not even aware of whether the block was in memory or
not. The copy always succeeds.
The cache manager also works for pages that are mapped into virtual memory
and accessed with pointers rather than being copied between kernel and user-mode
buffers. When a thread accesses a virtual address mapped to a file and a page fault
occurs, the memory manager may in many cases be able to satisfy the access as a
soft fault. It does not need to access the disk, since it finds that the page is already
in physical memory because it is mapped by the cache manager.
and play) and power management for devices and the CPU—all using a fundamen-
tally asynchronous structure that allows computation to overlap with I/O transfers.
There are many hundreds of thousands of devices that work with Windows. For a
large number of common devices it is not even necessary to install a driver, be-
cause there is already a driver that shipped with the Windows operating system.
But even so, counting all the revisions, there are almost a million distinct driver
binaries that run on Windows. In the following sections we will examine some of
the issues relating to I/O.
The I/O manager is on intimate terms with the plug-and-play manager. The
basic idea behind plug and play is that of an enumerable bus. Many buses, includ-
ing PC Card, PCI, PCIe, AGP, USB, IEEE 1394, EIDE, SCSI, and SATA, have
been designed so that the plug-and-play manager can send a request to each slot
and ask the device there to identify itself. Having discovered what is out there, the
plug-and-play manager allocates hardware resources, such as interrupt levels,
locates the appropriate drivers, and loads them into memory. As each driver is
loaded, a driver object is created for it. And then for each device, at least one de-
vice object is allocated. For some buses, such as SCSI, enumeration happens only
at boot time, but for other buses, such as USB, it can happen at any time, requiring
close cooperation between the plug-and-play manager, the bus drivers (which ac-
tually do the enumerating), and the I/O manager.
In Windows, all the file systems, antivirus filters, volume managers, network
protocol stacks, and even kernel services that have no associated hardware are im-
plemented using I/O drivers. The system configuration must be set to cause some
of these drivers to load, because there is no associated device to enumerate on the
bus. Others, like the file systems, are loaded by special code that detects they are
needed, such as the file-system recognizer that looks at a raw volume and deci-
phers what type of file system format it contains.
An interesting feature of Windows is its support for dynamic disks. These
disks may span multiple partitions and even multiple disks and may be reconfig-
ured on the fly, without even having to reboot. In this way, logical volumes are no
longer constrained to a single partition or even a single disk so that a single file
system may span multiple drives in a transparent way.
The I/O to volumes can be filtered by a special Windows driver to produce
Volume Shadow Copies. The filter driver creates a snapshot of the volume which
can be separately mounted and represents a volume at a previous point in time. It
does this by keeping track of changes after the snapshot point. This is very con-
venient for recovering files that were accidentally deleted, or traveling back in time
to see the state of a file at periodic snapshots made in the past.
But shadow copies are also valuable for making accurate backups of server
systems. The operating system works with server applications to have them reach
SEC. 11.7 INPUT/OUTPUT IN WINDOWS 945
a convenient point for making a clean backup of their persistent state on the vol-
ume. Once all the applications are ready, the system initializes the snapshot of the
volume and then tells the applications that they can continue. The backup is made
of the volume state at the point of the snapshot. And the applications were only
blocked for a very short time rather than having to go offline for the duration of the
backup.
Applications participate in the snapshot process, so the backup reflects a state
that is easy to recover in case there is a future failure. Otherwise the backup might
still be useful, but the state it captured would look more like the state if the system
had crashed. Recovering from a system at the point of a crash can be more dif-
ficult or even impossible, since crashes occur at arbitrary times in the execution of
the application. Murphy’s Law says that crashes are most likely to occur at the
worst possible time, that is, when the application data is in a state where recovery
is impossible.
Another aspect of Windows is its support for asynchronous I/O. It is possible
for a thread to start an I/O operation and then continue executing in parallel with
the I/O. This feature is especially important on servers. There are various ways
the thread can find out that the I/O has completed. One is to specify an event ob-
ject at the time the call is made and then wait on it eventually. Another is to speci-
fy a queue to which a completion event will be posted by the system when the I/O
is done. A third is to provide a callback procedure that the system calls when the
I/O has completed. A fourth is to poll a location in memory that the I/O manager
updates when the I/O completes.
The final aspect that we will mention is prioritized I/O. I/O priority is deter-
mined by the priority of the issuing thread, or it can be explicitly set. There are
five priorities specified: critical, high, normal, low, and very low. Critical is re-
served for the memory manager to avoid deadlocks that could otherwise occur
when the system experiences extreme memory pressure. Low and very low priori-
ties are used by background processes, like the disk defragmentation service and
spyware scanners and desktop search, which are attempting to avoid interfering
with normal operations of the system. Most I/O gets normal priority, but multi-
media applications can mark their I/O as high to avoid glitches. Multimedia appli-
cations can alternatively use bandwidth reservation to request guaranteed band-
width to access time-critical files, like music or video. The I/O system will pro-
vide the application with the optimal transfer size and the number of outstanding
I/O operations that should be maintained to allow the I/O system to achieve the re-
quested bandwidth guarantee.
The system call APIs provided by the I/O manager are not very different from
those offered by most other operating systems. The basic operations are open,
read, write, ioctl, and close, but there are also plug-and-play and power operations,
946 CASE STUDY 2: WINDOWS 8 CHAP. 11
operations for setting parameters, as well as calls for flushing system buffers, and
so on. At the Win32 layer these APIs are wrapped by interfaces that provide high-
er-level operations specific to particular devices. At the bottom, though, these
wrappers open devices and perform these basic types of operations. Even some
metadata operations, such as file rename, are implemented without specific system
calls. They just use a special version of the ioctl operations. This will make more
sense when we explain the implementation of I/O device stacks and the use of
IRPs by the I/O manager.
The native NT I/O system calls, in keeping with the general philosophy of
Windows, take numerous parameters, and include many variations. Figure 11-35
lists the primary system-call interfaces to the I/O manager. NtCreateFile is used to
open existing or new files. It provides security descriptors for new files, a rich de-
scription of the access rights requested, and gives the creator of new files some
control over how blocks will be allocated. NtReadFile and NtWriteFile take a file
handle, buffer, and length. They also take an explicit file offset, and allow a key to
be specified for accessing locked ranges of bytes in the file. Most of the parame-
ters are related to specifying which of the different methods to use for reporting
completion of the (possibly asynchronous) I/O, as described above.
NtQuer yDirector yFile is an example of a standard paradigm in the executive
where various Query APIs exist to access or modify information about specific
types of objects. In this case, it is file objects that refer to directories. A parameter
specifies what type of information is being requested, such as a list of the names in
SEC. 11.7 INPUT/OUTPUT IN WINDOWS 947
the directory or detailed information about each file that is needed for an extended
directory listing. Since this is really an I/O operation, all the standard ways of
reporting that the I/O completed are supported. NtQueryVolumeInformationFile is
like the directory query operation, but expects a file handle which represents an
open volume which may or may not contain a file system. Unlike for directories,
there are parameters than can be modified on volumes, and thus there is a separate
API NtSetVolumeInformationFile.
NtNotifyChangeDirector yFile is an example of an interesting NT paradigm.
Threads can do I/O to determine whether any changes occur to objects (mainly
file-system directories, as in this case, or registry keys). Because the I/O is asyn-
chronous the thread returns and continues, and is only notified later when some-
thing is modified. The pending request is queued in the file system as an outstand-
ing I/O operation using an I/O Request Packet. Notifications are problematic if
you want to remove a file-system volume from the system, because the I/O opera-
tions are pending. So Windows supports facilities for canceling pending I/O oper-
ations, including support in the file system for forcibly dismounting a volume with
pending I/O.
NtQuer yInformationFile is the file-specific version of the system call for direc-
tories. It has a companion system call, NtSetInformationFile. These interfaces ac-
cess and modify all sorts of information about file names, file features like en-
cryption and compression and sparseness, and other file attributes and details, in-
cluding looking up the internal file id or assigning a unique binary name (object id)
to a file.
These system calls are essentially a form of ioctl specific to files. The set oper-
ation can be used to rename or delete a file. But note that they take handles, not
file names, so a file first must be opened before being renamed or deleted. They
can also be used to rename the alternative data streams on NTFS (see Sec. 11.8).
Separate APIs, NtLockFile and NtUnlockFile, exist to set and remove byte-
range locks on files. NtCreateFile allows access to an entire file to be restricted by
using a sharing mode. An alternative is these lock APIs, which apply mandatory
access restrictions to a range of bytes in the file. Reads and writes must supply a
key matching the key provided to NtLockFile in order to operate on the locked
ranges.
Similar facilities exist in UNIX, but there it is discretionary whether applica-
tions heed the range locks. NtFsControlFile is much like the preceding Query and
Set operations, but is a more generic operation aimed at handling file-specific oper-
ations that do not fit within the other APIs. For example, some operations are spe-
cific to a particular file system.
Finally, there are miscellaneous calls such as NtFlushBuffersFile. Like the
UNIX sync call, it forces file-system data to be written back to disk. NtCancel-
IoFile cancels outstanding I/O requests for a particular file, and NtDeviceIoCon-
trolFile implements ioctl operations for devices. The list of operations is actually
much longer. There are system calls for deleting files by name, and for querying
948 CASE STUDY 2: WINDOWS 8 CHAP. 11
the attributes of a specific file—but these are just wrappers around the other I/O
manager operations we have listed and did not really need to be implemented as
separate system calls. There are also system calls for dealing with I/O completion
ports, a queuing facility in Windows that helps multithreaded servers make ef-
ficient use of asynchronous I/O operations by readying threads by demand and
reducing the number of context switches required to service I/O on dedicated
threads.
The Windows I/O system consists of the plug-and-play services, the device
power manager, the I/O manager, and the device-driver model. Plug-and-play
detects changes in hardware configuration and builds or tears down the device
stacks for each device, as well as causing the loading and unloading of device driv-
ers. The device power manager adjusts the power state of the I/O devices to reduce
system power consumption when devices are not in use. The I/O manager pro-
vides support for manipulating I/O kernel objects, and IRP-based operations like
IoCallDrivers and IoCompleteRequest. But most of the work required to support
Windows I/O is implemented by the device drivers themselves.
Device Drivers
To make sure that device drivers work well with the rest of Windows, Micro-
soft has defined the WDM (Windows Driver Model) that device drivers are ex-
pected to conform with. The WDK (Windows Driver Kit) contains docu-
mentation and examples to help developers produce drivers which conform to the
WDM. Most Windows drivers start out as copies of an appropriate sample driver
from the WDK, which is then modified by the driver writer.
Microsoft also provides a driver verifier which validates many of the actions
of drivers to be sure that they conform to the WDM requirements for the structure
and protocols for I/O requests, memory management, and so on. The verifier ships
with the system, and administrators can control it by running [Link], which al-
lows them to configure which drivers are to be checked and how extensive (i.e., ex-
pensive) the checks should be.
Even with all the support for driver development and verification, it is still very
difficult to write even simple drivers in Windows, so Microsoft has built a system
of wrappers called the WDF (Windows Driver Foundation) that runs on top of
WDM and simplifies many of the more common requirements, mostly related to
correct interaction with device power management and plug-and-play operations.
To further simplify driver writing, as well as increase the robustness of the sys-
tem, WDF includes the UMDF (User-Mode Driver Framework) for writing driv-
ers as services that execute in processes. And there is the KMDF (Kernel-Mode
SEC. 11.7 INPUT/OUTPUT IN WINDOWS 949
Driver Framework) for writing drivers as services that execute in the kernel, but
with many of the details of WDM made automagical. Since underneath it is the
WDM that provides the driver model, that is what we will focus on in this section.
Devices in Windows are represented by device objects. Device objects are also
used to represent hardware, such as buses, as well as software abstractions like file
systems, network protocol engines, and kernel extensions, such as antivirus filter
drivers. All these are organized by producing what Windows calls a device stack,
as previously shown in Fig. 11-14.
I/O operations are initiated by the I/O manager calling an executive API
IoCallDriver with pointers to the top device object and to the IRP representing the
I/O request. This routine finds the driver object associated with the device object.
The operation types that are specified in the IRP generally correspond to the I/O
manager system calls described above, such as create, read, and close.
Figure 11-36 shows the relationships for a single level of the device stack. For
each of these operations a driver must specify an entry point. IoCallDriver takes the
operation type out of the IRP, uses the device object at the current level of the de-
vice stack to find the driver object, and indexes into the driver dispatch table with
the operation type to find the corresponding entry point into the driver. The driver
is then called and passed the device object and the IRP.
Driver code
Driver object
Driver object
Dispatch table
Instance data
CREATE
READ
WRITE
Next device object FLUSH
IOCTL
CLEANUP
CLOSE
…
Once a driver has finished processing the request represented by the IRP, it has
three options. It can call IoCallDriver again, passing the IRP and the next device
object in the device stack. It can declare the I/O request to be completed and re-
turn to its caller. Or it can queue the IRP internally and return to its caller, having
declared that the I/O request is still pending. This latter case results in an asyn-
chronous I/O operation, at least if all the drivers above in the stack agree and also
return to their callers.
950 CASE STUDY 2: WINDOWS 8 CHAP. 11
Figure 11-37 shows the major fields in the IRP. The bottom of the IRP is a dy-
namically sized array containing fields that can be used by each driver for the de-
vice stack handling the request. These stack fields also allow a driver to specify
the routine to call when completing an I/O request. During completion each level
of the device stack is visited in reverse order, and the completion routine assigned
by each driver is called in turn. At each level the driver can continue to complete
the request or decide there is still more work to do and leave the request pending,
suspending the I/O completion for the time being.
Kernel buffer address
Flags
User buffer address
Operation code
Buffer pointers
Completion/cancel info
Thread
Driver
Completion
queuing
APC block
& comm.
When allocating an IRP, the I/O manager has to know how deep the particular
device stack is so that it can allocate a sufficiently large IRP. It keeps track of the
stack depth in a field in each device object as the device stack is formed. Note that
there is no formal definition of what the next device object is in any stack. That
information is held in private data structures belonging to the previous driver on
the stack. In fact, the stack does not really have to be a stack at all. At any layer a
driver is free to allocate new IRPs, continue to use the original IRP, send an I/O op-
eration to a different device stack, or even switch to a system worker thread to con-
tinue execution.
The IRP contains flags, an operation code for indexing into the driver dispatch
table, buffer pointers for possibly both kernel and user buffers, and a list of MDLs
(Memory Descriptor Lists) which are used to describe the physical pages repres-
ented by the buffers, that is, for DMA operations. There are fields used for cancel-
lation and completion operations. The fields in the IRP that are used to queue the
SEC. 11.7 INPUT/OUTPUT IN WINDOWS 951
IRP to devices while it is being processed are reused when the I/O operation has
finally completed to provide memory for the APC control object used to call the
I/O manager’s completion routine in the context of the original thread. There is
also a link field used to link all the outstanding IRPs to the thread that initiated
them.
Device Stacks
A driver in Windows may do all the work by itself, as the printer driver does in
Fig. 11-38. On the other hand, drivers may also be stacked, which means that a re-
quest may pass through a sequence of drivers, each doing part of the work. Two
stacked drivers are also illustrated in Fig. 11-38.
User process
User
program
Win32
Rest of windows
Filter
Driver
Function Function stack
Figure 11-38. Windows allows drivers to be stacked to work with a specific in-
stance of a device. The stacking is represented by device objects.
One common use for stacked drivers is to separate the bus management from
the functional work of controlling the device. Bus management on the PCI bus is
quite complicated on account of many kinds of modes and bus transactions. By
952 CASE STUDY 2: WINDOWS 8 CHAP. 11
separating this work from the device-specific part, driver writers are freed from
learning how to control the bus. They can just use the standard bus driver in their
stack. Similarly, USB and SCSI drivers have a device-specific part and a generic
part, with common drivers being supplied by Windows for the generic part.
Another use of stacking drivers is to be able to insert filter drivers into the
stack. We have already looked at the use of file-system filter drivers, which are in-
serted above the file system. Filter drivers are also used for managing physical
hardware. A filter driver performs some transformation on the operations as the
IRP flows down the device stack, as well as during the completion operation with
the IRP flows back up through the completion routines each driver specified. For
example, a filter driver could compress data on the way to the disk or encrypt data
on the way to the network. Putting the filter here means that neither the applica-
tion program nor the true device driver has to be aware of it, and it works automat-
ically for all data going to (or coming from) the device.
Kernel-mode device drivers are a serious problem for the reliability and stabil-
ity of Windows. Most of the kernel crashes in Windows are due to bugs in device
drivers. Because kernel-mode device drivers all share the same address space with
the kernel and executive layers, errors in the drivers can corrupt system data struc-
tures, or worse. Some of these bugs are due to the astonishingly large numbers of
device drivers that exist for Windows, or to the development of drivers by less-
experienced system programmers. The bugs are also due to the enormous amount
of detail involved in writing a correct driver for Windows.
The I/O model is powerful and flexible, but all I/O is fundamentally asynchro-
nous, so race conditions can abound. Windows 2000 added the plug-and-play and
device power management facilities from the Win9x systems to the NT-based Win-
dows for the first time. This put a large number of requirements on drivers to deal
correctly with devices coming and going while I/O packets are in the middle of
being processed. Users of PCs frequently dock/undock devices, close the lid and
toss notebooks into briefcases, and generally do not worry about whether the little
green activity light happens to still be on. Writing device drivers that function cor-
rectly in this environment can be very challenging, which is why WDF was devel-
oped to simplify the Windows Driver Model.
Many books are available about the Windows Driver Model and the newer
Windows Driver Foundation (Kanetkar, 2008; Orwick & Smith, 2007; Reeves,
2010; Viscarola et al., 2007; and Vostokov, 2009).
use them. FAT-32 uses 32-bit disk addresses and supports disk partitions up to 2
TB. There is no security in FAT-32 and today it is really used only for tran-
sportable media, like flash drives. NTFS is the file system developed specifically
for the NT version of Windows. Starting with Windows XP it became the default
file system installed by most computer manufacturers, greatly improving the secu-
rity and functionality of Windows. NTFS uses 64-bit disk addresses and can (theo-
retically) support disk partitions up to 264 bytes, although other considerations
limit it to smaller sizes.
In this chapter we will examine the NTFS file system because it is a modern
one with many interesting features and design innovations. It is large and complex
and space limitations prevent us from covering all of its features, but the material
presented below should give a reasonable impression of it.
Individual file names in NTFS are limited to 255 characters; full paths are lim-
ited to 32,767 characters. File names are in Unicode, allowing people in countries
not using the Latin alphabet (e.g., Greece, Japan, India, Russia, and Israel) to write
file names in their native language. For example, φιλε is a perfectly legal file
name. NTFS fully supports case-sensitive names (so foo is different from Foo and
FOO). The Win32 API does not support case-sensitivity fully for file names and
not at all for directory names. The support for case sensitivity exists when running
the POSIX subsystem in order to maintain compatibility with UNIX. Win32 is not
case sensitive, but it is case preserving, so file names can have different case letters
in them. Though case sensitivity is a feature that is very familiar to users of UNIX,
it is largely inconvenient to ordinary users who do not make such distinctions nor-
mally. For example, the Internet is largely case-insensitive today.
An NTFS file is not just a linear sequence of bytes, as FAT-32 and UNIX files
are. Instead, a file consists of multiple attributes, each represented by a stream of
bytes. Most files have a few short streams, such as the name of the file and its
64-bit object ID, plus one long (unnamed) stream with the data. However, a file
can also have two or more (long) data streams as well. Each stream has a name
consisting of the file name, a colon, and the stream name, as in foo:stream1. Each
stream has its own size and is lockable independently of all the other streams. The
idea of multiple streams in a file is not new in NTFS. The file system on the Apple
Macintosh uses two streams per file, the data fork and the resource fork. The first
use of multiple streams for NTFS was to allow an NT file server to serve Macin-
tosh clients. Multiple data streams are also used to represent metadata about files,
such as the thumbnail pictures of JPEG images that are available in the Windows
GUI. But alas, the multiple data streams are fragile and frequently fall off files
when they are transported to other file systems, transported over the network, or
even when backed up and later restored, because many utilities ignore them.
954 CASE STUDY 2: WINDOWS 8 CHAP. 11
NTFS is a hierarchical file system, similar to the UNIX file system. The sepa-
rator between component names is ‘‘ \’’, however, instead of ‘‘/’’, a fossil inherited
from the compatibility requirements with CP/M when MS-DOS was created
(CP/M used the slash for flags). Unlike UNIX the concept of the current working
directory, hard links to the current directory (.) and the parent directory (..) are im-
plemented as conventions rather than as a fundamental part of the file-system de-
sign. Hard links are supported, but used only for the POSIX subsystem, as is
NTFS support for traversal checking on directories (the ‘x’ permission in UNIX).
Symbolic links in are supported for NTFS. Creation of symbolic links is nor-
mally restricted to administrators to avoid security issues like spoofing, as UNIX
experienced when symbolic links were first introduced in 4.2BSD. The imple-
mentation of symbolic links uses an NTFS feature called reparse points (dis-
cussed later in this section). In addition, compression, encryption, fault tolerance,
journaling, and sparse files are also supported. These features and their imple-
mentations will be discussed shortly.
NTFS is a highly complex and sophisticated file system that was developed
specifically for NT as an alternative to the HPFS file system that had been devel-
oped for OS/2. While most of NT was designed on dry land, NTFS is unique
among the components of the operating system in that much of its original design
took place aboard a sailboat out on the Puget Sound (following a strict protocol of
work in the morning, beer in the afternoon). Below we will examine a number of
features of NTFS, starting with its structure, then moving on to file-name lookup,
file compression, journaling, and file encryption.
Each NTFS volume (e.g., disk partition) contains files, directories, bitmaps,
and other data structures. Each volume is organized as a linear sequence of blocks
(clusters in Microsoft’s terminology), with the block size being fixed for each vol-
ume and ranging from 512 bytes to 64 KB, depending on the volume size. Most
NTFS disks use 4-KB blocks as a compromise between large blocks (for efficient
transfers) and small blocks (for low internal fragmentation). Blocks are referred to
by their offset from the start of the volume using 64-bit numbers.
The principal data structure in each volume is the MFT (Master File Table),
which is a linear sequence of fixed-size 1-KB records. Each MFT record describes
one file or one directory. It contains the file’s attributes, such as its name and time-
stamps, and the list of disk addresses where its blocks are located. If a file is ex-
tremely large, it is sometimes necessary to use two or more MFT records to con-
tain the list of all the blocks, in which case the first MFT record, called the base
record, points to the additional MFT records. This overflow scheme dates back to
SEC. 11.8 THE WINDOWS NT FILE SYSTEM 955
CP/M, where each directory entry was called an extent. A bitmap keeps track of
which MFT entries are free.
The MFT is itself a file and as such can be placed anywhere within the volume,
thus eliminating the problem with defective sectors in the first track. Furthermore,
the file can grow as needed, up to a maximum size of 248 records.
The MFT is shown in Fig. 11-39. Each MFT record consists of a sequence of
(attribute header, value) pairs. Each attribute begins with a header telling which
attribute this is and how long the value is. Some attribute values are variable
length, such as the file name and the data. If the attribute value is short enough to
fit in the MFT record, it is placed there. If it is too long, it is placed elsewhere on
the disk and a pointer to it is placed in the MFT record. This makes NTFS very ef-
ficient for small files, that is, those that can fit within the MFT record itself.
The first 16 MFT records are reserved for NTFS metadata files, as illustrated
in Fig. 11-39. Each record describes a normal file that has attributes and data
blocks, just like any other file. Each of these files has a name that begins with a
dollar sign to indicate that it is a metadata file. The first record describes the MFT
file itself. In particular, it tells where the blocks of the MFT file are located so that
the system can find the MFT file. Clearly, Windows needs a way to find the first
block of the MFT file in order to find the rest of the file-system information. The
way it finds the first block of the MFT file is to look in the boot block, where its
address is installed when the volume is formatted with the file system.
1 KB
Record 1 is a duplicate of the early portion of the MFT file. This information
is so precious that having a second copy can be critical in the event one of the first
blocks of the MFT ever becomes unreadable. Record 2 is the log file. When struc-
tural changes are made to the file system, such as adding a new directory or remov-
ing an existing one, the action is logged here before it is performed, in order to in-
crease the chance of correct recovery in the event of a failure during the operation,
such as a system crash. Changes to file attributes are also logged here. In fact, the
only changes not logged here are changes to user data. Record 3 contains infor-
mation about the volume, such as its size, label, and version.
As mentioned above, each MFT record contains a sequence of (attribute head-
er, value) pairs. The $AttrDef file is where the attributes are defined. Information
about this file is in MFT record 4. Next comes the root directory, which itself is a
file and can grow to arbitrary length. It is described by MFT record 5.
Free space on the volume is kept track of with a bitmap. The bitmap is itself a
file, and its attributes and disk addresses are given in MFT record 6. The next
MFT record points to the bootstrap loader file. Record 8 is used to link all the bad
blocks together to make sure they never occur in a file. Record 9 contains the se-
curity information. Record 10 is used for case mapping. For the Latin letters A-Z
case mapping is obvious (at least for people who speak Latin). Case mapping for
other languages, such as Greek, Armenian, or Georgian (the country, not the state),
is less obvious to Latin speakers, so this file tells how to do it. Finally, record 11 is
a directory containing miscellaneous files for things like disk quotas, object identi-
fiers, reparse points, and so on. The last four MFT records are reserved for future
use.
Each MFT record consists of a record header followed by the (attribute header,
value) pairs. The record header contains a magic number used for validity check-
ing, a sequence number updated each time the record is reused for a new file, a
count of references to the file, the actual number of bytes in the record used, the
identifier (index, sequence number) of the base record (used only for extension
records), and some other miscellaneous fields.
NTFS defines 13 attributes that can appear in MFT records. These are listed in
Fig. 11-40. Each attribute header identifies the attribute and gives the length and
location of the value field along with a variety of flags and other information.
Usually, attribute values follow their attribute headers directly, but if a value is too
long to fit in the MFT record, it may be put in separate disk blocks. Such an
attribute is said to be a nonresident attribute. The data attribute is an obvious
candidate. Some attributes, such as the name, may be repeated, but all attributes
must appear in a fixed order in the MFT record. The headers for resident attributes
are 24 bytes long; those for nonresident attributes are longer because they contain
information about where to find the attribute on disk.
The standard information field contains the file owner, security information,
the timestamps needed by POSIX, the hard-link count, the read-only and archive
bits, and so on. It is a fixed-length field and is always present. The file name is a
SEC. 11.8 THE WINDOWS NT FILE SYSTEM 957
Attribute Description
Standard information Flag bits, timestamps, etc.
File name File name in Unicode; may be repeated for MS-DOS name
Security descriptor Obsolete. Security information is now in $Extend$Secure
Attribute list Location of additional MFT records, if needed
Object ID 64-bit file identifier unique to this volume
Reparse point Used for mounting and symbolic links
Volume name Name of this volume (used only in $Volume)
Volume information Volume version (used only in $Volume)
Index root Used for directories
Index allocation Used for very large directories
Bitmap Used for very large directories
Logged utility stream Controls logging to $LogFile
Data Stream data; may be repeated
The next three attributes deal with how directories are implemented. Small ones
are just lists of files but large ones are implemented using B+ trees. The logged
utility stream attribute is used by the encrypting file system.
Finally, we come to the attribute that is the most important of all: the data
stream (or in some cases, streams). An NTFS file has one or more data streams as-
sociated with it. This is where the payload is. The default data stream is
unnamed (i.e., dirpath \ file name::$DATA), but the alternate data streams each
have a name, for example, dirpath \ file name:streamname:$DATA.
For each stream, the stream name, if present, goes in this attribute header. Fol-
lowing the header is either a list of disk addresses telling which blocks the stream
contains, or for streams of only a few hundred bytes (and there are many of these),
the stream itself. Putting the actual stream data in the MFT record is called an
immediate file (Mullender and Tanenbaum, 1984).
Of course, most of the time the data does not fit in the MFT record, so this
attribute is usually nonresident. Let us now take a look at how NTFS keeps track
of the location of nonresident attributes, in particular data.
Storage Allocation
The model for keeping track of disk blocks is that they are assigned in runs of
consecutive blocks, where possible, for efficiency reasons. For example, if the first
logical block of a stream is placed in block 20 on the disk, then the system will try
hard to place the second logical block in block 21, the third logical block in 22,
and so on. One way to achieve these runs is to allocate disk storage several blocks
at a time, when possible.
The blocks in a stream are described by a sequence of records, each one
describing a sequence of logically contiguous blocks. For a stream with no holes
in it, there will be only one such record. Streams that are written in order from be-
ginning to end all belong in this category. For a stream with one hole in it (e.g.,
only blocks 0–49 and blocks 60–79 are defined), there will be two records. Such a
stream could be produced by writing the first 50 blocks, then seeking forward to
logical block 60 and writing another 20 blocks. When a hole is read back, all the
missing bytes are zeros. Files with holes are called sparse files.
Each record begins with a header giving the offset of the first block within the
stream. Next comes the offset of the first block not covered by the record. In the
example above, the first record would have a header of (0, 50) and would provide
the disk addresses for these 50 blocks. The second one would have a header of
(60, 80) and would provide the disk addresses for these 20 blocks.
Each record header is followed by one or more pairs, each giving a disk ad-
dress and run length. The disk address is the offset of the disk block from the start
of its partition; the run length is the number of blocks in the run. As many pairs as
needed can be in the run record. Use of this scheme for a three-run, nine-block
stream is illustrated in Fig. 11-41.
SEC. 11.8 THE WINDOWS NT FILE SYSTEM 959
Disk blocks
In this figure we have an MFT record for a short stream of nine blocks (header
0–8). It consists of the three runs of consecutive blocks on the disk. The first run
is blocks 20–23, the second is blocks 64–65, and the third is blocks 80–82. Each
of these runs is recorded in the MFT record as a (disk address, block count) pair.
How many runs there are depends on how well the disk block allocator did in find-
ing runs of consecutive blocks when the stream was created. For an n-block
stream, the number of runs can be anything from 1 through n.
Several comments are worth making here. First, there is no upper limit to the
size of streams that can be represented this way. In the absence of address com-
pression, each pair requires two 64-bit numbers in the pair for a total of 16 bytes.
However, a pair could represent 1 million or more consecutive disk blocks. In fact,
a 20-MB stream consisting of 20 separate runs of 1 million 1-KB blocks each fits
easily in one MFT record, whereas a 60-KB stream scattered into 60 isolated
blocks does not.
Second, while the straightforward way of representing each pair takes 2 × 8
bytes, a compression method is available to reduce the size of the pairs below 16.
Many disk addresses have multiple high-order zero-bytes. These can be omitted.
The data header tells how many are omitted, that is, how many bytes are actually
used per address. Other kinds of compression are also used. In practice, the pairs
are often only 4 bytes.
Our first example was easy: all the file information fit in one MFT record.
What happens if the file is so large or highly fragmented that the block information
does not fit in one MFT record? The answer is simple: use two or more MFT
records. In Fig. 11-42 we see a file whose base record is in MFT record 102. It
has too many runs for one MFT record, so it computes how many extension
records it needs, say, two, and puts their indices in the base record. The rest of the
record is used for the first k data runs.
960 CASE STUDY 2: WINDOWS 8 CHAP. 11
109
108 Run #m+1 Run n Second extension record
107
106
105 Run #k+1 Run m First extension record
104
103
102 MFT 105 MFT 108 Run #1 Run #k Base record
101
100
Figure 11-42. A file that requires three MFT records to store all its runs.
Note that Fig. 11-42 contains some redundancy. In theory, it should not be
necessary to specify the end of a sequence of runs because this information can be
calculated from the run pairs. The reason for ‘‘overspecifying’’ this information is
to make seeking more efficient: to find the block at a given file offset, it is neces-
sary to examine only the record headers, not the run pairs.
When all the space in record 102 has been used up, storage of the runs con-
tinues with MFT record 105. As many runs are packed in this record as fit. When
this record is also full, the rest of the runs go in MFT record 108. In this way,
many MFT records can be used to handle large fragmented files.
A problem arises if so many MFT records are needed that there is no room in
the base MFT to list all their indices. There is also a solution to this problem: the
list of extension MFT records is made nonresident (i.e., stored in other disk blocks
instead of in the base MFT record). Then it can grow as large as needed.
An MFT entry for a small directory is shown in Fig. 11-43. The record con-
tains a number of directory entries, each of which describes one file or directory.
Each entry has a fixed-length structure followed by a variable-length file name.
The fixed part contains the index of the MFT entry for the file, the length of the file
name, and a variety of other fields and flags. Looking for an entry in a directory
consists of examining all the file names in turn.
Large directories use a different format. Instead, of listing the files linearly, a
B+ tree is used to make alphabetical lookup possible and to make it easy to insert
new names in the directory in the proper place.
The NTFS parsing of the path \ foo \ bar begins at the root directory for C:,
whose blocks can be found from entry 5 in the MFT (see Fig. 11-39). The string
‘‘foo’’ is looked up in the root directory, which returns the index into the MFT for
the directory foo. This directory is then searched for the string ‘‘bar’’, which refers
to the MFT record for this file. NTFS performs access checks by calling back into
the security reference monitor, and if everything is cool it searches the MFT record
for the attribute ::$DATA, which is the default data stream.
SEC. 11.8 THE WINDOWS NT FILE SYSTEM 961
Record
header
Standard
Unused
info
Having found file bar, NTFS will set pointers to its own metadata in the file
object passed down from the I/O manager. The metadata includes a pointer to the
MFT record, information about compression and range locks, various details about
sharing, and so on. Most of this metadata is in data structures shared across all file
objects referring to the file. A few fields are specific only to the current open, such
as whether the file should be deleted when it is closed. Once the open has suc-
ceeded, NTFS calls IoCompleteRequest to pass the IRP back up the I/O stack to
the I/O and object managers. Ultimately a handle for the file object is put in the
handle table for the current process, and control is passed back to user mode. On
subsequent ReadFile calls, an application can provide the handle, specifying that
this file object for C: \ foo \ bar should be included in the read request that gets pas-
sed down the C: device stack to NTFS.
In addition to regular files and directories, NTFS supports hard links in the
UNIX sense, and also symbolic links using a mechanism called reparse points.
NTFS supports tagging a file or directory as a reparse point and associating a block
of data with it. When the file or directory is encountered during a file-name parse,
the operation fails and the block of data is returned to the object manager. The ob-
ject manager can interpret the data as representing an alternative path name and
then update the string to parse and retry the I/O operation. This mechanism is used
to support both symbolic links and mounted file systems, redirecting the search to
a different part of the directory hierarchy or even to a different partition.
Reparse points are also used to tag individual files for file-system filter drivers.
In Fig. 11-20 we showed how file-system filters can be installed between the I/O
manager and the file system. I/O requests are completed by calling IoComplete-
Request, which passes control to the completion routines each driver represented
962 CASE STUDY 2: WINDOWS 8 CHAP. 11
in the device stack inserted into the IRP as the request was being made. A driver
that wants to tag a file associates a reparse tag and then watches for completion re-
quests for file open operations that failed because they encountered a reparse point.
From the block of data that is passed back with the IRP, the driver can tell if this is
a block of data that the driver itself has associated with the file. If so, the driver
will stop processing the completion and continue processing the original I/O re-
quest. Generally, this will involve proceeding with the open request, but there is a
flag that tells NTFS to ignore the reparse point and open the file.
File Compression
0 16 32 47
0 7 8 23 24 31
Compressed Uncompressed Compressed
Disk addr 30 37 40 55 85 92
(a)
Header Five runs (of which two empties)
Standard
File name 0 48 30 8 0 8 40 16 85 8 0 8 Unused
info
(b)
Journaling
NTFS supports two mechanisms for programs to detect changes to files and di-
rectories. First is an operation, NtNotifyChangeDirectoryFile, that passes a buffer
and returns when a change is detected to a directory or directory subtree. The re-
sult is that the buffer has a list of change records. If it is too small, records are lost.
The second mechanism is the NTFS change journal. NTFS keeps a list of all
the change records for directories and files on the volume in a special file, which
programs can read using special file-system control operations, that is, the
FSCTL QUERY USN JOURNAL option to the NtFsControlFile API. The journal
file is normally very large, and there is little likelihood that entries will be reused
before they can be examined.
File Encryption
Computers are used nowadays to store all kinds of sensitive data, including
plans for corporate takeovers, tax information, and love letters, which the owners
do not especially want revealed to anyone. Information loss can happen when a
notebook computer is lost or stolen, a desktop system is rebooted using an MS-
DOS floppy disk to bypass Windows security, or a hard disk is physically removed
from one computer and installed on another one with an insecure operating system.
Windows addresses these problems by providing an option to encrypt files, so
that even in the event the computer is stolen or rebooted using MS-DOS, the files
will be unreadable. The normal way to use Windows encryption is to mark certain
directories as encrypted, which causes all the files in them to be encrypted, and
964 CASE STUDY 2: WINDOWS 8 CHAP. 11
new files moved to them or created in them to be encrypted as well. The actual en-
cryption and decryption are not managed by NTFS itself, but by a driver called
EFS (Encryption File System), which registers callbacks with NTFS.
EFS provides encryption for specific files and directories. There is also anoth-
er encryption facility in Windows called BitLocker which encrypts almost all the
data on a volume, which can help protect data no matter what—as long as the user
takes advantage of the mechanisms available for strong keys. Given the number of
systems that are lost or stolen all the time, and the great sensitivity to the issue of
identity theft, making sure secrets are protected is very important. An amazing
number of notebooks go missing every day. Major Wall Street companies sup-
posedly average losing one notebook per week in taxicabs in New York City alone.
the current generation of multiprocessors, both hibernation and resume can be per-
formed in a few seconds even on systems with many gigabytes of RAM.
An alternative to hibernation is standby mode where the power manager re-
duces the entire system to the lowest power state possible, using just enough power
to the refresh the dynamic RAM. Because memory does not need to be copied to
disk, this is somewhat faster than hibernation on some systems.
Despite the availability of hibernation and standby, many users are still in the
habit of shutting down their PC when they finish working. Windows uses hiberna-
tion to perform a pseudo shutdown and startup, called HiberBoot, that is much fast-
er than normal shutdown and startup. When the user tells the system to shutdown,
HiberBoot logs the user off and then hibernates the system at the point they would
normally login again. Later, when the user turns the system on again, HiberBoot
will resume the system at the login point. To the user it looks like shutdown was
very, very fast because most of the system initialization steps are skipped. Of
course, sometimes the system needs to perform a real shutdown in order to fix a
problem or install an update to the kernel. If the system is told to reboot rather
than shutdown, the system undergoes a real shutdown and performs a normal boot.
On phones and tablets, as well as the newest generation of laptops, computing
devices are expected to be always on yet consume little power. To provide this
experience Modern Windows implements a special version of power management
called CS (connected standby). CS is possible on systems with special network-
ing hardware which is able to listen for traffic on a small set of connections using
much less power than if the CPU were running. A CS system always appears to be
on, coming out of CS as soon as the screen is turned on by the user. Connected
standby is different than the regular standby mode because a CS system will also
come out of standby when it receives a packet on a monitored connection. Once
the battery begins to run low, a CS system will go into the hibernation state to
avoid completely exhausting the battery and perhaps losing user data.
Achieving good battery life requires more than just turning off the processor as
often as possible. It is also important to keep the processor off as long as possible.
The CS network hardware allows the processors to stay off until data have arrived,
but other events can also cause the processors to be turned back on. In NT-based
Windows device drivers, system services, and the applications themselves fre-
quently run for no particular reason other than to check on things. Such polling
activity is usually based on setting timers to periodically run code in the system or
application. Timer-based polling can produce a cacophony of events turning on the
processor. To avoid this, Modern Windows requires that timers specify an impreci-
sion parameter which allows the operating system to coalesce timer events and re-
duce the number of separate occasions one of the processors will have to be turned
back on. Windows also formalizes the conditions under which an application that
is not actively running can execute code in the background. Operations like check-
ing for updates or freshening content cannot be performed solely by requesting to
run when a timer expires. An application must defer to the operating system about
966 CASE STUDY 2: WINDOWS 8 CHAP. 11
when to run such background activities. For example, checking for updates might
occur only once a day or at the next time the device is charging its battery. A set of
system brokers provide a variety of conditions which can be used to limit when
background activity is performed. If a background task needs to access a low-cost
network or utilize a user’s credentials, the brokers will not execute the task until
the requisite conditions are present.
Many applications today are implemented with both local code and services in
the cloud. Windows provides WNS (Windows Notification Service) which allows
third-party services to push notifications to a Windows device in CS without re-
quiring the CS network hardware to specifically listen for packets from the third
party’s servers. WNS notifications can signal time-critical events, such as the arri-
val of a text message or a VoIP call. When a WNS packet arrives, the processor
will have to be turned on to process it, but the ability of the CS network hardware
to discriminate between traffic from different connections means the processor
does not have to awaken for every random packet that arrives at the network inter-
face.
Every Windows user (and group) is identified by an SID (Security ID). SIDs
are binary numbers with a short header followed by a long random component.
Each SID is intended to be unique worldwide. When a user starts up a process, the
process and its threads run under the user’s SID. Most of the security system is de-
signed to make sure that each object can be accessed only by threads with autho-
rized SIDs.
Each process has an access token that specifies an SID and other properties.
The token is normally created by winlogon, as described below. The format of the
token is shown in Fig. 11-45. Processes can call GetTokenInformation to acquire
this information. The header contains some administrative information. The expi-
ration time field could tell when the token ceases to be valid, but it is currently not
used. The Groups field specifies the groups to which the process belongs, which is
needed for the POSIX subsystem. The default DACL (Discretionary ACL) is the
968 CASE STUDY 2: WINDOWS 8 CHAP. 11
access control list assigned to objects created by the process if no other ACL is
specified. The user SID tells who owns the process. The restricted SIDs are to
allow untrustworthy processes to take part in jobs with trustworthy processes but
with less power to do damage.
Finally, the privileges listed, if any, give the process special powers denied or-
dinary users, such as the right to shut the machine down or access files to which
access would otherwise be denied. In effect, the privileges split up the power of
the superuser into several rights that can be assigned to processes individually. In
this way, a user can be given some superuser power, but not all of it. In summary,
the access token tells who owns the process and which defaults and powers are as-
sociated with it.
When a user logs in, winlogon gives the initial process an access token. Subse-
quent processes normally inherit this token on down the line. A process’ access
token initially applies to all the threads in the process. However, a thread can ac-
quire a different access token during execution, in which case the thread’s access
token overrides the process’ access token. In particular, a client thread can pass its
access rights to a server thread to allow the server to access the client’s protected
files and other objects. This mechanism is called impersonation. It is imple-
mented by the transport layers (i.e., ALPC, named pipes, and TCP/IP) and used by
RPC to communicate from clients to servers. The transports use internal interfaces
in the kernel’s security reference monitor component to extract the security context
for the current thread’s access token and ship it to the server side, where it is used
to construct a token which can be used by the server to impersonate the client.
Another basic concept is the security descriptor. Every object has a security
descriptor associated with it that tells who can perform which operations on it.
The security descriptors are specified when the objects are created. The NTFS file
system and the registry maintain a persistent form of security descriptor, which is
used to create the security descriptor for File and Key objects (the object-manager
objects representing open instances of files and keys).
A security descriptor consists of a header followed by a DACL with one or
more ACEs (Access Control Entries). The two main kinds of elements are Allow
and Deny. An Allow element specifies an SID and a bitmap that specifies which
operations processes that SID may perform on the object. A Deny element works
the same way, except a match means the caller may not perform the operation. For
example, Ida has a file whose security descriptor specifies that everyone has read
access, Elvis has no access. Cathy has read/write access, and Ida herself has full
SEC. 11.10 SECURITY IN WINDOWS 8 969
access. This simple example is illustrated in Fig. 11-46. The SID Everyone refers
to the set of all users, but it is overridden by any explicit ACEs that follow.
Security
descriptor
File Header
Deny
Security
descriptor Elvis ACE
111111
Header Allow
Owner's SID Cathy
Group SID 110000
DACL Allow
SACL Ida
111111
Allow
Everyone
100000
Header
Audit
Marilyn ACE
111111