Kernel development

Brief items

Kernel release status

The current development kernel is 4.7-rc2, released on June 5. Linus said: "There's a late non-fix I took even though the merge window is over, because I've been wanting it for a while. I doubt anybody notices the actual effects of a pty change/cleanup that means that our old disgusting DEVPTS_MULTIPLE_INSTANCES kernel config option is gone, because the cleanup means that it is no longer needed." For details on this change, see this article from last week's Kernel Page.

Stable updates: 4.6.2, 4.5.7, 4.4.13, and 3.14.72 were released on June 7. Note that 4.5.7 is the end of the 4.5 series.

Comments (none posted)

Quotes of the week

After v4.6 released, a number of people complained that Linux's use of __ro_after_init was extremely limited, but they did not send patches. In the interest of showing how to make progress in this area, this is a set of two patches for x86 and arm, extracted from the grsecurity/PaX patchset in about an hour. I invite others to continue this work, as it is relatively easy to accomplish.

— Kees Cook

When you say "also" in a patch changelog, that usually means this should be split into multiple patches.

— Greg Kroah-Hartman

Comments (1 posted)

Kernel development news

Reconsidering swapping

By Jonathan Corbet
June 7, 2016

"Swapping" is generally considered to be a bit of a dirty word among long-time Linux users, who will often go to considerable lengths to avoid it. The memory-management (MM) subsystem has been designed to facilitate that avoidance whenever possible. Now, though, MM developer Johannes Weiner is suggesting that, in light of recent developments in hardware, swapping deserves another look. His associated patch set includes benchmark results indicating that he may be on to something.

Is swapping still bad?

User-accessible memory on a Linux system is divided into two broad classes: file-backed and anonymous. File-backed pages (or page-cache pages) correspond to a segment of a file on disk; if they do not contain newly written data that has not yet made it back to persistent storage, these pages can be easily reclaimed for other uses. Anonymous pages do not correspond to a file on disk; they hold the run-time data generated and used by a process. Reclaiming an anonymous page requires writing its contents to the swap device.

As a general rule, reclaiming anonymous pages (swapping) is seen as being considerably more expensive than reclaiming file-backed pages. One of the key reasons for this difference is that file-backed pages can be read from (and written to) persistent storage in large, contiguous chunks, while anonymous pages tend to be scattered randomly on the swap device. On a rotating storage device, scattered I/O operations are expensive, so a system that is doing a lot of swapping will slow down considerably. It is far faster to read a bunch of sequentially stored file-backed pages — and, since the file is usually current on disk, those pages may not need to be written at reclaim time at all.

Swapping is so much slower that many administrators try to configure their systems to do as little swapping as possible. At its most extreme, this can involve not setting up a swap device at all; this common practice deprives the kernel of any way to reclaim anonymous pages, regardless of whether that memory could be put to better use elsewhere. An intermediate step is to use the swappiness tuning knob (described here in 2004) to bias the system strongly toward reclaiming file-backed pages. Setting swappiness to zero will cause the kernel to swap only when memory pressure reaches dire levels.

Johannes starts off his patch set by noting that this mechanism was designed around the characteristics of rotating storage. Anytime the drive used for swapping needed to perform a seek — which would happen often with randomly placed I/O — throughput would drop dramatically. Hence the strong aversion to swapping if it could possibly be avoided. But, Johannes notes, technology has moved on and some of these decisions should be reconsidered:

With the proliferation of fast random IO devices such as SSDs and persistent memory, though, swap becomes interesting again, not just as a last-resort overflow, but as an extension of memory that can be used to optimize the in-memory balance between the page cache and the anonymous workingset even during moderate load. Our current reclaim choices don't exploit the potential of this hardware.

Not only should the system be more willing to swap out anonymous memory, Johannes claims, but, at times, swapping may well be a better option than reclaiming page-cache pages. That could be true if the swap device is faster than the drives used to hold files; it is also true if the system is reclaiming needed file-backed pages while memory is clogged with unused anonymous pages.

Deciding when to swap

The first step in the patch set is to widen the range of possible settings for the swappiness knob. In current kernels, it can go from zero (no swapping at all if possible) to 100 (reclaim anonymous and file-backed pages equally). Johannes raises the maximum to 200; at that value, the system will strongly favor swapping. That is a possibility nobody has ever wanted before, but fast drives have now made it useful.

While there may always be a use for knobs like swappiness, the best kind of system is one that tunes itself without the need for administrator intervention. So Johannes goes on to change the mechanism that decides whether to reclaim pages from the anonymous least-recently-used (LRU) list or the file-backed LRU. For each list, he introduces the concept of the "cost" of reclaiming a page from that list; the reclaim code then directs its efforts toward the list that costs the least to reclaim pages from.

The first step is to track the cost of "rotations" on each LRU. The MM code does its best to reclaim pages that are not in active use. This is done by occasionally passing through the list and clearing the "referenced" bit on each page. The pages that are used thereafter will have that bit set again; those that still have the referenced bit cleared on a subsequent scan have not been touched in the meantime. Those pages are the least likely to be missed and are, thus, the first to be reclaimed. Pages which have been referenced, instead, are "rotated" to the head of the list, giving them a period of time before they are again considered for reclaim.

That rotation costs a bit of CPU time. If a particular LRU list has a lot of referenced pages in it, scanning that list will use a relatively large amount of time for a relatively small payoff in reclaimable pages; in this case, the kernel may well be better off scanning the other list, which may have more unused pages. To that end, Johannes's patch set tracks the number of rotated pages and uses it to establish the cost of reclaiming from each list.

While rotation has a cost, that cost pales relative to that of reclaiming a page that will be quickly faulted back into memory — even if it is written to a fast device in the meantime. As it happens, Johannes added a mechanism to track "refaulted" pages back in 2012; it is used in current kernels to determine how large the active working set is at any given time. This mechanism can also tell the kernel whether it is reclaiming too many anonymous or file-backed pages. The final patch in the set uses refault information to adjust the cost of reclaiming from each LRU; if pages taken from one LRU are quickly faulted back in, the kernel will turn its attention to the other LRU instead.

In the current patch set, the cost of a refaulted page is set to be 32 times the cost of a rotated page. Johannes suggests in the comments that this value is arbitrary and may change in the future. For now, the intent is to cause refaults to dominate in the cost calculation, but, he says, there may well be settings where refaults cost less than rotations.

The patch set comes with a number of benchmarks to show its impact on performance. A PostgreSQL benchmark goes from 81 to 105 transactions per second with the patches applied; the refault rate is halved, and kernel CPU time is reduced. A streaming I/O benchmark, which shouldn't create serious memory pressure, runs essentially unchanged. So, as far as Johannes's testing goes, the numbers look good.

Memory-management changes are fraught with potential hazards, though, and it is entirely possible that other workloads will be hurt by these changes. The only way to gain confidence that this won't happen is wider testing and review. This patch set is quite young; there have been some favorable reviews, but that testing has not yet happened. Thus, it may be a while before this code goes anywhere near a mainline kernel. But it has been clear for a while that the MM subsystem is going to need a number of changes to bring its design in line with current hardware; this work may be a promising step in that direction.

Comments (12 posted)

Sparse: a look under the hood

June 8, 2016

This article was contributed by Neil Brown

"Sparse" is a C language "semantic parser" originally written by Linus Torvalds to support his work on the Linux kernel. It was designed, according to the README file, to be "small - and simple" and particularly to be "easy to use". Reasons to use a simple C parser could include data mining (to summarize particular features of some code, for example), analysis (possibly to look for troublesome patterns), or visualization (to make it easier to understand or navigate around a large code set). In support of this reuse, sparse is licensed under the permissive MIT License and is structured as a library that other tools can easily incorporate. This library is accompanied by a number of tools that demonstrate some of those reuse possibilities.

Unfortunately though, sparse comes with little documentation to help a potential user get started. In the hope of correcting this omission, the following is an attempt to make the internals for sparse more approachable and to highlight some of the various uses that can be made of it.

Background patterns

Before getting into the details of the interfaces, some observations on overall style will be helpful. The first point to note is that sparse makes free use of global variables. Various aspects of the current state of the parser, and various configuration options set from the command line, are stored in global variables. Consequently, many functions have side effects that are not at all obvious at first glance, so caution and thorough research is advisable when exploring the code. Beyond this general pattern, there are two particular details of sparse worth exploring: memory allocation and list manipulation.

For memory allocation, sparse does not use the familiar malloc() and free() interfaces from the C library but, instead, provides a dedicated allocator that uses mmap() to allocate large blocks of memory which are then subdivided. In an approach somewhat reminiscent of the "slab allocator" used in the Linux kernel, the sparse allocator allocates multiple "blobs" that are each used for a distinct type of object such as tokens, identifiers, expressions, etc. The allocator is optimized for a usage pattern characterized by lots of allocations with few or no deallocations happening until a time comes where all objects of a particular type are released. Freeing individual objects is supported for fixed-sized allocations only, in which case the freed space is simply placed on a free list to satisfy a subsequent allocation request.

The wholesale freeing of all objects of a given type typically happens after a particular file has been completely processed, thus allowing a number of files to be processed sequentially without needing to store all of them in memory at once. When multiple files are processed, there may be some preamble that should apply to each file. The -D command-line option, which provides an initial definition of a macro, is a common example, but there are others, such as -include, which identifies a file to be included before the main file. In order to preserve the results of processing this preamble when memory is freed after processing the first file, a protect_foo() interface is provided for each allocator. This interface ensures every "foo" allocated so far will never be freed. This is used after the prefix has been parsed to preserved those results indefinitely.

The foo in protect_foo() above is any of the different defined allocators, of which there are 20. Sparse uses the C pre-processor to effect a mechanism similar to C++ templates so that allocators can be defined that are type-safe, returning or consuming a particular type rather than just a void pointer as malloc() and free() use. This leads to one of my personal least favorite coding styles, where the definition of a function, being constructed inside a macro, cannot be found by git grep or etags. A search for a string like " __alloc_statement" finds the single use of this function, but does not report its definition. We will get back to this problem later.

The generic lists used in sparse are quite different from the "list_head" based lists used in the Linux kernel. They consist of a simple linked list of arrays of pointers. This two-level structure (list of arrays) makes iteration over the list a little more complex, but means that generic insert, delete, and concatenate operations can be performed efficiently, and that direct indexing is straightforward after a linearization step.

Parsing phases

The README file describes the parsing phases as:

full-file tokenization
pre-processing (which can cause another tokenization phase of another file)
semantic parsing
lazy type evaluation
inline function expansion and tree simplification

which is reasonably accurate, but is not quite what you see when you look in the code. From the perspective of a program making use of the sparse library there are three main phases and two high-level data types that need to be understood.

The first phase is embodied in the sparse_initialize() and sparse() functions. The former is given the command-line arguments (argc, argv) passed to the program and an empty list. The command-line arguments are processed to emulate gcc or a similar compiler, so macro definitions (-D), warning levels (-W), machine types (-m), and preliminary include files (-include) are all handled, among others. All target file names are added to the passed-in list, all flags that modify global state are reflected in the relevant global variables, and all preliminary code, such as -D and -include, is parsed to produce a list of symbols which is returned. The parsing also modifies global variables such as hash_table[], which stores all identifiers.

The list of symbols represents all the top-level definitions of functions and variables. Declaration of types, typedefs, and references to external functions do not appear in this list: the relevant details will be found within the substructure of the symbols where these declarations [test-inspect] are used. Sparse comes with a tool called test_inspect that allows the symbol list parsed from a given file to be displayed and some of the substructure of each symbol to be inspected. This is useful for getting a feel for what sparse is producing.

Once sparse_initialize() has been called and the file list is no longer empty, sparse() can be called in turn on each file in that list. This will parse each file in the context extracted from the arguments and return a separate symbol_list for each file. The symbol_list is the result of nearly all of the parsing phases listed earlier.

tokenize_stream() is first used to convert a stream (an abstraction over either a file or a text buffer) into a linked list of tokens (not using the generic list framework, just a simple linked list). Each token includes a position so that the results of parsing, such as warnings, can be accurately linked back to the original code.

The token list, once completely extracted, is passed to preprocess(), which performs the various substitutions expected of a C preprocessor. While this is a distinct, well defined phase, the call to preprocess() is hidden inside the sparse_tokenstream() function which then repeatedly calls external_declaration() on the stream of preprocessed tokens to add declarations to a global hash_table and build the list of external symbols.

At this stage, the detail within each symbol is just an abstract syntax tree representation of the parsed code. There are statements, expressions, argument lists, and all the details that can be extracted from a purely syntactic analysis. The only analysis that has been performed beyond local syntax is the connection of the use of each symbol to its declaration. This work is necessary because, as discussed in the Wikipedia article on "The lexer hack", correct syntactic analysis of C requires that symbols declared by typedef be distinguished from other symbols.

There is one more parsing step applied to the symbol list before it is returned by the parse() entry point: evaluate_symbol() is called. This combines the "lazy type evaluation" and "inline function expansion" phases mentioned in the README. It resolves details of the type of each symbol such as storage size and alignment and then checks all initializers and code for type compatibility. This determines the type for all expressions, or reports errors and warnings when unacceptable or undesirable constructs are found. Exactly which warnings are generated here and which are left until later seems a little ad hoc. For example, sparse produces a warning if a simple assignment is found in the condition of an if statement:

   if (variable = value)
      statement;

While this could be detected during sparse_tokenstream(), it is actually handled in evaluate_symbol(). This is a common pattern: functionality is often to be found somewhere convenient rather than somewhere meaningful. This doesn't affect the functionality of the code, but can detract from its transparency.

Tree simplification

Based on the parsing phases listed in the README, all that is left after the call to sparse() is "tree simplification". This simplification happens in two stages that must, if wanted, be called by the main program after the call to sparse().

expand_symbol() can be called on each symbol in the list and primarily performs constant folding. For example if it finds an expression that adds "3" to "4" it will replace it with the constant "7". One detail that the parser tracks is whether a given symbol has ever been assigned to or otherwise had its value changed. If a symbol has an initializer but has never been changed, then expand_symbol() will use the initialized value where ever the symbol is found, thus achieving a higher level of code simplification.

expand_symbol() does a little bit of dead-code elimination when the constant propagation determines that &&, ||, or ?: can only have one possible outcome. For example, if sparse encounters code like:

    3 - 3 && foo()

the call to foo() will be removed, since it will never be executed. Dead code caused by a construct that resolves to "if (0)" is not eliminated at this stage as a jump into the body of the if could keep some of the code alive.

Finally, expand_symbol() performs a little bit of optimization. If a conditional expression (one using the "? :" operator) is found to have no side effects, and its computation is not expensive, then the type of the expression is changed from EXPR_CONDITIONAL to EXPR_SELECT, implying that it can be implemented without using any jump instructions, since a cmov (conditional move) or similar will suffice.

This parsing stage transforms the expressions and statements within each symbol in-place so no new data structure is needed to report the new results. The final stage is quite different and is a lot closer to code generation than it really is to parsing.

linearize_symbol() takes a symbol that has been parsed, evaluated, and expanded; if that symbol represents a function it will produce a network of "basic blocks" represented by a single entry point. A basic block is a sequence of instructions with no jumps except at the end, so all jumps or control transfers are from the end of one basic block to the beginning of another. Performing the EXPR_SELECT optimization before this step can result in fewer basic blocks.

The details of this conversion and its usefulness are fairly impenetrable until you know about Static Single Assignment form (SSA), at which point they become relatively straightforward. The key elements of SSA are the basic blocks, the links between them representing jumps, and versioned variables. When a variable is assigned to multiple times in the code, SSA requires that variable be cloned, once per assignment, so that each final variable receives a single assignment. In sparse these versioned variables are referred to as pseudos.

Using SSA form simplifies a number of optimizations, including the dead-code removal in if (0) statements mentioned above. These optimizations are not only important when the aim is code generation, they are valuable for providing high-quality warnings, which is the main use case for sparse today. A simple example of this is the __range__ statement that sparse adds to C. It is given three values such as:

   __range__ sizeof(struct foo), 0, 128

with the implication that the first value must be within the range given by other two. Once sparse has performed all the parsing steps and has a network of basic blocks, one of the tests it performs is to examine every instruction in every basic block and give the "value out of range" warning if the range-check operation (OP_RANGE) is present. This works because one of the late optimizations is to discard OP_RANGE instructions when all three values are constant and the first value is within the required range. In a language like C, where inline functions and macro expansion can place lots of dead code in unexpected places, it is important to remove as much of it as possible before passing judgment on that code's quality.

Use cases

This window into the purpose and structure of the various interfaces of libsparse is focused on the miniature: the steps and details. As such it doesn't give much hint as to why anyone would care, or what sorts of tools can be built with it. Undoubtedly there are possibilities that haven't been implemented or even imagined yet, but a quick look at the tools that come with sparse might be useful for sparking new ideas.

Along with test_inspect which has already been mentioned and allows some elements of structure to be viewed, there is graph that converts the basic-block network into a graph description in graphviz format; see the example to the right. This, together with similar tools, can be helpful for students trying to understand how compilation works.

A data extraction tool of a different type is ctags. A "tags" file lists locations in a set of files that are particularly interesting, typically the locations of the definitions or declaration of different names in a program. Text editors can use these tags files to help the user navigate around the code. A tags file is typically generated by a fairly simple regular-expression-based parse of the file. While this this is often effective, it is not perfect. As mentioned above, when a function is defined using a macro, a regular expression isn't going to be able to identify the function name, so the standard ctags and etags (an emacs-specific version) tools do not find such function definitions.

Sparse comes with a ctags tool that examines the symbol table generated during parsing of C code and creates a tags file recording exactly where every global symbol was defined, even when that was the result of multi-layered macro expansion. While git grep cannot tell me where the __alloc_statement() function was defined, the tags file created by ctags tells me it was on line 70 of allocate.h:

      DECLARE_ALLOCATOR(statement);

If I can fit this ctags into my workflow I might need to find a new least favorite coding style.

The main tool that uses the sparse library is, of course, sparse itself, which reports various errors and warnings while examining the code. Some of these, such as the test of assignment in an if condition, are applicable to C in general, but a large class of the warnings that sparse generates come from extending the C language is various ways. Sparse defines the macro __CHECKER__ so that the use of these extensions can be made visible only to sparse, not to other C compilers.

Some of these extensions, like __range__, are new statements, but most are attributes that can be attached to variable and type declarations using the GCC attribute syntax extension. These can provide extra information about how a variable or type should be used so sparse can warn when the expectations are not met. For example, there are two ways to initialize a structure, one of which is with positional initializers:

    struct foo { int a,b; } positional = { 1, 2 };

The alternative is to use designated initializers:

    struct foo { int a,b; } designated = { b:2, a:1 };

Positional initializers provide a simple list of values that are assigned to the fields of the structure based on their position in the list. Designated initializers, instead, attach the field name to each value in the initializer to avoid simple ordering errors. In 2009, sparse gained support for a designated-init attribute that, when attached to a structure type, will trigger a warning if a positional initializer were ever used to initialize a structure of that type. This same attribute was copied to GCC in 2014 so we don't really need sparse any longer to get that warning, but there are other extensions that are less generally applicable and so less likely to make their way into GCC.

Two such extensions are related to the type system and have been discussed previously in these pages: bitwise, which creates a "new" type (in the Ada sense) that is identical to some other integer type except that it is incompatible with it, and address_spaces, which provide similar functionality for pointers. bitwise can be used to avoid confusing big-endian and little-endian values, or to avoid accidentally using bitmasks on the wrong variable. The most obvious use of address spaces in the Linux kernel are to distinguish user-space pointers from kernel-space pointers, though there are other uses.

All of these extensions are amenable to simple static analysis: they enhance the type information in a way that allows certain operations to be easily seen as incorrect. Sometimes it would be nice to perform some more dynamic analysis, where a particular operation is valid only when proceeded or followed by some matching operation. A memory allocation must be followed by either releasing the memory or storing a reference somewhere, a pointer may only be dereferenced if it has been assigned a non-NULL value, a lock that has been taken must always be released, and so on.

The final SSA stage of sparse does allow for some dynamic analysis, but only at a very coarse level. It can often detect when a variable can be used without ever being assigned, but it cannot, for example, track if a variable is within a given range; that only works for constants.

One small step toward more general data flow analysis is found in the "context" tracking that sparse does to help catch errors where a lock is taken but not released. While the implementation is useful, it is extremely simplistic. It does not track individual locks at all but, instead, stores a single integer "context" counter for each basic block. Any "lock" event on any variable increments this counter, any unlock decrements it. As long as all paths through the code lead to the same context value at each location, it is assumed that the code is correct. This test would be easy to fool, but code designed to fool sparse would likely be quite obvious to humans, while a forgotten unlock calls on error paths, which humans may miss, would be obvious to sparse.

Building on sparse

This observation that sparse, while powerful, is sometimes simplistic is where we will leave sparse for now, though the interested reader is encouraged to explore and experiment with the code which is, of course, open. But this is not the end of our little foray into the internals of static analyzers. Smatch is a tool built on top of sparse which fills in some of the gaps left by sparse. If you have a desire to define some extensions to C to help catch more errors and sparse doesn't seem to be up to your task, smatch may be the tool for you.

Comments (1 posted)

Mount namespaces and shared subtrees

June 8, 2016

This article was contributed by Michael Kerrisk.

Namespaces in operation

Mount namespaces are a powerful and flexible tool for creating per-user and per-container filesystem trees. They are also a surprisingly complex feature; in this continuation of our series on namespaces we unravel some of that complexity. In particular, we will take a close look at the shared subtrees feature, which allows mount and unmount events to be propagated between mount namespaces in an automatic, controlled fashion.

Introduction

Mount namespaces were the first namespace type added to Linux, appearing in 2002 in Linux 2.4.19. They isolate the list of mount points seen by the processes in a namespace. Or, to put things another way, each mount namespace has its own list of mount points, meaning that processes in different namespaces see and are able to manipulate different views of the single directory hierarchy.

When the system is first booted, there is a single mount namespace, the so-called "initial namespace". New mount namespaces are created by using the CLONE_NEWNS flag with either the clone() system call (to create a new child process in the new namespace) or the unshare() system call (to move the caller into the new namespace). When a new mount namespace is created, it receives a copy of the mount point list replicated from the namespace of the caller of clone() or unshare().

Following the clone() or unshare() call, mount points can be independently added and removed in each namespace (via mount() and umount()). Changes to the mount point list are (by default) visible only to processes in the mount namespace where the process resides; the changes are not visible in other mount namespaces.

Mount namespaces serve a variety of purposes. For example, they can be used to provide per-user views of the filesystem. Other uses include mounting a /proc filesystem for a new PID namespace without causing side effects for other process and chroot()-style isolation of a process to a portion of the single directory hierarchy. In some use cases, mount namespaces are combined with bind mounts.

Shared subtrees

Once the implementation of mount namespaces was completed, user-space programmers encountered a usability problem: mount namespaces provided too much isolation between namespaces. Suppose, for example, that a new disk is loaded into an optical disk drive. In the original implementation, the only way to make that disk visible in all mount namespaces was to mount the disk separately in each namespace. In many cases, it would instead be preferable to perform a single mount operation that makes the disk visible in all (or perhaps some subset) of the mount namespaces on the system.

Because of the problem just described, the shared subtrees feature was added in Linux 2.6.15 (in early 2006, around three years after the initial implementation of mount namespaces). The key benefit of shared subtrees is to allow automatic, controlled propagation of mount and unmount events between namespaces. This means, for example, that mounting an optical disk in one mount namespace can trigger a mount of that disk in all other namespaces.

Under the shared subtrees feature, each mount point is marked with a "propagation type", which determines whether mount points created and removed under this mount point are propagated to other mount points. There are four different propagation types:

MS_SHARED: This mount point shares mount and unmount events with other mount points that are members of its "peer group" (which is described in more detail below). When a mount point is added or removed under this mount point, this change will propagate to the peer group, so that the mount or unmount will also take place under each of the peer mount points. Propagation also occurs in the reverse direction, so that mount and unmount events on a peer mount will also propagate to this mount point.
MS_PRIVATE: This is the converse of a shared mount point. The mount point does not propagate events to any peers, and does not receive propagation events from any peers.
MS_SLAVE: This propagation type sits midway between shared and private. A slave mount has a master—a shared peer group whose members propagate mount and unmount events to the slave mount. However, the slave mount does not propagate events to the master peer group.
MS_UNBINDABLE: This mount point is unbindable. Like a private mount point, this mount point does not propagate events to or from peers. In addition, this mount point can't be the source for a bind mount operation.

It's worth expanding on a few points that were glossed over above. The first is that the propagation type is a per-mount-point setting. Within a namespace, some mount points might be marked shared, while others are marked private (or slave or unbindable).

The second point to emphasize is that the propagation type determines the propagation of mount and unmount events immediately under the mount point. Thus, if, under a shared mount, X, we create a child mount, Y, that child mount will propagate to other mount points in the peer group. However, the propagation type of X would have no effect for mount points created and removed under Y; whether or not events under Y are propagated would depend on the propagation type that is defined for Y. Analogously, whether an unmount event would be propagated when X itself is unmounted would depend on the propagation type of the parent mount of X.

In passing, it is perhaps worth clarifying that the word "event" is used here as an abstract term, in the sense of "something happened". The notion of event propagation does not imply some sort of message passing between mount points. Rather, it carries the idea that some mount or unmount operation on one mount point triggered a matching operation one or more other mount points.

Finally, it is possible for a mount to be both the slave of a master peer group as well as sharing events with a set of peers of its own—a so-called slave-and-shared mount. In this case, the mount might receive propagation events from the master, and those events would then be propagated to its peers.

Peer groups

A peer group is a set of mount points that propagate mount and unmount events to one another. A peer group acquires new members when a mount point whose propagation type is shared is either replicated during the creation of a new namespace or is used as the source for a bind mount. (For a bind mount, the details are more complex than we describe here; details can be found in the kernel source file Documentation/filesystems/sharedsubtree.txt.) In both cases, the new mount point is made a member of the same peer group as the existing mount point. Conversely, a mount point ceases to be a member of a peer group when it is unmounted, either explicitly, or implicitly when a mount namespace is torn down because the last member process terminates or moves to another namespace.

For example, suppose that in a shell running in the initial mount namespace, we make the root mount point private and create two shared mount points:

    sh1# mount --make-private /
    sh1# mount --make-shared /dev/sda3 /X
    sh1# mount --make-shared /dev/sda5 /Y

As indicated by the "#" in the shell prompts, privilege is required for the various mount commands that we employ in the example shell sessions to create mount points and change their propagation types.

Then, on a second terminal, we use the unshare command to create a new mount namespace where we run a shell:

    sh2# unshare -m --propagation unchanged sh

(The -m option creates a new mount namespace; the purpose of the --propagation unchanged option is explained later.)

Returning to the first terminal, we then create a bind mount from the /X mount point:

    sh1# mkdir /Z
    sh1# mount --bind /X /Z

Following these steps, we have the situation shown in the diagram below.

[Shared mount point
peer groups example]

In this scenario, there are two peer groups:

The first peer group contains the mount points X, X' (the duplicate of mount point X that was created when the second namespace was created), and Z (the bind mount created from the source mount point X in the initial namespace).
The second peer group contains the mount points Y and Y' (the duplicate of mount point Y that was created when the second namespace was created).

Note that the bind mount Z, which was created in the initial namespace after the second namespace was created, was not replicated in the second namespace because the parent mount (/) was marked private.

Examining propagation types and peer groups via `/proc/PID/mountinfo`

The /proc/PID/mountinfo file (documented in the proc(5) manual page) displays a range of information about the mount points for the mount namespace in which the process PID resides. All processes that reside in the same mount namespace will see the same view in this file. This file was designed to provide more information about mount points than was possible with the older, non-extensible /proc/PID/mounts file. Included in each record in this file is a (possibly empty) set of so-called "optional fields", which display information about the propagation type and peer group (for shared mounts) of each mount.

For a shared mount, the optional fields in the corresponding record in /proc/PID/mountinfo will contain a tag of the form shared:N. Here, the shared tag indicates that the mount is sharing propagation events with a peer group. The peer group is identified by N, an integer value that uniquely identifies the peer group. These IDs are numbered starting at 1, and may be recycled when a peer group ceases to exist because all of its members departed the group. All mount points that are members of the same peer group will show a shared:N tag with the same N in the /proc/PID/mountinfo file.

Thus for example, if we list the contents of /proc/self/mountinfo in the first of the shells discussed in the example above, we see the following (with a little bit of sed filtering to trim some irrelevant information from the output):

    sh1# cat /proc/self/mountinfo | sed 's/ - .*//'
    61 0 8:2 / / rw,relatime
    81 61 8:3 / /X rw,relatime shared:1
    124 61 8:5 / /Y rw,relatime shared:2
    228 61 8:3 / /Z rw,relatime shared:1

From this output, we first see that the root mount point is private. This is indicated by the absence of any tags in the optional fields. We also see that the mount points /X and /Z are shared mount points in the same peer group (with ID 1), which means that mount and unmount events under either of these two mounts will propagate to the other. The mount /Y is a shared mount in a different peer group (ID 2), which, by definition, does not propagate events to or from the mounts in peer group 1.

The /proc/PID/mountinfo file also enables us to see the parental relationship between mount points. The first field in each record is a unique ID for each mount point. The second field is the ID for the parent mount. From the above output, we can see that the mount points /X, /Y, and /Z are all children of the root mount because their parent IDs are all 61.

Running the same command in the second shell (in the second namespace), we see:

    sh2# cat /proc/self/mountinfo | sed 's/ - .*//'
    147 146 8:2 / / rw,relatime
    221 147 8:3 / /X rw,relatime shared:1
    224 147 8:5 / /Y rw,relatime shared:2

Again, we see that the root mount point is private. Then we see that /X is a shared mount in peer group 1, the same peer group as the mounts /X and /Z in the initial mount namespace. Finally, we see that /Y is a shared mount in peer group 2, the same peer group as the mount /Y in the initial mount namespace. One final point to note is that the mount points that were replicated in the second namespace have their own unique IDs that differ from the IDs of the corresponding mounts in the initial namespace.

Debating defaults

Because the situation is a little complex, we have so far avoided discussing what the default propagation type is for a new mount point. From the kernel's perspective, the default when a new device mount is created is as follows:

If the mount point has a parent (i.e., it is a non-root mount point) and the propagation type of the parent is MS_SHARED, then the propagation type of the new mount is also MS_SHARED.
Otherwise, the propagation type of the new mount is MS_PRIVATE.

According to these rules, the root mount would be MS_PRIVATE, and all descendant mounts would by default also be MS_PRIVATE. However, MS_SHARED would arguably have been a better default, since it is the more commonly employed propagation type. For that reason, systemd sets the propagation type of all mount points to MS_SHARED. Thus, on most modern Linux distributions, the default propagation type is effectively MS_SHARED. This is not the final word on the subject, however, since the util-linux unshare utility also has something to say. When creating a new mount namespace, unshare assumes that the user wants a fully isolated namespace, and makes all mount points private by performing the equivalent of the following command (which recursively marks all mounts under the root directory as private):

    mount --make-rprivate /

To prevent this, we can use an additional option when creating the new namespace:

    unshare -m --propagation unchanged <cmd>

Concluding remarks

In this article, we introduced the "theory" of mount namespaces and shared subtrees. We now have enough information to demonstrate and understand the semantics of the various propagation types; that will be the subject of a follow-on article.

Comments (5 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 4.7-rc2 ?

Greg KH Linux 4.6.2 ?

Sebastian Andrzej Siewior 4.6.1-rt3 ?

Greg KH Linux 4.5.7 ?

Greg KH Linux 4.4.13 ?

Sebastian Andrzej Siewior 4.4.12-rt19 ?

Greg KH Linux 3.14.72 ?

Architecture-specific

David Long arm64: Add kernel probes (kprobes) support ?

Geoff Levand arm64 kexec kernel patches ?

Garlic Tseng [PATCH v2 0/9] ASoC: Mediatek: Add support for MT2701 SOC ?

Bill Mills ARM LPAE Outer Shared v2 ?

Naveen N. Rao eBPF JIT for PPC64 ?

H. Peter Anvin x86: use gcc 6+ asm flag output feature ?

Core kernel code

Dave Hansen [v2] System Calls for Memory Protection Keys ?

Device drivers

Yisen Zhuang net: hns: add support of ACPI ?

Neil Armstrong watchdog: Add Amlogic Meson GXBB Watchdog Timer driver ?

Neil Armstrong hw_random: Add Amlogic Meson SoCs Random Genenerator driver ?

Rocky Hsiao Dyna-Image AL3320A update, add AL3010 driver ?

LABBE Corentin net-next: ethernet: add sun8i-emac driver ?

Chris Zhong Rockchip Type-C and DispplayPort driver ?

Frank Wang Add a new Rockchip usb2 phy driver ?

Pramod Kumar Add MDIO bus multiplexer support for iProc SoCs ?

Neil Leeder qcom: add l2 cache perf events driver ?

Christophe Leroy crypto: talitos - implementation of AEAD for SEC1 ?

Jon Hunter Add support for Tegra210 AGIC ?

Hauke Mehrtens NET: PHY: Intel XWAY driver ?

Ramiro Oliveira OV5647 sensor driver ?

Hans de Goede drm: Add Grain Media GM12U320 kms driver ?

Songjun Wu [media] atmel-isc: add driver for Atmel ISC ?

Christoph Hellwig NVMe over Fabrics target implementation ?

Christoph Hellwig NVMe over Fabrics RDMA transport drivers ?

Peter Griffin Add support for FDMA DMA controller and slim core rproc found on STi chipsets ?

honghui.zhang@mediatek.com MT2701 iommu support ?

YT Shen MT2701 DRM support ?

Lijun Ou Add HiSilicon RoCE driver ?

Ramesh Shanmugasundaram Add CAN FD driver support to r8a7795 SoC ?

Device driver infrastructure

Pantelis Antoniou Portable Device Tree Connector ?

Laura Abbott ion: improved ABI ?

Christoph Hellwig generic NVMe over Fabrics library support ?

Roger Quadros USB OTG/dual-role framework ?

Lee Jones pwm: Add support for PWM Capture ?

Documentation

Jani Nikula Documentation/sphinx ?

Filesystems and block I/O

mchristi@redhat.com [PATCH 00/45] v8: separate operations from flags in the bio/request structs ?

Shaun Tancheff Block layer support ZAC/ZBC commands ?

Memory management

Michal Hocko Handle oom bypass more gracefully ?

Michal Hocko mm: give GFP_REPEAT a better semantic ?

Kirill A. Shutemov THP-enabled tmpfs/shmem using compound pages ?

Mike Kravetz hugetlb support for userfaultfd ?

Johannes Weiner mm: balance LRU lists based on relative thrashing ?

Security-related

Kees Cook expand use of __ro_after_init ?

Page editor: Jonathan Corbet
Next page: Distributions>>