Namespaces in operation, part 4: more on PID namespaces
In this article, we continue last week's discussion of PID namespaces (and extend our ongoing series on namespaces). One use of PID namespaces is to implement a package of processes (a container) that behaves like a self-contained Linux system. A key part of a traditional system—and likewise a PID namespace container—is the init process. Thus, we'll look at the special role of the init process and note one or two areas where it differs from the traditional init process. In addition, we'll look at some other details of the namespaces API as it applies to PID namespaces.
The PID namespace init process
The first process created inside a PID namespace gets a process ID of 1 within the namespace. This process has a similar role to the init process on traditional Linux systems. In particular, the init process can perform initializations required for the PID namespace as whole (e.g., perhaps starting other processes that should be a standard part of the namespace) and becomes the parent for processes in the namespace that become orphaned.
In order to explain the operation of PID namespaces, we'll make use of a few purpose-built example programs. The first of these programs, ns_child_exec.c, has the following command-line syntax:
ns_child_exec [options] command [arguments]
The ns_child_exec program uses the clone() system call to create a child process; the child then executes the given command with the optional arguments. The main purpose of the options is to specify new namespaces that should be created as part of the clone() call. For example, the -p option causes the child to be created in a new PID namespace, as in the following example:
$ su # Need privilege to create a PID namespace Password: # ./ns_child_exec -p sh -c 'echo $$' 1
That command line creates a child in a new PID namespace to execute a shell echo command that displays the shell's PID. With a PID of 1, the shell was the init process for the PID namespace that (briefly) existed while the shell was running.
Our next example program, simple_init.c, is a program that we'll execute as the init process of a PID namespace. This program is designed to allow us to demonstrate some features of PID namespaces and the init process.
The simple_init program performs the two main functions of init. One of these functions is "system initialization". Most init systems are more complex programs that take a table-driven approach to system initialization. Our (much simpler) simple_init program provides a simple shell facility that allows the user to manually execute any shell commands that might be needed to initialize the namespace; this approach also allows us to freely execute shell commands in order to conduct experiments in the namespace. The other function performed by simple_init is to reap the status of its terminated children using waitpid().
Thus, for example, we can use the ns_child_exec program in conjunction with simple_init to fire up an init process that runs in a new PID namespace:
# ./ns_child_exec -p ./simple_init init$
The init$ prompt indicates that the simple_init program is ready to read and execute a shell command.
We'll now use the two programs we've presented so far in conjunction with another small program, orphan.c, to demonstrate that processes that become orphaned inside a PID namespace are adopted by the PID namespace init process, rather than the system-wide init process.
The orphan program performs a fork() to create a child process. The parent process then exits while the child continues to run; when the parent exits, the child becomes an orphan. The child executes a loop that continues until it becomes an orphan (i.e., getppid() returns 1); once the child becomes an orphan, it terminates. The parent and the child print messages so that we can see when the two processes terminate and when the child becomes an orphan.
In order to see what that our simple_init program reaps the orphaned child process, we'll employ that program's -v option, which causes it to produce verbose messages about the children that it creates and the terminated children whose status it reaps:
# ./ns_child_exec -p ./simple_init -v init: my PID is 1 init$ ./orphan init: created child 2 Parent (PID=2) created child with PID 3 Parent (PID=2; PPID=1) terminating init: SIGCHLD handler: PID 2 terminated init$ # simple_init prompt interleaved with output from child Child (PID=3) now an orphan (parent PID=1) Child (PID=3) terminating init: SIGCHLD handler: PID 3 terminated
In the above output, the indented messages prefixed with init: are printed by the simple_init program's verbose mode. All of the other messages (other than the init$ prompts) are produced by the orphan program. From the output, we can see that the child process (PID 3) becomes an orphan when its parent (PID 2) terminates. At that point, the child is adopted by the PID namespace init process (PID 1), which reaps the child when it terminates.
Signals and the init process
The traditional Linux init process is treated specially with respect to signals. The only signals that can be delivered to init are those for which the process has established a signal handler; all other signals are ignored. This prevents the init process—whose presence is essential for the stable operation of the system—from being accidentally killed, even by the superuser.
PID namespaces implement some analogous behavior for the namespace-specific init process. Other processes in the namespace (even privileged processes) can send only those signals for which the init process has established a handler. This prevents members of the namespace from inadvertently killing a process that has an essential role in the namespace. Note, however, that (as for the traditional init process) the kernel can still generate signals for the PID namespace init process in all of the usual circumstances (e.g., hardware exceptions, terminal-generated signals such as SIGTTOU, and expiration of a timer).
Signals can also (subject to the usual permission checks) be sent to the PID namespace init process by processes in ancestor PID namespaces. Again, only the signals for which the init process has established a handler can be sent, with two exceptions: SIGKILL and SIGSTOP. When a process in an ancestor PID namespace sends these two signals to the init process, they are forcibly delivered (and can't be caught). The SIGSTOP signal stops the init process; SIGKILL terminates it. Since the init process is essential to the functioning of the PID namespace, if the init process is terminated by SIGKILL (or it terminates for any other reason), the kernel terminates all other processes in the namespace by sending them a SIGKILL signal.
Normally, a PID namespace will also be destroyed when its init process terminates. However, there is an unusual corner case: the namespace won't be destroyed as long as a /proc/PID/ns/pid file for one of the processes in that namespaces is bind mounted or held open. However, it is not possible to create new processes in the namespace (via setns() plus fork()): the lack of an init process is detected during the fork() call, which fails with an ENOMEM error (the traditional error indicating that a PID cannot be allocated). In other words, the PID namespace continues to exist, but is no longer usable.
Mounting a procfs filesystem (revisited)
In the previous article in this series, the /proc filesystems (procfs) for the PID namespaces were mounted at various locations other than the traditional /proc mount point. This allowed us to use shell commands to look at the contents of the /proc/PID directories that corresponded to each of the new PID namespace while at the same time using the ps command to look at the processes visible in the root PID namespace.
However, tools such as ps rely on the contents of the procfs mounted at /proc to obtain the information that they require. Therefore, if we want ps to operate correctly inside a PID namespace, we need to mount a procfs for that namespace. Since the simple_init program permits us to execute shell commands, we can perform this task from the command line, using the mount command:
# ./ns_child_exec -p -m ./simple_init init$ mount -t proc proc /proc init$ ps a PID TTY STAT TIME COMMAND 1 pts/8 S 0:00 ./simple_init 3 pts/8 R+ 0:00 ps a
The ps a command lists all processes accessible via /proc. In this case, we see only two processes, reflecting the fact that there are only two processes running in the namespace.
When running the ns_child_exec command above, we employed that program's -m option, which places the child that it creates (i.e., the process running simple_init) inside a separate mount namespace. As a consequence, the mount command does not affect the /proc mount seen by processes outside the namespace.
unshare() and setns()
In the second article in this series, we described two system calls that are part of the namespaces API: unshare() and setns(). Since Linux 3.8, these system calls can be employed with PID namespaces, but they have some idiosyncrasies when used with those namespaces.
Specifying the CLONE_NEWPID flag in a call to unshare() creates a new PID namespace, but does not place the caller in the new namespace. Rather, any children created by the caller will be placed in the new namespace; the first such child will become the init process for the namespace.
The setns() system call now supports PID namespaces:
setns(fd, 0); /* Second argument can be CLONE_NEWPID to force a check that 'fd' refers to a PID namespace */
The fd argument is a file descriptor that identifies a PID namespace that is a descendant of the PID namespace of the caller; that file descriptor is obtained by opening the /proc/PID/ns/pid file for one of the processes in the target namespace. As with unshare(), setns() does not move the caller to the PID namespace; instead, children that are subsequently created by the caller will be placed in the namespace.
We can use an enhanced version of the ns_exec.c program that we presented in the second article in this series to demonstrate some aspects of using setns() with PID namespaces that appear surprising until we understand what is going on. The new program, ns_run.c, has the following syntax:
ns_run [-f] [-n /proc/PID/ns/FILE]... command [arguments]
The program uses setns() to join the namespaces specified by the /proc/PID/ns files contained within -n options. It then goes on to execute the given command with optional arguments. If the -f option is specified, it uses fork() to create a child process that is used to execute the command.
Suppose that, in one terminal window, we fire up our simple_init program in a new PID namespace in the usual manner, with verbose logging so that we are informed when it reaps child processes:
# ./ns_child_exec -p ./simple_init -v init: my PID is 1 init$
Then we switch to a second terminal window where we use the ns_run program to execute our orphan program. This will have the effect of creating two processes in the PID namespace governed by simple_init:
# ps -C sleep -C simple_init PID TTY TIME CMD 9147 pts/8 00:00:00 simple_init # ./ns_run -f -n /proc/9147/ns/pid ./orphan Parent (PID=2) created child with PID 3 Parent (PID=2; PPID=0) terminating # Child (PID=3) now an orphan (parent PID=1) Child (PID=3) terminating
Looking at the output from the "Parent" process (PID 2) created when the orphan program is executed, we see that its parent process ID is 0. This reflects the fact that the process that started the orphan process (ns_run) is in a different namespace—one whose members are invisible to the "Parent" process. As already noted in the previous article, getppid() returns 0 in this case.
The following diagram shows the relationships of the various processes before the orphan "Parent" process terminates. The arrows indicate parent-child relationships between processes.
Returning to the window running the simple_init program, we see the following output:
init: SIGCHLD handler: PID 3 terminated
The "Child" process (PID 3) created by the orphan program was reaped by simple_init, but the "Parent" process (PID 2) was not. This is because the "Parent" process was reaped by its parent (ns_run) in a different namespace. The following diagram shows the processes and their relationships after the orphan "Parent" process has terminated and before the "Child" terminates.
It's worth emphasizing that setns() and unshare() treat PID namespaces specially. For other types of namespaces, these system calls do change the namespace of the caller. The reason that these system calls do not change the PID namespace of the calling process is because becoming a member of another PID namespace would cause the process's idea of its own PID to change, since getpid() reports the process's PID with respect to the PID namespace in which the process resides. Many user-space programs and libraries rely on the assumption that a process's PID (as reported by getpid()) is constant (in fact, the GNU C library getpid() wrapper function caches the PID); those programs would break if a process's PID changed. To put things another way: a process's PID namespace membership is determined when the process is created, and (unlike other types of namespace membership) cannot be changed thereafter.
Concluding remarks
In this article we've looked at the special role of the PID namespace
init process, shown how to mount a procfs for a PID namespace so
that it can be used by tools such as ps, and looked at some of the
peculiarities of unshare() and setns() when employed with
PID namespaces. This completes our discussion of PID namespaces; in the
next article, we'll turn to look at user namespaces.
Index entries for this article | |
---|---|
Kernel | Namespaces/PID namespaces |
Posted Jan 23, 2013 19:09 UTC (Wed)
by luto (subscriber, #39314)
[Link] (2 responses)
http://web.mit.edu/luto/www/linux/nnp/newns.c
Have fun! If any of you find it useful, let me know -- I can probably polish it a bit and send it to util-linux or something.
Also, on very new kernels (3.8+), a lot of this stuff can be done without privilege if you're willing to accept a few restrictions.
Posted Jan 24, 2013 16:56 UTC (Thu)
by dashesy (guest, #74652)
[Link] (1 responses)
Posted Jan 25, 2013 16:44 UTC (Fri)
by fishface60 (subscriber, #88700)
[Link]
I'm quite looking forward to being able to launch a container in shell.
Posted Jan 23, 2013 21:44 UTC (Wed)
by dashesy (guest, #74652)
[Link] (6 responses)
Is there a way I can bookmark the entire series rather than individual articles?
Posted Jan 23, 2013 21:50 UTC (Wed)
by corbet (editor, #1)
[Link] (5 responses)
Posted Jan 23, 2013 23:31 UTC (Wed)
by dashesy (guest, #74652)
[Link]
Posted Jan 24, 2013 4:44 UTC (Thu)
by xxiao (guest, #9631)
[Link] (3 responses)
Thanks!
Posted Jan 25, 2013 15:03 UTC (Fri)
by nix (subscriber, #2304)
[Link] (2 responses)
Posted Jan 27, 2013 0:23 UTC (Sun)
by xxiao (guest, #9631)
[Link] (1 responses)
Posted Jan 27, 2013 18:06 UTC (Sun)
by nix (subscriber, #2304)
[Link]
Posted Jan 25, 2013 10:00 UTC (Fri)
by sorokin (guest, #88478)
[Link] (3 responses)
Looks like some dirty hack. Why init process can not disable signals itself?
Posted Jan 25, 2013 10:27 UTC (Fri)
by andresfreund (subscriber, #69562)
[Link] (2 responses)
Posted Jan 25, 2013 10:35 UTC (Fri)
by mpr22 (subscriber, #60784)
[Link] (1 responses)
Posted Jan 25, 2013 18:07 UTC (Fri)
by ebiederm (subscriber, #35028)
[Link]
The reason for ignoring the others is that is the way things have worked for "init" processes as far back in the linux history as I have looked, and maintaining backwards compatibility is important.
Posted Jan 31, 2013 16:26 UTC (Thu)
by alex2 (guest, #73934)
[Link] (2 responses)
Posted Feb 5, 2013 12:50 UTC (Tue)
by Lennie (subscriber, #49641)
[Link]
This is because I'm not sure how well you can control what packets can and can not be send from the network namespace from the parent namespace.
Posted Feb 5, 2013 16:18 UTC (Tue)
by bjencks (subscriber, #80303)
[Link]
(Note that you can still connect to filesystem-namespace unix sockets if you can access them as files -- you need to chroot or use mount namespaces if you want to hide them as well. I believe abstract namespace unix sockets are isolated per-namespace.)
Posted Jun 27, 2013 20:00 UTC (Thu)
by Urhixidur (guest, #91620)
[Link]
Posted Mar 5, 2015 2:04 UTC (Thu)
by apollock (subscriber, #14629)
[Link] (6 responses)
I'm creating a new everything, i.e.
and then unmounting a filesystem from that shell, and it's getting unmounted in another shell that hasn't been interacting with the namespace, which isn't what I would have expected? Similarly, if I mount /proc in the last two example invocations above, it clobbers the systemwide /proc mount with what's going on inside my new PID namespace. Also not what I would have expected?
I'm using 3.19.0
Posted Mar 5, 2015 9:25 UTC (Thu)
by mkerrisk (subscriber, #1978)
[Link] (5 responses)
So, in the new namespace, you need to disable propagation of mount events on /, either by making it a private mount (prevents propagation in both directions) or by making it a slave mount (allows propagation of mounts events under / into the new namespace, but doesn't propagate events outside the new namespace. So, for example, in the shell session under the heading Mounting a procfs filesystem (revisited), we should add one further shell command:
For more info about mount propagation, see the kernel source file Documentation/filesystems/sharedsubtree.txt and the mount(8) man page.
Posted Mar 5, 2015 23:55 UTC (Thu)
by apollock (subscriber, #14629)
[Link] (4 responses)
It looks like I have to do the same thing to /proc prior to mounting it
Posted Mar 6, 2015 9:19 UTC (Fri)
by mkerrisk (subscriber, #1978)
[Link] (3 responses)
Actually, it was quite by chance. I happened to be checking some details in these articles myself.
> It looks like I have to do the same thing to /proc prior to mounting it
I don't believe that should be necessary. What makes you think that it is?
Posted Mar 7, 2015 7:53 UTC (Sat)
by apollock (subscriber, #14629)
[Link] (2 responses)
I was basically testing two scenarios:
1) Unmounting a filesystem that was mounted inside and outside the new namespace. Expected behaviour: it was only unmounted inside the new namespace
2) Mounting /proc inside the new namespace. Expected behaviour: only seeing the process entries for processes inside the new namespace inside the namespace, and there being no impact outside this namespace
Posted Mar 7, 2015 10:04 UTC (Sat)
by mkerrisk (subscriber, #1978)
[Link] (1 responses)
> It looks like I have to do the same thing to /proc prior to mounting it
Yes, you're right. I was getting confused with another case, where if we mount a procfs at a location other than the usual /proc, then we need to make / a private or slave mount in order not to have that mount appear in the initial mount namespace.
So, in fact all that's needed if we're mounting at /proc inside the simple_init program is
Posted Oct 10, 2017 1:54 UTC (Tue)
by marcosps (subscriber, #115562)
[Link]
what do you think about changing the article adding the --make-slave parameter mount? It made me turn off my computer twice, as the system gets unstable (at least n my fedora 26)...
Posted Dec 15, 2016 2:26 UTC (Thu)
by orbisvicis (guest, #113024)
[Link] (1 responses)
Posted Feb 13, 2019 9:48 UTC (Wed)
by mkerrisk (subscriber, #1978)
[Link]
Posted Feb 12, 2019 6:30 UTC (Tue)
by ywchang (guest, #130070)
[Link] (4 responses)
However, for the last part, I am thinking, it probably also should work if there is no `-f` option for the `ns_run` command.
If there is no `-f`, that means, the `ns_run` process itself will first update pid namespace, and then run the `./orphan` itself. Then the forked child will show up in the new pid namespace, while `ns_run` will die. The forked child process should also be reaped by the init process in the new pid namespace, right?
I tried and it didn't work as I described. Anyone shed some light please? Thanks in advance.
Posted Feb 13, 2019 10:02 UTC (Wed)
by mkerrisk (subscriber, #1978)
[Link] (3 responses)
As described in the article, the setns() and unshare() system calls treat PID namespaces specially. For other types of namespaces, these system calls do change the namespace of the caller. For PID namespaces, these system calls do not change the PID namespace of the calling process. Instead, any subsequently created children of the caller will be created in the target/new namespace.
If the -f option was omitted, the ns_run command would execute the orphan program in the PID namespace where ns_run resides, rather than the namespace specified by the -n option.
Posted Feb 14, 2019 9:58 UTC (Thu)
by ywchang (guest, #130070)
[Link] (2 responses)
I get your point that PID namespace is special, that using `setns()` the calling process itself will not join the new namespace, while the children forked will be joining the new namespace.
I raised the question that if we can drop the `-f`, is because if looking at the implementation of `orphan` codes, the orphan process is actually forked out from the `ns_run`, which has called `setns()` before. That means if `ns_run` serve as the parent, fork the orphan, and then exit. Then the orphan process should in theory be in the new namespace.
I did the experiment, drop the `-f` option, and then I found the orphan did show up in the new namespace. Because I use the readlink against the orphan process outside PID namespace, same with PID namespace `init` process.
I mounted the proc to /proc2, and run `ls -d /proc2/[1-9]*`
The results is like
/proc2/1 /proc2/2
And I run this command to explore /proc2/2, `cat /proc2/2/status | egrep '^(Name|PP*id)'`
it shows me this,
Name: orphan
At this moment, I'm confused. Why the orphan process is not reaped by `init` process? And why the PPid is becoming 0?
Posted Feb 15, 2019 7:59 UTC (Fri)
by mkerrisk (subscriber, #1978)
[Link] (1 responses)
When a process becomes orphaned it is reparented to the init process *in the PID namespace of its parent*.
When you run the 'orphan' program without the '-f' option, then the child is in the new PID namespace, but its parent is still in the initial PID NS. Thus the orphan gets reparented to init in the initial PID namespace.
PPid == 0 is the system's way of telling the child process that it has no visible parent (because the parent is in another PID namespace).
Posted Feb 16, 2019 7:04 UTC (Sat)
by ywchang (guest, #130070)
[Link]
Namespaces in operation, part 4: more on PID namespaces
Namespaces in operation, part 4: more on PID namespaces
Namespaces in operation, part 4: more on PID namespaces
Namespaces in operation, part 4: more on PID namespaces
The initial article is gaining links to the rest as we go along. The namespaces section in the Kernel Index might also prove useful.
Bookmarkig the series
Bookmarkig the series
Bookmarkig the series
Bookmarkig the series
Bookmarkig the series
it's much better to login to lwn then see all my favourite links booked on this site.
Bookmarkig the series
Namespaces in operation, part 4: more on PID namespaces
Namespaces in operation, part 4: more on PID namespaces
I would hope so. (The easy way to check is to sit in front of a machine you don't mind having crash, and try sending SIGKILL to PID 1.)
Namespaces in operation, part 4: more on PID namespaces
Namespaces in operation, part 4: more on PID namespaces
Namespaces in operation, part 4: more on PID namespaces
Namespaces in operation, part 4: more on PID namespaces
Namespaces in operation, part 4: more on PID namespaces
Namespaces in operation, part 4: more on PID namespaces
Can't get mount namespaces to behave as expected
sudo unshare --mount --uts --net --pid --fork --mount-proc /bin/bash
sudo /tmp/newns --uts --mount --pid --init --net /bin/bash
sudo /tmp/ns_child_exec -p -m /tmp/simple_init
@apollock: yes, I recently commented on this in another article in this series. Basically, some distros (e.g., Fedora) these days enable mount propagation by default, which means that when you mount /proc in the new mount namespace, you do indeed clobber /proc in the initial mount namespace.
Can't get mount namespaces to behave as expected
# ./ns_child_exec -p -m ./simple_init
init$ mount --make-slave / # <== NEW
init$ mount -t proc proc /proc
init$ ps a
Can't get mount namespaces to behave as expected
Can't get mount namespaces to behave as expected
Can't get mount namespaces to behave as expected
So, going back to your earlier comment:
Can't get mount namespaces to behave as expected
# ./ns_child_exec -p -m ./simple_init
init$ mount --make-slave /proc # <== NEW
init$ mount -t proc proc /proc
init$ ps a
Nothing needs to be done to /, as far as I can tell.
Can't get mount namespaces to behave as expected
Namespaces in operation, part 4: more on PID namespaces
Namespaces in operation, part 4: more on PID namespaces
Namespaces in operation, part 4: more on PID namespaces
Namespaces in operation, part 4: more on PID namespaces
Namespaces in operation, part 4: more on PID namespaces
Pid: 2
PPid: 0
Namespaces in operation, part 4: more on PID namespaces
> And why the PPid is becoming 0?
Namespaces in operation, part 4: more on PID namespaces