Namespace file descriptors

By Jake Edge
September 29, 2010

Giving different groups of processes their own view of global kernel resources—network environments and filesystem trees for example—is one of the goals of the kernel container developers. These views, or namespaces, are created as part of a clone() with one of the CLONE_NEW* flags and are only visible to the new process and its children. Eric Biederman has proposed a mechanism that would allow other processes, outside of the namespace-creator's descendants, to see and access those namespaces.

When we looked at an earlier version back in March, Biederman had proposed two new system calls, nsfd() and setns(). Since that time, he has eliminated the nsfd() call by adding a new /proc/<pid>/ns directory with files that can be opened to provide a file descriptor for the different kinds of namespaces. That removes the need for a dedicated system call to find and return an fd to a namespace.

Currently, there must be a process running in a namespace to keep it around, but there are use cases where it is rather cumbersome to have a dedicated process for keeping the namespace alive. With the new patches, doing a bind mount of the proc file for a namespace:

    mount --bind /proc/self/ns/net /some/path

for example, will keep the namespace alive until it is unmounted.

The setns() call is unchanged from the earlier proposal:

    int setns(unsigned int nstype, int nsfd);

It will set the namespace of the process to that indicated by the file descriptor nsfd, which should be a reference to an open namespace /proc file. nstype is either zero or the name of the namespace type the caller is trying to switch to ("net", "ipc", "uts", and "mnt" are implemented), so the call will fail if the namespace that is referred to by nsfd does not correspond. The call will also fail unless the caller has the CAP_SYS_ADMIN capability (root privileges, essentially).

For this round, Biederman has also added something of a convenience function, in the form of the socketat() system call:

    int socketat(int nsfd, int family, int type, int protocol);

The call parallels socket(), but takes an nsfd parameter for the namespace to create the socket in. As pointed out in the discussion of that patch, socketat() could be implemented using setns():

    setns(0, nsfd);
    sock = socket(...);
    setns(0, original_nsfd);

Biederman agrees that it could be done in user space, but is concerned about race conditions in an implementation of that kind. In addition, unlike for the other namespace types, he has some specific use cases in mind for network namespaces:

The use case are applications are the handful of networking applications that find that it makes sense to listen to sockets from multiple network namespaces at once. Say a home machine that has a vpn into your office network and the vpn into the office network runs in a different network namespace so you don't have to worry about address conflicts between the two networks, the chance of accidentally bridging between them, and so you can use different dns resolvers for the different networks.

But he also realized that it might be a somewhat controversial addition. Overall, there has been relatively little discussion of the patchset on linux-kernel, and Biederman said that it had received positive reviews on the containers mailing list. He posted the patches so that other kernel developers could review the ABI additions, and there seem to be no complaints with setns() and the /proc filesystem additions.

Changes for the "pid" namespace were not included in these patches as there is some work needed before that namespace can be safely unshared. That work doesn't affect the ABI, though. Once the pid namespace is added in, it seems likely we will see these patches return, perhaps without socketat(), sometime soon. Allowing suitably privileged processes to access others' namespaces will be a useful addition, and one that may not be too far off.

Index entries for this article
Kernel	Containers
Kernel	Namespaces
Kernel	Virtualization/Containers

Namespace file descriptors

Posted Jul 22, 2011 4:49 UTC (Fri) by ebiederm (subscriber, #35028) [Link] (4 responses)

There were some small changes between when this article was written and when socketat was merged.

- socketat was dropped from the patchset. It can be implemented race free in userspace and there are not yet enough userspace applications to care.

- setns had it's aguments slightly changes and swapped. setns is now
int setns(int fd, int nstype);
Where nstype is a clone flag, instead of the original overly clever
ascii encoded in an integer without using a define.

Eric

Namespace file descriptors

Posted Jul 22, 2011 12:26 UTC (Fri) by razb (guest, #43424) [Link]

can you share an example with a code ?

Namespace file descriptors

Posted Jul 27, 2011 16:31 UTC (Wed) by renzo (guest, #77450) [Link] (1 responses)

I think it would be simpler for programmers to have network stacks on the file system as special files.

I proposed this approach two years ago at FOSDEM. msocket is similar
to socketat but it has a pathname instead of a file descriptor as its
first argument.
If the network stacks were special files, sysadm may provide more than one
stack to the users, each application can decide which stack to use.
A "default" stack can be defined for backwards compatibility: the "socket" syscall uses the default stack.

for details see: http://wiki.virtualsquare.org/wiki/index.php/Multi_stack_...

and: http://archive.fosdem.org/2009/schedule/events/ipn_msockets

renzo

Namespace file descriptors

Posted Oct 2, 2011 16:20 UTC (Sun) by uriel (guest, #20754) [Link]

Plan 9 always used a file system interface to its network stack, and you could mount multiple ip stacks, and remote network stacks, and 'virtual network stacks' (eg., providing tunneling to a remote non-plan9 system using sshnet).

See:

http://doc.cat-v.org/plan_9/4th_edition/papers/net/
http://man.cat-v.org/plan_9/3/ip

Namespace file descriptors

Posted Jul 27, 2011 19:15 UTC (Wed) by chloe_zen (guest, #8258) [Link]

How can it be race-free?