Namespace file descriptors
Giving different groups of processes their own view of global kernel resources—network environments and filesystem trees for example—is one of the goals of the kernel container developers. These views, or namespaces, are created as part of a clone() with one of the CLONE_NEW* flags and are only visible to the new process and its children. Eric Biederman has proposed a mechanism that would allow other processes, outside of the namespace-creator's descendants, to see and access those namespaces.
When we looked at an earlier version back in March, Biederman had proposed two new system calls, nsfd() and setns(). Since that time, he has eliminated the nsfd() call by adding a new /proc/<pid>/ns directory with files that can be opened to provide a file descriptor for the different kinds of namespaces. That removes the need for a dedicated system call to find and return an fd to a namespace.
Currently, there must be a process running in a namespace to keep it around, but there are use cases where it is rather cumbersome to have a dedicated process for keeping the namespace alive. With the new patches, doing a bind mount of the proc file for a namespace:
mount --bind /proc/self/ns/net /some/pathfor example, will keep the namespace alive until it is unmounted.
The setns() call is unchanged from the earlier proposal:
int setns(unsigned int nstype, int nsfd);It will set the namespace of the process to that indicated by the file descriptor nsfd, which should be a reference to an open namespace /proc file. nstype is either zero or the name of the namespace type the caller is trying to switch to ("net", "ipc", "uts", and "mnt" are implemented), so the call will fail if the namespace that is referred to by nsfd does not correspond. The call will also fail unless the caller has the CAP_SYS_ADMIN capability (root privileges, essentially).
For this round, Biederman has also added something of a convenience function, in the form of the socketat() system call:
int socketat(int nsfd, int family, int type, int protocol);The call parallels socket(), but takes an nsfd parameter for the namespace to create the socket in. As pointed out in the discussion of that patch, socketat() could be implemented using setns():
setns(0, nsfd); sock = socket(...); setns(0, original_nsfd);Biederman agrees that it could be done in user space, but is concerned about race conditions in an implementation of that kind. In addition, unlike for the other namespace types, he has some specific use cases in mind for network namespaces:
But he also realized that it might be a somewhat controversial addition. Overall, there has been relatively little discussion of the patchset on linux-kernel, and Biederman said that it had received positive reviews on the containers mailing list. He posted the patches so that other kernel developers could review the ABI additions, and there seem to be no complaints with setns() and the /proc filesystem additions.
Changes for the "pid" namespace were not included in these patches as there is some work needed before that namespace can be safely unshared. That work doesn't affect the ABI, though. Once the pid namespace is added in, it seems likely we will see these patches return, perhaps without socketat(), sometime soon. Allowing suitably privileged processes to access others' namespaces will be a useful addition, and one that may not be too far off.
Index entries for this article | |
---|---|
Kernel | Containers |
Kernel | Namespaces |
Kernel | Virtualization/Containers |
Posted Jul 22, 2011 4:49 UTC (Fri)
by ebiederm (subscriber, #35028)
[Link] (4 responses)
- socketat was dropped from the patchset. It can be implemented race free in userspace and there are not yet enough userspace applications to care.
- setns had it's aguments slightly changes and swapped. setns is now
Eric
Posted Jul 22, 2011 12:26 UTC (Fri)
by razb (guest, #43424)
[Link]
Posted Jul 27, 2011 16:31 UTC (Wed)
by renzo (guest, #77450)
[Link] (1 responses)
I proposed this approach two years ago at FOSDEM. msocket is similar
for details see: http://wiki.virtualsquare.org/wiki/index.php/Multi_stack_...
and: http://archive.fosdem.org/2009/schedule/events/ipn_msockets
renzo
Posted Oct 2, 2011 16:20 UTC (Sun)
by uriel (guest, #20754)
[Link]
See:
http://doc.cat-v.org/plan_9/4th_edition/papers/net/
Posted Jul 27, 2011 19:15 UTC (Wed)
by chloe_zen (guest, #8258)
[Link]
Namespace file descriptors
int setns(int fd, int nstype);
Where nstype is a clone flag, instead of the original overly clever
ascii encoded in an integer without using a define.
Namespace file descriptors
Namespace file descriptors
to socketat but it has a pathname instead of a file descriptor as its
first argument.
If the network stacks were special files, sysadm may provide more than one
stack to the users, each application can decide which stack to use.
A "default" stack can be defined for backwards compatibility: the "socket" syscall uses the default stack.
Namespace file descriptors
http://man.cat-v.org/plan_9/3/ip
Namespace file descriptors