Container Mechanics in RKT and Linux

Container mechanics
in rkt and Linux
Alban Crequy
LinuxCon Europe 2015 Dublin

Alban Crequy
✤ Working on rkt
✤ One of the maintainer of rkt
✤ Previously worked on D-Bus and AF_BUS
https://github.com/alban
Container mechanics in rkt and Linux
✤ Containers
✤ Linux namespaces
✤ Cgroups
✤ How rkt use them
Containers vs virtual machines
virtual machines
app app app app
Guest Linux Guest Linux

kernel kernel
Hypervisor
Host Linux kernel
Hardware
Containers
containers or pods
a a
app app app p
p
p
p
Docker daemon rkt rkt
Host Linux kernel
Hardware
rkt architecture
stage 2 apps apps
stage 1 systemd-nspawn lkvm
stage 0 rkt rkt

Containers: no guest kernel
system calls:
open(), sethostname()
app app app
rkt rkt
kernel API Host Linux kernel
Hardware
Containers with an example
Getting and setting the hostname:
✤ The system calls for getting and setting the hostname are older than containers
int uname(struct utsname *buf);
int gethostname(char *name, size_t len);
int sethostname(const char *name, size_t len);

hostname: hostname:
containers
thunderstorm sunshine
hostname:
host rainbow
Linux namespaces
Processes in namespaces
1
3
6
2 9
gethostname() -> “rainbow” gethostname() -> “thunderstorm”

Linux Namespaces
Several independent namespaces
✤ uts (Unix Timesharing System) namespace

✤ mount namespace
✤ pid namespace
✤ network namespace
✤ user namespace
Creating new namespaces
unshare(CLONE_NEWUTS);
6
2
“rainbow”
Creating new namespaces
6 6
2
“rainbow” “rainbow”
PID namespace
Hiding processes and PID translation
✤ the host sees all
processes
✤ the container only its own
processes
Hiding processes and PID translation
✤ the host sees all
processes
✤ the container only its own
processes
Actually pid 30920

Initial PID namespace
2
6
7
Creating a new namespace
clone(CLONE_NEWPID, ...);
1
2
6
7
Creating a new namespace
2
6
1 7
rkt
rkt run …
✤ uses unshare() to create a new network namespace

✤ uses clone() to start the first process in the container with a new pid namespace
rkt enter …
✤ uses setns() to enter an existing namespace

Joining an existing namespace
1
setns(...,CLONE_NEWPID);
2
6
1 7
Joining an existing namespace
2
6
4
When does PID translation happen?
✤ the kernel always show the
✤ getpid(), getppid()
✤ /proc
✤ /sys/fs/cgroup/<subsys>/.../cgroup.procs
✤ credentials passed in Unix sockets (SCM_CREDS)
✤ pid = fork()
Future:
✤ Possibly: getvpid() patch being discussed

Mount namespaces
/
container
/my-app
/etc /var /home

host
user
Storing the container data (Copy-on-write)
Container filesystem
Overlay fs “upper” directory

/var/lib/rkt/pods/run/<pod-uuid>/overlay/sha512-.../upper/
Application Container Image

/var/lib/rkt/cas/tree/sha512-...
rkt directories
/var/lib/rkt
├─ cas
│ └─ tree
│ ├─ deps-sha512-19bf...
│ └─ deps-sha512-a5c2...
└─ pods
└─ run
└─ e0ccc8d8
└─ overlay/sha512-19bf.../upper
└─ stage1/rootfs/
unshare(..., CLONE_NEWNS);
1 7
3
/
/etc /var /home
user
1 7 7
3
/ /
/etc /var /home /etc /var /home
user user
Changing root with MS_MOVE
/ //
... mount($ROOTFS, “/”, MS_MOVE) my-app

...
rootfs rootfs
my-app
$ROOTFS = /var/lib/rkt/pods/run/e0ccc8d8.../stage1/rootfs
Relationship between the
Mount propagation events two mounts:
- shared
- master / slave
- private
/ /
/etc /var /home /etc /var /home
user user
Mount propagation events
/home
/home /home /home /home
/home
Private Shared Master and slave

How rkt uses mount propagation events
✤ / in the container namespace
is recursively set as slave:
mount(NULL, "/", NULL, /

MS_SLAVE|MS_REC, NULL)
/etc /var /home
user
Network namespace
Network isolation
Goal:
✤ each container has their own container1 container2

network interfaces
eth0 eth0
✤ Cannot see the network
traffic outside the container
(e.g. tcpdump)
host
eth0
Network tooling
✤ Linux can create pairs of
virtual net interfaces
✤ Can be linked in a bridge
container1 container2
eth0 eth0
veth1 veth2
bridge
eth0 IP masquerading via iptables

How does rkt do it?
✤ rkt uses the network plugins implemented by the Container Network Interface (CNI,
https://github.com/appc/cni)
network
namespace systemd-nspawn
configure create, exec()

via setns + join
netlink
HTTP API
network plugins rkt
/var/lib/rkt/pods/run/$POD_UUID/netns
User namespaces
History of Linux namespaces
✓ 1991: Linux
✓ 2002: namespaces in Linux 2.4.19
✓ 2008: LXC
✓ 2011: systemd-nspawn
✓ 2013: user namespaces in Linux 3.8
✓ 2013: Docker
✓ 2014: rkt
… development still active

Why user namespaces?
✤ Better isolation
✤ Run applications which would need more capabilities
✤ Per user limits
✤ Future:
✣ Unprivileged containers: possibility to have container without root
User ID ranges
4,294,967,295
0 (32-bit range)
host
0 65535
container 1
0 65535
container 2
User ID mapping
/proc/$PID/uid_map: “0 1048576 65536”
unmapped unmapped
1048576
host
65536
container
65536 unmapped
Problems with container images
web server
container 1 container 2
Application
Container
Image (ACI) Container Container
filesystem filesystem
Overlayfs “upper” Overlayfs “upper”

directory directory
downloading Application Container Image (ACI)

Problems with container images
✤ Files UID / GID
✤ rkt currently only supports user namespaces without overlayfs
✣ Performance loss: no COW from overlayfs
✣ “chown -R” for every file in each container
Problems with volumes
bind mount
/
(rw / ro)
/data /my-app /data
✤ mounted in several
containers /
✤ No UID translation
✤ Dynamic UID maps
/data /var /home
user
User namespace and filesystem problem
✤ Possible solution: add options to mount() to apply a UID mapping

✤ rkt would use it when mounting:
✣ the overlay rootfs
✣ volumes
✤ Idea suggested on kernel mailing lists
Namespace lifecycle
Namespace references
1
3
6
2 9
Namespace file descriptor
These files can be opened, bind mounted, fd-passed (SCM_RIGHTS)

Isolators
Isolators in rkt
✤ specified
in an image manifest
✤ limiting capabilities
or resources
Isolators in rkt
Currently implemented Possible additions
✤ capabilities ✤ block-bandwidth
✤ cpu ✤ block-iops
✤ memory ✤ network-bandwidth
✤ disk-space
Capabilities (1/3)
✤ Old model (before Linux 2.2):
✣ User root (user id = 0) can do everything
✣ Regular users are limited
✤ Now: processes have capabilities
Configuring the network CAP_NET_ADMIN
Mounting a filesystem CAP_SYS_ADMIN
Creating a block device CAP_MKNOD
etc. 37 different capabilities today

Capabilities (2/3)
✤ Each process has several capability sets:
Ambient
Permitted Inheritable Effective
(soon!)
Bounding set
Capabilities (3/3)
Other security mechanisms:
✤ Mandatory Access Control (MAC) with Linux Security Modules (LSMs):

✣ SELinux
✣ AppArmor…
✤ seccomp
Isolator: memory and cpu
✤ based on cgroups
cgroups
What’s a control group (cgroup)
✤ group processes together
✤ organised in trees
✤ applying limits to them as a group
cgroups
cgroup API
/sys/fs/cgroup/*/
/proc/cgroups
/proc/$PID/cgroup
List of cgroup controllers
/sys/fs/cgroup/
├─ cpu
├─ devices
├─ freezer
├─ memory
├─ ...
└─ systemd
How systemd units use cgroups
/sys/fs/cgroup/
├─ systemd
│ ├─ user.slice
│ ├─ system.slice
│ │ ├─ NetworkManager.service
│ │ │ └─ cgroups.procs
│ │ ...
│ └─ machine.slice
How systemd units use cgroups w/ containers
/sys/fs/cgroup/ │...
├─ systemd ├─ cpu
│ ├─ user.slice │ ├─ user.slice
│ ├─ system.slice │ ├─ system.slice
│ └─ machine.slice │ └─ machine.slice
│ └─ machine-rkt….scope │ └─ machine-rkt….scope
│ └─ system.slice │ └─ system.slice
│ └─ app.service │ └─ app.service
│ ├─ memory
│ │ ├─ user.slice
│... │ ├─ system.slice
│ └─ machine.slice ...
cgroups mounted in the container
/sys/fs/cgroup/ │...
├─ systemd RO ├─ cpu
│ ├─ user.slice │ ├─ user.slice
│ ├─ system.slice │ ├─ system.slice
│ └─ machine.slice │ └─ machine.slice
│ └─ machine-rkt….scope │ └─ machine-rkt….scope
│ └─ system.slice │ └─ system.slice
│ └─ app.service │ └─ app.service
│ ├─ memory
│ │ ├─ user.slice
│... RW │ ├─ system.slice
│ └─ machine.slice ...
Memory isolator
[Service]
write to
“limit”: ExecStart= memory.limit_in_
“500M” MemoryLimit=500M bytes
Application
systemd service file systemd action
Image Manifest
CPU isolator
[Service]
“limit”: ExecStart= write to
CPUShares=512 cpu.share
“500m”
Application
systemd service file systemd action
Image Manifest
Unified cgroup hierarchy
✤ Multiple hierarchies:
✣ one cgroup mount point for each controller (memory, cpu, etc.)
✣ flexible but complex
✣ cannot remount with a different set of controllers
✣ difficult to give to containers in a safe way
✤ Unified hierarchy:
✣ cgroup filesystem mounted only one time
✣ still in development in Linux: mount with option “__DEVEL__sane_behavior”
✣ initial implementation in systemd-v226 (September 2015)
✣ no support in rkt yet
Isolator: network
✤ limit the network bandwidth
✤ cgroup controller “net_cls” to tag packets emitted by a process
✤ iptables / traffic control to apply on tagged packets
✤ open question: allocation of tags?
✤ not implemented in rkt yet
Isolator: disk quotas
Disk quotas
Not implemented in rkt
✤ loop device
✤ btrfs subvolumes
✣ systemd-nspawn can use them
✤ per user and group quotas
✣ not suitable for containers
✤ per project quotas: in xfs and soon in ext4
✣ open question: allocation of project id?
Conclusion
We talked about:
✤ the isolation provided by rkt

✤ namespaces
✤ cgroups
✤ how rkt uses the namespace & cgroup API
Thanks
CC-BY-SA
Thanks Chris for the theme!

Container Mechanics in RKT and Linux

Uploaded by

Copyright:

Available Formats

Container Mechanics in RKT and Linux

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Container Mechanics in RKT and Linux

Uploaded by

Copyright:

Available Formats

Container mechanics

in rkt and Linux

LinuxCon Europe 2015 Dublin

app app app app

Guest Linux Guest Linux

Host Linux kernel

Docker daemon rkt rkt

Host Linux kernel

stage 2 apps apps

stage 1 systemd-nspawn lkvm

stage 0 rkt rkt

kernel API Host Linux kernel

Getting and setting the hostname:

int uname(struct utsname *buf);

int gethostname(char *name, size_t len);

int sethostname(const char *name, size_t len);

gethostname() -> “rainbow” gethostname() -> “thunderstorm”

✤ uts (Unix Timesharing System) namespace

Actually pid 30920

✤ uses unshare() to create a new network namespace

✤ uses setns() to enter an existing namespace

✤ Possibly: getvpid() patch being discussed

/etc /var /home

Overlay fs “upper” directory

Application Container Image

/etc /var /home

/etc /var /home /etc /var /home

... mount($ROOTFS, “/”, MS_MOVE) my-app

/etc /var /home /etc /var /home

/home /home /home /home

Private Shared Master and slave

mount(NULL, "/", NULL, /

/etc /var /home

✤ each container has their own container1 container2

eth0 IP masquerading via iptables

configure create, exec()

✓ 2002: namespaces in Linux 2.4.19

… development still active

Overlayfs “upper” Overlayfs “upper”

downloading Application Container Image (ACI)

/data /my-app /data

✤ Possible solution: add options to mount() to apply a UID mapping

These files can be opened, bind mounted, fd-passed (SCM_RIGHTS)

Currently implemented Possible additions

✤ Now: processes have capabilities

Configuring the network CAP_NET_ADMIN

Mounting a filesystem CAP_SYS_ADMIN

Creating a block device CAP_MKNOD

etc. 37 different capabilities today

✤ Mandatory Access Control (MAC) with Linux Security Modules (LSMs):

✤ the isolation provided by rkt

Thanks Chris for the theme!

You might also like