Container Mechanics in RKT and Linux
Container Mechanics in RKT and Linux
Container Mechanics in RKT and Linux
Alban Crequy
✤ Working on rkt
✤ One of the maintainer of rkt
✤ Previously worked on D-Bus and AF_BUS
https://github.com/alban
Container mechanics in rkt and Linux
✤ Containers
✤ Linux namespaces
✤ Cgroups
✤ How rkt use them
Containers vs virtual machines
virtual machines
Hypervisor
Hardware
Containers
containers or pods
a a
app app app p
p
p
p
Hardware
rkt architecture
system calls:
open(), sethostname()
app app app
rkt rkt
Hardware
Containers with an example
✤ The system calls for getting and setting the hostname are older than containers
hostname:
host rainbow
Linux namespaces
Processes in namespaces
1
3
6
2 9
6
2
“rainbow”
Creating new namespaces
6 6
2
“rainbow” “rainbow”
PID namespace
Hiding processes and PID translation
✤ the host sees all
processes
✤ the container only its own
processes
Hiding processes and PID translation
✤ the host sees all
processes
✤ the container only its own
processes
2
6
7
Creating a new namespace
clone(CLONE_NEWPID, ...);
1
2
6
7
Creating a new namespace
2
6
1 7
rkt
rkt run …
rkt enter …
1
setns(...,CLONE_NEWPID);
2
6
1 7
Joining an existing namespace
2
6
4
When does PID translation happen?
✤ the kernel always show the
✤ getpid(), getppid()
✤ /proc
✤ /sys/fs/cgroup/<subsys>/.../cgroup.procs
✤ credentials passed in Unix sockets (SCM_CREDS)
✤ pid = fork()
Future:
container
/my-app
user
Storing the container data (Copy-on-write)
Container filesystem
1 7
3
/
user
1 7 7
3
/ /
user user
Changing root with MS_MOVE
/ //
rootfs rootfs
my-app
$ROOTFS = /var/lib/rkt/pods/run/e0ccc8d8.../stage1/rootfs
Relationship between the
Mount propagation events two mounts:
- shared
- master / slave
- private
/ /
user user
Mount propagation events
/home
/home
user
Network namespace
Network isolation
Goal:
host
eth0
Network tooling
✤ Linux can create pairs of
virtual net interfaces
✤ Can be linked in a bridge
container1 container2
eth0 eth0
veth1 veth2
bridge
network
namespace systemd-nspawn
HTTP API
network plugins rkt
/var/lib/rkt/pods/run/$POD_UUID/netns
User namespaces
History of Linux namespaces
✓ 1991: Linux
✓ 2008: LXC
✓ 2011: systemd-nspawn
✓ 2013: user namespaces in Linux 3.8
✓ 2013: Docker
✓ 2014: rkt
host
0 65535
container 1
0 65535
container 2
User ID mapping
/proc/$PID/uid_map: “0 1048576 65536”
unmapped unmapped
1048576
host
65536
container
65536 unmapped
Problems with container images
web server
container 1 container 2
Application
Container
Image (ACI) Container Container
filesystem filesystem
✤ mounted in several
containers /
✤ No UID translation
✤ Dynamic UID maps
/data /var /home
user
User namespace and filesystem problem
1
3
6
2 9
Namespace file descriptor
✤ specified
in an image manifest
✤ limiting capabilities
or resources
Isolators in rkt
✤ capabilities ✤ block-bandwidth
✤ cpu ✤ block-iops
✤ memory ✤ network-bandwidth
✤ disk-space
Capabilities (1/3)
✤ Old model (before Linux 2.2):
✣ User root (user id = 0) can do everything
✣ Regular users are limited
Ambient
Permitted Inheritable Effective
(soon!)
Bounding set
Capabilities (3/3)
Other security mechanisms:
/sys/fs/cgroup/*/
/proc/cgroups
/proc/$PID/cgroup
List of cgroup controllers
/sys/fs/cgroup/
├─ cpu
├─ devices
├─ freezer
├─ memory
├─ ...
└─ systemd
How systemd units use cgroups
/sys/fs/cgroup/
├─ systemd
│ ├─ user.slice
│ ├─ system.slice
│ │ ├─ NetworkManager.service
│ │ │ └─ cgroups.procs
│ │ ...
│ └─ machine.slice
How systemd units use cgroups w/ containers
/sys/fs/cgroup/ │...
├─ systemd ├─ cpu
│ ├─ user.slice │ ├─ user.slice
│ ├─ system.slice │ ├─ system.slice
│ └─ machine.slice │ └─ machine.slice
│ └─ machine-rkt….scope │ └─ machine-rkt….scope
│ └─ system.slice │ └─ system.slice
│ └─ app.service │ └─ app.service
│ ├─ memory
│ │ ├─ user.slice
│... │ ├─ system.slice
│ └─ machine.slice ...
cgroups mounted in the container
/sys/fs/cgroup/ │...
├─ systemd RO ├─ cpu
│ ├─ user.slice │ ├─ user.slice
│ ├─ system.slice │ ├─ system.slice
│ └─ machine.slice │ └─ machine.slice
│ └─ machine-rkt….scope │ └─ machine-rkt….scope
│ └─ system.slice │ └─ system.slice
│ └─ app.service │ └─ app.service
│ ├─ memory
│ │ ├─ user.slice
│... RW │ ├─ system.slice
│ └─ machine.slice ...
Memory isolator
[Service]
write to
“limit”: ExecStart= memory.limit_in_
“500M” MemoryLimit=500M bytes
Application
systemd service file systemd action
Image Manifest
CPU isolator
[Service]
“limit”: ExecStart= write to
CPUShares=512 cpu.share
“500m”
Application
systemd service file systemd action
Image Manifest
Unified cgroup hierarchy
✤ Multiple hierarchies:
✣ one cgroup mount point for each controller (memory, cpu, etc.)
✣ flexible but complex
✣ cannot remount with a different set of controllers
✣ difficult to give to containers in a safe way
✤ Unified hierarchy:
✣ cgroup filesystem mounted only one time
✣ still in development in Linux: mount with option “__DEVEL__sane_behavior”
✣ initial implementation in systemd-v226 (September 2015)
✣ no support in rkt yet
Isolator: network
✤ limit the network bandwidth
✤ cgroup controller “net_cls” to tag packets emitted by a process
✤ iptables / traffic control to apply on tagged packets
✤ open question: allocation of tags?
✤ not implemented in rkt yet
Isolator: disk quotas
Disk quotas
Not implemented in rkt
✤ loop device
✤ btrfs subvolumes
✣ systemd-nspawn can use them
✤ per user and group quotas
✣ not suitable for containers
✤ per project quotas: in xfs and soon in ext4
✣ open question: allocation of project id?
Conclusion
We talked about:
CC-BY-SA