Container Mechanics in RKT and Linux

Download as pdf or txt
Download as pdf or txt
You are on page 1of 75

Container mechanics

in rkt and Linux

Alban Crequy

LinuxCon Europe 2015 Dublin


Alban Crequy

✤ Working on rkt
✤ One of the maintainer of rkt
✤ Previously worked on D-Bus and AF_BUS

https://github.com/alban
Container mechanics in rkt and Linux

✤ Containers
✤ Linux namespaces
✤ Cgroups
✤ How rkt use them
Containers vs virtual machines
virtual machines

app app app app

Guest Linux Guest Linux


kernel kernel

Hypervisor

Host Linux kernel

Hardware
Containers
containers or pods

a a
app app app p
p
p
p

Docker daemon rkt rkt

Host Linux kernel

Hardware
rkt architecture

stage 2 apps apps

stage 1 systemd-nspawn lkvm

stage 0 rkt rkt


Containers: no guest kernel

system calls:
open(), sethostname()
app app app

rkt rkt

kernel API Host Linux kernel

Hardware
Containers with an example

Getting and setting the hostname:

✤ The system calls for getting and setting the hostname are older than containers

int uname(struct utsname *buf);

int gethostname(char *name, size_t len);

int sethostname(const char *name, size_t len);


hostname: hostname:
containers
thunderstorm sunshine

hostname:
host rainbow
Linux namespaces
Processes in namespaces

1
3
6
2 9

gethostname() -> “rainbow” gethostname() -> “thunderstorm”


Linux Namespaces
Several independent namespaces

✤ uts (Unix Timesharing System) namespace


✤ mount namespace
✤ pid namespace
✤ network namespace
✤ user namespace
Creating new namespaces
unshare(CLONE_NEWUTS);

6
2

“rainbow”
Creating new namespaces

6 6
2

“rainbow” “rainbow”
PID namespace
Hiding processes and PID translation
✤ the host sees all
processes
✤ the container only its own
processes
Hiding processes and PID translation
✤ the host sees all
processes
✤ the container only its own
processes

Actually pid 30920


Initial PID namespace

2
6

7
Creating a new namespace
clone(CLONE_NEWPID, ...);
1

2
6

7
Creating a new namespace

2
6

1 7
rkt
rkt run …

✤ uses unshare() to create a new network namespace


✤ uses clone() to start the first process in the container with a new pid namespace

rkt enter …

✤ uses setns() to enter an existing namespace


Joining an existing namespace

1
setns(...,CLONE_NEWPID);

2
6

1 7
Joining an existing namespace

2
6

4
When does PID translation happen?
✤ the kernel always show the
✤ getpid(), getppid()
✤ /proc
✤ /sys/fs/cgroup/<subsys>/.../cgroup.procs
✤ credentials passed in Unix sockets (SCM_CREDS)
✤ pid = fork()

Future:

✤ Possibly: getvpid() patch being discussed


Mount namespaces
/

container
/my-app

/etc /var /home


host

user
Storing the container data (Copy-on-write)

Container filesystem

Overlay fs “upper” directory


/var/lib/rkt/pods/run/<pod-uuid>/overlay/sha512-.../upper/

Application Container Image


/var/lib/rkt/cas/tree/sha512-...
rkt directories
/var/lib/rkt
├─ cas
│ └─ tree
│ ├─ deps-sha512-19bf...
│ └─ deps-sha512-a5c2...
└─ pods
└─ run
└─ e0ccc8d8
└─ overlay/sha512-19bf.../upper
└─ stage1/rootfs/
unshare(..., CLONE_NEWNS);

1 7

3
/

/etc /var /home

user
1 7 7

3
/ /

/etc /var /home /etc /var /home

user user
Changing root with MS_MOVE

/ //

... mount($ROOTFS, “/”, MS_MOVE) my-app


...

rootfs rootfs

my-app

$ROOTFS = /var/lib/rkt/pods/run/e0ccc8d8.../stage1/rootfs
Relationship between the
Mount propagation events two mounts:
- shared
- master / slave
- private

/ /

/etc /var /home /etc /var /home

user user
Mount propagation events

/home

/home /home /home /home

/home

Private Shared Master and slave


How rkt uses mount propagation events
✤ / in the container namespace
is recursively set as slave:

mount(NULL, "/", NULL, /


MS_SLAVE|MS_REC, NULL)

/etc /var /home

user
Network namespace
Network isolation
Goal:

✤ each container has their own container1 container2


network interfaces
eth0 eth0
✤ Cannot see the network
traffic outside the container
(e.g. tcpdump)

host
eth0
Network tooling
✤ Linux can create pairs of
virtual net interfaces
✤ Can be linked in a bridge
container1 container2
eth0 eth0

veth1 veth2

bridge

eth0 IP masquerading via iptables


How does rkt do it?
✤ rkt uses the network plugins implemented by the Container Network Interface (CNI,
https://github.com/appc/cni)

network
namespace systemd-nspawn

configure create, exec()


via setns + join
netlink

HTTP API
network plugins rkt

/var/lib/rkt/pods/run/$POD_UUID/netns
User namespaces
History of Linux namespaces
✓ 1991: Linux

✓ 2002: namespaces in Linux 2.4.19

✓ 2008: LXC
✓ 2011: systemd-nspawn
✓ 2013: user namespaces in Linux 3.8
✓ 2013: Docker
✓ 2014: rkt

… development still active


Why user namespaces?
✤ Better isolation
✤ Run applications which would need more capabilities
✤ Per user limits
✤ Future:
✣ Unprivileged containers: possibility to have container without root
User ID ranges
4,294,967,295
0 (32-bit range)

host

0 65535

container 1
0 65535

container 2
User ID mapping
/proc/$PID/uid_map: “0 1048576 65536”

unmapped unmapped
1048576

host
65536

container
65536 unmapped
Problems with container images
web server

container 1 container 2
Application
Container
Image (ACI) Container Container
filesystem filesystem

Overlayfs “upper” Overlayfs “upper”


directory directory

downloading Application Container Image (ACI)


Problems with container images
✤ Files UID / GID
✤ rkt currently only supports user namespaces without overlayfs
✣ Performance loss: no COW from overlayfs
✣ “chown -R” for every file in each container
Problems with volumes
bind mount
/
(rw / ro)

/data /my-app /data

✤ mounted in several
containers /
✤ No UID translation
✤ Dynamic UID maps
/data /var /home

user
User namespace and filesystem problem

✤ Possible solution: add options to mount() to apply a UID mapping


✤ rkt would use it when mounting:
✣ the overlay rootfs
✣ volumes
✤ Idea suggested on kernel mailing lists
Namespace lifecycle
Namespace references

1
3
6
2 9
Namespace file descriptor

These files can be opened, bind mounted, fd-passed (SCM_RIGHTS)


Isolators
Isolators in rkt

✤ specified
in an image manifest

✤ limiting capabilities
or resources
Isolators in rkt

Currently implemented Possible additions

✤ capabilities ✤ block-bandwidth
✤ cpu ✤ block-iops
✤ memory ✤ network-bandwidth
✤ disk-space
Capabilities (1/3)
✤ Old model (before Linux 2.2):
✣ User root (user id = 0) can do everything
✣ Regular users are limited

✤ Now: processes have capabilities

Configuring the network CAP_NET_ADMIN

Mounting a filesystem CAP_SYS_ADMIN

Creating a block device CAP_MKNOD

etc. 37 different capabilities today


Capabilities (2/3)
✤ Each process has several capability sets:

Ambient
Permitted Inheritable Effective
(soon!)

Bounding set
Capabilities (3/3)
Other security mechanisms:

✤ Mandatory Access Control (MAC) with Linux Security Modules (LSMs):


✣ SELinux
✣ AppArmor…
✤ seccomp
Isolator: memory and cpu
✤ based on cgroups
cgroups
What’s a control group (cgroup)
✤ group processes together
✤ organised in trees
✤ applying limits to them as a group
cgroups
cgroup API

/sys/fs/cgroup/*/

/proc/cgroups

/proc/$PID/cgroup
List of cgroup controllers
/sys/fs/cgroup/
├─ cpu
├─ devices
├─ freezer
├─ memory
├─ ...
└─ systemd
How systemd units use cgroups
/sys/fs/cgroup/
├─ systemd
│ ├─ user.slice
│ ├─ system.slice
│ │ ├─ NetworkManager.service
│ │ │ └─ cgroups.procs
│ │ ...
│ └─ machine.slice
How systemd units use cgroups w/ containers
/sys/fs/cgroup/ │...
├─ systemd ├─ cpu
│ ├─ user.slice │ ├─ user.slice
│ ├─ system.slice │ ├─ system.slice
│ └─ machine.slice │ └─ machine.slice
│ └─ machine-rkt….scope │ └─ machine-rkt….scope
│ └─ system.slice │ └─ system.slice
│ └─ app.service │ └─ app.service
│ ├─ memory
│ │ ├─ user.slice
│... │ ├─ system.slice
│ └─ machine.slice ...
cgroups mounted in the container
/sys/fs/cgroup/ │...
├─ systemd RO ├─ cpu
│ ├─ user.slice │ ├─ user.slice
│ ├─ system.slice │ ├─ system.slice
│ └─ machine.slice │ └─ machine.slice
│ └─ machine-rkt….scope │ └─ machine-rkt….scope
│ └─ system.slice │ └─ system.slice
│ └─ app.service │ └─ app.service
│ ├─ memory
│ │ ├─ user.slice
│... RW │ ├─ system.slice
│ └─ machine.slice ...
Memory isolator

[Service]
write to
“limit”: ExecStart= memory.limit_in_
“500M” MemoryLimit=500M bytes

Application
systemd service file systemd action
Image Manifest
CPU isolator

[Service]
“limit”: ExecStart= write to
CPUShares=512 cpu.share
“500m”

Application
systemd service file systemd action
Image Manifest
Unified cgroup hierarchy
✤ Multiple hierarchies:
✣ one cgroup mount point for each controller (memory, cpu, etc.)
✣ flexible but complex
✣ cannot remount with a different set of controllers
✣ difficult to give to containers in a safe way
✤ Unified hierarchy:
✣ cgroup filesystem mounted only one time
✣ still in development in Linux: mount with option “__DEVEL__sane_behavior”
✣ initial implementation in systemd-v226 (September 2015)
✣ no support in rkt yet
Isolator: network
✤ limit the network bandwidth
✤ cgroup controller “net_cls” to tag packets emitted by a process
✤ iptables / traffic control to apply on tagged packets
✤ open question: allocation of tags?
✤ not implemented in rkt yet
Isolator: disk quotas
Disk quotas
Not implemented in rkt

✤ loop device
✤ btrfs subvolumes
✣ systemd-nspawn can use them
✤ per user and group quotas
✣ not suitable for containers
✤ per project quotas: in xfs and soon in ext4
✣ open question: allocation of project id?
Conclusion
We talked about:

✤ the isolation provided by rkt


✤ namespaces
✤ cgroups
✤ how rkt uses the namespace & cgroup API
Thanks

CC-BY-SA

Thanks Chris for the theme!

You might also like