Kaiser Wagner PikeOS

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Evolution of the PikeOS Microkernel

Robert Kaiser Stephan Wagner


SYSGO AG, Klein-Winternheim, and SYSGO AG, Klein-Winternheim
Distributed Systems Lab,
University of Applied Sciences, Wiesbaden

E-mail: {rob,swa}@sysgo.com

Abstract as such it can host an operating system along with its world
of application programs (see Figure 1). Since partitions op-
The PikeOS microkernel is targeted at real-time embed- erate on separate sets of resources, they are completely iso-
ded systems. Its main goal is to provide a partitioned envi- lated and there is no way for a program in one partition to
ronment for multiple operating systems with different design affect another partition 2 . In this way, multiple ”guest” op-
goals to coexist in a single machine. It was initially mod- erating systems are able to coexist in a single machine and
elled after the L4 microkernel and has gradually evolved their individual functionalities can be tailored to match the
over the years of its application to the real-time, embedded requirements of their application programs. Thus, applica-
systems space. This paper describes the concepts that were tions are no longer forced to unconditionally trust a huge
added or removed during this evolution and it provides the monolithic kernel containing a lot of complex functionali-
rationale behind these design decisions. ties that the application may or may not need. Instead, each
subsystem can choose the amount of code that it wants to
trust: It can trade complexity for trust.
1 Introduction In complex embedded systems, there frequently ex-
ist applications with very different temporal requirements:
some must guarantee timely service under all circumstances
Microkernels have been receiving new attention during
while others have no such constraint, but are instead ex-
the recent years. After being discarded in the mid 1990s on
pected to work ”as fast as possible” by making good use of
the grounds of causing too much performance impact, the
all resources they can possibly get. The differences between
approach seems to be the answer to today’s computer prob-
such real-time and non-real-time programs are reflected in
lems: Today, computers generally do not suffer from lack
the functionalities they need their underlying operating sys-
of performance, but they often have severe reliability prob-
tem to provide: There are distinct real-time and non-real-
lems. This is especially relevant in the field of embedded
time operating system functions. This presents a problem in
systems: While the modern PC user may have (grudgingly)
monolithic systems because there exists only one operating
come to accept the occasional system crash as a fact of life,
system interface which has to incorporate all the real-time
crashing cellular phones or video recorders are embarrass-
and non-real-time functionalities. In contrast, a microker-
ing for their manufacturers and a potential cause for loss of
nel can host multiple operating systems, so it is possible
reputation and market share. More critically though, a mal-
to have distinct real-time or non-real-time interfaces coex-
function in an electronic control unit of, e.g. a car or an
isting in separate partitions of a single machine. However,
airplane can be a severe threat to the life of humans.
in such a scenario, the microkernel must guarantee timely
Software complexity is the core problem here and mi-
availability of sufficient computational resources to its real-
crokernels offer the possibility to tackle it with a ”divide
time guests so they can in turn fulfil their requirements. Not
and conquer” approach: a microkernel only provides ba-
all microkernels are suitable in this respect.
sic functionality which can be used to divide the system’s
Since the late 1990s, Sysgo have been developing their
resources (memory, I/O devices, CPU time) into separate
own microkernel. Initially, it was called ”P4” and it was
subsets. Each of these subsets, which we will further refer
a faithful re-implementation of the L4 version 2.0 micro-
to as partitions, can be regarded as a virtual machine1 and
kernel API as specified by Jochen Liedtke in [17]. It was
1 We consider virtual machine monitors such as Xen [2] or VMware

ESX Server [22] to be specialised microkernels. 2 Unless, of course, both sides explicitly agree on a transaction

50
all other programs with priorities below its own indefinitely.
So, conversely, every program with a given priority is forced
to trust in the cooperation of all programs that have a prior-
ity higher than its own. There is an implicit dependency
between the priority level and the level of trust attributed to
a program. But these two attributes do not necessarily co-
incide: A high priority level may need to be granted to a
program because it has a requirement regarding timely ex-
Figure 1. Partitioned system ecution, whereas a high level of trust should only follow
from a thorough inspection of the program code. If a high-
targeted at embedded systems, therefore, unlike the other priority program exists in one partition and a lower-priority
then-current L4 implementations, it was written almost en- one exists in another, the low priority program must still
tirely in C to facilitate porting, and it was designed to be trust the high-priority one, so this defeats the secure isola-
fully preemptive to support real-time applications. tion between partitions.
Over the years of using this kernel in the embedded A method is needed here to guarantee (and to enforce)
space, we identified a number of issues with the original sufficient time quanta for partitions hosting real-time pro-
L4 version 2.0 interface. These prompted changes of the grams, thus enabling them to on the one hand provide timely
programming interface and thus, although it is still based service, while on the other hand keeping them from affect-
on the principles laid out in [16], the PikeOS microker- ing other partitions by consuming more CPU time than they
nel’s interface today resembles none of the other existing are entitled to.
L4 API versions. In the following sections, we will discuss
the problems we encountered, and we will outline some of Memory Requirements and Memory Ac-
the solutions we chose to apply. countability

2 Issues Embedded systems are designed for a specific purpose


and they are expected to perform their job at the lowest
Scheduler Functionality possible cost. Therefore, an embedded system is usually
equipped with just enough memory to do what it is made
As already mentioned, there is often a need in embedded for, and it is difficult to justify a new concept in the embed-
systems for programs and operating systems with different ded market if it implies a need for more memory.
temporal requirements to coexist. Real-time programs need However, experience using a microkernel like L4 shows
to have guaranteed amounts of time at their disposal so they that it does increase a system’s overall memory require-
can meet their deadlines. Non-real-time programs need to ments. One reason for this is related to the need (or ten-
use all the resources they can get in order to deliver optimal dency) of using threads to implement certain concepts: In-
performance. Time allocations for real-time programs must terrupt handlers, for example, are implemented as threads.
be dimensioned according to worst-case assumptions which In L4 Linux, every user process needs two L4 threads for
are conceivable, but are generally very unlikely to apply. its implementation [10]. User-level synchronisation objects
The discrepancy between such a worst case and the average (e.g. counting semaphores) are implemented as threads. In
case can easily span multiple orders of magnitude, mainly result, systems based on L4 tend to employ a rather large
due to the caches found in every modern computer archi- number of threads. Each of them requires at least one page
tecture. Therefore, in the average case, real-time programs of kernel memory for kernel stack and data structures and
usually complete well before their deadlines. In order to put another page of user memory for the user stack. With a
the remaining, excess time to good use, the system should typical load of more than a hundred threads, the resulting
be able to dynamically re-allocate it for use by non-real- memory consumption (8 KB per thread) can be prohibitive
time programs. for many embedded applications.
The classical priority-driven scheduling used by L4 sup- Another reason is the kernel’s mapping database which
ports this: By assigning sufficiently high priorities to all is used to store the mapping trees of all of the system’s phys-
threads which have temporal requirements, these threads ical pages. It grows and shrinks dynamically as mappings
obtain access to the CPU first, and any time not consumed are created or deleted. There is no conceptual limit to the
by them is then dynamically made availabe to the lower- amount of memory that the mapping database might con-
priority, non-real-time parts of the system. sume. In [18], Liedtke et al describe a problem which has
However, there is a general problem with this: Any pro- been a topic of ongoing work in the L4 community until
gram with a sufficiently high priority has the ability to block today: Malicious (or faulty) programs are able to exhaust

51
the kernel’s memory resources, for example, by requesting is able to consume kernel resources (e.g. by installing map-
a sufficiently large number of mappings to be made. In this pings), to manipulate system settings or to delay a given
way, programs running in one partition can adversely affect schedule, etc.
programs in other partitions, thereby again defeating the se- Furthermore, L4 lacks a flexible but powerful IPC con-
cure isolation between partitions. Furthermore, the map- trol mechanism to manage the information flow of a com-
ping database requires a kernel heap allocator3 to deliver plex system easily and effectively. The ”clans and chiefs”
memory blocks which are used to store mapping entries. method introduced in [15] is generally regarded to be a sim-
These entries are made on behalf of different user programs. ple, but too inflexible solution.
So, while programs can be charged individually for the ker-
nel memory they consume to store page tables, there is no
3 Approaches
straightforward way to do this also for the amounts of ker-
nel memory that the mapping database consumes on their
behalf. The scope of this paper does not allow for an exhaus-
tive discussion of the details of all the changes that were
Code Complexity made. Therefore, we will concentrate mainly on the two
changes that will probably be considered the most radical
ones by the L4 community, namely the addition of partition
The PikeOS kernel is targeted for use in safety-critical
scheduling and the removal of the mapping database. Sub-
applications. Thus it must be prepared for a comprehensive
sequently, we will briefly discuss some more modifications,
validation according to safety standards such as [20]. Since
ordered by relevance.
the kernel runs in privileged mode, all of its code contributes
to the trusted code base of every application that might run
on top of it. Therefore, the amount of kernel code must be Partition Scheduling
kept minimal. L4 implementations have traditionally been
very good in this respect. However, even in L4 there is The goal of PikeOS is to provide partitions (or virtual
a kernel mapping database which, besides its problematic machines) that comprise a subset of the system’s resources.
consumption of kernel memory, is also a complex piece of Processing time is one of those resources. We expect the
code, and so is the underlying slab allocator. This makes it partitions to host a variety of guest operating systems with
hard (read: costly) to validate. different requirements regarding timely execution. There
On the other hand, some L4 concepts tend to force un- will typically be real-time as well as non-real-time systems,
necessary complexity on user level code. An example for and the real-time systems will generally fall into one of two
this would be the construction of address spaces by means categories:
of IPC: The creator of an address space has to provide a
client thread to run in the new address space for the sole • Time-triggered: Threads are executed according to a
purpose of accepting its mappings. Creation of an address static schedule which activates each thread at predeter-
space is an operation that all user-level programs (including mined points in time and for a predetermined amount
safety-critical ones) need to do at some point, so it does not of time.
help to reduce kernel complexity at the cost of making these
• Event-triggered: Threads are activated in response to
operations overly complex at the user level: either way, the
external events. Scheduling priorities are used to de-
complex code will be part of the trusted code base.
cide which thread is activated first in case multiple
events occur at the same time.
Access Privileges
Both approaches have their specific advantages and
Application programs running under a guest operating disadvantages[7]. They are mutually exclusive: Allowing
system need not (and, according to the principle of least for events to interrupt a time-triggered schedule affects its
privilege, should not) be able to access the microkernel’s determinism (it increases jitter). On the other hand, reserv-
system call interface directly. In fact, they should not even ing a time slot in the schedule for processing any pending
be aware of the microkernel’s presence. Otherwise, for ex- events leads to poor worst case response times and –again–
ample, a Linux user process could change its address space jitter of event-triggered threads. So, whenever the two ap-
layout without the Linux kernel knowing about it. However, proaches are combined in a system, one of them has to be
the L4 version 2.0 interface does not provide any means to given precedence and as a consequence, the other is des-
restrict access to its system call interface. Thus, any thread tined to perform poorly.
3 The PikeOS kernel originally employed a ”slab” allocator ([4]) for this We can not expect guest operating systems to trust each
purpose. other. Therefore, a scheduling technique had to be devised

52
that on one hand allows real-time and non-real-time sys-
tems to coexist efficiently, but that on the other hand avoids
the implicit dependency between priority and trust we de-
scribed earlier. This method had to be efficient and simple
in order to not add to the complexity of trusted code. To
our knowledge, the scheduling method used in the PikeOS
kernel is both unique and new ([13]). It is a superset to the
method described in the ARINC 653 standard [1], which is
commonly applied in the field of avionics:
Like L4, the PikeOS microkernel uses threads with static
priority levels to represent activities. But unlike L4, they
are grouped into sets which we refer to as ”time partitions”,
τi . The microkernel supports a configurable number of such Figure 2. PikeOS partition scheduler: princi-
time partitions. Each of them is represented in the micro- ple of operation.
kernel as an array of linked lists (one list per priority level).
Threads that have the same priority level are linked into the
same list in a first in/first out manner, and the thread at the
head of the list is the first to be executed. When a thread their priorities. Generally, these threads will be con-
blocks, it is appended to the end of the list. So, within figured to use a mid-level range of priorities (though
each time partition, there is the same scheduling method this is not technically necessary).
that L4 uses, i.e. a classical, priority-driven scheduling with
• τ0 : The semantics of the threads assigned to the back-
round robin scheduling between threads at the same priority.
ground partition, τ0 , depend on their respective priori-
However, unlike L4, the PikeOS microkernel supports mul-
ties.
tiple time partitions instead of just one. Threads can only
execute while their corresponding time partition is active, – Low-priority threads within τ0 will receive the
regardless of their priority. If we cycled through the time processing time that was assigned to, but not used
partitions, activating each one at a time for a fixed duration, by the mid-priority threads in τi . All threads that
we would obtain the behaviour of an ARINC 653 scheduler. do not have any real-time requirements are there-
But in contrast to this, the PikeOS kernel allows two of the fore assigned to τ0 , and they are all given the
time partitions to be active at the same time: same, low priority level. Since their priorities
are equal, they run under a round robin scheduler,
• The first time partition, τ0 plays a special role in that it
sharing their amount of computation time evenly.
is always active. It is referred to as the ”background”
partition. – Mid- or high-priority threads in τ0 compete with
the currently active foreground partition. If their
• Of all other partitions τi (i 6= 0), only one can be ac- priorities are higher, they can preempt any low-
tive at a time. The microkernel provides a (privileged) or mid-priority threads at any time. This is used
system call to select the currently active time partition. to implement event-driven real-time threads.
It is referred to as the ”foreground” partition. Switch-
ing happens cyclically, according to a pre-configured, The relation between the priorities of the foreground and
static schedule. the background partition decides wether time-driven threads
take precedence over event-driven ones or vice versa: If an
The microkernel scheduler always selects for execution event-driven thread’s priority is higher, it can preempt time-
the thread with the highest priority from the set union of τ0 driven threads at any time, effectively ”stealing” its execu-
and τi . Figure 2 shows the principle. tion time from them. Therefore, the points in time when
The threads have different semantics, depending on their time-driven threads are activated become undeterministic to
priorities and on their time partitions: some extent (i.e. they ”jitter”). The maximum jitter is in-
creased by the cumulative worst case execution times of all
• τi (i 6= 0): The system cyclically activates each of the high-priority threads (which has to be bounded). It should
possible τi in turn, giving each of them a configurable be noted that in this case, the high-priority threads have to
portion of the cycle time. So, the totality of all threads be trusted to not exceed their execution time.
in any of these time partitions receives a fixed amount In the opposite case, i.e. the time-triggered threads hav-
of time at fixed points in time. During their active time ing a higher priority, the situation is reversed: the event-
slice, the threads compete for the CPU according to triggered threads need to trust in the time-triggered threads

53
to leave over sufficient time for processing events. This can Therefore, if an application can live –for example– with
simply be done by leaving a part of the schedule unallo- a context switch overhead of 10%, minimum per partition
cated. time allocations can be made as low as 250µs.
The PikeOS partition scheduler can not completely solve Recently, a new scheduling method called ”adaptive par-
the dilemma between time-triggered and event-triggered titioning” [5] has been announced by QNX software sys-
real-time systems, but it does offer good flexibility in this tems. Like the PikeOS scheduler, this approach pursues the
respect: The decision wether to give precedence to time- goal of combining deterministic time allocation with good
triggered or event-triggered threads can be made individu- processor utilisation. It also represents groups of threads
ally for every time-triggered partition, by selecting the pri- as partitions and uses a combination of priority-driven and
orities of its threads to be either above or below those of the time-driven scheduling, but no per-partition time slots are
event-triggered threads. defined, i.e. all partitions can compete for the CPU at all
The microkernel implements only the mechanism4 to se- times, based on their thread’s priorities. This would suggest
lect one of the time partitions as active foreground partition. an O(n) complexity of the scheduler, however sufficient de-
The policy part of deciding which of the time partitions is tails about the implementation are not available. The guar-
activated when is left to the user level. This can be done anteed time allocations of partitions are specified as relative
by an interrupt handler thread which runs at a high priority values (percentages of the total CPU processing capacity),
in τ0 . In a typical configuration, this thread gets activated but it is not clear how these could be mapped to time slots,
by an external one-shot timer whenever the timeslice of the i.e. points in time when a partition receives the CPU and
currently active foreground partition has expired. It then durations for how long it will be able to retain it. The ap-
activates the next time partition, re-programs the one-shot proach uses an ”averaging window” 7 during which a par-
timer to trigger an interrupt when the next partition’s time tition is entitled to receive its percentage of CPU capacity,
allocation expires and then waits for this interrupt. How- but it also allows special ”critical partitions” to exceed their
ever, this is only one possible scenario: Since it is imple- allocations during an averaging window, as long as they
mented at the user level, the partition switching policy can later repay the ”debt” which they accumulate in this way.
easily be replaced without any changes to the microkernel. In summary, this approach seems to be more complex than
In practice, for a safe coexistence of multiple real-time the PikeOS partition scheduler, which would imply a more
systems, each of them should have its own time slot, i.e. complex trusted code base. It appears to be targeted mainly
its threads should be assigned to one of the foreground par- towards soft real-time applications: while it is able to guar-
titions τi (i 6= 0). In this way, each of them can trust to antee a deterministic average distribution of CPU resources
receive a guaranteed amount of time at cyclically repeated to its partitions, the exact time slot during which a particular
(i.e. fixed) points in time. This is a necessary requirement partition has access to the CPU is not very well defined.
when applying scheduling analysis to a real-time subsys-
tem (see, for example [3]). The possibility to override this Abandonment of the Mapping Database
strict time-driven scheduling scheme by high-priority back-
ground partition threads is reserved for selected activities The method of creating address spaces recursively on top
which can be trusted to consume only negligible amounts of other address spaces as described by Liedtke in [16] is an
of CPU time 5 . With all real-time systems being confined intriguingly elegant mechanism. However, the concept calls
to their individual time slots, their worst case response time for a recursive unmap operation which in turn necessitates a
and jitter directly depend on the cycle frequency at which mapping database to keep track of the mapping history of all
their time slots are repeated (see [12]). It would be desir- pages. This mapping database is the source of many prob-
able to make this frequency as high as possible but this is lems, most of which have already been introduced (i.e. po-
limited by the cost of switching between time partitions. tentially unlimited kernel memory consumption, difficulty
Therefore, the scheduling algorithm that is executed as part to attribute the memory to individual kernel resource par-
of these switches has to be to be both fast (i.e. non-complex) titions, general code complexity). An additional problem
and bounded. The PikeOS partition scheduler meets these which is relevant especially for real-time systems is the dif-
requirements: since it has to consider only two partitions ficulty in estimating the worst case execution time needed
for each switch, its execution time is constant (i.e. O(1)). for the recursive termination of task trees. The sum of all
Practical experience with a PowerPC platform6 has shown these problems provided a strong motivation to question the
worst case partition switch times to be in the range of 25µs. need for a mapping database in general, and an analysis of
4 Called switch tp() in Figure 2 the practical use cases in the context of PikeOS revealed
5 The microkernel maintains a ”maximum controlled priority” as de- that recursive unmap operations normally do not occur:
fined in the L4 version 2.0 API [17] to ensure that untrusted activities can
not assume priorities above their limit. 7 The size of this window, according to [5], is typically in the range of
6 Motorola MPC5200 at 400 MHz. 100 milliseconds.

54
The address spaces in which guest operating systems ex- There exists at least one other L4 variant which, accord-
ist are constructed and destroyed by so-called ”head tasks” ing to [19], also works without a mapping database [21].
(see figure 3). The only time when these head tasks unmap Like PikeOS, this L4 variant explicitly addresses real-time,
pages is when they destroy their child’s address space. In embedded systems, which suggests a similar motivation for
this special case, recursive unmapping is not needed since this modification (although it has not been discussed in any
all address spaces that have been constructed on top of the paper we are aware of).
one being destroyed are also destroyed anyway. Techni-
cally, a head task could revoke mappings without destroying Kernel Resource Partitioning
the child’s address space, but the PikeOS head tasks can be
trusted to not do this. Since they are the parents of all user
To facilitate the accounting of kernel resources and
level processes and since a task must fully trust its parent
thereby to avoid the possibility of denial-of-service attacks
anyway, this is not a new requirement.
as described in [18], the concept of kernel resource parti-
tions was introduced into the PikeOS microkernel. The ap-
proach is straightforward: The global kernel memory pool
of L4 was split into a configurable subset of pools, which
we refer to as kernel resource partitions. Every task is as-
signed to one of these partitions at creation time. Whenever
the kernel allocates memory on behalf of a user thread, that
memory is taken from the corresponding kernel resource
partition. If the memory resources of a partition are ex-
hausted, no further allocation for this partition is possible.
Other resource partitions, however, are not affected. Thus,
denial-of-service attacks are only possible among threads
Figure 3. Simplified mappings in PikeOS which belong to the same resource partition. The configu-
ration of the resource partitions is done by a specific system
The range of guest operating systems that PikeOS can call. The initial task, σ0 , has the ”ability” (see below) to
host in its partitions goes from relatively ”simple”, typi- make this call. Since all other tasks are ultimately children
cally real-time and/or safety-critical systems, up to ”com- of σ0 , and since σ0 can be trusted to not pass this ability on
plex” operating systems with virtual memory support (e.g. to any of them, it is effectively the only task that is able to
Linux). The former usually exist in a single, static address use this system call.
space. They do not require any mapping functionality them- Clearly, this approach is a simplistic solution to the prob-
selves. In fact, these subsystems could do just as well with lem of kernel resource allocation: it is quite inflexible since
memory protection only. The latter implement their own all kernel memory resources have to be statically defined at
memory management. These create tasks to whom they boot time. Nevertheless, for the PikeOS system, it serves
map pages, but those tasks do not create any subtasks them- the purpose of enabling safe isolation between partitions
selves8 to whom they could map pages. Thus, all requests to while keeping the trusted code base small. Its inflexibility
revoke a mapping are either made from within a task itself is not as big a problem in practice as it may seem: For real-
or from its direct parent. These non-recursive unmap op- time systems, it is usually not difficult to determine their
erations, however, can be implemented without a mapping worst-case kernel memory requirements beforehand and –
database. Therefore, it was possible to remove it without unlike their time requirements– the discrepancy between
substitution. worst case and average case is generally not as big. For
Recent research has shown formal verification of an L4 general-purpose systems like Linux, kernel memory con-
mapping database to be feasible ([23]). Thus, the code com- sumption is definitely not easy to predict. However, these
plexity argument made above against the mapping database systems can implement strategies for graceful degradation:
may have lost some of its weight in the meantime. Nev- The PikeOS Linux server, for example, flushes its client’s
ertheless, the fact remains that validation/verification of a mappings when its kernel resource partition runs out of
mapping database is a major task and, since its functional- memory, thus freeing up the kernel memory used for their
ity is not required in the context of the PikeOS system, its page tables. This leads to a performance degradation akin
removal still remains justified. to TLB thrashing, but, at least, the system remains stable.
8 This is usually enforced by not giving the tasks the ”ability” (see be-
The problem of managing kernel memory requirements
has been a topic of active research in the L4 community for
low) to access the microkernel interface. And even without such an en-
forcement, the effects of then-possible violations would still remain con- several years, and it still is. Several approaches have been
fined to their partitions. proposed (e.g. [8, 6], but apparently, the final solution is

55
yet to be determined. These approaches aim to solve the far Access Privileges
more challenging problem of dynamically changing kernel
memory requirements, while PikeOS can live with a static, The PikeOS kernel provides functionality to restrict
per partition allocation. access to the microkernel’s system call interface on a per
task basis. To implement this, the concept of abilities was
added to the kernel: Each ability enables access to a set of
Mapping system call system calls. The kernel checks a task’s abilities with each
system call. Depending on the settings, attempting a system
To enable fast and simple creation of entire address call without sufficient abilities either results in an error, or
spaces, the PikeOS kernel provides a special mapping sys- an exception message being sent to the caller’s exception
tem call. This system call is restricted to either the caller’s handler. The latter can be used to virtualise system calls.
address space, or to its child’s address space. The receiver Abilities are assigned to a task during its creation by the
of such a mapping does not have to explicitly agree to re- creating parent task. They are thus a task property and are
ceive the mapping. At first glance, this looks like a violation stored in the corresponding task descriptor data structure
of Liedtke’s principle that any transactions between address which is maintained by the microkernel. They can not
spaces must be explicitly agreed upon by all parties [16]. be changed during the lifetime of the task. The kernel
However, since the source of the mapping in this case is ei- ensures that a parent can not give its child any abilities that
ther the task itself (i.e. a trusted party), or its parent, which it does not have itself, i.e. a parent can further restrict its
could just as well create a thread in the target address space child’s abilities, but it can not extend them. It is possible
under its own control that would then accept the mappings, to either disallow certain groups of system calls based on
this is not really a security risk. Although not strictly neces- their specific functionality (like system-calls manipulating
sary, this facility greatly simplifies the process of construct- threads, etc.), or to disallow any system calls at all.
ing address spaces at the user-level side, while the added The concept of abilities –although developed
complexity at the kernel side is almost negligible. independently– is similar to the fine-grained kernel
access privileges of MINIX 3 ([9]). The L4::Pistachio
kernel variant supports the notion of ”privileged threads”,
Events which are entitled (by virtue of being members of an elite
group of tasks) to make certain system calls. This also
bears some similarity with PikeOS’s abilities, however the
Many of the services that where implemented on top selection of accessible system calls is pre-set and the right
of the PikeOS microkernel have a requirement for a ba- to access them can not be passed to tasks outside of the
sic, asynchronous communication primitive (e.g. a count- elite group.
ing semaphore). The L4 API only supports synchronous To control IPC communication, the PikeOS kernel assigns
IPC. A counting semaphore can easily be implemented in communication rights to each task. A thread is only
user space based on the IPC call, so, according to the prin- allowed to send an IPC or to signal an event to another
ciple of minimality, this is how it should be done. However, thread if its task has the right of communication with the
this would require a user thread managing the semaphore destination task. By default, every task is only allowed to
object. The resulting memory overhead9 was considered communicate with its parent, but a parent is able to both
high enough to justify a slight violation of the minimality grant the communication right between any of its children
principle by introducing a new concept, the event, into the (i.e. siblings) and to grant communication rights with
kernel interface. arbitrary, unrelated tasks.
Events are counting semaphores associated to threads
(i.e. every thread implicitly has one). A thread can wait 4 Conclusion
on its event by making an appropriate system call. This
decrements the event counter and it blocks the caller if the The PikeOS microkernel is an early spin-off from the
resulting counter is less than zero. The counter is incre- line of L4 kernel development, which has been adapted to
mented by other threads signalling events to the thread. If the specific needs of safety-critical real-time embedded sys-
the counter is incremented to zero, its associated thread is tems. It features partitioning of both temporal as well as
unblocked again. The rules for sending events are similar spatial resources. Compared to other L4 variants that have
to those for IPC, i.e. the recipient must explicitly allow any been developed in parallel, it focuses on real-time comput-
potential sender threads to signal events. ing and minimising trusted code, sacrificing flexibility in
some cases.
9 I.e. 8 kB for a thread that implements a counter – a 32-bit object! It currently runs on various PowerPC, MIPS and ia-32

56
machines. A number of different real-time as well as non [5] D. Dodge, A. Dank, S. Marineau-Mes, P. van der Veen,
real-time operating system interfaces have been ported to C. Burgess, T. Fletcher, and B. Stecher. Process scheduler
run in the microkernel’s partitions. Among them are Linux, employing adaptive partitioning of process threads. Cana-
POSIX PSE51, ITRON, OSEK OS, an Ada runtime sys- dian patent application CA000002538503A1, March 2006.
[6] D. Elkaduwe, P. Derrin, and K. Elphinstone. Kernel data
tem, a JVM and a Soft-PLC runtime system. Some per-
- first class citizens of the system. Technical Report
formance comparisons between the PikeOS Linux server
PA005847, National ICT Australia and University of New
and native Linux on x86 platforms have been published in South Wales, December 2005.
[14]. The average performance impact of 22% is slightly [7] G. Fohler. Flexible Reliable Timing - Real-Time vs. Reli-
higher than that of other microkernel-based Linux imple- ability. In Keynote Address, 10th European Workshop on
mentations, however, optimizing the performance of Linux Dependable Computing, 1999.
has always been a secondary concern for this system. [8] A. Haeberlen. User-level Management of Kernel Memory.
In Proc. of the 8th Asia-Pacific Computer Systems Architec-
ture Conference, Aizu-Wakamatsu City, Japan, September
5 Outlook 2003.
[9] J. N. Herder, H. Bos, B. Gras, P. Homburg, and A. S. Tanen-
Current development of PikeOS goes in several direc- baum. Modular System Programming in MINIX 3. j-
tions: LOGIN, 31(2):19–28, April 2006.
[10] M. Hohmuth. Linux-Emulation auf einem Mikrokern.
• Single threaded kernel: The PikeOS microkernel –like Diploma Thesis, TU Dresden, August 1996.
Fiasco ([11])– was designed to be fully preemptive. [11] M. Hohmuth. The Fiasco kernel: Requirements definition,
1998.
However, a fully preemptive kernel is always more
[12] R. Kaiser. Scheduling Virtual Machines in Real-time Em-
complex, opening up many possibilities for, e.g., race bedded Systems. In S. A. Brandt, editor, OSPERT 2006
conditions which are hard to detect. It also requires Workshop on Operating Systems for Embedded Real-Time
more resources, and the transient time spent in the ker- applications, pages 7–15, July 2006.
nel is very short anyway, so, a preemptive kernel may [13] R. Kaiser and R. Fuchsen. Verfahren zur Verteilung
not really be worth the effort. Thus, in hindsight, this von Rechenzeit in einem Rechnersystem. German patent
idea should be reconsidered. DE102004054571A1, November 2004.
[14] R. Kaiser, S. Wagner, and A. Zuepke. Safe and Cooperative
• Multiprocessor: With the advent of multicore-CPUs, Coexistence of a SoftPLC and Linux. 8th Real-Time Linux
embedded multiprocessor systems are becoming fea- Workshop, Lanzhou, China, September 2006.
sible. The PikeOS microkernel was not designed for [15] J. Liedtke. Clans & chiefs. In Architektur von Rechensyste-
multiprocessors, so this must be remedied. Besides be- men, 12. GI/ITG-Fachtagung, pages 294–305, London, UK,
1992. Springer-Verlag.
ing a necessary step to support future processor gener-
[16] J. Liedtke. On µ-Kernel Construction. In SOSP, pages 237–
ations, this also opens up the interesting new prospect 250, 1995.
of finally solving the conflict between time-driven and [17] J. Liedtke. L4 Reference Manual - 486, Pentium, Pentium
event-driven threads by binding them to separate pro- Pro, 1996.
cessor cores. [18] J. Liedtke, N. Islam, and T. Jaeger. Preventing Denial-of-
Service Attacks on a µ-Kernel for WebOSes, 1999.
• Certification: An effort to certify an appliance using [19] NICTA – National ICT Australia. Embedded - ERTOS. On-
PikeOS to DO-178B ([20]) is currently underway. line: http://ertos.nicta.com.au/research/l4/embedded.pml,
2006.
[20] RTCA. Software Considerations in Airborne Systems and
References Equipment Certification. Guideline DO-178A, Radio Tech-
nical Commission for Aeronautics, One McPherson Square,
[1] ARINC. Avionics Application Software Standard Interface. 1425 K Street N.W., Suite 500, Washington DC 20005,
Technical Report ARINC Specification 653, Aeronautical USA, March 1985.
Radio, Inc., 1997. [21] S. Ruocco. Real-Time Programming and L4 Microkernels.
[2] P. Barham, B. Dragovic, K. Fraser, S. H, T. Harris, A. Ho, National ICT Australia, 2006.
R. Neugebauer, I. Pratt, and A. Warfield. Xen and the Art of [22] J. Sugerman, G. Venkitachalam, and B.-H. Lim. Virtualiz-
Virtualization, 2003. ing I/O Devices on VMware Workstation’s Hosted Virtual
[3] G. Bollella. Slotted priorities: supporting real-time comput- Machine Monitor. In Proceedings of the General Track:
ing within general-purpose operating systems. PhD thesis, 2002 USENIX Annual Technical Conference, pages 1–14.
University of North Carolina at Chapel Hill, 1997. Advisor: USENIX Association, 2001.
Kevin Jeffay. [23] H. Tuch, G. Klein, and G. Heiser. Os verification — now!
[4] J. Bonwick and Sun Microsystems. The slab allocator: An In M. Seltzer, editor, Proc. 10th Workshop on Hot Topics in
object-caching kernel memory allocator. In USENIX Sum- Operating Systems (HotOS X), 2005.
mer, pages 87–98, 1997.

57

You might also like