QNX Neutrino RTOS System Architecture
QNX Neutrino RTOS System Architecture
QNX Neutrino RTOS System Architecture
19962014,
Table of Contents
About This Guide
......................................................................................................................9
Table of Contents
Synchronization via atomic operations .....................................................................54
Synchronization services implementation .................................................................55
Clock and timer services ..................................................................................................56
Time correction .....................................................................................................57
Timers ..................................................................................................................57
Interrupt handling ...........................................................................................................60
Interrupt latency ...................................................................................................60
Scheduling latency ................................................................................................61
Nested interrupts ..................................................................................................61
Interrupt calls .......................................................................................................62
Table of Contents
What is a resource manager? ..........................................................................................164
Why write a resource manager? .............................................................................164
The types of resource managers ............................................................................166
Communication via native IPC ..............................................................................167
Resource manager architecture ......................................................................................169
Message types .....................................................................................................169
The resource manager shared library .....................................................................170
Summary .....................................................................................................................175
Table of Contents
Dynamic host configuration ...........................................................................................266
AutoIP ................................................................................................................266
PPP over Ethernet .........................................................................................................267
/etc/autoconnect ...........................................................................................................268
Go to:
System services
Shared objects
Device drivers
Image, RAM, Power-Safe, QNX 4, DOS, CD-ROM, Flash, NFS, Filesystems (p. 177)
CIFS, Ext2, and other filesystems
Persistent Publish/Subscribe (PPS)
Network subsystem
TCP/IP implementation
Fault recovery
Glossary
For information about programming, see Get Programming with the QNX Neutrino
RTOS and the QNX Neutrino Programmer's Guide.
Typographical conventions
Throughout this manual, we use certain typographical conventions to distinguish
technical terms. In general, the conventions we use conform to those found in IEEE
POSIX publications.
The following table summarizes our conventions:
Reference
Example
Code examples
Command options
-lR
Commands
make
Environment variables
PATH
/dev/null
Function names
exit()
Keyboard chords
Keyboard input
Username
Keyboard keys
Enter
Program output
login:
Variable names
stdin
Parameters
parm1
User-interface components
Navigator
Window title
Options
Cautions tell you about commands or procedures that may have unwanted
or undesirable side effects.
10
Typographical conventions
Note to Windows users
In our documentation, we use a forward slash (/) as a delimiter in all pathnames,
including those pointing to Windows files. We also generally follow POSIX/UNIX
filesystem conventions.
11
Technical support
Technical assistance is available for all supported products.
To obtain technical support for any QNX product, visit the Support area on our website
(www.qnx.com). You'll find a wide range of support options, including community
forums.
12
Chapter 1
The Philosophy of the QNX Neutrino RTOS
The primary goal of the QNX Neutrino RTOS is to deliver the open systems POSIX API
in a robust, scalable form suitable for a wide range of systemsfrom tiny,
resource-constrained embedded systems to high-end distributed computing
environments. The OS supports several processor families, including x86 and ARM.
For mission-critical applications, a robust architecture is also fundamental, so the OS
makes flexible and complete use of MMU hardware.
Of course, simply setting out these goals doesn't guarantee results. We invite you to
read through this System Architecture guide to get a feel for our implementation
approach and the design trade-offs chosen to achieve these goals. When you reach
the end of this guide, we think you'll agree that QNX Neutrino is the first OS product
of its kind to truly deliver open systems standards, wide scalability, and high reliability.
13
14
Product scaling
Product scaling
Since you can readily scale a microkernel OS simply by including or omitting the
particular processes that provide the functionality required, you can use a single
microkernel OS for a much wider range of purposes than you can a realtime executive.
Product development often takes the form of creating a product line, with successive
models providing greater functionality. Rather than be forced to change operating
systems for each version of the product, developers using a microkernel OS can easily
scale the system as neededby adding filesystems, networking, graphical user
interfaces, and other technologies.
Some of the advantages to this scalable approach include:
portable application code (between product-line members)
common tools used to develop the entire product line
portable skill sets of development staff
reduced time-to-market
15
Apart from any bandwagon motive for adopting industry standards, there are several
specific advantages to applying the POSIX standard to the embedded realtime
marketplace:
Multiple OS sources
Hardware manufacturers are loath to choose a single-sourced hardware
component because of the risks implied if that source discontinues
production. For the same reason, manufacturers shouldn't be tied to a
16
17
18
Microkernel architecture
Microkernel architecture
Buzzwords often fall in and out of fashion. Vendors tend to enthusiastically apply the
buzzwords of the day to their products, whether the terms actually fit or not.
The term microkernel has become fashionable. Although many new operating systems
are said to be microkernels (or even nanokernels), the term may not mean very
much without a clear definition.
Let's try to define the term. A microkernel OS is structured as a tiny kernel that provides
the minimal services used by a team of optional cooperating processes, which in turn
provide the higher-level OS functionality. The microkernel itself lacks filesystems and
many other services normally expected of an OS; those services are provided by optional
processes.
The real goal in designing a microkernel OS is not simply to make it small. A
microkernel OS embodies a fundamental change in the approach to delivering OS
functionality. Modularity is the key, size is but a side effect. To call any kernel a
microkernel simply because it happens to be small would miss the point entirely.
Since the IPC services provided by the microkernel are used to glue the OS itself
together, the performance and flexibility of those services govern the performance of
the resulting OS. With the exception of those IPC services, a microkernel is roughly
comparable to a realtime executive, both in terms of the services provided and in their
realtime performance.
The microkernel differs from an executive in how the IPC services are used to extend
the functionality of the kernel with additional, service-providing processes. Since the
OS is implemented as a team of cooperating processes managed by the microkernel,
user-written processes can serve both as applications and as processes that extend
the underlying OS functionality for industry-specific applications. The OS itself becomes
open and easily extensible. Moreover, user-written extensions to the OS won't affect
the fundamental reliability of the core OS.
A difficulty for many realtime executives implementing the POSIX 1003.1 standard
is that their runtime environment is typically a single-process, multiple-threaded model,
with unprotected memory between threads. Such an environment is only a subset of
the multi-process model that POSIX assumes; it cannot support the fork() function.
In contrast, QNX Neutrino fully utilizes an MMU to deliver the complete POSIX process
model in a protected environment.
As the following diagrams show, a true microkernel offers complete memory protection,
not only for user applications, but also for OS components (device drivers, filesystems,
etc.):
19
File
system
Applications
Kernel space
TCP/IP
stack
File
system
Device drivers
Applications
TCP/IP
stack
Device drivers
20
Microkernel architecture
PowerSafe file
manager
Process
manager
UDF file
manager
HFS file
manager
Flash
file
manager
NFS file
manager
QNX Neutrino
microkernel
Software bus
GUI
manager
Character
manager
Application
Application
Mqueue
manager
Network
manager
CIFS file
manager
A true kernel
The kernel is the heart of any operating system. In some systems, the kernel
comprises so many functions that for all intents and purposes it is the entire operating
system!
But our microkernel is truly a kernel. First of all, like the kernel of a realtime executive,
it's very small. Secondly, it's dedicated to only a few fundamental services:
thread services via POSIX thread-creation primitives
signal services via POSIX signal primitives
message-passing servicesthe microkernel handles the routing of all messages
between all threads throughout the entire system.
synchronization services via POSIX thread-synchronization primitives.
scheduling servicesthe microkernel schedules threads for execution using the
various POSIX realtime scheduling policies.
timer servicesthe microkernel provides the rich set of POSIX timer services.
process management servicesthe microkernel and the process manager together
form a unit (called procnto). The process manager portion is responsible for
managing processes, memory, and the pathname space.
Unlike threads, the microkernel itself is never scheduled for execution. The processor
executes code in the microkernel only as the result of an explicit kernel call, an
exception, or in response to a hardware interrupt.
21
System processes
All OS services, except those provided by the mandatory microkernel/process manager
module (procnto), are handled via standard processes.
A richly configured system could include the following:
filesystem managers
character device managers
native network manager
TCP/IP
Device drivers
Device drivers allow the OS and application programs to make use of the underlying
hardware in a generic way (e.g., a disk drive, a network interface).
While most OSs require device drivers to be tightly bound into the OS itself, device
drivers for QNX Neutrino can be started and stopped as standard processes. As a
22
Microkernel architecture
result, adding device drivers doesn't affect any other part of the OSdrivers can be
developed and debugged like any other application.
23
Interprocess communication
When several threads run concurrently, as in typical realtime multitasking environments,
the OS must provide mechanisms to allow them to communicate with each other.
Interprocess communication (IPC) is the key to designing an application as a set of
cooperating processes in which each process handles one well-defined part of the
whole.
The OS provides a simple but powerful set of IPC capabilities that greatly simplify the
job of developing applications made up of cooperating processes. For more information,
see the Interprocess Communication (IPC) (p. 67) chapter.
24
Single-computer model
QNX Neutrino is designed from the ground up as a network-wide operating system.
In some ways, a native QNX Neutrino network feels more like a mainframe computer
than a set of individual micros. Users are simply aware of a large set of resources
available for use by any application. But unlike a mainframe, QNX Neutrino provides
a highly responsive environment, since the appropriate amount of computing power
can be made available at each node to meet the needs of each user.
In a mission-critical environment, for example, applications that control realtime I/O
devices may require more performance than other, less critical, applications, such as
a web browser. The network is responsive enough to support both types of applications
at the same timethe OS lets you focus computing power on the devices in your hard
realtime system where and when it's needed, without sacrificing concurrent connectivity
to the desktop. Moreover, critical aspects of realtime computing, such as priority
inheritance, function seamlessly across a QNX Neutrino network, regardless of the
physical media employed (switch fabric, serial, etc.).
Flexible networking
QNX Neutrino networks can be put together using various hardware and
industry-standard protocols. Since these are completely transparent to application
programs and users, new network architectures can be introduced at any time without
disturbing the OS.
Each node in the network is assigned a unique name that becomes its identifier. This
name is the only visible means to determine whether the OS is running as a network
or as a standalone operating system.
25
26
Chapter 2
The QNX Neutrino Microkernel
The microkernel implements the core POSIX features used in embedded realtime
systems, along with the fundamental QNX Neutrino message-passing services.
The POSIX features that aren't implemented in the procnto microkernel (file and
device I/O, for example) are provided by optional processes and shared libraries.
To determine the release version of the kernel on your system, use the uname
-a command. For more information, see its entry in the Utilities Reference.
Successive microkernels from QNX Software Systems have seen a reduction in the
code required to implement a given kernel call. The object definitions at the lowest
layer in the kernel code have become more specific, allowing greater code reuse (such
as folding various forms of POSIX signals, realtime signals, and QNX Neutrino pulses
into common data structures and code to manipulate those structures).
At its lowest level, the microkernel contains a few fundamental objects and the highly
tuned routines that manipulate them. The OS is built from this foundation.
Interface
Thread
Microkernel
Objects
Thread
Sched
Connection
Synch
Signal
Dispatch
Message
Channel
Timer
Vector
Timer
Clock
Channel
Interrupt
Pulse
27
28
System services
System services
The microkernel has kernel calls to support the following:
threads
message passing
signals
clocks
timers
interrupt handlers
semaphores
mutual exclusion locks (mutexes)
condition variables (condvars)
barriers
The entire OS is built upon these calls. The OS is fully preemptible, even while passing
messages between processes; it resumes the message pass where it left off before
preemption.
The minimal complexity of the microkernel helps place an upper bound on the longest
nonpreemptible code path through the kernel, while the small code size makes
addressing complex multiprocessor issues a tractable problem. Services were chosen
for inclusion in the microkernel on the basis of having a short execution path.
Operations requiring significant work (e.g., process loading) were assigned to external
processes/threads, where the effort to enter the context of that thread would be
insignificant compared to the work done within the thread to service the request.
Rigorous application of this rule to dividing the functionality between the kernel and
external processes destroys the myth that a microkernel OS must incur higher runtime
overhead than a monolithic kernel OS. Given the work done between context switches
(implicit in a message pass), and the very quick context-switch times that result from
the simplified kernel, the time spent performing context switches becomes lost in
the noise of the work done to service the requests communicated by the message
passing between the processes that make up the OS.
The following diagram shows the preemption details for the non-SMP kernel (x86
implementation).
29
SYSCALL
SYSEXIT
Entry
usecs
Kernel
operations,
which may
include a
message
pass
usecs to
msecs
Interrupts on,
full preemption
Lockdown
usecs
Interrupts on,
no preemption
Exit
usecs
Interrupts off
Interrupts off
30
31
Microkernel call
Description
pthread_create()
ThreadCreate()
pthread_exit()
ThreadDestroy()
Destroy a thread
pthread_detach()
ThreadDetach()
pthread_join()
ThreadJoin()
pthread_cancel()
ThreadCancel()
N/A
ThreadCtl()
pthread_mutex_init()
SyncTypeCreate()
Create a mutex
pthread_mutex_destroy()
SyncDestroy()
Destroy a mutex
pthread_mutex_lock()
SyncMutexLock()
Lock a mutex
pthread_mutex_trylock()
SyncMutexLock()
pthread_mutex_unlock()
pthread_cond_init()
SyncTypeCreate()
pthread_cond_destroy()
SyncDestroy()
pthread_cond_wait()
pthread_cond_signal()
pthread_cond_broadcast()
pthread_getschedparam()
SchedGet()
pthread_setschedparam(),
SchedSet()
pthread_sigmask()
SignalProcmask()
pthread_kill()
SignalKill()
pthread_setschedprio()
The OS can be configured to provide a mix of threads and processes (as defined by
POSIX). Each process is MMU-protected from each other, and each process may
contain one or more threads that share the process's address space.
32
For information about processes and threads from the programming point of view, see
the Processes and Threads chapter of Get Programming with the QNX Neutrino RTOS,
and the Programming Overview and Processes chapters of the QNX Neutrino
Programmer's Guide.
Thread attributes
Although threads within a process share everything within the process's address space,
each thread still has some private data. In some cases, this private data is protected
within the kernel (e.g., the tid or thread ID), while other private data resides unprotected
in the process's address space (e.g., each thread has a stack for its own use). Some
of the more noteworthy thread-private resources are:
tid
Each thread is identified by an integer thread ID, starting at 1. The tid is
unique within the thread's process.
Priority
Each thread has a priority that helps determine when it runs. A thread inherits
its initial priority from its parent, but the priority can change, depending on
the scheduling policy, explicit changes that the thread makes, or messages
sent to the thread.
In the QNX Neutrino RTOS, processes don't have priorities; their
threads do.
For more information, see Thread scheduling (p. 38), later in this chapter.
Name
Starting with the QNX Neutrino Core OS 6.3.2, you can assign a name to a
thread; see the entries for pthread_getname_np() and pthread_setname_np()
in the QNX Neutrino C Library Reference. Utilities such as dumper and
pidin support thread names. Thread names are a QNX Neutrino extension.
Register set
33
Each thread has its own instruction pointer (IP), stack pointer (SP), and
other processor-specific register context.
Stack
Each thread executes on its own stack, stored within the address space of
its process.
Signal mask
Each thread has its own signal mask.
Thread local storage
A thread has a system-defined data area called thread local storage (TLS).
The TLS is used to store per-thread information (such as tid, pid, stack
base, errno, and thread-specific key/data bindings). The TLS doesn't need
to be accessed directly by a user application. A thread can have user-defined
data associated with a thread-specific data key.
Cancellation handlers
Callback functions that are executed when the thread terminates.
Thread-specific data, implemented in the pthread library and stored in the TLS,
provides a mechanism for associating a process global integer key with a unique
per-thread data value. To use thread-specific data, you first create a new key and then
bind a unique data value to the key (per thread). The data value may, for example, be
an integer or a pointer to a dynamically allocated data structure. Subsequently, the
key can return the bound data value per thread.
A typical application of thread-specific data is for a thread-safe function that needs
to maintain a context for each calling thread.
key
tid
34
Function
Description
pthread_key_create()
Description
pthread_key_delete()
pthread_setspecific()
pthread_getspecific()
CONDVAR
JOIN
MUTEX
REPLY
SIGSUSP
SIG
WAITINFO
SEND
INTERRUPT
RECEIVE
NET_REPLY
NANO
SLEEP
RUNNING
WAITPAGE
NET_SEND
WAITCTX
READY
SEM
WAIT
THREAD
STOPPED
STACK
DEAD
Figure 8: Possible thread states. Note that, in addition to the transitions shown above,
a thread can move from any state (except DEAD) to READY.
35
36
37
Thread scheduling
Part of the kernel's job is to determine which thread runs and when.
First, let's look at when the kernel makes its scheduling decisions.
The execution of a running thread is temporarily suspended whenever the microkernel
is entered as the result of a kernel call, exception, or hardware interrupt. A scheduling
decision is made whenever the execution state of any thread changesit doesn't
matter which processes the threads might reside within. Threads are scheduled globally
across all processes.
Normally, the execution of the suspended thread will resume, but the thread scheduler
will perform a context switch from one thread to another whenever the running thread:
is blocked
is preempted
yields
When is a thread blocked?
The running thread is blocked when it must wait for some event to occur
(response to an IPC request, wait on a mutex, etc.). The blocked thread is
removed from the running array and the highest-priority ready thread is then
run. When the blocked thread is subsequently unblocked, it's placed on the
end of the ready queue for that priority level.
When is a thread preempted?
The running thread is preempted when a higher-priority thread is placed on
the ready queue (it becomes READY, as the result of its block condition
being resolved). The preempted thread is put at the beginning of the ready
queue for that priority and the higher-priority thread runs.
When is a thread yielded?
The running thread voluntarily yields the processor (sched_yield()) and is
placed on the end of the ready queue for that priority. The highest-priority
thread then runs (which may still be the thread that just yielded).
Scheduling priority
Every thread is assigned a priority. The thread scheduler selects the next thread to
run by looking at the priority assigned to every thread that is READY (i.e., capable of
using the CPU). The thread with the highest priority is selected to run.
The following diagram shows the ready queue for five threads (BF) that are READY.
Thread A is currently running. All other threads (GZ) are BLOCKED. Thread A, B, and
38
Thread scheduling
C are at the highest priority, so they'll share the processor based on the running thread's
scheduling policy.
Ready
queue
Running
255
10
B
Priority
Blocked
5
D
Idle
0
F
In QNX Neutrino 6.6 or later, you can append an s or S to this option if you want
out-of-range priority requests by default to saturate at the maximum allowed value
instead of resulting in an error. When you're setting a priority, you can wrap it in one
these (non-POSIX) macros to specify how to handle out-of-range priority requests:
SCHED_PRIO_LIMIT_ERROR(priority) indicate an error
SCHED_PRIO_LIMIT_SATURATE(priority) saturate at the maximum allowed
priority
Here's a summary of the ranges:
Priority level
Owner
Idle thread
1 through priority 1
Unprivileged or privileged
Privileged
39
Scheduling policies
To meet the needs of various applications, the QNX Neutrino RTOS provides these
scheduling algorithms:
FIFO scheduling
round-robin scheduling
sporadic scheduling
Each thread in the system may run using any method. The methods are effective on
a per-thread basis, not on a global basis for all threads and processes on a node.
Remember that the FIFO and round-robin scheduling policies apply only when two or
more threads that share the same priority are READY (i.e., the threads are directly
competing with each other). The sporadic method, however, employs a budget for
a thread's execution. In all cases, if a higher-priority thread becomes READY, it
immediately preempts all lower-priority threads.
In the following diagram, three threads of equal priority are READY. If Thread A blocks,
Thread B will run.
40
Thread scheduling
Running
Ready
queue
A
10
C
Priority
Blocked
FIFO scheduling
In FIFO scheduling, a thread selected to run continues executing until it:
voluntarily relinquishes control (e.g., it blocks)
is preempted by a higher-priority thread
Ready
queue
Running
Priority
10
B
Round-robin scheduling
In round-robin scheduling, a thread selected to run continues executing until it:
voluntarily relinquishes control
is preempted by a higher-priority thread
consumes its timeslice
As the following diagram shows, Thread A ran until it consumed its timeslice; the next
READY thread (Thread B) now runs:
41
Running
Priority
10
Sporadic scheduling
The sporadic scheduling policy is generally used to provide a capped limit on the
execution time of a thread within a given period of time.
This behavior is essential when Rate Monotonic Analysis (RMA) is being performed
on a system that services both periodic and aperiodic events. Essentially, this algorithm
allows a thread to service aperiodic events without jeopardizing the hard deadlines of
other threads or processes in the system.
As in FIFO scheduling, a thread using sporadic scheduling continues executing until
it blocks or is preempted by a higher-priority thread. And as in adaptive scheduling,
a thread using sporadic scheduling will drop in priority, but with sporadic scheduling
you have much more precise control over the thread's behavior.
Under sporadic scheduling, a thread's priority can oscillate dynamically between a
foreground or normal priority and a background or low priority. Using the following
parameters, you can control the conditions of this sporadic shift:
Initial budget (C)
The amount of time a thread is allowed to execute at its normal priority (N)
before being dropped to its low priority (L).
Low priority (L)
The priority level to which the thread will drop. The thread executes at this
lower priority (L) while in the background, and runs at normal priority (N)
while in the foreground.
Replenishment period (T)
42
Thread scheduling
The period of time during which a thread is allowed to consume its execution
budget. To schedule replenishment operations, the POSIX implementation
also uses this value as the offset from the time the thread becomes READY.
Max number of pending replenishments
This value limits the number of replenishment operations that can take
place, thereby bounding the amount of system overhead consumed by the
sporadic scheduling policy.
In a poorly configured system, a thread's execution budget may
become eroded because of too much blockingi.e., it won't receive
enough replenishments.
As the following diagram shows, the sporadic scheduling policy establishes a thread's
initial execution budget (C), which is consumed by the thread as it runs and is
replenished periodically (for the amount T). When a thread blocks, the amount of the
execution budget that's been consumed (R) is arranged to be replenished at some
later time (e.g., at 40 msec) after the thread first became ready to run.
Replenished at this point
R
C
C
T
0 msec
40 msec
80 msec
Priority L
May or may not run
T
0 msec
40 msec
80 msec
43
Priority N
3
4 5
Priority L
T
T
0 msec
40 msec
80 msec
44
Thread scheduling
IPC issues
Since all the threads in a process have unhindered access to the shared data space,
wouldn't this execution model trivially solve all of our IPC problems? Can't we just
communicate the data through shared memory and dispense with any other execution
models and IPC mechanisms?
If only it were that simple!
One issue is that the access of individual threads to common data must be
synchronized. Having one thread read inconsistent data because another thread is
part way through modifying it is a recipe for disaster. For example, if one thread is
updating a linked list, no other threads can be allowed to traverse or modify the list
until the first thread has finished. A code passage that must execute serially (i.e.,
by only one thread at a time) in this manner is termed a critical section. The program
would fail (intermittently, depending on how frequently a collision occurred) with
irreparably damaged links unless some synchronization mechanism ensured serial
access.
Mutexes, semaphores, and condvars are examples of synchronization tools that can
be used to address this problem. These tools are described later in this section.
Although synchronization services can be used to allow threads to cooperate, shared
memory per se can't address a number of IPC issues. For example, although threads
can communicate through the common data space, this works only if all the threads
communicating are within a single process. What if our application needs to
communicate a query to a database server? We need to pass the details of our query
to the database server, but the thread we need to communicate with lies within a
database server process and the address space of that server isn't addressable to us.
The OS takes care of the network-distributed IPC issue because the one
interfacemessage passingoperates in both the local and network-remote cases,
and can be used to access all OS services. Since messages can be exactly sized, and
since most messages tend to be quite tiny (e.g., the error status on a write request,
or a tiny read request), the data moved around the network can be far less with message
45
46
Synchronization services
Synchronization services
The QNX Neutrino RTOS provides the POSIX-standard thread-level synchronization
primitives, some of which are useful even between threads in different processes.
The synchronization services include at least the following:
Synchronization service
Supported between
processes
Neutrino LAN
Yes
No
Yes
No
No
No
No
No
No
Yes
Yes
No
Yes
No
Yes
The above synchronization primitives are implemented directly by the kernel, except
for:
barriers, sleepon locks, and reader/writer locks (which are built from mutexes and
condvars)
atomic operations (which are either implemented directly by the processor or
emulated in the kernel)
You should allocate mutexes, condvars, barriers, reader/writer locks, and
semaphores, as well as objects you plan to use atomic operations on, only in
normal memory mappings. On certain processors, atomic operations and calls
such as pthread_mutex_lock() will cause a fault if the object is allocated in
uncached memory.
47
48
Synchronization services
You can also modify the attributes of the mutex (using pthread_mutexattr_settype())
to allow a mutex to be recursively locked by the same thread. This can be useful to
allow a thread to call a routine that might attempt to lock a mutex that the thread
already happens to have locked.
In this code sample, the mutex is acquired before the condition is tested. This ensures
that only this thread has access to the arbitrary condition being examined. While the
condition is true, the code sample will block on the wait call until some other thread
performs a signal or broadcast on the condvar.
The while loop is required for two reasons. First of all, POSIX cannot guarantee that
false wakeups will not occur (e.g., multiprocessor systems). Second, when another
thread has made a modification to the condition, we need to retest to ensure that the
modification matches our criteria. The associated mutex is unlocked atomically by
pthread_cond_wait() when the waiting thread is blocked to allow another thread to
enter the critical section.
A thread that performs a signal will unblock the highest-priority thread queued on the
condvar, while a broadcast will unblock all threads queued on the condvar. The
associated mutex is locked atomically by the highest-priority unblocked thread; the
thread must then unlock the mutex after proceeding through the critical section.
49
Barriers
A barrier is a synchronization mechanism that lets you corral several cooperating
threads (e.g., in a matrix computation), forcing them to wait at a specific point until
all have finished before any one thread can continue.
Unlike the pthread_join() function, where you'd wait for the threads to terminate, in
the case of a barrier you're waiting for the threads to rendezvous at a certain point.
When the specified number of threads arrive at the barrier, we unblock all of them so
they can continue to run.
You first create a barrier with pthread_barrier_init():
#include <pthread.h>
int
pthread_barrier_init (pthread_barrier_t *barrier,
const pthread_barrierattr_t *attr,
unsigned int count);
This creates a barrier object at the passed address (a pointer to the barrier object is
in barrier), with the attributes as specified by attr. The count member holds the number
of threads that must call pthread_barrier_wait().
Once the barrier is created, each thread will call pthread_barrier_wait() to indicate
that it has completed:
#include <pthread.h>
int pthread_barrier_wait (pthread_barrier_t *barrier);
50
Synchronization services
#include
#include
#include
#include
#include
<unistd.h>
<stdlib.h>
<time.h>
<pthread.h>
<sys/neutrino.h>
pthread_barrier_t
void *
thread1 (void *not_used)
{
time_t now;
time (&now);
printf ("thread1 starting at %s", ctime (&now));
// do the computation
// let's just do a sleep here...
sleep (20);
pthread_barrier_wait (&barrier);
// after this point, all three threads have completed.
time (&now);
printf ("barrier in thread1() done at %s", ctime (&now));
}
void *
thread2 (void *not_used)
{
time_t now;
time (&now);
printf ("thread2 starting at %s", ctime (&now));
// do the computation
// let's just do a sleep here...
sleep (40);
pthread_barrier_wait (&barrier);
// after this point, all three threads have completed.
time (&now);
printf ("barrier in thread2() done at %s", ctime (&now));
}
int main () // ignore arguments
{
time_t now;
// create a barrier object with a count of 3
pthread_barrier_init (&barrier, NULL, 3);
// start up two threads, thread1 and thread2
pthread_create (NULL, NULL, thread1, NULL);
pthread_create (NULL, NULL, thread2, NULL);
// at this point, thread1 and thread2 are running
// now wait for completion
time (&now);
printf ("main() waiting for barrier at %s", ctime (&now));
pthread_barrier_wait (&barrier);
// after this point, all three threads have completed.
time (&now);
printf ("barrier in main() done at %s", ctime (&now));
pthread_exit( NULL );
return (EXIT_SUCCESS);
}
The main thread created the barrier object and initialized it with a count of the total
number of threads that must be synchronized to the barrier before the threads may
carry on. In the example above, we used a count of 3: one for the main() thread, one
for thread1(), and one for thread2().
Then we start thread1() and thread2(). To simplify this example, we have the threads
sleep to cause a delay, as if computations were occurring. To synchronize, the main
51
Description
pthread_barrierattr_getpshared()
pthread_barrierattr_destroy()
pthread_barrierattr_init()
pthread_barrierattr_setpshared()
pthread_barrier_destroy()
Destroy a barrier
pthread_barrier_init()
Initialize a barrier
pthread_barrier_wait()
Sleepon locks
Sleepon locks are very similar to condvars, with a few subtle differences.
Like condvars, sleepon locks (pthread_sleepon_lock()) can be used to block until a
condition becomes true (like a memory location changing value). But unlike condvars,
which must be allocated for each condition to be checked, sleepon locks multiplex
their functionality over a single mutex and dynamically allocated condvar, regardless
of the number of conditions being checked. The maximum number of condvars ends
up being equal to the maximum number of blocked threads. These locks are patterned
after the sleepon locks commonly used within the UNIX kernel.
Reader/writer locks
More formally known as Multiple readers, single writer locks, these locks are used
when the access pattern for a data structure consists of many threads reading the
data, and (at most) one thread writing the data. These locks are more expensive than
mutexes, but can be useful for this data access pattern.
This lock works by allowing all the threads that request a read-access lock
(pthread_rwlock_rdlock()) to succeed in their request. But when a thread wishing to
write asks for the lock (pthread_rwlock_wrlock()), the request is denied until all the
current reading threads release their reading locks (pthread_rwlock_unlock()).
Multiple writing threads can queue (in priority order) waiting for their chance to write
the protected data structure, and all the blocked writer-threads will get to run before
52
Synchronization services
reading threads are allowed access again. The priorities of the reading threads are not
considered.
There are also calls (pthread_rwlock_tryrdlock() and pthread_rwlock_trywrlock()) to
allow a thread to test the attempt to achieve the requested lock, without blocking.
These calls return with a successful lock or a status indicating that the lock couldn't
be granted immediately.
Reader/writer locks aren't implemented directly within the kernel, but are instead built
from the mutex and condvar services provided by the kernel.
Semaphores
Semaphores are another common form of synchronization that allows threads to post
and wait on a semaphore to control when threads wake or sleep.
The post (sem_post()) operation increments the semaphore; the wait (sem_wait())
operation decrements it.
If you wait on a semaphore that is positive, you will not block. Waiting on a nonpositive
semaphore will block until some other thread executes a post. It is valid to post one
or more times before a wait. This use will allow one or more threads to execute the
wait without blocking.
A significant difference between semaphores and other synchronization primitives is
that semaphores are async safe and can be manipulated by signal handlers. If the
desired effect is to have a signal handler wake a thread, semaphores are the right
choice.
Note that in general, mutexes are much faster than semaphores, which always
require a kernel entry. Semaphores don't affect a thread's effective priority; if
you need priority inheritance, use a mutex. For more information, see Mutexes:
mutual exclusion locks (p. 48), earlier in this chapter.
Another useful property of semaphores is that they were defined to operate between
processes. Although our mutexes work between processes, the POSIX thread standard
considers this an optional capability and as such may not be portable across systems.
For synchronization between threads in a single process, mutexes will be more efficient
than semaphores.
As a useful variation, a named semaphore service is also available. It lets you use
semaphores between processes on different machines connected by a network.
Note that named semaphores are slower than the unnamed
variety.
53
54
Synchronization services
setting bits
toggling (complementing) bits
These atomic operations are available by including the C header file <atomic.h>.
Although you can use these atomic operations just about anywhere, you'll find them
particularly useful in these two cases:
between an ISR and a thread
between two threads (SMP or single-processor)
Since an ISR can preempt a thread at any given point, the only way that the thread
would be able to protect itself would be to disable interrupts. Since you should avoid
disabling interrupts in a realtime system, we recommend that you use the atomic
operations provided with QNX Neutrino.
On an SMP system, multiple threads can and do run concurrently. Again, we run into
the same situation as with interrupts aboveyou should use the atomic operations
where applicable to eliminate the need to disable and reenable interrupts.
POSIX call
Description
SyncTypeCreate()
pthread_mutex_init(), pthread_cond_init(),
sem_init()
semaphore
SyncDestroy()
Block on a condvar
SyncMutexLock()
Lock a mutex
pthread_mutex_lock(), pthread_mutex_trylock()
SyncMutexUnlock() pthread_mutex_unlock()
Unlock a mutex
SyncSemPost()
sem_post()
Post a semaphore
SyncSemWait()
sem_wait(), sem_trywait()
Wait on a semaphore
55
The ClockTime() kernel call allows you to get or set the system clock specified by an
ID (CLOCK_REALTIME), which maintains the system time. Once set, the system time
increments by some number of nanoseconds based on the resolution of the system
clock. This resolution can be queried or changed using the ClockPeriod() call.
Within the system page, an in-memory data structure, there's a 64-bit field (nsec)
that holds the number of nanoseconds since the system was booted. The nsec field
is always monotonically increasing and is never affected by setting the current time
of day via ClockTime() or ClockAdjust().
The ClockCycles() function returns the current value of a free-running 64-bit cycle
counter. This is implemented on each processor as a high-performance mechanism
for timing short intervals. For example, on Intel x86 processors, an opcode that reads
the processor's time-stamp counter is used. On a Pentium processor, this counter
increments on each clock cycle. A 100 MHz Pentium would have a cycle time of
1/100,000,000 seconds (10 nanoseconds). Other CPU architectures have similar
instructions.
On processors that don't implement such an instruction in hardware, the kernel will
emulate one. This will provide a lower time resolution than if the instruction is provided
(838.095345 nanoseconds on an IBM PC-compatible system).
In all cases, the SYSPAGE_ENTRY(qtime)->cycles_per_sec field gives the
number of ClockCycles() increments in one second.
The ClockPeriod() function allows a thread to set the system timer to some multiple
of nanoseconds; the OS kernel will do the best it can to satisfy the precision of the
request with the hardware available.
The interval selected is always rounded down to an integral of the precision of the
underlying hardware timer. Of course, setting it to an extremely low value can result
in a significant portion of CPU performance being consumed servicing timer interrupts.
56
POSIX call
Description
clock_gettime(), clock_settime()
call
ClockTime()
N/A
ClockCycles()
N/A
ClockPeriod()
clock_getres()
ClockId()
clock_getcpuclockid(),
pthread_getcpuclockid()
clockid_t.
The kernel can run in a tickless mode in order to reduce power consumption, but this
is a bit of a misnomer. The system still has clock ticks, and everything runs as normal
unless the system is idle. Only when the system goes completely idle does the kernel
turn off clock ticks, and in reality what it does is slow down the clock so that the next
tick interrupt occurs just after the next active timer is to fire, so that the timer will
fire immediately. To enable tickless operation, specify the -Z option for the startup-*
code.
Time correction
In order to facilitate applying time corrections without having the system experience
abrupt steps in time (or even having time jump backwards), the ClockAdjust() call
provides the option to specify an interval over which the time correction is to be applied.
This has the effect of speeding or retarding time over a specified interval until the
system has synchronized to the indicated current time. This service can be used to
implement network-coordinated time averaging between multiple nodes on a network.
Timers
The QNX Neutrino RTOS directly provides the full set of POSIX timer functionality.
Since these timers are quick to create and manipulate, they're an inexpensive resource
in the kernel.
The POSIX timer model is quite rich, providing the ability to have the timer expire on:
an absolute date
a relative date (i.e., n nanoseconds from now)
cyclical (i.e., every n nanoseconds)
The cyclical mode is very significant, because the most common use of timers tends
to be as a periodic source of events to kick a thread into life to do some processing
and then go back to sleep until the next event. If the thread had to re-program the
timer for every event, there would be the danger that time would slip unless the thread
was programming an absolute date. Worse, if the thread doesn't get to run on the timer
57
Our solution is a form of timeout request atomic to the service request itself. One
approach might have been to provide an optional timeout parameter on every available
service request, but this would overly complicate service requests with a passed
parameter that would often go unused.
QNX Neutrino provides a TimerTimeout() kernel call that allows an application to
specify a list of blocking states for which to start a specified timeout. Later, when the
application makes a request of the kernel, the kernel will atomically enable the
previously configured timeout if the application is about to block on one of the specified
states.
Since the OS has a very small number of blocking states, this mechanism works very
concisely. At the conclusion of either the service request or the timeout, the timer will
be disabled and control will be given back to the application.
TimerTimeout(...);
...
...
...
blocking_call();
...
Timer atomically armed within kernel
58
POSIX call
Description
TimerAlarm()
alarm()
TimerCreate()
timer_create()
call
TimerDestroy() timer_delete()
TimerInfo()
timer_gettime()
TimerInfo()
timer_getoverrun()
TimerSettime() timer_settime()
pthread_cond_timedwait(),
pthread_mutex_trylock()
For more information, see the Clocks, Timers, and Getting a Kick Every So Often
chapter of Get Programming with the QNX Neutrino RTOS.
59
Interrupt handling
No matter how much we wish it were so, computers are not infinitely fast. In a realtime
system, it's absolutely crucial that CPU cycles aren't unnecessarily spent. It's also
crucial to minimize the time from the occurrence of an external event to the actual
execution of code within the thread responsible for reacting to that event. This time
is referred to as latency.
The two forms of latency that most concern us are interrupt latency and scheduling
latency.
Latency times can vary significantly, depending on the speed of the processor
and other factors. For more information, visit our website (www.qnx.com).
Interrupt latency
Interrupt latency is the time from the assertion of a hardware interrupt until the first
instruction of the device driver's interrupt handler is executed.
The OS leaves interrupts fully enabled almost all the time, so that interrupt latency
is typically insignificant. But certain critical sections of code do require that interrupts
be temporarily disabled. The maximum such disable time usually defines the worst-case
interrupt latencyin QNX Neutrino this is very small.
The following diagrams illustrate the case where a hardware interrupt is processed by
an established interrupt handler. The interrupt handler either will simply return, or it
will return and cause an event to be delivered.
Interrupt handler
runs
Interrupt handler
finishes
Interrupt
occurs
Interrupted process
continues execution
Time
T il
T il
Tint
T iret
interrupt latency
60
Interrupt handling
Scheduling latency
In some cases, the low-level hardware interrupt handler must schedule a higher-level
thread to run. In this scenario, the interrupt handler will return and indicate that an
event is to be delivered. This introduces a second form of latencyscheduling
latencywhich must be accounted for.
Scheduling latency is the time between the last instruction of the user's interrupt
handler and the execution of the first instruction of a driver thread. This usually means
the time it takes to save the context of the currently executing thread and restore the
context of the required driver thread. Although larger than interrupt latency, this time
is also kept small in a QNX Neutrino system.
Interrupt
occurs
Interrupt handler
runs
Interrupt handler
finishes,
triggering a
sigevent
Driver thread
runs
Time
Tsl
T il
Tint
T il
interrupt latency
Nested interrupts
The QNX Neutrino RTOS fully supports nested interrupts.
The previous scenarios describe the simplestand most commonsituation where
only one interrupt occurs. Worst-case timing considerations for unmasked interrupts
must take into account the time for all interrupts currently being processed, because
a higher priority, unmasked interrupt will preempt an existing interrupt.
In the following diagram, Thread A is running. Interrupt IRQx causes interrupt handler
Intx to run, which is preempted by IRQy and its handler Inty. Inty returns an event
causing Thread B to run; Intx returns an event causing Thread C to run.
61
Thread A
Thread C
IRQ x
Thread B
Intx
IRQ y
Inty
Interrupt calls
The interrupt-handling API includes the following kernel calls:
Function
Description
InterruptAttach()
InterruptAttachEvent()
InterruptDetach()
InterruptWait()
InterruptEnable()
InterruptDisable()
InterruptMask()
InterruptUnmask()
InterruptLock()
InterruptUnlock()
62
Interrupt handling
Using this API, a suitably privileged user-level thread can call InterruptAttach() or
InterruptAttachEvent(), passing a hardware interrupt number and the address of a
function in the thread's address space to be called when the interrupt occurs. QNX
Neutrino allows multiple ISRs to be attached to each hardware interrupt
numberunmasked interrupts can be serviced during the execution of running interrupt
handlers.
The startup code is responsible for making sure that all interrupt sources
are masked during system initialization. When the first call to
InterruptAttach() or InterruptAttachEvent() is done for an interrupt vector,
the kernel unmasks it. Similarly, when the last InterruptDetach() is done
for an interrupt vector, the kernel remasks the level.
For more information on InterruptLock() and InterruptUnlock(), see Critical
sections (p. 120) in the chapter on Multicore Processing in this guide.
It isn't safe to use floating-point operations in Interrupt Service Routines.
The following code sample shows how to attach an ISR to the hardware timer interrupt
on the PC (which the OS also uses for the system clock). Since the kernel's timer ISR
is already dealing with clearing the source of the interrupt, this ISR can simply
increment a counter variable in the thread's data space and return to the kernel:
#include <stdio.h>
#include <sys/neutrino.h>
#include <sys/syspage.h>
struct sigevent event;
volatile unsigned counter;
const struct sigevent *handler( void *area, int id ) {
// Wake up the thread every 100th interrupt
if ( ++counter == 100 ) {
counter = 0;
return( &event );
}
else
return( NULL );
}
int main() {
int i;
int id;
// Request I/O privileges
ThreadCtl( _NTO_TCTL_IO, 0 );
// Initialize event structure
event.sigev_notify = SIGEV_INTR;
// Attach ISR vector
id=InterruptAttach( SYSPAGE_ENTRY(qtime)->intr, &handler,
NULL, 0, 0 );
for( i = 0; i < 10; ++i ) {
// Wait for ISR to wake us up
InterruptWait( 0, NULL );
printf( "100 events\n" );
}
// Disconnect the ISR handler
InterruptDetach(id);
63
With this approach, appropriately privileged user-level threads can dynamically attach
(and detach) interrupt handlers to (and from) hardware interrupt vectors at run time.
These threads can be debugged using regular source-level debug tools; the ISR itself
can be debugged by calling it at the thread level and source-level stepping through it
or by using the InterruptAttachEvent() call.
When the hardware interrupt occurs, the processor will enter the interrupt redirector
in the microkernel. This code pushes the registers for the context of the currently
running thread into the appropriate thread table entry and sets the processor context
such that the ISR has access to the code and data that are part of the thread the ISR
is contained within. This allows the ISR to use the buffers and code in the user-level
thread to resolve the interrupt and, if higher-level work by the thread is required, to
queue an event to the thread the ISR is part of, which can then work on the data the
ISR has placed into thread-owned buffers.
Since it runs with the memory-mapping of the thread containing it, the ISR can directly
manipulate devices mapped into the thread's address space, or directly perform I/O
instructions. As a result, device drivers that manipulate hardware don't need to be
linked into the kernel.
The interrupt redirector code in the microkernel will call each ISR attached to that
hardware interrupt. If the value returned indicates that a process is to be passed an
event of some sort, the kernel will queue the event. When the last ISR has been called
for that vector, the kernel interrupt handler will finish manipulating the interrupt
control hardware and then return from interrupt.
This interrupt return won't necessarily be into the context of the thread that was
interrupted. If the queued event caused a higher-priority thread to become READY,
the microkernel will then interrupt-return into the context of the now-READY thread
instead.
This approach provides a well-bounded interval from the occurrence of the interrupt
to the execution of the first instruction of the user-level ISR (measured as interrupt
latency), and from the last instruction of the ISR to the first instruction of the thread
readied by the ISR (measured as thread or process scheduling latency).
The worst-case interrupt latency is well-bounded, because the OS disables interrupts
only for a couple opcodes in a few critical regions. Those intervals when interrupts are
disabled have deterministic runtimes, because they're not data dependent.
The microkernel's interrupt redirector executes only a few instructions before calling
the user's ISR. As a result, process preemption for hardware interrupts or kernel calls
is equally quick and exercises essentially the same code path.
While the ISR is executing, it has full hardware access (since it's part of a privileged
thread), but can't issue other kernel calls. The ISR is intended to respond to the
hardware interrupt in as few microseconds as possible, do the minimum amount of
64
Interrupt handling
work to satisfy the interrupt (read the byte from the UART, etc.), and if necessary,
cause a thread to be scheduled at some user-specified priority to do further work.
Worst-case interrupt latency is directly computable for a given hardware priority from
the kernel-imposed interrupt latency and the maximum ISR runtime for each interrupt
higher in hardware priority than the ISR in question. Since hardware interrupt priorities
can be reassigned, the most important interrupt in the system can be made the highest
priority.
Note also that by using the InterruptAttachEvent() call, no user ISR is run. Instead,
a user-specified event is generated on each and every interrupt; the event will typically
cause a waiting thread to be scheduled to run and do the work. The interrupt is
automatically masked when the event is generated and then explicitly unmasked by
the thread that handles the device at the appropriate time.
Both InterruptMask() and InterruptUnmask() are counting functions. For
example, if InterruptMask() is called ten times, then InterruptUnmask() must
also be called ten times.
Thus the priority of the work generated by hardware interrupts can be performed at
OS-scheduled priorities rather than hardware-defined priorities. Since the interrupt
source won't re-interrupt until serviced, the effect of interrupts on the runtime of
critical code regions for hard-deadline scheduling can be controlled.
In addition to hardware interrupts, various events within the microkernel can also
be hooked by user processes and threads. When one of these events occurs, the
kernel can upcall into the indicated function in the user thread to perform some
specific processing for this event. For example, whenever the idle thread in the system
is called, a user thread can have the kernel upcall into the thread so that
hardware-specific low-power modes can be readily implemented.
Microkernel call
Description
InterruptHookIdle()
InterruptHookTrace()
For more information about interrupts, see the Interrupts chapter of Get Programming
with the QNX Neutrino RTOS, and the Writing an Interrupt Handler chapter of the
QNX Neutrino Programmer's Guide.
65
Chapter 3
Interprocess Communication (IPC)
Interprocess Communication plays a fundamental role in the transformation of the
microkernel from an embedded realtime kernel into a full-scale POSIX operating
system. As various service-providing processes are added to the microkernel, IPC is
the glue that connects those components into a cohesive whole.
Although message passing is the primary form of IPC in the QNX Neutrino RTOS,
several other forms are available as well. Unless otherwise noted, those other forms
of IPC are built over our native message passing. The strategy is to create a simple,
robust IPC service that can be tuned for performance through a simplified code path
in the microkernel; more feature cluttered IPC services can then be implemented
from these.
Benchmarks comparing higher-level IPC services (like pipes and FIFOs implemented
over our messaging) with their monolithic kernel counterparts show comparable
performance.
QNX Neutrino offers at least the following forms of IPC:
Service:
Implemented in:
Message-passing
Kernel
Signals
Kernel
External process
Shared memory
Process manager
Pipes
External process
FIFOs
External process
The designer can select these services on the basis of bandwidth requirements, the
need for queuing, network transparency, etc. The trade-off can be complex, but the
flexibility is useful.
As part of the engineering effort that went into defining the microkernel, the focus on
message passing as the fundamental IPC primitive was deliberate. As a form of IPC,
message passing (as implemented in MsgSend(), MsgReceive(), and MsgReply()), is
synchronous and copies data. Let's explore these two attributes in more detail.
67
Client does a
MsgSend()
Server does a
MsgReceive()
Client does a
MsgSend()
REPLY
blocked
READY
Legend:
Server does a
MsgReply() or
MsgError()
This thread
Other thread
68
READY
Server does a
MsgReceive()
Legend:
This thread
Server does a
MsgReceive()
Other thread
MsgReply() vs MsgError()
69
70
Message copying
Message copying
Since our messaging services copy a message directly from the address space of one
thread to another without intermediate buffering, the message-delivery performance
approaches the memory bandwidth of the underlying hardware.
The kernel attaches no special meaning to the content of a messagethe data in a
message has meaning only as mutually defined by sender and receiver. However,
well-defined message types are also provided so that user-written processes or
threads can augment or substitute for system-supplied services.
The messaging primitives support multipart transfers, so that a message delivered
from the address space of one thread to another needn't pre-exist in a single, contiguous
buffer. Instead, both the sending and receiving threads can specify a vector table that
indicates where the sending and receiving message fragments reside in memory. Note
that the size of the various parts can be different for the sender and receiver.
Multipart transfers allow messages that have a header block separate from the data
block to be sent without performance-consuming copying of the data to create a
contiguous message. In addition, if the underlying data structure is a ring buffer,
specifying a three-part message will allow a header and two disjoint ranges within the
ring buffer to be sent as a single atomic message. A hardware equivalent of this concept
would be that of a scatter/gather DMA facility.
IOV
Len
Each
0
IOV
may have
any
number
of parts 2
Message Data
Addr
Part 1
Part 2
Each part
may be
0 to 4 GB
Part 3
71
Five-part IOV
Len
0
16
Addr
Header
400
1
512
4
512
4
30
2
3
This code essentially builds a message structure on the stack, populates it with various
constants and passed parameters from the calling thread, and sends it to the filesystem
manager associated with fd. The reply indicates the success or failure of the operation.
72
Message copying
This implementation doesn't prevent the kernel from detecting large message
transfers and choosing to implement page flipping for those cases. Since
most messages passed are quite tiny, copying messages is often faster than
manipulating MMU page tables. For bulk data transfer, shared memory between
processes (with message-passing or the other synchronization primitives for
notification) is also a viable option.
73
Simple messages
For simple single-part messages, the OS provides functions that take a pointer directly
to a buffer without the need for an IOV (input/output vector). In this case, the number
of parts is replaced by the size of the message directly pointed to.
In the case of the message send primitivewhich takes a send and a reply bufferthis
introduces four variations:
Function
Send message
Reply message
MsgSend()
Simple
Simple
MsgSendsv()
Simple
IOV
MsgSendvs()
IOV
Simple
MsgSendv()
IOV
IOV
The other messaging primitives that take a direct message simply drop the trailing v
in their names:
74
IOV
Simple direct
MsgReceivev()
MsgReceive()
MsgReceivePulsev()
MsgReceivePulse()
MsgReplyv()
MsgReply()
MsgReadv()
MsgRead()
MsgWritev()
MsgWrite()
Description
ChannelCreate()
ChannelDestroy()
Destroy a channel.
ConnectAttach()
ConnectDetach()
Detach a connection.
Server
Channel
Client
Connections
Server
Channel
75
This loop allows the thread to receive messages from any thread that had a connection
to the channel.
The server can also use name_attach() to create a channel and associate a
name with it. The sender process can then use name_open() to locate that
name and create a connection to it.
Pulses
In addition to the synchronous Send/Receive/Reply services, the OS also supports
fixed-size, nonblocking messages. These are referred to as pulses and carry a small
payload (four bytes of data plus a single byte code).
Pulses pack a relatively small payloadeight bits of code and 32 bits of data. Pulses
are often used as a notification mechanism within interrupt handlers. They also allow
servers to signal clients without blocking on them.
Code
8 bits
Value
32 bits
76
77
Message-passing API
The message-passing API consists of the following functions:
Function
Description
MsgSend()
MsgReceive()
MsgReceivePulse()
MsgReply()
Reply to a message.
MsgError()
MsgRead()
MsgWrite()
MsgInfo()
MsgSendPulse()
MsgDeliverEvent()
MsgKeyData()
For information about messages from the programming point of view, see the Message
Passing chapter of Get Programming with the QNX Neutrino RTOS.
78
MsgSend()
MsgSend()
79
MsgSend()
B
80
Events
Events
A significant advance in the kernel design for QNX Neutrino is the event-handling
subsystem. POSIX and its realtime extensions define a number of asynchronous
notification methods (e.g., UNIX signals that don't queue or pass data, POSIX realtime
signals that may queue and pass data, etc.).
The kernel also defines additional, QNX Neutrino-specific notification techniques such
as pulses. Implementing all of these event mechanisms could have consumed
significant code space, so our implementation strategy was to build all of these
notification methods over a single, rich, event subsystem.
A benefit of this approach is that capabilities exclusive to one notification technique
can become available to others. For example, an application can apply the same
queueing services of POSIX realtime signals to UNIX signals. This can simplify the
robust implementation of signal handlers within applications.
The events encountered by an executing thread can come from any of three sources:
a MsgDeliverEvent() kernel call invoked by a thread
an interrupt handler
the expiry of a timer
The event itself can be any of a number of different types: QNX Neutrino pulses,
interrupts, various forms of signals, and forced unblock events. Unblock is a
means by which a thread can be released from a deliberately blocked state without
any explicit event actually being delivered.
Given this multiplicity of event types, and applications needing the ability to request
whichever asynchronous notification technique best suits their needs, it would be
awkward to require that server processes (the higher-level threads from the previous
section) carry code to support all these options.
Instead, the client thread can give a data structure, or cookie, to the server to hang
on to until later. When the server needs to notify the client thread, it will invoke
MsgDeliverEvent() and the microkernel will set the event type encoded within the
cookie upon the client thread.
Server
sigevent
Client
MsgSend()
MsgReply()
MsgDeliverEvent()
81
I/O notification
The ionotify() function is a means by which a client thread can request asynchronous
event delivery.
Many of the POSIX asynchronous services (e.g., mq_notify() and the client-side of
the select()) are built on top of ionotify(). When performing I/O on a file descriptor
(fd), the thread may choose to wait for an I/O event to complete (for the write() case),
or for data to arrive (for the read() case). Rather than have the thread block on the
resource manager process that's servicing the read/write request, ionotify() can allow
the client thread to post an event to the resource manager that the client thread would
like to receive when the indicated I/O condition occurs. Waiting in this manner allows
the thread to continue executing and responding to event sources other than just the
single I/O request.
The select() call is implemented using I/O notification and allows a thread to block
and wait for a mix of I/O events on multiple fd's while continuing to respond to other
forms of IPC.
Here are the conditions upon which the requested event can be delivered:
_NOTIFY_COND_OUTPUTthere's room in the output buffer for more data.
_NOTIFY_COND_INPUTresource-manager-defined amount of data is available
to read.
_NOTIFY_COND_OBANDresource-manager-defined out of band data is available.
82
Signals
Signals
The OS supports the 32 standard POSIX signals (as in UNIX) as well as the POSIX
realtime signals, both numbered from a kernel-implemented set of 64 signals with
uniform functionality. While the POSIX standard defines realtime signals as differing
from UNIX-style signals (in that they may contain four bytes of data and a byte code
and may be queued for delivery), this functionality can be explicitly selected or
deselected on a per-signal basis, allowing this converged implementation to still comply
with the standard.
Incidentally, the UNIX-style signals can select POSIX realtime signal queuing, if the
application wants it. The QNX Neutrino RTOS also extends the signal-delivery
mechanisms of POSIX by allowing signals to be targeted at specific threads, rather
than simply at the process containing the threads. Since signals are an asynchronous
event, they're also implemented with the event-delivery mechanisms.
Microkernel call
POSIX call
SignalKill()
SignalAction()
Description
sigqueue()
sigaction()
SignalProcmask()
SignalSuspend()
sigprocmask(),
pthread_sigmask()
mask of a thread.
sigsuspend(), pause()
SignalWaitinfo()
sigwaitinfo()
83
64...1
64
Thread
Thread
vector
Signal
blocked
64...1
33
47
Signal
queue
Signals queued to this thread.
Thread
Signal
blocked
64...1
33
Signal
queue
Signals delivered to this thread.
Special signals
As mentioned earlier, the OS defines a total of 64 signals.
Their range is as follows:
84
Signals
Signal range
Description
1 ... 57
41 ... 56
57 ... 64
The eight special signals cannot be ignored or caught. An attempt to call the signal()
or sigaction() functions or the SignalAction() kernel call to change them will fail with
an error of EINVAL.
In addition, these signals are always blocked and have signal queuing enabled. An
attempt to unblock these signals via the sigprocmask() function or SignalProcmask()
kernel call will be quietly ignored.
A regular signal can be programmed to this behavior using the following standard
signal calls. The special signals save the programmer from writing this code and protect
the signal from accidental changes to this behavior.
sigset_t *set;
struct sigaction action;
sigemptyset(&set);
sigaddset(&set, signo);
sigprocmask(SIG_BLOCK, &set, NULL);
action.sa_handler = SIG_DFL;
action.sa_flags = SA_SIGINFO;
sigaction(signo, &action, NULL);
This configuration makes these signals suitable for synchronous notification using the
sigwaitinfo() function or SignalWaitinfo() kernel call. The following code will block
until the eighth special signal is received:
sigset_t *set;
siginfo_t info;
sigemptyset(&set);
sigaddset(&set, SIGRTMAX + 8);
sigwaitinfo(&set, &info);
printf("Received signal %d with code %d and value %d\n",
info.si_signo,
info.si_code,
info.si_value.sival_int);
Since the signals are always blocked, the program cannot be interrupted or killed if
the special signal is delivered outside of the sigwaitinfo() function. Since signal queuing
is always enabled, signals won't be lostthey'll be queued for the next sigwaitinfo()
call.
These signals were designed to solve a common IPC requirement where a server wishes
to notify a client that it has information available for the client. The server will use
85
Summary of signals
This table describes what each signal means.
Signal
Description
SIGABRT
SIGALRM
SIGBUS
SIGCONT
SIGDEADLK
86
Signals
Signal
Description
a SIGDEADLK instead to all threads that
are waiting on the mutex without a
timeout.
Note that SIGDEADLK and SIGEMT refer
to the same signal. Some utilities (e.g.,
gdb, ksh, slay, and kill) know about
SIGEMT, but not SIGDEADLCK.
SIGEMT
SIGFPE
SIGHUP
SIGILL
SIGINT
87
Description
SIGIOT
SIGKILL
SIGPIPE
SIGPWR
SIGQUIT
SIGSEGV
SIGSTOP
SIGSYS
SIGTERM
Termination signal.
SIGTRAP
SIGTSTP
SIGTTIN
SIGTTOU
88
SIGURG
SIGUSR1
SIGUSR2
SIGWINCH
Unlike our inherent message-passing primitives, the POSIX message queues reside
outside the kernel.
File-like interface
Message queues resemble files, at least as far as their interface is concerned.
You open a message queue with mq_open(), close it with mq_close(), and destroy it
with mq_unlink(). And to put data into (write) and take it out of (read) a message
queue, you use mq_send() and mq_receive().
89
/data
/dev/mqueue/data
/acme/data
/dev/mqueue/acme/data
/qnx/data
/dev/mqueue/qnx/data
You can display all message queues in the system using the ls command as follows:
ls -Rl /dev/mqueue
Message-queue functions
POSIX message queues are managed via the following functions:
Function
Description
mq_open()
mq_close()
mq_unlink()
mq_send()
mq_receive()
mq_notify()
90
mq_setattr()
mq_getattr()
Shared memory
Shared memory
Shared memory offers the highest bandwidth IPC available.
Once a shared-memory object is created, processes with access to the object can use
pointers to directly read and write into it. This means that access to shared memory
is in itself unsynchronized. If a process is updating an area of shared memory, care
must be taken to prevent another process from reading or updating the same area.
Even in the simple case of a read, the other process may get information that is in
flux and inconsistent.
To solve these problems, shared memory is often used in conjunction with one of the
synchronization primitives to make updates atomic between processes. If the granularity
of updates is small, then the synchronization primitives themselves will limit the
inherently high bandwidth of using shared memory. Shared memory is therefore most
efficient when used for updating large amounts of data as a block.
Both semaphores and mutexes are suitable synchronization primitives for use with
shared memory. Semaphores were introduced with the POSIX realtime standard for
interprocess synchronization. Mutexes were introduced with the POSIX threads standard
for thread synchronization. Mutexes may also be used between threads in different
processes. POSIX considers this an optional capability; we support it. In general,
mutexes are more efficient than semaphores.
91
92
Function
Description
Classification
shm_open()
POSIX
close()
POSIX
mmap()
POSIX
munmap()
POSIX
munmap_flags()
QNX
Neutrino
mprotect()
POSIX
msync()
POSIX
Shared memory
Function
Description
Classification
shm_ctl(),
QNX
shm_ctl_special()
shm_unlink()
Neutrino
Remove a shared-memory region.
POSIX
POSIX shared memory is implemented in the QNX Neutrino RTOS via the process
manager (procnto). The above calls are implemented as messages to procnto (see
the Process Manager (p. 125) chapter in this book).
The shm_open() function takes the same arguments as open() and returns a file
descriptor to the object. As with a regular file, this function lets you create a new
shared-memory object or open an existing shared-memory object.
You must open the file descriptor for reading; if you want to write in the memory
object, you also need write access, unless you specify a private (MAP_PRIVATE)
mapping.
When a new shared-memory object is created, the size of the object is set to zero. To
set the size, you use ftruncate()the very same function used to set the size of a file
or shm_ctl().
mmap()
Once you have a file descriptor to a shared-memory object, you use the mmap() function
to map the object, or part of it, into your process's address space.
The mmap() function is the cornerstone of memory management within QNX Neutrino
and deserves a detailed discussion of its capabilities.
You can also use mmap() to map files and typed memory objects into your
process's address space.
93
Shared memory
object
addr
len
offset
len
Description
PROT_EXEC
PROT_NOCACHE
PROT_NONE
No access allowed.
PROT_READ
PROT_WRITE
You should use the PROT_NOCACHE manifest when you're using a shared-memory
region to gain access to dual-ported memory that may be modified by hardware (e.g.,
a video frame buffer or a memory-mapped network or communications board). Without
this manifest, the processor may return stale data from a previously cached read.
The mapping_flags determine how the memory is mapped. These flags are broken
down into two partsthe first part is a type and must be specified as one of the
following:
94
Shared memory
Map type
Description
MAP_SHARED
MAP_PRIVATE
The MAP_SHARED type is the one to use for setting up shared memory between
processes; MAP_PRIVATE has more specialized uses.
You can OR a number of flags into the above type to further define the mapping. These
are described in detail in the mmap() entry in the QNX Neutrino C Library Reference.
A few of the more interesting flags are:
MAP_ANON
Map anonymous memory that isn't associated with any file descriptor; you
must set the fd parameter to NOFD. The mmap() function allocates the
memory, and by default, fills the allocated memory with zeros; see
Initializing allocated memory (p. 96).
You commonly use MAP_ANON with MAP_PRIVATE, but you can use it with
MAP_SHARED to create a shared memory area for forked applications. You
can use MAP_ANON as the basis for a page-level memory allocator.
MAP_FIXED
Map the object to the address specified by where_i_want_it. If a
shared-memory region contains pointers within it, then you may need to
force the region at the same address in all processes that map it. This can
be avoided by using offsets within the region in place of direct pointers.
MAP_PHYS
This flag indicates that you wish to deal with physical memory. The fd
parameter should be set to NOFD. When used without MAP_ANON, the
offset_within_shared_memory specifies the exact physical address to map
(e.g., for video frame buffers). If used with MAP_ANON, then physically
contiguous memory is allocated (e.g., for a DMA buffer).
You can use MAP_NOX64K and MAP_BELOW16M to further define the
MAP_ANON allocated memory and address limitations present in some forms
of DMA.
95
MAP_NOX64K
Used with MAP_PHYS | MAP_ANON. The allocated memory area will not
cross a 64-KB boundary. This is required for the old 16-bit PC DMA.
MAP_BELOW16M
Used with MAP_PHYS | MAP_ANON. The allocated memory area will reside
in physical memory below 16 MB. This is necessary when using DMA with
ISA bus devices.
MAP_NOINIT
Relax the POSIX requirement to zero the allocated memory; see Initializing
allocated memory (p. 96), below.
Using the mapping flags described above, a process can easily share memory between
processes:
/* Map in a shared memory region */
fd = shm_open("datapoints", O_RDWR);
addr = mmap(0, len, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
You can unmap all or part of a shared-memory object from your address space using
munmap(). This primitive isn't restricted to unmapping shared memoryit can be
used to unmap any region of memory within your process. When used in conjunction
with the MAP_ANON flag to mmap(), you can easily implement a private page-level
allocator/deallocator.
You can change the protections on a mapped region of memory using mprotect(). Like
munmap(), mprotect() isn't restricted to shared-memory regionsit can change the
protection on any region of memory within your process.
96
Shared memory
Avoiding initializing the memory requires the cooperation of the process doing the
unmapping and the one doing the mapping:
The munmap_flags() function is a non-POSIX function that's similar to munmap()
but lets you control what happens when the memory is next mapped:
int munmap_flags( void *addr, size_t len,
unsigned flags );
97
Don't clear memory when it's freed (the default). When memory is freed for
later reuse, the contents of that memory remain untouched; whatever the
application that owned the memory left behind is left intact until the next
time that memory is allocated by another process. At that point, before the
memory is handed to the next process, it's zeroed.
98
Typed memory
Typed memory
Typed memory is POSIX functionality defined in the 1003.1 specification. It's part of
the advanced realtime extensions, and the manifests are located in the <sys/mman.h>
header file.
Typed memory adds the following functions to the C library:
posix_typed_mem_open()
Open a typed memory object. This function returns a file descriptor, which
you can then pass to mmap() to establish a memory mapping of the typed
memory object.
posix_typed_mem_get_info()
Get information (currently the amount of available memory) about a typed
memory object.
POSIX typed memory provides an interface to open memory objects (which are defined
in an OS-specific fashion) and perform mapping operations on them. It's useful in
providing an abstraction between BSP- or board-specific address layouts and device
drivers or user code.
Implementation-defined behavior
POSIX specifies that typed memory pools (or objects) are created and defined in an
implementation-specific fashion.
This section describes the following for QNX Neutrino:
Seeding of typed memory regions (p. 99)
Naming of typed memory regions (p. 100)
Pathname space and typed memory (p. 101)
mmap() allocation flags and typed memory objects (p. 101)
Permissions and typed memory objects (p. 102)
Object length and offset definitions (p. 102)
Interaction with other POSIX APIs (p. 102)
99
/memory
0, 0xFFFFFFFF
/memory/ram
0, 0x1FFFFFF
/memory/ram/sysram
0x1000, 0x1FFFFFF
/memory/isa/ram/dma
0x1000, 0xFFFFFF
/memory/ram/dma
0x1000, 0x1FFFFFF
The name you pass to posix_typed_mem_open() follows the above naming convention.
POSIX allows an implementation to define what happens when the name doesn't start
with a leading slash (/). The resolution rules on opening are as follows:
1. If the name starts with a leading /, an exact match is done.
2. The name may contain intermediate / characters. These are considered as path
component separators. If multiple path components are specified, they're matched
from the bottom up (the opposite of the way filenames are resolved).
3. If the name doesn't start with a leading /, a tail match is done on the pathname
components specified.
100
Typed memory
Here are some examples of how posix_typed_mem_open() resolves names, using the
above sample configuration:
This name:
Resolves to:
See:
/memory
/memory
Rule 1
/memory/ram
/memory/ram
Rule 2
/sysram
Fails
sysram
/memory/ram/sysram
Rule 3
The memory is allocated and not available for other allocations, but if you fork the
process, the child processes can access it as well. The memory is released when
the last mapping to it is removed.
Note that like somebody doing mem_offset() and then a MAP_PHYS to gain access
to previously allocated memory, somebody else could open the typed memory object
with POSIX_TYPED_MEM_MAP_ALLOCATABLE (or with no flags) and gain access
to the same physical memory that way.
POSIX_TYPED_MEM_ALLOCATE_CONTIG is like MAP_ANON | MAP_SHARED, in
that it causes a contiguous allocation.
The POSIX_TYPED_MEM_MAP_ALLOCATABLE case, which is used to create a
mapping to an object without allocation or deallocation. This is equivalent to a
shared mapping to physical memory.
101
102
Typed memory
rlimits
The POSIX setrlimit() APIs provide the ability to set limits on the virtual and
physical memory that a process can consume. Since typed memory operations
may operate on normal RAM (sysram) and will create mappings in the
process's address space, they need to be taken into account when doing the
rlimit accounting. In particular, the following rules apply:
Any mapping created by mmap() for typed memory objects is counted in
the process's RLIMIT_VMEM or RLIMIT_AS limit.
Typed memory never counts against RLIMIT_DATA.
POSIX file-descriptor functions
You can use the file descriptor that posix_typed_memory_open() returns with
selected POSIX fd-based calls, as follows:
fstat(fd,..), which fills in the stat structure as it does for a shared
memory object, except that the size field doesn't hold the size of the
typed memory object.
close(fd) closes the file descriptor.
dup() and dup2() duplicate the file handle.
posix_mem_offset() behaves as documented in the POSIX specification.
Practical examples
Here are some examples of how you could use typed memory.
Allocating contiguous memory from system RAM
Here's a code snippet that allocates contiguous memory from system RAM:
int fd = posix_typed_mem_open( "/memory/ram/sysram", O_RDWR,
POSIX_TYPED_MEM_ALLOCATE_CONTIG);
void *vaddr = mmap( NULL, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE, fd, 0);
where phys_addr is the physical address of the SRAM, size is the SRAM size, and
mem_id is the ID of the parent (typically memory, which is returned by as_default()).
103
Alternatively, you may want to use the packet memory as direct shared, physical
buffers. In this case, applications would use it as follows:
int fd = posix_typed_mem_open( "packet_memory", O_RDWR,
POSIX_TYPED_MEM_MAP_ALLOCATABLE);
void *vaddr = mmap( NULL, size, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, offset);
where dma_addr is the start of the DMA-safe RAM, and size is the size of the DMA-safe
region.
This code creates an asinfo entry for dma, which is a child of ram. Drivers can then
use it to allocate DMA-safe buffers:
int fd = posix_typed_mem_open( "ram/dma", O_RDWR,
POSIX_TYPED_MEM_ALLOCATE_CONTIG);
void *vaddr = mmap( NULL, size, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
104
Pipes
A pipe is an unnamed file that serves as an I/O channel between two or more
cooperating processes: one process writes into the pipe, the other reads from
the pipe.
The pipe manager takes care of buffering the data. The buffer size is defined
as PIPE_BUF in the <limits.h> file. A pipe is removed once both of its
ends have closed. The function pathconf() returns the value of the limit.
Pipes are normally used when two processes want to run in parallel, with
data moving from one process to the other in a single direction. (If
bidirectional communication is required, messages should be used instead.)
A typical application for a pipe is connecting the output of one program to
the input of another program. This connection is often made by the shell.
For example:
ls | more
directs the standard output from the ls utility through a pipe to the standard
input of the more utility.
If you want to:
Use the:
FIFOs
FIFOs are essentially the same as pipes, except that FIFOs are named
permanent files that are stored in filesystem directories.
If you want to:
Use the:
mkfifo utility
mkfifo() function
105
Use the:
106
Chapter 4
The Instrumented Microkernel
An instrumented version of the microkernel (procnto-instr) is equipped with a
sophisticated tracing and profiling mechanism that lets you monitor your system's
execution in real time. The procnto-instr module works on both single-CPU and SMP
systems.
The procnto-instr module uses very little overhead and gives exceptionally good
performanceit's typically about 98% as fast as the noninstrumented kernel (when
it isn't logging). The additional amount of code (about 30 KB on an x86 system) in
the instrumented kernel is a relatively small price to pay for the added power and
flexibility of this useful tool. Depending on the footprint requirements of your final
system, you may choose to use this special kernel as a development/prototyping tool
or as the actual kernel in your final product.
The instrumented module is nonintrusiveyou don't have to modify a program's source
code in order to monitor how that program interacts with the kernel. You can trace as
many or as few interactions (e.g., kernel calls, state changes, and other system
activities) as you want between the kernel and any running thread or process in your
system. You can even monitor interrupts. In this context, all such activities are known
as events.
For more details, see the System Analysis Toolkit User's Guide.
107
Instrumentation at a glance
Here are the essential tasks involved in kernel instrumentation:
1. The instrumented microkernel (procnto-instr) emits trace events as a result
of various system activities. These events are automatically copied to a set of buffers
grouped into a circular linked list.
2. As soon as the number of events inside a buffer reaches the high-water mark, the
kernel notifies a data-capture utility.
3. The data-capture utility then writes the trace events from the buffer to an output
device (e.g., a serial port, an event file, etc.).
4. A data-interpretation facility then interprets the events and presents this data to
the user.
Process and thread activity
Instrumented
microkernel
Event buffers
Data-capture
utility
Data filter/
interpretation
108
Event control
Event control
Given the large number of activities occurring in a live system, the number of events
that the kernel emits can be overwhelming (in terms of the amount of data, the
processing requirements, and the resources needed to store it). But you can easily
control the amount of data emitted.
Specifically, you can:
control the initial conditions that trigger event emissions
apply predefined kernel filters to dynamically control emissions
implement your own event handlers for even more filtering.
Once the data has been collected by the data-capture utility (tracelogger), it can
then be analyzed. You can analyze the data in real time or offline after the relevant
events have been gathered. The System Analysis tool within the IDE presents this data
graphically so you can see what's going on in your system.
Modes of emission
Apart from applying the various filters to control the event stream, you can also specify
one of two modes the kernel can use to emit events:
fast mode
Emits only the most pertinent information (e.g., only two kernel call
arguments) about an event.
wide mode
Generates more information (e.g., all kernel call arguments) for the same
event.
The trade-off here is one of speed vs knowledge: fast mode delivers less data, while
wide mode packs much more information for each event. Either way, you can easily
tune your system, because these modes work on a per-event basis.
As an example of the difference between the fast and wide emission modes, let's look
at the kinds of information we might see for a MsgSendv() call entry:
Fast mode data
Connection ID
4 bytes
Message data
109
Connection ID
4 bytes
# of parts to send
4 bytes
# of parts to receive
4 bytes
Message data
Message data
4 bytes
Message data
4 bytes
Total emitted: 24 bytes
Ring buffer
Rather than always emit events to an external device, the kernel can keep all of the
trace events in an internal circular buffer.
This buffer can be programmatically dumped to an external device on demand when
a certain triggering condition is met, making this a very powerful tool for identifying
elusive bugs that crop up under certain runtime conditions.
110
Data interpretation
Data interpretation
The data of an event includes a high-precision timestamp as well as the ID number
of the CPU on which the event was generated. This information helps you easily
diagnose difficult timing problems, which are more likely to occur on multiprocessor
systems.
The event format also includes the CPU platform (e.g., x86, ARM, etc.) and endian
type, which facilitates remote analysis (whether in real time or offline). Using a data
interpreter, you can view the data output in various ways, such as:
a timestamp-based linear presentation of the entire system
a running view of only the active threads/processes
a state-based view of events per process/thread.
The linear output from the data interpreter might look something like this:
TRACEPRINTER version 0.94
-- HEADER FILE INFORMATION -TRACE_FILE_NAME:: /dev/shmem/tracebuffer
TRACE_DATE:: Fri Jun 8 13:14:40 2001
TRACE_VER_MAJOR:: 0
TRACE_VER_MINOR:: 96
TRACE_LITTLE_ENDIAN:: TRUE
TRACE_ENCODING:: 16 byte events
TRACE_BOOT_DATE:: Fri Jun 8 04:31:05 2001
TRACE_CYCLES_PER_SEC:: 400181900
TRACE_CPU_NUM:: 4
TRACE_SYSNAME:: QNX
TRACE_NODENAME:: x86quad.gp.qa
TRACE_SYS_RELEASE:: 6.1.0
TRACE_SYS_VERSION:: 2001/06/04-14:07:56
TRACE_MACHINE:: x86pc
TRACE_SYSPAGE_LEN:: 2440
-- KERNEL EVENTS -t:0x1310da15 CPU:01 CONTROL :TIME msb:0x0000000f, lsb(offset):0x1310d81c
t:0x1310e89d CPU:01 PROCESS :PROCCREATE_NAME
ppid:0
pid:1
name:./procnto-smp-instr
t:0x1310eee4 CPU:00 THREAD :THCREATE
pid:1 tid:1
t:0x1310f052 CPU:00 THREAD :THRUNNING
pid:1 tid:1
t:0x1310f144 CPU:01 THREAD :THCREATE
pid:1 tid:2
t:0x1310f201 CPU:01 THREAD :THREADY
pid:1 tid:2
t:0x1310f32f CPU:02 THREAD :THCREATE
pid:1 tid:3
t:0x1310f3ec CPU:02 THREAD :THREADY
pid:1 tid:3
t:0x1310f52d CPU:03 THREAD :THCREATE
pid:1 tid:4
t:0x1310f5ea CPU:03 THREAD :THRUNNING
pid:1 tid:4
t:0x1310f731 CPU:02 THREAD :THCREATE
pid:1 tid:5
.
.
.
To help you fine-tune your interpretation of the event data stream, we provide a library
(traceparser) so you can write your own custom event interpreters.
111
112
Proactive tracing
Proactive tracing
While the instrumented kernel provides an excellent unobtrusive method for
instrumenting and monitoring processes, threads, and the state of your system in
general, you can also have your applications proactively influence the event-collection
process.
Using the TraceEvent() library call, applications themselves can inject custom events
into the trace stream. This facility is especially useful when building large, tightly
coupled, multicomponent systems.
For example, the following simple call would inject the integer values of eventcode,
first, and second into the event stream:
TraceEvent(_NTO_TRACE_INSERTSUSEREVENT, eventcode, first,
second);
You can also inject a string (e.g., My Event) into the event stream, as shown in the
following code:
#include <stdio.h>
#include <sys/trace.h>
/* Code to associate with emitted events */
#define MYEVENTCODE 12
int main(int argc, char **argv) {
printf("My pid is %d \n", getpid());
/* Inject two integer events (26, 1975) */
TraceEvent(_NTO_TRACE_INSERTSUSEREVENT, MYEVENTCODE,
26, 1975);
/* Inject a string event (My Event) */
TraceEvent(_NTO_TRACE_INSERTUSRSTREVENT, MYEVENTCODE,
"My Event");
return 0;
}
The output, as gathered by the traceprinter data interpreter, would then look
something like this:
.
.
.
t:0x38ea737e CPU:00 USREVENT:EVENT:12, d0:26 d1:1975
.
.
.
t:0x38ea7cb0 CPU:00 USREVENT:EVENT:12 STR:"My Event"
Note that 12 was specified as the trace user eventcode for these events.
113
Chapter 5
Multicore Processing
Two heads are better than one goes the old saying, and the same is true for computer
systems, where twoor moreprocessors can greatly improve performance.
Multiprocessing systems can be in these forms:
Discrete or traditional
A system that has separate physical processors hooked up in multiprocessing
mode over a board-level bus.
Multicore
A chip that has one physical processor with multiple CPUs interconnected
over a chip-level bus.
Multicore processors deliver greater computing power through concurrency,
offer greater system density, and run at lower clock speeds than uniprocessor
chips. Multicore processors also reduce thermal dissipation, power
consumption, and board area (and hence the cost of the system).
Multiprocessing includes several operating modes:
Asymmetric multiprocessing (AMP) (p. 116)
A separate OS, or a separate instantiation of the same OS, runs on each
CPU.
Symmetric multiprocessing (SMP) (p. 117)
A single instantiation of an OS manages all CPUs simultaneously, and
applications can float to any of them.
Bound multiprocessing (BMP) (p. 122)
A single instantiation of an OS manages all CPUs simultaneously, but each
application is locked to a specific CPU.
To determine how many processors there are on your system, look at the
num_cpu entry of the system page. For more information, see Structure of
the system page in the Customizing Image Startup Programs chapter of
Building Embedded Systems.
115
Multicore Processing
116
117
Multicore Processing
an SMP system, because its thread would be scheduled on the available processors
beside other servers and client processes.
As a testament to this microkernel approach, the SMP-enabled QNX Neutrino
kernel/process manager adds only a few kilobytes of additional code. The SMP versions
are designed for these main processor families:
ARM (procnto-smp)
x86 (procnto-smp)
The x86 version can boot on any system that conforms to the Intel MultiProcessor
Specification (MP Spec) with up to 32 Pentium (or better) processors. QNX Neutrino
also supports Intel's Hyper-Threading Technology found in P4 and Xeon processors.
The procnto-smp manager will also function on a single non-SMP system. With the
cost of building a dual-processor Pentium motherboard very nearly the same as that
for a single-processor motherboard, it's possible to deliver cost-effective solutions that
can be scaled in the field by the simple addition of a second CPU. The fact that the
OS itself is only a few kilobytes larger also allows SMP to be seriously considered for
small CPU-intensive embedded systems, not just high-end servers.
118
119
Multicore Processing
Critical sections
To control access to data structures that are shared between them, threads and
processes use the standard POSIX primitives of mutexes, condvars, and semaphores.
These work without change in an SMP system.
Many realtime systems also need to protect access to shared data structures between
an interrupt handler and the thread that owns the handler. The traditional POSIX
primitives used between threads aren't available for use by an interrupt handler. There
are two solutions here:
One is to remove all work from the interrupt handler and do all the work at thread
time instead. Given our fast thread scheduling, this is a very viable solution.
In a uniprocessor system running the QNX Neutrino RTOS, an interrupt handler
may preempt a thread, but a thread will never preempt an interrupt handler. This
allows the thread to protect itself from the interrupt handler by disabling and
enabling interrupts for very brief periods of time.
The thread on a non-SMP system protects itself with code of the form:
InterruptDisable()
// critical section
InterruptEnable()
Or:
InterruptMask(intr)
// critical section
InterruptUnmask(intr)
Unfortunately, this code will fail on an SMP system since the thread may be running
on one processor while the interrupt handler is concurrently running on another
processor!
One solution would be to lock the thread to a particular processor (see Bound
Multiprocessing (BMP) (p. 122), later in this chapter).
120
121
Multicore Processing
122
123
Multicore Processing
124
Feature
SMP
BMP
AMP
Yes
Yes
Yes
Yes
Limited
In most cases
Yes
Yes
Yes
Yes
Yes
Intercore messaging
Fast (OS
Fast (OS
Slower
primitives)
primitives)
(application)
Yes
Yes
Load balancing
Yes
Yes
Yes
Yes
Chapter 6
Process Manager
The process manager is capable of creating multiple POSIX processes (each of which
may contain multiple POSIX threads).
In the QNX Neutrino RTOS, the microkernel is paired with the Process Manager in a
single module (procnto). This module is required for all runtime systems. Its main
areas of responsibility include:
process managementmanages process creation, destruction, and process attributes
such as user ID (uid) and group ID (gid).
memory managementmanages a range of memory-protection capabilities, shared
libraries, and interprocess POSIX shared-memory primitives.
pathname managementmanages the pathname space into which resource
managers may attach.
User processes can access microkernel functions directly via kernel calls and process
manager functions by sending messages to procnto. Note that a user process sends
a message by invoking the MsgSend*() kernel call.
It's important to note that threads executing within procnto invoke the microkernel
in exactly the same way as threads in other processes. The fact that the process
manager code and the microkernel share the same process address space doesn't
imply a special or private interface. All threads in the system share the same
consistent kernel interface and all perform a privilege switch when invoking the
microkernel.
125
Process Manager
Process management
The first responsibility of procnto is to dynamically create new processes. These
processes will then depend on procnto's other responsibilities of memory management
and pathname management.
Process management consists of both process creation and destruction as well as the
management of process attributes such as process IDs, process groups, user IDs, etc.
Process primitives
The process primitives include:
posix_spawn()
POSIX
spawn()
QNX Neutrino
fork()
POSIX
vfork()
UNIX BSD extension
exec*()
POSIX
posix_spawn()
The posix_spawn() function creates a child process by directly specifying an executable
to load.
To those familiar with UNIX systems, the posix_spawn() call is modeled after a fork()
followed by an exec*(). However, it operates much more efficiently in that there's no
need to duplicate address spaces as in a fork(), only to destroy and replace it when
the exec*() is called.
In a UNIX system, one of the main advantages of using the fork()-then-exec*() method
of creating a child process is the flexibility in changing the default environment
inherited by the new child process. This is done in the forked child just before the
126
Process management
exec*(). For example, the following simple shell command would close and reopen
the standard output before exec*()'ing:
ls >file
You can do the same with posix_spawn(); it gives you control over the following classes
of environment inheritance, which are often adjusted when creating a new child
process:
file descriptors
process user and group IDs
signal mask
ignored signals
adaptive partitioning (scheduler ) attributes
There's also a companion function, posix_spawnp(), that doesn't require the absolute
path to the program to spawn, but instead searches for the executable using the caller's
PATH.
Using the posix_spawn() functions is the preferred way to create a new child process.
spawn()
The QNX Neutrino spawn() function is similar to posix_spawn().
The spawn() function gives you control over the following:
file descriptors
process group ID
signal mask
ignored signals
the node to create the process on
scheduling policy
scheduling parameters (priority)
maximum stack size
runmask (for SMP systems)
The basic forms of the spawn() function are:
spawn()
Spawn with the explicitly specified path.
spawnp()
Search the current PATH and invoke spawn() with the first matching
executable.
127
Process Manager
There's also a set of convenience functions that are built on top of spawn() and
spawnp() as follows:
spawnl()
Spawn with the command line provided as inline arguments.
spawnle()
spawnl() with explicitly passed environment variables.
spawnlp()
spawnp() that follows the command search path.
spawnlpe()
spawnlp() with explicitly passed environment variables.
spawnv()
Spawn with the command line pointed to by an array of pointers.
spawnve()
spawnv() with explicitly passed environment variables.
spawnvp()
spawnv() that follows the command search path.
spawnvpe()
spawnvp() with explicitly passed environment variables.
When a process is spawn()'ed, the child process inherits the following attributes of its
parent:
process group ID (unless SPAWN_SETGROUP is set in inherit.flags)
session membership
real user ID and real group ID
supplementary group IDs
priority and scheduling policy
current working directory and root directory
file creation mask
signal mask (unless SPAWN_SETSIGMASK is set in inherit.flags)
signal actions specified as SIG_DFL
signal actions specified as SIG_IGN (except the ones modified by inherit.sigdefault
when SPAWN_SETSIGDEF is set in inherit.flags)
128
Process management
The child process has several differences from the parent process:
Signals set to be caught by the parent process are set to the default action
(SIG_DFL).
The child process's tms_utime, tms_stime, tms_cutime, and tms_cstime are tracked
separately from the parent's.
The number of seconds left until a SIGALRM signal would be generated is set to
zero for the child process.
The set of pending signals for the child process is empty.
File locks set by the parent aren't inherited.
Per-process timers created by the parent aren't inherited.
Memory locks and mappings set by the parent aren't inherited.
If the child process is spawned on a remote node, the process group ID and the session
membership aren't set; the child process is put into a new session and a new process
group.
The child process can access the parent process's environment by using the environ
global variable (found in <unistd.h>).
For more information, see the spawn() function in the QNX Neutrino C Library
Reference.
fork()
The fork() function creates a new child process by sharing the same code as the calling
process and duplicating the calling process's data to give the child process an exact
copy. Most process resources are inherited.
The following resources are explicitly not inherited:
process ID
parent process ID
file locks
pending signals and alarms
timers
The fork() function is typically used for one of two reasons:
to create a new instance of the current execution environment
to create a new process running a different program
When creating a new thread, common data is placed in an explicitly created shared
memory region. Prior to the POSIX thread standard, this was the only way to accomplish
this. With POSIX threads, this use of fork() is better accomplished by creating threads
within a single process using pthread_create().
When creating a new process running a different program, the call to fork() is soon
followed by a call to one of the exec*() functions. This too is better accomplished by
129
Process Manager
a single call to the posix_spawn() function or the QNX Neutrino spawn() function,
which combine both operations with far greater efficiency.
Since QNX Neutrino provides better POSIX solutions than using fork(), its use is
probably best suited for porting existing code and for writing portable code that must
run on a UNIX system that doesn't support the POSIX pthread_create() or posix_spawn()
API.
vfork()
The vfork() function (which should be called only from a single-threaded process) is
useful when the purpose of fork() would have been to create a new system context for
a call to one of the exec*() functions.
The vfork() function differs from fork() in that the child doesn't get a copy of the calling
process's data. Instead, it borrows the calling process's memory and thread of control
until a call to one of the exec*() functions is made. The calling process is suspended
while the child is using its resources.
The vfork() child can't return from the procedure that called vfork(), since the eventual
return from the parent vfork() would then return to a stack frame that no longer existed.
exec*()
The exec*() family of functions replaces the current process with a new process, loaded
from an executable file. Since the calling process is replaced, there can be no
successful return.
The following exec*() functions are defined:
execl()
Exec with the command line provided as inline arguments.
execle()
execl() with explicitly passed environment variables.
execlp()
execl() that follows the command search path.
execlpe()
execlp()with explicitly passed environment variables.
execv()
execl() with the command line pointed to by an array of pointers.
execve()
130
Process management
Process loading
Processes loaded from a filesystem using the exec*(), posix_spawn() or spawn() calls
are in ELF (Executable and Linking Format).
If the filesystem is on a block-oriented device, the code and data are loaded into main
memory. By default, the memory pages containing the binaries are demand-loaded,
but you can use the procnto -m option to change this; for more information, see
Locking memory (p. 136), later in this chapter.
If the filesystem is memory mapped (e.g., ROM/flash image), the code needn't be
loaded into RAM, but may be executed in place. This approach makes all RAM available
for data and stack, leaving the code in ROM or flash. In all cases, if the same process
is loaded more than once, its code will be shared.
131
Process Manager
Memory management
While some realtime kernels or executives provide support for memory protection in
the development environment, few provide protected memory support for the runtime
configuration, citing penalties in memory and performance as reasons. But with memory
protection becoming common on many embedded processors, the benefits of memory
protection far outweigh the very small penalties in performance for enabling it.
The key advantage gained by adding memory protection to embedded applications,
especially for mission-critical systems, is improved robustness.
With memory protection, if one of the processes executing in a multitasking environment
attempts to access memory that hasn't been explicitly declared or allocated for the
type of access attempted, the MMU hardware can notify the OS, which can then abort
the thread (at the failing/offending instruction).
This protects process address spaces from each other, preventing coding errors in
a thread in one process from damaging memory used by threads in other processes
or even in the OS. This protection is useful both for development and for the installed
runtime system, because it makes postmortem analysis possible.
During development, common coding errors (e.g., stray pointers and indexing beyond
array bounds) can result in one process/thread accidentally overwriting the data space
of another process. If the overwriting touches memory that isn't referenced again until
much later, you can spend hours of debuggingoften using in-circuit emulators and
logic analyzersin an attempt to find the guilty party.
With an MMU enabled, the OS can abort the process the instant the memory-access
violation occurs, providing immediate feedback to the programmer instead of
mysteriously crashing the system some time later. The OS can then provide the location
of the errant instruction in the failed process, or position a symbolic debugger directly
on this instruction.
132
Memory management
4G
Linear
space
Page
tables
Physical
memory
1023
1023
Page
directory
0
1023
0
133
Process Manager
employ a hardware watchdog timer to detect if the software or hardware has lost
its mind, but this approach lacks the finesse of an MMU-assisted watchdog.
Hardware watchdog timers are usually implemented as a retriggerable monostable
timer attached to the processor reset line. If the system software doesn't strobe the
hardware timer regularly, the timer will expire and force a processor reset. Typically,
some component of the system software will check for system integrity and strobe the
timer hardware to indicate the system is sane.
Although this approach enables recovery from a lockup related to a software or hardware
glitch, it results in a complete system restart and perhaps significant downtime
while this restart occurs.
Software watchdog
When an intermittent software error occurs in a memory-protected system, the OS can
catch the event and pass control to a user-written thread instead of the memory dump
facilities. This thread can make an intelligent decision about how best to recover from
the failure, instead of forcing a full reset as the hardware watchdog timer would do.
The software watchdog could:
Abort the process that failed due to a memory access violation and simply restart
that process without shutting down the rest of the system.
Abort the failed process and any related processes, initialize the hardware to a
safe state, and then restart the related processes in a coordinated manner.
If the failure is very critical, perform a coordinated shutdown of the entire system
and sound an audible alarm.
The important distinction here is that we retain intelligent, programmed control of the
embedded system, even though various processes and threads within the control
software may have failed for various reasons. A hardware watchdog timer is still of use
to recover from hardware latch-ups, but for software failures we now have much
better control.
While performing some variation of these recovery strategies, the system can also
collect information about the nature of the software failure. For example, if the
embedded system contains or has access to some mass storage (flash memory, hard
drive, a network link to another computer with disk storage), the software watchdog
can generate a chronologically archived sequence of dump files. These dump files
could then be used for postmortem diagnostics.
Embedded control systems often employ these partial restart approaches to surviving
intermittent software failures without the operators experiencing any system downtime
or even being aware of these quick-recovery software failures. Since the dump files
are available, the developers of the software can detect and correct software problems
without having to deal with the emergencies that result when critical systems fail at
inconvenient times. If we compare this to the hardware watchdog timer approach and
the prolonged interruptions in service that result, it's obvious what our preference is!
134
Memory management
Postmortem dump-file analysis is especially important for mission-critical embedded
systems. Whenever a critical system fails in the field, significant effort should be made
to identify the cause of the failure so that a fix can be engineered and applied to
other systems before they experience similar failures.
Dump files give programmers the information they need to fix the problemwithout
them, programmers may have little more to go on than a customer's cryptic complaint
that the system crashed.
Quality control
By dividing embedded software into a team of cooperating, memory-protected processes
(containing threads), we can readily treat these processes as components to be used
again in new projects. Because of the explicitly defined (and hardware-enforced)
interfaces, these processes can be integrated into applications with confidence that
they won't disrupt the system's overall reliability. In addition, because the exact binary
image (not just the source code) of the process is being reused, we can better control
changes and instabilities that might have resulted from recompilation of source code,
relinking, new versions of development tools, header files, library routines, etc.
Since the binary image of the process is reused (with its behavior perhaps modified
by command-line options), the confidence we have in that binary module from acquired
experience in the field more easily carries over to new applications than if the binary
image of the process were changed.
As much as we strive to produce error-free code for the systems we deploy, the reality
of software-intensive embedded systems is that programming errors will end up in
released products. Rather than pretend these bugs don't exist (until the customer calls
to report them), we should adopt a mission-critical mindset. Systems should be
designed to be tolerant of, and able to recover from, software faults. Making use of
the memory protection delivered by integrated MMUs in the embedded systems we
build is a good step in that direction.
Full-protection model
Our full-protection model relocates all code in the image into a new virtual space,
enabling the MMU hardware and setting up the initial page-table mappings. This
allows procnto to start in a correct, MMU-enabled environment. The process manager
will then take over this environment, changing the mapping tables as needed by the
processes it starts.
135
Process Manager
will increase due to the increased complexity of obtaining addressability between two
completely private address spaces.
Private memory space starts at 0 on x86 and ARM
processors.
User
process 1
User
process 2
User
process 3
System process
procnto
3.5G 0
3.5G 0
3.5G
3.5G
4G
Locking memory
The QNX Neutrino RTOS supports POSIX memory locking, so that a process can avoid
the latency of fetching a page of memory, by locking the memory so that the page is
memory-resident (i.e., it remains in physical memory).
The levels of locking are as follows:
Unlocked
Unlocked memory can be paged in and out. Memory is allocated when it's
mapped, but page table entries aren't created. The first attempt to access
the memory fails, and the thread stays in the WAITPAGE state while the
memory manager initializes the memory and creates the page table entries.
Failure to initialize the page results in the receipt of a SIGBUS signal.
136
Memory management
Locked
Locked memory may not be paged in or out. Page faults can still occur on
access or reference, to maintain usage and modification statistics. Pages
that you think are PROT_WRITE are still actually PROT_READ. This is so
that, on the first write, the kernel may be alerted that a MAP_PRIVATE page
now is different from the shared backing store, and must be privatized.
To lock and unlock a portion of a thread's memory, call mlock() and
munlock(); to lock and unlock all of a thread's memory, call mlockall() and
munlockall(). The memory remains locked until the process unlocks it, exits,
or calls an exec*() function. If the process calls fork(), a posix_spawn*()
function, or a spawn*() function, the memory locks are released in the child
process.
More than one process can lock the same (or overlapping) region; the memory
remains locked until all the processes have unlocked it. Memory locks don't
stack; if a process locks the same region more than once, unlocking it once
undoes all of the process's locks on that region.
To lock all memory for all applications, specify the -ml option for procnto.
Thus all pages are at least initialized (if still set only to PROT_READ).
Superlocked
(A QNX Neutrino extension) No faulting is allowed at all; all memory must
be initialized and privatized, and the permissions set, as soon as the memory
is mapped. Superlocking covers the thread's whole address space.
To superlock memory, obtain I/O privileges by:
1. Enabling the PROCMGR_AID_IO ability. For more information, see
procmgr_ability().
2. Calling ThreadCtl(), specifying the _NTO_TCTL_IO flag:
ThreadCtl( _NTO_TCTL_IO, 0 );
To superlock all memory for all applications, specify the -mL option for
procnto.
For MAP_LAZY mappings, memory isn't allocated or mapped until the memory is first
referenced for any of the above types. Once it's been referenced, it obeys the above
rulesit's a programmer error to touch a MAP_LAZY area in a critical region (where
interrupts are disabled or in an ISR) that hasn't already been referenced.
137
Process Manager
138
Memory management
than might otherwise occur. (Some architectures don't support large page sizes;
on these architectures, fragmentation of in-use memory is irrelevant.)
If free memory is fragmented, it prevents an application from allocating contiguous
memory, which in turn might lead to complete failure of the application.
To defragment free memory, the memory manager swaps memory that's in use for
memory that's free, in such a way that the free memory blocks coalesce into larger
blocks that are sufficient to satisfy a request for contiguous memory.
When an application allocates memory, it's provided by the operating system in
quantums, 4-KB blocks of memory that exist on 4-KB boundaries. The operating
system programs the MMU so that the application can reference the physical block
of memory through a virtual address; during operation, the MMU translates a virtual
address into a physical address.
For example, a request for 16 KB of memory is satisfied by allocating four 4-KB
quantums. The operating system sets aside the four physical blocks for the application
and configures the MMU to ensure that the application can reference them through
a 16-KB contiguous virtual address. However, these blocks might not be physically
contiguous; the operating system can arrange the MMU configuration (the virtual to
physical mapping) so that non-contiguous physical addresses are accessed through
contiguous virtual addresses.
The task of defragmentation consists of changing existing memory allocations and
mappings to use different underlying physical pages. By swapping around the underlying
physical quantums, the OS can consolidate the fragmented free blocks into contiguous
runs. However, it's careful to avoid moving certain types of memory where the
virtual-to-physical mapping can't safely be changed:
Memory allocated by the kernel and addressed through the one-to-one mapping
area can't be moved, because the one-to-one mapping area defines the mapping
of virtual to physical addresses, and the OS can't change the physical address
without also changing the virtual address.
Memory that's locked by the application (see mlock() and mlockall()) can't be
moved: by locking the memory, the application is indicating that moving the memory
isn't acceptable.
An application that runs with I/O privileges (see the _NTO_TCTL_IO flag for
ThreadCtl()) has all pages locked by default, because device drivers often require
physical addresses.
Pages of memory that have mutex objects on them aren't currently moved. While
it's possible to move these pages, mutex objects are registered with the kernel
through their physical addresses, so moving a page with a mutex on it would require
rehashing the mutex object in the kernel.
There are other times when memory can't be moved; see Automatically marking
memory as unmovable, below.
139
Process Manager
Defragmentation is done, if necessary, when an application allocates a piece of
contiguous memory. The application does this through the mmap() call, providing
MAP_PHYS | MAP_ANON flags. If it isn't possible to satisfy a MAP_PHYS allocation
with contiguous memory, what happens depends on whether defragmentation is
disabled or enabled:
If it's disabled, mmap() fails.
If it's enabled, the memory manager runs a memory-defragmentation algorithm
that attempts to rearrange memory mappings across the system in order to allow
the MAP_PHYS allocation to be satisfied.
During the memory defragmentation, the thread calling mmap() is blocked.
Compaction can take a significant amount of time (particularly on systems
with large amounts of memory), but other system activities are mostly
unaffected.
Since other system tasks are running simultaneously, the defragmentation
algorithm takes into account that memory mappings can change while the
algorithm is running.
140
Memory management
already unmovable, so this option is irrelevant. It's also relevant only if the memory
defragmentation feature is enabled.
This option is disabled by default. If you find an application that behaves poorly, you
can enable automatic marking as a workaround until the application is corrected.
141
Process Manager
Pathname management
I/O resources aren't built into the microkernel, but are instead provided by resource
manager processes that may be started dynamically at runtime. The procnto manager
allows resource managers, through a standard API, to adopt a subset of the pathname
space as a domain of authority to administer.
As other resource managers adopt their respective domains of authority, procnto
becomes responsible for maintaining a pathname tree to track the processes that own
portions of the pathname space. An adopted pathname is sometimes referred to as a
prefix because it prefixes any pathnames that lie beneath it; prefixes can be arranged
in a hierarchy called a prefix tree. The adopted pathname is also called a mountpoint,
because that's where a server mounts into the pathname.
This approach to pathname space management is what allows QNX Neutrino to preserve
the POSIX semantics for device and file access, while making the presence of those
services optional for small embedded systems.
At startup, procnto populates the pathname space with the following pathname
prefixes:
Prefix
Description
/proc/boot/
/proc/pid
/dev/zero
/dev/mem
142
Pathname management
Resolving pathnames
When a process opens a file, the POSIX-compliant open() library routine first sends
the pathname to procnto, where the pathname is compared against the prefix tree
to determine which resource managers should be sent the open() message.
The prefix tree may contain identical or partially overlapping regions of
authoritymultiple servers can register the same prefix. If the regions are identical,
the order of resolution can be specified (see Ordering mountpoints (p. 143)). If the
regions are overlapping, the responses from the path manager are ordered with the
longest prefixes first; for prefixes of equal length, the same specified order of resolution
applies as for identical regions.
For example, suppose we have these prefixes registered:
Prefix
Description
/dev/ser1
/dev/ser2
/dev/hd0
The filesystem manager has registered a prefix for a mounted QNX 4 filesystem (i.e.,
/). The block device driver has registered a prefix for a block special file that represents
an entire physical hard drive (i.e., /dev/hd0). The serial device manager has registered
two prefixes for the two PC serial ports.
The following table illustrates the longest-match rule for pathname resolution:
This pathname:
matches:
/dev/ser1
/dev/ser1
devc-ser*
/dev/ser2
/dev/ser2
devc-ser*
/dev/ser
fs-qnx4.so
/dev/hd0
/dev/hd0
devb-eide.so
/usr/jhsmith/test
fs-qnx4.so
Ordering mountpoints
Generally the order of resolving a filename is the order in which you mounted the
filesystems at the same mountpoint (i.e., new mounts go on top of or in front of any
143
Process Manager
existing ones). You can specify the order of resolution when you mount the filesystem.
For example, you can use:
the before and after keywords for block I/O (devb-*) drivers, in the blk options
the -Z b and -Z a options to fs-cifs, fs-nfs2, and fs-nfs3
You can also use the -o option to mount with these keywords:
before
Mount the filesystem so that it's resolved before any other filesystems
mounted at the same pathname (in other words, it's placed in front of any
existing mount). When you access a file, the system looks on this filesystem
first.
after
Mount the filesystem so that it's resolved after any other filesystems mounted
at the same pathname (in other words, it's placed behind any existing
mounts). When you access a file, the system looks on this filesystem last,
and only if the file wasn't found on any other filesystems.
If you specify the appropriate before option, the filesystem floats in front of any other
filesystems mounted at the same mountpoint, except those that you later mount with
before. If you specify after, the filesystem goes behind any any other filesystems
mounted at the same mountpoint, except those that are already mounted with after.
So, the search order for these filesystems is:
1. those mounted with before
2. those mounted with no flags
3. those mounted with after
with each list searched in order of mount requests. The first server to claim the name
gets it. You would typically use after to have a filesystem wait at the back and pick
up things the no one else is handling, and before to make sure a filesystems looks
first at filenames.
Single-device mountpoints
Consider an example involving three servers:
Server A
A QNX 4 filesystem. Its mountpoint is /. It contains the files bin/true
and bin/false.
Server B
A flash filesystem. Its mountpoint is /bin. It contains the files ls and
echo.
144
Pathname management
Server C
A single device that generates numbers. Its mountpoint is /dev/random.
At this point, the process manager's internal mount table would look like this:
Mountpoint
Server
/bin
/dev/random
Server C (device)
Of course, each Server name is actually an abbreviation for the nd,pid,chid for that
particular server channel.
Now suppose a client wants to send a message to Server C. The client's code might
look like this:
int fd;
fd = open("/dev/random", ...);
read(fd, ...);
close(fd);
In this case, the C library will ask the process manager for the servers that could
potentially handle the path /dev/random. The process manager would return a list
of servers:
Server C (most likely; longest path match)
Server A (least likely; shortest path match)
From this information, the library will then contact each server in turn and send it an
open message, including the component of the path that the server should validate:
1. Server C receives a null path, since the request came in on the same path as the
mountpoint.
2. Server A receives the path dev/random, since its mountpoint was /.
As soon as one server positively acknowledges the request, the library won't contact
the remaining servers. This means Server A is contacted only if Server C denies the
request.
This process is fairly straightforward with single device entries, where the first server
is generally the server that will handle the request. Where it becomes interesting is in
the case of unioned filesystem mountpoints.
145
Process Manager
146
Pathname management
with a regular open, then the normal resolution procedure takes place and only one
server is accessed.
Symbolic prefixes
We've discussed prefixes that map to a resource manager. A second form of prefix,
known as a symbolic prefix, is a simple string substitution for a matched prefix.
You create symbolic prefixes using the POSIX ln (link) command. This command is
typically used to create hard or symbolic links on a filesystem by using the -s option.
If you also specify the -P option, then a symbolic link is created in the in-memory
prefix space of procnto.
Command
Description
ln -s existing_file symbolic_link
Note that a prefix tree symbolic link will always take precedence over a filesystem
symbolic link.
For example, assume you're running on a machine that doesn't have a local filesystem.
However, there's a filesystem on another node (say neutron) that you wish to access
as /bin. You accomplish this using the following symbolic prefix:
ln -Ps /net/neutron/bin
/bin
This will cause /bin to be mapped into /net/neutron/bin. For example, /bin/ls
will be replaced with the following:
/net/neutron/bin/ls
This new pathname will again be applied against the prefix tree, but this time the
prefix matched will be /net, which will point to lsm-qnet. The lsm-qnet resource
manager will then resolve the neutron component, and redirect further resolution
requests to the node called neutron. On node neutron, the rest of the pathname
(i.e. /bin/ls) will be resolved against the prefix space on that node. This will resolve
to the filesystem manager on node neutron, where the open() request will be directed.
With just a few characters, this symbolic prefix has allowed us to access a remote
filesystem as though it were local.
147
Process Manager
It's not necessary to run a local filesystem process to perform the redirection. A diskless
workstation's prefix tree might look something like this:
/
lsm-qnet.so
dev
console
ser2
ser1
devc-con
devc-ser...
With this prefix tree, local devices such as /dev/ser1 or /dev/console will be
routed to the local character device manager, while requests for other pathnames will
be routed to the remote filesystem.
Any request to open /dev/modem will be replaced with /dev/ser1. This mapping
would allow the modem to be changed to a different serial port simply by changing
the symbolic prefix and without affecting any applications.
Relative pathnames
Pathnames need not start with slash. In such cases, the path is considered relative
to the current working directory.
The OS maintains the current working directory as a character string. Relative
pathnames are always converted to full network pathnames by prepending the current
working directory string to the relative pathname.
Note that different behaviors result when your current working directory starts with a
slash versus starting with a network root.
148
Pathname management
Network root
If the current working directory begins with a network root in the form
/net/node_name, it's said to be specific and locked to the pathname space of the
specified node. If you don't specify a network root, the default one is prepended.
For example, this command:
cd /net/percy
is an example of the first (specific) form, and would lock future relative pathname
evaluation to be on node percy, no matter what your default network root happens
to be. Subsequently entering cd dev would put you in /net/percy/dev.
On the other hand, this command:
cd /
would be of the second form, where the default network root would affect the relative
pathname resolution. For example, if your default network root were /net/florence,
then entering cd dev would put you in /net/florence/dev. Since the current
working directory doesn't start with a node override, the default network root is
prepended to create a fully specified network pathname.
To run a command with a specific network root, use the on command, specifying the
-f option:
on -f /net/percy command
This runs the given command with /net/percy as the network root; that is, it searches
for the commandand any files with relative paths specified as argumentson
/net/percy and runs the command on /net/percy. In contrast, this:
on -n /net/percy command
searches for the given commandand any files with relative pathson your local
node and runs the command on /net/percy.
In a program, you can specify a network root when you call chroot().
This really isn't as complicated as it may seem. Most of the time, you don't specify a
network root, and everything you do will simply work within your namespace (defined
by your default network root). Most users will log in, accept the normal default network
root (i.e., the namespace of their own node), and work within that environment.
149
Process Manager
A note about cd
In some traditional UNIX systems, the cd (change directory) command modifies the
pathname given to it if that pathname contains symbolic links. As a result, the
pathname of the new current working directory may differ from the one given to cd.
In QNX Neutrino, however, cd doesn't modify the pathnameaside from collapsing
.. references. For example:
cd /usr/home/dan/test/../doc
would result in a current working directorywhich you can display with pwdof
/usr/home/dan/doc, even if some of the elements in the pathname were symbolic
links.
For more information about symbolic links and .. references, see QNX 4 filesystem
in the Working with Filesystems chapter of the QNX Neutrino User's Guide.
I/O Manager
space
FD
SCOID
Open
control
block
Open
control
block
Sparse array
150
Pathname management
Process A
Server process
Open
control blocks
File
descriptors
/tmp/file
Process B
Several file descriptors in one or more processes can refer to the same OCB. This is
accomplished by two means:
A process may use the dup(), dup2(), or fcntl() functions to create a duplicate file
descriptor that refers to the same OCB.
When a new process is created via vfork(), fork(), posix_spawn(), or spawn(), all
open file descriptors are by default inherited by the new process; these inherited
descriptors refer to the same OCBs as the corresponding file descriptors in the
parent process.
When several FDs refer to the same OCB, then any change in the state of the OCB is
immediately seen by all processes that have file descriptors linked to the same OCB.
For example, if one process uses the lseek() function to change the position of the
seek point, then reading or writing takes place from the new position no matter which
linked file descriptor is used.
The following diagram shows two processes in which one opens a file twice, then does
a dup() to get a third FD. The process then creates a child that inherits all open files.
151
Process Manager
File
descriptors
0
Parent
process
Server process
Open
control blocks
File
descriptors
/tmp/file
Child
process
1
2
152
Chapter 7
Dynamic Linking
In a typical system, a number of programs will be running. Each program relies on a
number of functions, some of which will be standard C library functions, like printf(),
malloc(), write(), etc.
If every program uses the standard C library, it follows that each program would normally
have a unique copy of this particular library present within it. Unfortunately, this
results in wasted resources. Since the C library is common, it makes more sense to
have each program reference the common instance of that library, instead of having
each program contain a copy of the library. This approach yields several advantages,
not the least of which is the savings in terms of total system memory required.
153
Dynamic Linking
Statically linked
The term statically linked means that the program and the particular library that it's
linked against are combined together by the linker at linktime.
This means that the binding between the program and the particular library is fixed
and known at linktimewell in advance of the program's ever running. It also means
that we can't change this binding, unless we relink the program with a new version of
the library.
You might consider linking a program statically in cases where you weren't sure whether
the correct version of a library will be available at runtime, or if you were testing a new
version of a library that you don't yet want to install as shared.
Programs that are linked statically are linked against archives of objects (libraries)
that typically have the extension of .a. An example of such a collection of objects is
the standard C library, libc.a.
154
Dynamically linked
Dynamically linked
The term dynamically linked means that the program and the particular library it
references aren't combined together by the linker at linktime.
Instead, the linker places information into the executable that tells the loader which
shared object module the code is in and which runtime linker should be used to find
and bind the references. This means that the binding between the program and the
shared object is done at runtimebefore the program starts, the appropriate shared
objects are found and bound.
This type of program is called a partially bound executable, because it isn't fully
resolvedthe linker, at linktime, didn't cause all the referenced symbols in the program
to be associated with specific code from the library. Instead, the linker simply said:
This program calls some functions within a particular shared object, so I'll just make
a note of which shared object these functions are in, and continue on. Effectively,
this defers the binding until runtime.
Programs that are linked dynamically are linked against shared objects that have the
extension .so. An example of such an object is the shared object version of the
standard C library, libc.so.
You use a command-line option to the compiler driver qcc to tell the tool chain whether
you're linking statically or dynamically. This command-line option then determines
the extension used (either .a or .so).
155
Dynamic Linking
156
Execution view:
ELF header
ELF header
Section 1
...
Section n
...
Segment 1
Segment 2
...
...
Figure 37: Object file format: linking view and execution view.
ELF without COFF
Most implementations of ELF loaders are derived from COFF (Common
Object File Format) loaders; they use the linking view of the ELF objects at
load time. This is inefficient because the program loader must load the
executable using sections. A typical program could contain a large number
of sections, each of which would have to be located in the program and
loaded into memory separately.
157
Dynamic Linking
QNX Neutrino, however, doesn't rely at all on the COFF technique of loading
sections. When developing our ELF implementation, we worked directly from
the ELF spec and kept efficiency paramount. The ELF loader uses the
execution view of the program. By using the execution view, the task of
the loader is greatly simplified: all it has to do is copy to memory the load
segments (usually two) of the program or library. As a result, process creation
and library loading operations are much faster.
Shared memory
Heap
Data
Text
Process base address
Stack
Stack
Guard page
158
Runtime linker
The runtime linker is invoked when a program that was linked against a shared object
is started or when a program requests that a shared object be dynamically loaded. The
runtime linker is contained within the C runtime library.
The runtime linker performs several tasks when loading a shared library (.so file):
1. If the requested shared library isn't already loaded in memory, the runtime linker
loads it:
If the shared library name is fully qualified (i.e., begins with a slash), it's loaded
directly from the specified location. If it can't be found there, no further searches
are performed.
If it's not a fully qualified pathname, the runtime linker searches for it as follows:
1. If the executable's dynamic section contains a DT_RPATH tag, then the path
specified by DT_RPATH is searched.
2. If the shared library isn't found, the runtime linker searches for it in the
directories specified by LD_LIBRARY_PATH only if the program isn't marked
as setuid.
3. If the shared library still isn't found, then the runtime linker searches for
the default library search path as specified by the LD_LIBRARY_PATH
environment variable to procnto (i.e., the CS_LIBPATH configuration
string). If none has been specified, then the default library path is set to the
image filesystem's path.
2. Once the requested shared library is found, it's loaded into memory. For ELF shared
libraries, this is a very efficient operation: the runtime linker simply needs to use
the mmap() call twice to map the two load segments into memory.
3. The shared library is then added to the internal list of all libraries that the process
has loaded. The runtime linker maintains this list.
4. The runtime linker then decodes the dynamic section of the shared object.
This dynamic section provides information to the linker about other libraries that this
library was linked against. It also gives information about the relocations that need to
be applied and the external symbols that need to be resolved. The runtime linker will
first load any other required shared libraries (which may themselves reference other
shared libraries). It will then process the relocations for each library. Some of these
relocations are local to the library, while others require the runtime linker to resolve
a global symbol. In the latter case, the runtime linker will search through the list of
libraries for this symbol. In ELF files, hash tables are used for the symbol lookup, so
they're very fast. The order in which libraries are searched for symbols is very important,
as we'll see in the section on Symbol name resolution (p. 160) below.
159
Dynamic Linking
Once all relocations have been applied, any initialization functions that have been
registered in the shared library's init section are called. This is used in some
implementations of C++ to call global constructors.
The program can also determine the symbol associated with a given address by using
the dladdr() call. Finally, when the process no longer needs the shared library, it can
call dlclose() to unload the library from memory.
160
161
Chapter 8
Resource Managers
To give the QNX Neutrino RTOS a great degree of flexibility, to minimize the runtime
memory requirements of the final system, and to cope with the wide variety of devices
that may be found in a custom embedded system, the OS allows user-written processes
to act as resource managers that can be started and stopped dynamically.
Resource managers are typically responsible for presenting an interface to various
types of devices. This may involve managing actual hardware devices (like serial ports,
parallel ports, network cards, and disk drives) or virtual devices (like /dev/null, a
network filesystem, and pseudo-ttys).
In other operating systems, this functionality is traditionally associated with device
drivers. But unlike device drivers, resource managers don't require any special
arrangements with the kernel. In fact, a resource manager looks just like any other
user-level program.
163
Resource Managers
164
and the resource manager would respond with the appropriate statistics.
You could also use command-line utilities for a robot-arm driver. The driver could
register the name, /dev/robot/arm/angle, and interpret any writes to this
device as the angle to set the robot arm to. To test the driver from the command
line, you'd type:
echo 87 >/dev/robot/arm/angle
The echo utility opens /dev/robot/arm/angle and writes the string (87) to
it. The driver handles the write by setting the robot arm to 87 degrees. Note that
this was accomplished without writing a special tester program.
Another example would be names such as /dev/robot/registers/r1, r2,....
Reading from these names returns the contents of the corresponding registers;
writing to these names sets the corresponding registers to the given values.
Even if all of your other IPC is done via some non-POSIX API, it's still worth having
one thread written as a resource manager for responding to reads and writes for
doing things as shown above.
165
Resource Managers
166
the client's C library will construct an io_open message, which it then sends to the
devc-ser* resource manager via IPC.
Some time later, when the client program executes:
read (fd, buf, BUFSIZ);
the client's C library constructs an io_read message, which is then sent to the
resource manager.
A key point is that all communications between the client program and the resource
manager are done through native IPC messaging. This allows for a number of unique
features:
A well-defined interface to application programs. In a development environment,
this allows a very clean division of labor for the implementation of the client side
and the resource manager side.
A simple interface to the resource manager. Since all interactions with the resource
manager go through native IPC, and there are no special back door hooks or
arrangements with the OS, the writer of a resource manager can focus on the task
at hand, rather than worry about all the special considerations needed in other
operating systems.
Free network transparency. Since the underlying native IPC messaging mechanism
is inherently network-distributed without any additional effort required by the client
or server (resource manager), programs can seamlessly access resources on other
nodes in the network without even being aware that they're going over a network.
All QNX Neutrino device drivers and filesystems are implemented as resource
managers. This means that everything that a native QNX Neutrino device
driver or filesystem can do, a user-written resource manager can do as well.
Consider FTP filesystems, for instance. Here a resource manager would take over a
portion of the pathname space (e.g., /ftp) and allow users to cd into FTP sites to
get files. For example, cd /ftp/rtfm.mit.edu/pub would connect to the FTP
site rtfm.mit.edu and change directory to /pub. After that point, the user could
open, edit, or copy files.
167
Resource Managers
Application-specific filesystems would be another example of a user-written resource
manager. Given an application that makes extensive use of disk-based files, a custom
tailored filesystem can be written that works with that application and delivers superior
performance.
The possibilities for custom resource managers are limited only by the application
developer's imagination.
168
Message types
Architecturally, there are two categories of messages that a resource manager will
receive:
connect messages
I/O messages
A connect message is issued by the client to perform an operation based on a pathname
(e.g., an io_open message). This may involve performing operations such as
permission checks (does the client have the correct permission to open this device?)
and setting up a context for that request.
169
Resource Managers
An I/O message is one that relies upon this context (created between the client and
the resource manager) to perform subsequent processing of I/O messages (e.g.,
io_read).
There are good reasons for this design. It would be inefficient to pass the full pathname
for each and every read() request, for example. The io_open handler can also perform
tasks that we want done only once (e.g., permission checks), rather than with each
I/O message. Also, when the read() has read 4096 bytes from a disk file, there may
be another 20 megabytes still waiting to be read. Therefore, the read() function would
need to have some context information telling it the position within the file it's reading
from.
170
The client would generate an io_open message for the first open(), and then two
io_dup messages for the two dup() calls. Then, when the client executed the close()
calls, three io_close messages would be generated.
Since the dup() functions generate duplicates of the file descriptors, new context
information should not be allocated for each one. When the io_close messages
arrive, because no new context has been allocated for each dup(), no release of the
memory by each io_close message should occur either! (If it did, the first close
would wipe out the context.)
The resource manager shared library provides default handlers that keep track of the
open(), dup(), and close() messages and perform work only for the last close (i.e., the
third io_close message in the example above).
Dispatch functions
The OS provides a set of dispatch_* functions that:
allow a common blocking point for managers and clients that need to support
multiple message types (e.g., a resource manager could handle its own private
message range).
provide a flexible interface for message types that isn't tied to the resource manager
(for clean handling of private messages and pulse codes)
decouple the blocking and handler code from threads. You can implement the
resource manager event loop in your main code. This decoupling also makes for
easier debugging, because you can put a breakpoint between the block function
and the handler function.
For more information, see the Resource Managers chapter of Get Programming with
the QNX Neutrino RTOS, and the Writing a Resource Manager guide.
Combine messages
In order to conserve network bandwidth and to provide support for atomic operations,
the OS supports combine messages. A combine message is constructed by the client's
171
Resource Managers
C library and consists of a number of I/O and/or connect messages packaged together
into one.
For example, the function readblock() allows a thread to atomically perform an lseek()
and read() operation. This is done in the client library by combining the io_lseek
and io_read messages into one. When the resource manager shared library receives
the message, it will process both the io_lseek and io_read messages, effectively
making that readblock() function behave atomically.
Combine messages are also useful for the stat() function. A stat() call can be
implemented in the client's library as an open(), fstat(), and close(). Instead of
generating three separate messages (one for each of the component functions), the
library puts them together into one contiguous combine message. This boosts
performance, especially over a networked connection, and also simplifies the resource
manager, which doesn't need a connect function to handle stat().
The resource manager shared library takes care of the issues associated with breaking
out the individual components of the combine message and passing them to the various
handler functions supplied. Again, this minimizes the effort associated with writing a
resource manager.
Attribute
structure
Mount
structure
(optional)
172
One per
mountpoint
(optional)
OCB A
Clients
Attribute
structure for
/dev/path1
Process
A
OCB B
Channel
Process
B
Resource
manager
threads
Mount
structure
describing
OCB C
Process
C
Attribute
structure for
/dev/path*
/dev/path2
Resource manager
process
resmgr library
173
Resource Managers
Attribute
superstructure
(iofunc_attr_t *)
Default
members
Extensions
Figure 41: Encapsulating the default data structures used by resource managers.
The library contains iofunc_*() default handlers for these client functions:
chmod()
chown()
close()
devctl()
fpathconf()
fseek()
fstat()
lock()
lseek()
mmap()
open()
pathconf()
stat()
utime()
174
Summary
Summary
By supporting pathname space mapping, by having a well-defined interface to resource
managers, and by providing a set of libraries for common resource manager functions,
the QNX Neutrino RTOS offers the developer unprecedented flexibility and simplicity
in developing drivers for new hardwarea critical feature for many embedded
systems.
For more details on developing a resource manager, see the Resource Managers chapter
of Get Programming with the QNX Neutrino RTOS, and the Writing a Resource Manager
guide.
175
Chapter 9
Filesystems
The QNX Neutrino RTOS provides a rich variety of filesystems. Like most
service-providing processes in the OS, these filesystems execute outside the kernel;
applications use them by communicating via messages generated by the shared-library
implementation of the POSIX API.
Most of these filesystems are resource managers as described in this book. Each
filesystem adopts a portion of the pathname space (called a mountpoint) and provides
filesystem services through the standard POSIX API (open(), close(), read(), write(),
lseek(), etc.). Filesystem resource managers take over a mountpoint and manage the
directory structure below it. They also check the individual pathname components for
permissions and for access authorizations.
This implementation means that:
Filesystems may be started and stopped dynamically.
Multiple filesystems may run concurrently.
Applications are presented with a single unified pathname space and interface,
regardless of the configuration and number of underlying filesystems.
A filesystem running on one node is transparently accessible from any other node.
177
Filesystems
178
Filesystem classes
Filesystem classes
The many filesystems available can be categorized into the following classes:
Image (p. 185)
A special filesystem that presents the modules in the image and is always
present. Note that the procnto process automatically provides an image
filesystem and a RAM filesystem.
Block
Traditional filesystems that operate on block devices like hard disks and
CD-ROM drives. This includes the Power-Safe filesystem (p. 192), QNX 4
(p. 191), DOS (p. 201), and CD-ROM (p. 204) filesystems.
Flash
Nonblock-oriented filesystems designed explicitly for the characteristics of
flash memory devices. For NOR devices, use the FFS3 (p. 205) filesystem;
for NAND, use ETFS (p. 187).
Network
Filesystems that provide network file access to the filesystems on remote
host computers. This includes the NFS (p. 210) and CIFS (p. 211) (SMB)
filesystems.
Virtual (p. 216)
QNX Neutrino provides an Inflator virtual filesystem, a resource manager
that sits in front of other filesystems and uncompresses files that were
previously compressed (using the deflate utility).
179
Filesystems
fs-cifs
fs-nfs2
devf-*
procnto
fs-qnx4.so
fs-dos.so
fs-cd.so
io_blk.so
io-pkt
cam-cdrom.so
cam-disk.so
devb-*
devn-*.so
io-blk
Most of the filesystem shared libraries ride on top of the Block I/O module.
The io-blk.so module also acts as a resource manager and exports a block-special
file for each physical device. For a system with two hard disks the default files would
be:
/dev/hd0
First hard disk.
/dev/hd1
Second hard disk.
180
Filesystem classes
These files represent each raw disk and may be accessed using all the normal POSIX
file primitives (open(), close(), read(), write(), lseek(), etc.). Although the io-blk
module can support a 64-bit offset on seek, the driver interface is 32-bit, allowing
access to 2-terabyte disks.
Partitions
The QNX Neutrino RTOS complies with the de facto industry standard for partitioning
a disk.
This allows a number of filesystems to share the same physical disk. Each partition
is also represented as a block-special file, with the partition type appended to the
filename of the disk it's located on. In the above two-disk example, if the first disk
had a QNX 4 partition and a DOS partition, while the second disk had only a QNX 4
partition, then the default files would be:
/dev/hd0
First hard disk
/dev/hd0t6
DOS partition on first hard disk
/dev/hd0t79
QNX 4 partition on first hard disk
/dev/hd1
Second hard disk
/dev/hd1t79
181
Filesystems
Filesystem
OS/2 HPFS
Windows NT
11
12
14
15
77
78
79
99
UNIX
131
Linux (Ext2)
175
177
178
179
182
Filesystem classes
Buffer cache
The io-blk shared library implements a buffer cache that all filesystems inherit. The
buffer cache attempts to store frequently accessed filesystem blocks in order to
minimize the number of times a system has to perform a physical I/O to the disk.
Read operations are synchronous; write operations are usually asynchronous. When
an application writes to a file, the data enters the cache, and the filesystem manager
immediately replies to the client process to indicate that the data has been written.
The data is then written to the disk.
Critical filesystem blocks such as bitmap blocks, directory blocks, extent blocks, and
inode blocks are written immediately and synchronously to disk.
Applications can modify write behavior on a file-by-file basis. For example, a database
application can cause all writes for a given file to be performed synchronously. This
would ensure a high level of file integrity in the face of potential hardware or power
problems that might otherwise leave a database in an inconsistent state.
Filesystem limitations
POSIX defines the set of services a filesystem must provide. However, not all filesystems
are capable of delivering all those services.
Filesystem Access
date
Modification Status
date
change
Filename
length
Soft
Decompression
links
links
on read
date
Image
No
No
No
255
Yes
No
No
No
No
RAM
Yes
Yes
Yes
255
Yes
No
No
No
No
ETFS
Yes
Yes
Yes
91
Yes
Yes
No
Yes
No
QNX 4
Yes
Yes
Yes
48
Yes
Yes
Yes
Yes
No
Power-Safe Yes
Yes
Yes
510
Yes
Yes
Yes
Yes
No
DOS
Yes
Yes
No
8.3
No
Yes
No
No
No
NTFS
Yes
Yes
No
255
No
Yes
No
No
Yes
CD-ROM
Yes
Yes
Yes
207
Yes
Yes
No
Yes
No
UDF
Yes
Yes
Yes
254
Yes
Yes
No
No
No
HFS
Yes
Yes
Yes
255
Yes
Yes
No
No
No
FFS3
No
Yes
Yes
255
Yes
Yes
No
Yes
Yes
NFS
Yes
Yes
Yes
Yes
Yes
Yes
No
CIFS
No
Yes
No
Yes
No
No
No
Yes
Yes
183
Filesystems
Filesystem Access
date
Modification Status
date
change
Filename
length
Soft
Decompression
links
links
on read
Yes
Yes
No
date
Ext2
Yes
Yes
Yes
a
255
Yes
Yes
Our internal representation for file names is UTF-8, which uses a variable number
of bytes per character. Many on-disk formats instead use UCS2, which is a fixed
number (2 bytes). Thus a length limit in characters may be 1, 2, or 3 times that
number in bytes, as we convert from on-disk to OS representation. The lengths for
the QNX 4, Power-Safe, and EXT2 filesystems are in bytes; those for UDF, CD/Joliet,
and DOS/VFAT are in characters.
184
103 characters with Joliet extensions; 255 with Rock Ridge extensions.
31 on HFS.
Image filesystem
Image filesystem
Every QNX Neutrino system image provides a simple read-only filesystem that presents
the set of files built into the OS image.
Since this image may include both executables and data files, this filesystem is
sufficient for many embedded systems. If additional filesystems are required, they
would be placed as modules within the image where they can be started as needed.
185
Filesystems
RAM filesystem
Every QNX Neutrino system also provides a simple RAM-based filesystem that allows
read/write files to be placed under /dev/shmem.
Note that /dev/shmem isn't actually a filesystem. It's a window onto the
shared memory names that happens to have some filesystem-like
characteristics.
This RAM filesystem finds the most use in tiny embedded systems where persistent
storage across reboots isn't required, yet where a small, fast, temporary-storage
filesystem with limited features is called for.
The filesystem comes for free with procnto and doesn't require any setup. You can
simply create files under /dev/shmem and grow them to any size (depending on RAM
resources).
Although the RAM filesystem itself doesn't support hard or soft links or directories,
you can create a link to it by using process-manager links. For example, you could
create a link to a RAM-based /tmp directory:
ln -sP /dev/shmem /tmp
This tells procnto to create a process manager link to /dev/shmem known as /tmp.
Application programs can then open files under /tmp as if it were a normal filesystem.
In order to minimize the size of the RAM filesystem code inside the process
manager, this filesystem specifically doesn't include big filesystem features
such as file locking and directory creation.
186
Data
FID
Offset
Size
Sequence
CRCs
ECCs
Other
Header
Data
Header
...
Required;
device-independent
Optional;
device-dependent
Inside a transaction
Each transaction consists of a header followed by data. The header contains the
following:
187
Filesystems
FID
A unique file ID that identifies which file the transaction belongs to.
Offset
The offset of the data portion within the file.
Size
The size of the data portion.
Sequence
A monotonically increasing number (to enable time ordering).
CRCs
Data integrity checks (for NAND, NOR, SRAM).
ECCs
Error correction (for NAND).
Other
Reserved for future expansion.
CRC
NAND
ECC
read
Yes
Yes
Yes
Yes
1K
Yes
Yes
Yes
Yes
2K
RAM
No
No
No
No
1K
SRAM
Yes
No
No
No
1K
NOR
Yes
No
Yes
No
1K
512+16
NAND
2048+64
Although ETFS can support NOR flash, we recommend instead the FFS3 (p.
205) filesystem (devf-*), which is designed explicitly for NOR flash devices.
188
Reliability features
ETFS is designed to survive across a power failure, even during an active flash write
or block erase. The following features contribute to its reliability:
dynamic wear-leveling
static wear-leveling
CRC error detection
ECC error correction
read degradation monitoring with automatic refresh
transaction rollback
atomic file operations
automatic file defragmentation.
Dynamic wear-leveling
Flash memory allows a limited number of erase cycles on a flash block before the
block will fail. This number can be as low as 100,000. ETFS tracks the number of
erases on each block. When selecting a block to use, ETFS attempts to spread the
erase cycles evenly over the device, dramatically increasing its life. The difference
can be extreme: from usage scenarios of failure within a few days without wear-leveling
to over 40 years with wear-leveling.
Static wear-leveling
Filesystems often consist of a large number of static files that are read but not written.
These files will occupy flash blocks that have no reason to be erased. If the majority
of the files in flash are static, this will cause the remaining blocks containing dynamic
data to wear at a dramatically increased rate.
ETFS notices these under-worked static blocks and forces them into service by copying
their data to an over-worked block. This solves two problems: It gives the over-worked
block a rest, since it now contains static data, and it forces the under-worked static
block into the dynamic pool of blocks.
189
Filesystems
during normal usage. An ECC error is a warning signal that the flash block the error
occurred in may be getting weak, i.e. losing charge.
ETFS will mark the weak block for a refresh operation, which copies the data to a new
flash block and erases the weak block. The erase recharges the flash block.
Transaction rollback
When ETFS starts, it processes all transactions and rolls back (discards) the last partial
or damaged transaction. The rollback code is designed to handle a power failure during
a rollback operation, thus allowing the system to recover from multiple nested faults.
The validity of a transaction is protected by CRC codes on each transaction.
190
QNX 4 filesystem
QNX 4 filesystem
The QNX 4 filesystem (fs-qnx4.so) is a high-performance filesystem that shares
the same on-disk structure as in the QNX 4 RTOS.
The QNX 4 filesystem implements an extremely robust design, utilizing an extent-based,
bitmap allocation scheme with fingerprint control structures to safeguard against data
loss and to provide easy recovery. Features include:
extent-based POSIX filesystem
robustness: all sensitive filesystem info is written through to disk
on-disk signatures and special key information to allow fast data recovery in the
event of disk damage
505-character filenames
multi-threaded design
client-driven priority
same disk format as the filesystem under QNX 4
In QNX Neutrino 6.2.1 and later, the 48-character filename limit has increased
to 505 characters via a backwards-compatible extension. The same on-disk
format is retained, but new systems will see the longer name, while old ones
will see a truncated 48-character name.
For more information, see QNX 4 filesystem in the Working with Filesystems chapter
of the QNX Neutrino User's Guide.
191
Filesystems
Power-Safe filesystem
The Power-Safe filesystem is a reliable disk filesystem that can withstand power failures
without losing or corrupting data. It was designed for and is intended for traditional
rotating hard disk drive media.
This filesystem is supported by the fs-qnx6.so shared object.
Copy-on-write filesystem
To address the problems associated with existing disk filesystems, the Power-Safe
filesystem never overwrites live data; it does all updates using copy-on-write (COW),
192
Power-Safe filesystem
assembling a new view of the filesystem in unused blocks on the disk. The new view
of the filesystem becomes live only when all the updates are safely written on the
disk. Everything is COW: both metadata and user data are protected.
To see how this works, let's consider how the data is stored. A Power-Safe filesystem
is divided into logical blocks, the size of which you can specify when you use mkqnx6fs
to format the filesystem. Each inode includes 16 pointers to blocks. If the file is
smaller than 16 blocks, the inode points to the data blocks directly. If the file is any
bigger, those 16 blocks become pointers to more blocks, and so on.
The final block pointers to the real data are all in the leaves and are all at the same
level. In some other filesystemssuch as EXT2a file always has some direct blocks,
some indirect ones, and some double indirect, so you go to different levels to get to
different parts of the file. With the Power-Safe filesystem, all the user data for a file
is at the same level.
...
Inode
Indirect block
pointers
...
...
...
User data
If you change some data, it's written in one or more unused blocks, and the original
data remains unchanged. The list of indirect block pointers must be modified to refer
to the newly used blocks, but again the filesystem copies the existing block of pointers
and modifies the copy. The filesystem then updates the inodeonce again by modifying
a copyto refer to the new block of indirect pointers. When the operation is complete,
the original data and the pointers to it remain intact, but there's a new set of blocks,
indirect pointers, and inode for the modified data:
......
Inode
Indirect block
pointers
User data
......
...
...
193
Filesystems
A superblock is a global root block that contains the inodes for the system bitmap and
inodes files. A Power-Safe filesystem maintains two superblocks:
a stable superblock that reflects the original version of all the blocks
a working superblock that reflects the modified data
The working superblock can include pointers to blocks in the stable superblock. These
blocks contain data that hasn't yet been modified. The inodes and bitmap for the
working superblock grow from it.
Stable superblock
Working superblock
...
...
...
...
...
...
194
Power-Safe filesystem
Performance
The Copy on Write (COW) method has some drawbacks:
Each change to user data can cause up to a dozen blocks to be copied and modified,
because the filesystem never modifies the inode and indirect block pointers in
place; it has to copy the blocks to a new location and modify the copies. Thus,
write operations are longer.
When taking a snapshot, the filesystem must force all blocks fully to disk before
it commits the superblock.
However:
There's no constraint on the order in which the blocks (aside from the superblock)
can be written.
The new blocks can be allocated from any free, contiguous space.
The performance of the filesystem depends on how much buffer cache is available,
and on the frequency of the snapshots. Snapshots occur periodically (every 10 seconds,
or as specified by the snapshot option to fs-qnx6.so), and also when you call sync()
for the entire filesystem, or fsync() for a single file.
Synchronization is at the filesystem level, not at that of individual files, so
fsync() is potentially an expensive operation; the Power-Safe filesystem ignores
the O_SYNC flag.
You can also turn snapshots off if you're doing some long operation, and the
intermediate states aren't useful to you. For example, suppose you're copying a very
large file into a Power-Safe filesystem. The cp utility is really just a sequence of basic
operations:
an open(O_CREAT|O_TRUNC) to make the file
a bunch of write() operations to copy the data
a close(), chmod(), and chown() to copy the metadata
If the file is big enough so that copying it spans snapshots, you have on-disk views
that include the file not existing, the file existing at a variety of sizes, and finally the
complete file copied and its IDs and permissions set:
Snapshot
Snapshot
...
Snapshot
Snapshot
Time
open()
write()
write()
...
write()
close(),
chmod(),
chown()
Each snapshot is a valid point-in-time view of the filesystem (i.e., if you've copied 50
MB, the size is 50 MB, and all data up to 50 MB is also correctly copied and available).
195
Filesystems
If there's a power failure, the filesystem is restored to the most recent snapshot. But
the filesystem has no concept that the sequence of open(), write(), and close()
operations is really one higher-level operation, cp. If you want the higher-level
semantics, disable the snapshots around the cp, and then the middle snapshots won't
happen, and if a power failure occurs, the file will either be complete, or not there at
all.
For information about using this filesystem, see Power-Safe filesystem in the Working
with Filesystems chapter of the QNX Neutrino User's Guide.
Encryption
You can encrypt all or part of the contents of a Power-Safe filesystem by dividing it
into encryption domains.
A domain can contain any number of files or directories. After a domain has been
assigned to a directory, all files created within that directory are encrypted and inherit
that same domain. By default, assigning a domain key to a directory with existing files
or directories doesn't introduce encryption to those files. You can create files and
assign them to a domain, but you must do so before adding any data to them.
During operation, files that are assigned to a domain are encrypted, and the files'
contents are available only when the associated domain is unlocked. When a domain
is unlocked, all the files and directories under that domain are unlocked as well, and
therefore accessible (as per basic file permissions). When a domain is locked, any
access to files belonging to that domain is denied.
Locking and unlocking operations apply to an entire domain, not to specific
files or directories.
Description
File key
196
Power-Safe filesystem
Key type
Description
to encrypt file data, and is encrypted by
a domain key. Keys are managed by the
filesystem and are hidden from the user.
Domain key
Master key
System key
Encryption types
The Power-Safe filesystem supports the following types of encryption:
Domain-encryption type
Constant
Description
FS_CRYPTO_TYPE_NONE No encryption
FS_CRYPTO_TYPE_XTS
FS_CRYPTO_TYPE_CBC
399
Interface usage
To manage encryption from the command line, use fsencrypt; its -c lets you specify
the command to run. From your code, use the fscrypto library. You need to include
both the <fs_crypto_api.h> and <sys/fs_crypto.h> header files. Many of
the APIs return EOK on success and also have a reply argument that provides more
information.
197
Filesystems
API
fsencrypt
Description
command
fs_crypto_check()
check
fs_crypto_domain_add()
create
fs_crypto_domain_key_change()
change-key
fs_crypto_domain_key_check()
check-key
fs_crypto_domain_key_size()
fs_crypto_domain_lock()
lock
fs_crypto_domain_query()
query,
query-all
fs_crypto_domain_remove()
destroy
fs_crypto_domain_unlock()
unlock
fs_crypto_enable(),
enable
fs_crypto_enable_option()
fs_crypto_file_get_domain()
get
fs_crypto_file_set_domain()
set
fs_crypto_key_gen()
-K or -k option
fs_crypto_set_logging()
-l and -v options
The library also includes some functions that you can use to move existing files and
directories into an encryption domain. You first tag files and directories that you want
to move, and then you start the migration, which the filesystem does in the background.
These functions include:
198
Power-Safe filesystem
API
fsencrypt command
fs_crypto_migrate_control() migrate-start,
Description
Control encryption
migrate-stop,
migrate-delay,
filesystem
migrate-units
fs_crypto_migrate_path()
migrate-path
fs_crypto_migrate_status() migrate-status
fs_crypto_migrate_tag()
migrate-tag, tag
Examples
Here are some examples of the way you can use fsencrypt to manage filesystem
encryption:
Determine if encryption is supported or enabled:
$ fsencrypt -vc check -p /
ENCRYPTION_CHECK(Path:'/') FAILED: (18) - 'No support'
Determine if a file is encrypted. Files with the domain number of 0 aren't encrypted.
A nonzero value of 1100 means the file is assigned to a domain. Access to the
file's original contents is determined by the status of the domain. The example
below shows that the named file is assigned to domain 10:
$ fsencrypt -vcget -p /accounts/1000/secure/testfile
GET_DOMAIN(Path:'/accounts/1000/secure/testfile') = 10 SUCCESS
199
Filesystems
Unused domains are ones that haven't yet been created. In the example below,
domain 11 hasn't yet been created, and domain 10 is currently unlocked:
$ fsencrypt -vcquery -p/ -d11
QUERY_DOMAIN(Path:'/', Domain:11) NOTICE: Domain is UNUSED
$ fsencrypt -vcquery -p/ -d10
QUERY_DOMAIN(Path:'/', Domain:10) NOTICE: Domain is UNLOCKED
200
DOS Filesystem
DOS Filesystem
The DOS Filesystem, fs-dos.so, provides transparent access to DOS disks, so you
can treat DOS filesystems as though they were POSIX filesystems. This transparency
allows processes to operate on DOS files without any special knowledge or work on
their part.
The structure of the DOS filesystem on disk is old and inefficient, and lacks many
desirable features. Its only major virtue is its portability to DOS and Windows
environments. You should choose this filesystem only if you need to transport DOS
files to other machines that require it. Consider using the Power-Safe or QNX 4
filesystem alone if DOS file portability isn't an issue or in conjunction with the DOS
filesystem if it is.
If there's no DOS equivalent to a POSIX feature, fs-dos.so, with either return an
error or a reasonable default. For example, an attempt to create a link() will result in
the appropriate errno being returned. On the other hand, if there's an attempt to read
the POSIX times on a file, fs-dos.so will treat any of the unsupported times the
same as the last write time.
DOS version support
The fs-dos.so program supports both floppies and hard disk partitions
from DOS version 2.1 to Windows 98 with long filenames.
DOS text files
DOS terminates each line in a text file with two characters (CR/LF), while
POSIX (and most other) systems terminate each line with a single character
(LF). Note that fs-dos.so makes no attempt to translate text files being
read. Most utilities and programs aren't affected by this difference.
Note also that some very old DOS programs may use a CtrlZ (^Z) as a file
terminator. This character is also passed through without modification.
QNX-to-DOS filename mapping
In DOS, a filename can't contain any of the following characters:
/ \ [ ] : * | + = ; , ?
An attempt to create a file that contains one of these invalid characters will
return an error. DOS (8.3 format) also expects all alphabetical characters
to be uppercase, so fs-dos.so maps these characters to uppercase when
creating a filename on disk. But it maps a filename to lowercase by default
when returning a filename to a QNX Neutrino application, so that QNX
201
Filesystems
Neutrino users and programs can always see and type lowercase (via the
sfn=sfn_mode option).
Handling filenames
You can specify how you want fs-dos.so to handle long filenames (via
the lfn=lfn_mode option):
Ignore themdisplay/create only 8.3 filenames.
Show themif filenames are longer than 8.3 or if mixed case is used.
Always create both short and long filenames.
If you use the ignore option, you can specify whether or not to silently
truncate filename characters beyond the 8.3 limit.
International filenames
The DOS filesystem supports DOS code pages (international character
sets) for locale filenames. Short 8.3 names are stored using a particular
character set (typically the most common extended characters for a locale
are encoded in the 8th-bit character range). All the common American as
well as Western and Eastern European code pages (437, 850, 852, 866,
1250, 1251, 1252) are supported. If you produce software that must access
a variety of DOS/Windows hard disks, or operate in non-US-English countries,
this feature offers important portabilityfilenames will be created with both
a Unicode and locale name and are accessible via either name.
The DOS filesystem supports international text in filenames only.
No attempt is made to be aware of data contents, with the sole
exception of Windows shortcut (.LNK) files, which will be parsed
and translated into symbolic links if you've specified that option
(lnk=lnk_mode).
202
DOS Filesystem
DOS doesn't support all the permission bits specified by POSIX. It has a
READ_ONLY bit in place of separate READ and WRITE bits; it doesn't have
an EXECUTE bit. When a DOS file is created, the DOS READ_ONLY bit is
set if all the POSIX WRITE bits are off. When a DOS file is accessed, the
POSIX READ bit is always assumed to be set for user, group, and other.
Since you can't execute a file that doesn't have EXECUTE permission,
fs-dos.so has an option (exe=exec_mode) that lets you specify how to
handle the POSIX EXECUTE bit for executables.
File ownership
Although the DOS file structure doesn't support user IDs and group IDs,
fs-dos.so (by default) doesn't return an error code if an attempt is made
to change them. An error isn't returned because a number of utilities attempt
to do this and failure would result in unexpected errors. The approach taken
is you can change anything to anything since it isn't written to disk anyway.
The posix= options let you set stricter POSIX checks and enable POSIX
emulation. For example, in POSIX mode, an error of EINVAL is flagged for
attempts to do any of the following:
Set the user ID or group ID to something other than the default (root).
Remove an r (read) permission.
Set an s (set ID on execution) permission.
If you set the posix option to emulate (the default) or strict, you get the
following benefits:
The . and .. directory entries are created in the root directory.
The directory size is calculated.
The number of links in a directory is calculated, based on its
subdirectories.
203
Filesystems
CD-ROM filesystem
The CD-ROM filesystem provides transparent access to CD-ROM media, so you can
treat CD-ROM filesystems as though they were POSIX filesystems. This transparency
allows processes to operate on CD-ROM files without any special knowledge or work
on their part.
The fs-cd.so manager implements the ISO 9660 standard as well as a number of
extensions, including Rock Ridge (RRIP), Joliet (Microsoft), and multisession (Kodak
Photo CD, enhanced audio).
We've deprecated fs-cd.so in favor of fs-udf.so, which now supports
ISO-9660 filesystems in addition to UDF. For information about UDF, see
Universal Disk Format (UDF) filesystem (p. 213), later in this chapter.
204
FFS3 filesystem
FFS3 filesystem
The FFS3 filesystem drivers implement a POSIX-like filesystem on NOR flash memory
devices. The drivers are standalone executables that contain both the flash filesystem
code and the flash device code. There are versions of the FFS3 filesystem driver for
different embedded systems hardware as well as PCMCIA memory cards.
The naming convention for the drivers is devf-system, where system describes the
embedded system.
To find out what flash devices we currently support, refer to the following sources:
the boards and mtd-flash directories under
bsp_working_dir/src/hardware/flash
QNX Neutrino RTOS docs (devf-* entries in the Utilities Reference)
the QNX Software Systems website (www.qnx.com)
Customization
Along with the prebuilt flash filesystem drivers, including the generic driver
(devf-generic), we provide the libraries and source code that you'll need to build
custom flash filesystem drivers for different embedded systems. For information on
how to do this, see the Customizing the Flash Filesystem chapter in Building Embedded
Systems.
Organization
The FFS3 filesystem drivers support one or more logical flash drives. Each logical
drive is called a socket, which consists of a contiguous and homogeneous region of
flash memory. For example, in a system containing two different types of flash device
at different addresses, where one flash device is used for the boot image and the other
for the flash filesystem, each flash device would appear in a different socket.
Each socket may be divided into one or more partitions. Two types of partitions are
supported: raw partitions and flash filesystem partitions.
Raw partitions
A raw partition in the socket is any partition that doesn't contain a flash filesystem.
The driver doesn't recognize any filesystem types other than the flash filesystem. A
raw partition may contain an image filesystem or some application-specific data.
The filesystem will make accessible through a raw mountpoint (see below) any partitions
on the flash that aren't flash filesystem partitions. Note that the flash filesystem
partitions are available as raw partitions as well.
205
Filesystems
Filesystem partitions
A flash filesystem partition contains the POSIX-like flash filesystem, which uses a
QNX Software Systems proprietary format to store the filesystem data on the flash
devices. This format isn't compatible with either the Microsoft FFS2 or PCMCIA FTL
specification.
The filesystem allows files and directories to be freely created and deleted. It recovers
space from deleted files using a reclaim mechanism similar to garbage collection.
Mountpoints
When you start the flash filesystem driver, it will by default mount any partitions it
finds in the socket.
Note that you can specify the mountpoint using mkefs or flashctl (e.g., /flash).
Mountpoint
Description
/dev/fsX
/dev/fsXpY
/fsXpY
/fsXpY/.cmp
Features
The FFS3 filesystem supports many advanced features, such as POSIX compatibility,
multiple threads, background reclaim, fault recovery, transparent decompression,
endian-awareness, wear-leveling, and error-handling.
POSIX
The filesystem supports the standard POSIX functionality (including long filenames,
access privileges, random writes, truncation, and symbolic links) with the following
exceptions:
You can't create hard links.
Access times aren't supported (but file modification times and attribute change
times are).
These design compromises allow this filesystem to remain small and simple, yet include
most features normally found with block device filesystems.
206
FFS3 filesystem
Background reclaim
The FFS3 filesystem stores files and directories as a linked list of extents, which are
marked for deletion as they're deleted or updated. Blocks to be reclaimed are chosen
using a simple algorithm that finds the block with the most space to be reclaimed
while keeping level the amount of wear of each individual block. This wear-leveling
increases the MTBF (mean time between failures) of the flash devices, thus increasing
their longevity.
The background reclaim process is performed when there isn't enough free space. The
reclaim process first copies the contents of the reclaim block to an empty spare block,
which then replaces the reclaim block. The reclaim block is then erased. Unlike rotating
media with a mechanical head, proximity of data isn't a factor with a flash filesystem,
so data can be scattered on the media without loss of performance.
Fault recovery
The filesystem has been designed to minimize corruption due to accidental
loss-of-power faults. Updates to extent headers and erase block headers are always
executed in carefully scheduled sequences. These sequences allow the recovery of
the filesystem's integrity in the case of data corruption.
Note that properly designed flash hardware is essential for effective fault-recovery
systems. In particular, special reset circuitry must be in place to hold the system in
reset before power levels drop below critical. Otherwise, spurious or random bus
activity can form write/erase commands and corrupt the flash beyond recovery.
Rename operations are guaranteed atomic, even through loss-of-power faults. This
means, for example, that if you lost power while giving an image or executable a new
name, you would still be able to access the file via its old name upon recovery.
When the FFS3 filesystem driver is started, it scans the state of every extent header
on the media (in order to validate its integrity) and takes appropriate action, ranging
from a simple block reclamation to the erasure of dangling extent links. This process
is merged with the filesystem's normal mount procedure in order to achieve optimal
bootstrap timings.
Compression/decompression
For fast and efficient compression/decompression, you can use the deflate and
inflator utilities, which rely on popular deflate/inflate algorithms.
The deflate algorithm combines two algorithms. The first takes care of removing data
duplication in files; the second algorithm handles data sequences that appear the
most often by giving them shorter symbols. Those two algorithms provide excellent
lossless compression of data and executable files. The inflate algorithm simply reverses
what the deflate algorithm does.
207
Filesystems
The deflate utility is intended for use with the filter attribute for mkefs. You
can also use it to precompress files intended for a flash filesystem.
The inflator resource manager sits in front of the other filesystems that were
previously compressed using the deflate utility. It can almost double the effective
size of the flash memory.
Compressed files can be manipulated with standard utilities such as cp or ftpthey
can display their compressed and uncompressed size with the ls utility if used with
the proper mountpoint. These features make the management of a compressed flash
filesystem seamless to a systems designer.
Flash errors
As flash hardware wears out, its write state-machine may find that it can't write or
erase a particular bit cell. When this happens, the error status is propagated to the
flash driver so it can take proper action (i.e., mark the bad area and try to write/erase
in another place).
This error-handling mechanism is transparent. Note that after several flash errors, all
writes and erases that fail will eventually render the flash read-only. Fortunately, this
situation shouldn't happen before several years of flash operation. Check your flash
specification and analyze your application's data flow to flash in order to calculate its
potential longevity or MTBF.
Endian awareness
The FFS3 filesystem is endian-aware, making it portable across different platforms.
The optimal approach is to use the mkefs utility to select the target's endian-ness.
Utilities
The filesystem supports all the standard POSIX utilities such as ls, mkdir, rm, ln,
mv, and cp.
There are also some QNX Neutrino utilities for managing the flash:
flashctl
Erase, format, and mount flash partitions.
deflate
Compress files for flash filesystems.
mkefs
Create flash filesystem image files.
208
FFS3 filesystem
System calls
The filesystem supports all the standard POSIX I/O functions such as open(), close(),
read(), and write(). Special functions such as erasing are supported using the devctl()
function.
209
Filesystems
NFS filesystem
The Network File System (NFS) allows a client workstation to perform transparent file
access over a network. It allows a client workstation to operate on files that reside on
a server across a variety of operating systems. Client file access calls are converted to
NFS protocol requests, and are sent to the server over the network. The server receives
the request, performs the actual filesystem operation, and sends a response back to
the client.
The Network File System operates in a stateless fashion by using remote procedure
calls (RPC) and TCP/IP for its transport. Therefore, to use fs-nfs2 or fs-nfs3,
you'll also need to run the TCP/IP client for QNX Neutrino.
Any POSIX limitations in the remote server filesystem will be passed through to the
client. For example, the length of filenames may vary across servers from different
operating systems. NFS (versions 2 and 3) limits filenames to 255 characters; mountd
(versions 1 and 3) limits pathnames to 1024 characters.
Although NFS (version 2) is older than POSIX, it was designed to emulate UNIX
filesystem semantics and happens to be relatively close to POSIX. If possible,
you should use fs-nfs3 instead of fs-nfs2.
210
CIFS filesystem
CIFS filesystem
Formerly known as SMB, the Common Internet File System (CIFS) allows a client
workstation to perform transparent file access over a network to a Windows 98 or NT
system, or a UNIX system running an SMB server. Client file access calls are converted
to CIFS protocol requests and are sent to the server over the network. The server
receives the request, performs the actual filesystem operation, and sends a response
back to the client.
The CIFS protocol makes no attempt to conform to
POSIX.
The fs-cifs manager uses TCP/IP for its transport. Therefore, to use fs-cifs
(SMBfsys in QNX 4), you'll also need to run the TCP/IP client for the QNX Neutrino
RTOS.
211
Filesystems
212
213
Filesystems
214
Windows NT filesystem
Windows NT filesystem
The NT filesystem is used on Microsoft Windows NT and later.
The fs-nt.so shared object provides read-only access to NTFS disks on a QNX
Neutrino system.
215
Filesystems
216
Chapter 10
PPS
The QNX Neutrino Persistent Publish/Subscribe (PPS) service is a small, extensible
publish/subscribe service that offers persistence across reboots. It's designed to provide
a simple and easy-to-use solution for both publish/subscribe and persistence in
embedded systems, answering a need for building loosely connected systems using
asynchronous publications and notifications.
With PPS, publishing is asynchronous: the subscriber need not be waiting for the
publisher. In fact, the publisher and subscriber rarely know each other; their only
connection is an object which has a meaning and purpose for both publisher and
subscriber.
The PPS design is in many ways similar to many process control systems where the
objects are control values updated by hardware or software. Subscribers can be alarm
handling code, displays, and so on. Since there is a single instance of an object,
persistence is a natural property that can be applied to it.
217
PPS
Persistence
PPS maintains its objects in memory while it's running.
It will, as required:
save its objects to persistent storage, either on demand while it's running, or at
shutdown
restore its objects on startup, either immediately, or on first access (deferred
loading)
The underlying persistent storage used by PPS relies on a reliable filesystem, such
as:
diskPower-Safe filesystem
NAND FlashETFS filesystem
Nor FlashFFS3 filesystem
othera customer-generated filesystem
When PPS starts up, it immediately builds the directory hierarchy from the encoded
filenames on the persistent filesystem. It defers loading the objects in the directories
until first access to one of the files. This access could be an open() call on a PPS
object, or a readdir() call on the PPS directory.
On shutdown, PPS always saves any modified objects to a persistent filesystem. You
can also force PPS to save an object at any time by calling fsync() on the object. When
PPS saves to a persistent filesystem, it saves all objects to a single directory.
You can set PPS object and attribute qualifiers to have PPS not save specific
objects or attributes.
218
PPS objects
PPS objects
PPS uses an object-based system; that is, a system with objects whose properties a
publisher can modify. Clients that subscribe to an object receive updates when that
object changeswhen the publisher has modified it.
PPS objects exist as files with attributes in a special PPS filesystem. By default, PPS
objects appear under /fs/pps. You can:
Create directories and populate them with PPS objects by creating files in the
directories.
Use the open(), then the read() and write() functions to query and change PPS
objects.
Use standard utilities as simple debugging tools.
PPS directories can include special objects, such as .all and .notify, which
applications can open to facilitate subscription behavior.
When PPS creates, deletes, or truncates an object (a file or a directory), it places a
notification string into the queue of any subscriber or publisher that has open either
that object or the .all special object for the directory with the modified object. This
file can be open in either full or delta mode.
PPS supports pathname open options, and objects and attribute qualifiers. PPS uses
pathname open options to apply open options on the file descriptor used to open an
object. Object and attribute qualifiers set specific actions to take with an object or
attribute; for example, make an object non-persistent, or delete an attribute.
Pathname open options
PPS objects support an extended syntax on the pathnames used to open
them. Open options are added as suffixes to the pathname, following a
question mark (?). That is, the PPS service uses any data that follows a
question mark in a pathname to apply open options on the file descriptor
used to access the object. Multiple options are separated by question marks.
Object and attribute qualifiers
You can set qualifiers to read() and write() calls by starting a line containing
an object or attribute name with an opening square bracket, followed by a
list of single-letter or single-numeral qualifiers and terminated by a closing
square bracket.
219
PPS
Publishing
To publish to a PPS object, a publisher simply calls open() for the object file with
O_WRONLY to publish only, or O_RDWR to publish and subscribe. The publisher can
then call write() to modify the object's attributes. This operation is non-blocking.
PPS supports multiple publishers that publish to the same PPS object. This capability
is required because different publishers may have access to data that applies to
different attributes for the same object.
In a multimedia system, for instance, the renderer may be the source of a time::value
attribute, while the HMI may be the source of a duration::value attribute. A
publisher that changes only the time attribute will update only that attribute when
it writes to the object. It will leave the other attributes unchanged.
220
Subscribing
Subscribing
PPS clients can subscribe to multiple objects, and PPS objects can have multiple
subscribers. When a publisher changes an object, all clients subscribed to that object
are informed of the change.
To subscribe to an object, a client simply calls open() for the object with O_RDONLY
to subscribe only, or O_RDWR to publish and subscribe. The subscriber can then query
the object with a read() call.
A subscriber can open an object in full mode, in delta mode, or in full and delta modes
at the same time. The figure below illustrates the different information sent to
subscribers who open a PPS object in full mode and in delta mode.
Delta
Delta
Delta
Subscriber
Delta mode
Subscriber
Full mode
PPS
object
Full mode
In full mode (the default), the subscriber always receives a single, consistent
version of the entire object as it exists at the moment when it is requested.
If a publisher changes an object several times before a subscriber asks for
it, the subscriber receive s the state of the object at the time of asking only.
If the object changes again, the subscriber is notified again of the change.
Thus, in full mode, the subscriber may miss multiple changes to an
objectchanges to the object that occur before the subscriber asks for it.
Delta mode
In delta mode, a subscriber receives only the changes (but all the changes)
to an object's attributes. On the first read, since a subscriber knows nothing
about the state of an object, PPS assumes everything has changed. Therefore,
a subscriber's first read in delta mode returns all attributes for an object,
while subsequent reads return only the changes since that subscriber's
previous read. Thus, in delta mode, the subscriber always receives all changes
to an object.
221
PPS
PPS uses directories as a natural grouping mechanism to simplify and make more
efficient the task of subscribing to multiple objects. Subscribers can open multiple
objects, either by calling open() then select() on the objects, or, more easily, by opening
the special .all object which merges all objects in its directory.
PPS provides a mechanism to associate a set of file descriptors with a notification
group. This mechanism allows you to read only the PPS special notification object to
receive notification of changes to all objects associated with a notification group.
222
Chapter 11
Character I/O
A key requirement of any realtime operating system is high-performance character
I/O.
Character devices can be described as devices to which I/O consists of a sequence of
bytes transferred serially, as opposed to block-oriented devices (e.g., disk drives).
As in the POSIX and UNIX tradition, these character devices are located in the OS
pathname space under the /dev directory. For example, a serial port to which a modem
or terminal could be connected might appear in the system as:
/dev/ser1
Typical character devices found on PC hardware include:
serial ports
parallel ports
text-mode consoles
pseudo terminals (ptys)
Programs access character devices using the standard open(), close(), read(), and
write() API functions. Additional functions are available for manipulating other aspects
of the character device, such as baud rate, parity, flow control, etc.
Since it's common to run multiple character devices, they have been designed as a
family of drivers and a library called io-char to maximize code reuse.
io-char
Serial
driver
Parallel
driver
Console
driver
Pty
driver
223
Character I/O
Neutrino process and can run at different priorities according to the nature of the
hardware being controlled and the client's requesting service.
Once a single character device is running, the memory cost of adding additional devices
is minimal, since only the code to implement the new driver structure would be new.
224
Driver/io-char communication
Driver/io-char communication
The io-char library manages the flow of data between an application and the device
driver. Data flows between io-char and the driver through a set of memory queues
associated with each character device.
Three queues are used for each device. Each queue is implemented using a first-in,
first-out (FIFO) mechanism.
Application
processes
Process
B
Process
A
Process
C
io-char
Driver interface
out
Console
driver
in
Serial
driver
canon
Parallel
driver
Serial
communication
ports
System console
Parallel printer
225
Character I/O
The canonical queue is managed entirely by io-char and is used while processing
input data in edited mode. The size of this queue determines the maximum edited
input line that can be processed for a particular device.
The sizes of these queues are configurable using command-line options. Default values
are usually more than adequate to handle most hardware configurations, but you can
tune these to reduce overall system memory requirements, to accommodate unusual
hardware situations, or to handle unique protocol requirements.
Device drivers simply add received data to the raw input queue or consume and transmit
data from the output queue. The io-char module decides when (and if) output
transmission is to be suspended, how (and if) received data is echoed, etc.
226
Device control
Device control
Low-level device control is implemented using the devctl() call.
The POSIX terminal control functions are layered on top of devctl() as follows:
tcgetattr()
Get terminal attributes.
tcsetattr()
Set terminal attributes.
tcgetpgrp()
Get ID of process group leader for a terminal.
tcsetpgrp()
Set ID of process group leader for a terminal.
tcsendbreak()
Send a break condition.
tcflow()
Suspend or restart data transmission/reception.
QNX Neutrino extensions
The QNX Neutrino extensions to the terminal control API are as follows:
tcdropline()
Initiate a disconnect. For a serial device, this will pulse the DTR line.
tcinject()
Inject characters into the canonical buffer.
The io-char module acts directly on a common set of devctl() commands supported
by most drivers. Applications send device-specific devctl() commands through io-char
to the drivers.
227
Character I/O
Input modes
Each device can be in a raw or edited input mode.
MIN
TIME
TIMEOUT
FORWARD
OR
Respond
Respond
Respond
Respond
io-char
reads n bytes
228
Input modes
TIMEOUT
The qualifier TIMEOUT is useful when an application has knowledge of how
long it should wait for data before timing out. The timeout is specified in
1/10ths of a second.
Any protocol that knows the character count for a frame of data it expects
to receive can use TIMEOUT. This in combination with the baud rate allows
a reasonable guess to be made when data should be available. It acts as a
deadman timer to detect dropped characters. It can also be used in
interactive programs with user input to time out a read if no response is
available within a given time.
TIMEOUT is a QNX Neutrino extension and is not part of the POSIX standard.
FORWARD
The qualifier FORWARD is useful when a protocol is delimited by a special
framing character. For example, the PPP protocol used for TCP/IP over a
serial link starts and ends its packets with a framing character. When used
in conjunction with TIMEOUT, the FORWARD character can greatly improve
the efficiency of a protocol implementation. The protocol process will receive
complete frames, rather than character by character. In the case of a dropped
framing character, TIMEOUT or TIME can be used to quickly recover.
This greatly minimizes the amount of IPC work for the OS and results in a
much lower processor utilization for a given TCP/IP data rate. It's interesting
to note that PPP doesn't contain a character count for its frames. Without
the data-forwarding character, an implementation might be forced to read
the data one character at a time.
FORWARD is a QNX Neutrino extension and is not part of the POSIX
standard.
The ability to push the processing for application notification into the
service-providing components of the OS reduces the frequency with which
user-level processing must occur. This minimizes the IPC work to be done
in the system and frees CPU cycles for application processing. In addition,
if the application implementing the protocol is executing on a different
network node than the communications port, the number of network
transactions is also minimized.
For intelligent, multiport serial cards, the data-forwarding character
recognition can also be implemented within the intelligent serial card itself,
thereby significantly reducing the number of times the card must interrupt
the host processor for interrupt servicing.
229
Character I/O
230
Input modes
ERASE
Erase the character to the left of the cursor.
DEL
Erase the character at the current cursor position.
KILL
Erase the entire input line.
UP
Erase the current line and recall a previous line.
DOWN
Erase the current line and recall the next line.
INS
Toggle between insert mode and typeover mode (every new line starts in
insert mode).
Line-editing characters vary from terminal to terminal. The console always starts out
with a full set of editing keys defined.
If a terminal is connected via a serial channel, you need to define the editing characters
that apply to that particular terminal. To do this, you can use the stty utility. For
example, if you have an ANSI terminal connected to a serial port (called /dev/ser1),
you would use the following command to extract the appropriate editing keys from the
terminfo database and apply them to /dev/ser1:
stty term=ansi </dev/ser1
231
Character I/O
232
Console devices
Console devices
System consoles (with VGA-compatible graphics chips in text mode) are managed by
the devc-con or devc-con-hid driver. The video display card/screen and the system
keyboard are collectively referred to as the physical console.
The devc-con or devc-con-hid driver permits multiple sessions to be run
concurrently on a physical console by means of virtual consoles. The devc-con
console driver process typically manages more than one set of I/O queues to io-char,
which are made available to user processes as a set of character devices with names
like /dev/con1, /dev/con2, etc. From the application's point of view, there really
are multiple consoles available to be used.
Of course, there's only one physical console (screen and keyboard), so only one of
these virtual consoles is actually displayed at any one time. The keyboard is attached
to whichever virtual console is currently visible.
Terminal emulation
The console drivers emulate an ANSI terminal.
233
Character I/O
Serial devices
Serial communication channels are managed by the devc-ser* family of driver
processes. These drivers can manage more than one physical channel and provide
character devices with names such as /dev/ser1, /dev/ser2, etc.
When devc-ser* is started, command-line arguments can specify whichand how
manyserial ports are installed. On a PC-compatible system, this will typically be the
two standard serial ports often referred to as com1 and com2. The devc-ser* driver
directly supports most nonintelligent multiport serial cards.
QNX Neutrino includes various serial drivers (e.g., devc-ser8250). For details, see
the devc-ser* entries in the Utilities Reference.
The devc-ser* drivers support hardware flow control (except under edited mode)
provided that the hardware supports it. Loss of carrier on a modem can be programmed
to deliver a SIGHUP signal to an application process (as defined by POSIX).
234
Parallel devices
Parallel devices
Parallel printer ports are managed by the devc-par driver. When devc-par is started,
command-line arguments can specify which parallel port is installed.
The devc-par driver is an output-only driver, so it has no raw input or canonical
input queues. The size of the output buffer can be configured with a command-line
argument. If configured to a large size, this creates the effect of a software print buffer.
235
Character I/O
Serial line
devc-ser*
Application
process
Process
devc-pty
Application
process
236
Chapter 12
Networking Architecture
As with other service-providing processes in the QNX Neutrino RTOS, the networking
services execute outside the kernel. Developers are presented with a single unified
interface, regardless of the configuration and number of networks involved.
This architecture allows:
network drivers to be started and stopped dynamically
Qnet and other protocols to run together in any combination
Our native network subsystem consists of the network manager executable (io-pkt-v4,
io-pkt-v4-hc, or io-pkt-v6-hc), plus one or more shared library modules. These
modules can include protocols (e.g. lsm-qnet.so) and drivers (e.g.
devnp-speedo.so).
237
Networking Architecture
io-pkt
Net application Stack utilities
libsocket
Stack
Resource Manager
libc
BPF
IP input
Application
libprotocol A
libc
Resource
Manager
A
Protocols
(.so)
Stack
Packet Filtering
Ether input
802.11
framework
Drivers
(.so)
Legend:
Message-passing API
Function calls
WiFi drivers
(.so)
238
239
Networking Architecture
Threading model
The default mode of operation is for io-pkt to create one thread per CPU.
The io-pkt stack is fully multithreaded at layer 2. However, only one thread may
acquire the stack context for upper-layer packet processing. If multiple interrupt
sources require servicing at the same time, these may be serviced by multiple threads.
Only one thread will be servicing a particular interrupt source at any point in time.
Typically an interrupt on a network device indicates that there are packets to be
received. The same thread that handles the receive processing may later transmit the
received packets out another interface. Examples of this are layer-2 bridging and the
ipflow fastforwarding of IP packets.
The stack uses a thread pool to service events that are generated from other parts of
the system. These events may be:
time outs
ISR events
other things generated by the stack or protocol modules
You can use a command-line option to the driver to control the priority at which the
thread is run to receive packets. Client connection requests are handled in a floating
priority mode (i.e., the thread priority matches that of the client application thread
accessing the stack resource manager).
Once a thread receives an event, it examines the event type to see if it's a hardware
event, stack event, or other event:
If the event is a hardware event, the hardware is serviced and, for a receive packet,
the thread determines whether bridging or fast-forwarding is required. If so, the
thread performs the appropriate lookup to determine which interface the packet
should be queued for, and then takes care of transmitting it, after which it goes
back to check and see if the hardware needs to be serviced again.
If the packet is meant for the local stack, the thread queues the packet on the
stack queue. The thread then goes back and continues checking and servicing
hardware events until there are no more events.
Once a thread has completed servicing the hardware, it checks to see if there's
currently a stack thread running to service stack events that may have been
generated as a result of its actions. If there's no stack thread running, the thread
becomes the stack thread and loops, processing stack events until there are none
remaining. It then returns to the wait for event state in the thread pool.
This capability of having a thread change directly from being a hardware-servicing
thread to being the stack thread eliminates context switching and greatly improves
the receive performance for locally terminated IP flows.
240
Protocol module
Protocol module
The networking protocol module is responsible for implementing the details of a
particular protocol (e.g., Qnet).
Each protocol component is packaged as a shared object (e.g., lsm-qnet.so). One
or more protocol components may run concurrently.
For example, the following line from a buildfile shows io-pkt-v4 loading the Qnet
protocol via its -p protocol command-line option:
io-pkt-v4 -dne2000 -pqnet
Qnet is the QNX Neutrino native networking protocol. Its main purpose is to extend
the OS's powerful message-passing IPC transparently over a network of microkernels.
Qnet also provides Quality of Service policies to help ensure reliable network
transactions.
For more information on the Qnet and TCP/IP protocols, see the following chapters in
this book:
Native Networking (Qnet) (p. 243)
TCP/IP Networking (p. 257)
241
Networking Architecture
Driver module
The network driver module is responsible for managing the details of a particular
network adaptor (e.g., an NE-2000 compatible Ethernet controller). Each driver is
packaged as a shared object and installs into the io-pkt* component.
Once io-pkt* is running, you can dynamically load drivers at the command line
using the mount command.
For example, the following commands start io-pkt-v6-hc and then mount the
driver for the Broadcom 57xx chip set adapter:
io-pkt-v6-hc &
mount -T io-pkt devnp-bge.so
All network device drivers are shared objects whose names are of the form
devnp-driver.so.
The io-pkt* manager can also load legacy io-net drivers. The names of
these drivers start with devn-.
Once the shared object is loaded, io-pkt* will then initialize it. The driver and
io-pkt* are then effectively bound togetherthe driver will call into io-pkt* (for
example when packets arrive from the interface) and io-pkt* will call into the driver
(for example when packets need to be sent from an application to the interface).
To unload a legacy io-net driver, you can use the umount command. For example:
umount /dev/io-net/en0
To unload a new-style driver or a legacy io-net driver, use the ifconfig destroy
command:
ifconfig bge0 destroy
For more information on network device drivers, see their individual utility pages
(devn-*, devnp-*) in the Utilities Reference.
242
Chapter 13
Native Networking (Qnet)
In the Interprocess Communication (IPC) chapter earlier in this manual, we described
message passing in the context of a single node. But the true power of the QNX
Neutrino RTOS lies in its ability to take the message-passing paradigm and extend it
transparently over a network of microkernels. This chapter describes QNX Neutrino
native networking (via the Qnet protocol).
243
Now consider the case of a simple network with two machinesone contains the client
process, the other contains the server process.
244
lab2
Client
Server
Figure 50: A simple network where the client and server reside on separate machines.
The code required for client-server communication is identical to the code in the
single-node case, but with one important exception: the pathname. The pathname will
contain a prefix that specifies the node that the service (/dev/ser1) resides on. As
we'll see later, this prefix will be translated into a node descriptor for the lower-level
ConnectAttach() kernel call that will take place. Each node in the network is assigned
a node descriptor, which serves as the only visible means to determine whether the
OS is running as a network or standalone.
For more information on node descriptors, see the Transparent Distributed Processing
with Qnet chapter of the QNX Neutrino Programmer's Guide.
245
So with Qnet running, you can now open pathnames (files or managers) on other remote
Qnet nodes, just as you open files locally on your own node. This means you can access
regular files or manager processes on other Qnet nodes as if they were executing on
your local node.
Recall our open() example above. If you wanted to open a serial device on node1
instead of on your local machine, you simply specify the path:
fd = open("/net/node1/dev/ser1",O_RDWR...); /*Open a serial device on node1*/
For client-server communications, how does the client know what node descriptor to
use for the server?
The client uses the filesystem's pathname space to look up the server's address. In
the single-machine case, the result of that lookup will be a node descriptor, a process
ID, and a channel ID. In the networked case, the results are the samethe only
difference will be the value of the node descriptor.
If node descriptor is:
0 (or ND_LOCAL_NODE)
Nonzero
Remote
246
lab2
Process
manager
Process
manager
3
3
1
Qnet
2
Qnet
4
4
3
Client
Serial
driver
247
Network naming
As mentioned earlier, the pathname prefix /net is the most common name that
lsm-qnet.so uses.
In resolving names in a network-wide pathname space, the following terms come into
play:
node name
A character string that identifies the node you're talking to. Note that a node
name can't contain slashes or dots. In the example above, we used lab2
as one of our node names. The default is fetched via confstr() with the
_CS_HOSTNAME parameter.
node domain
A character string that's tacked onto the node name by lsm-qnet.so.
Together the node name and node domain must form a string that's unique
for all nodes that are talking to each other. The default is fetched via confstr()
with the _CS_DOMAIN parameter.
248
Resolvers
The following resolvers are built into the network manager:
en_ionetbroadcast requests for name resolution on the LAN (similar to the
TCP/IP ARP protocol). This is the default.
dnstake the node name, add a dot (.) followed by the node domain, and send
the result to the TCP/IP gethostbyname() function.
filesearch for accessible nodes, including the relevant network address, in a
static file.
249
QoS policies
Qnet supports transmission over multiple networks and provides several policies for
specifying how Qnet should select a network interface for transmission.
These Quality of Service policies include:
loadbalance (the default)
Qnet is free to use all available network links, and will share transmission
equally among them.
preferred
Qnet uses one specified link, ignoring all other networks (unless the preferred
one fails).
exclusive
Qnet uses oneand only onelink, ignoring all others, even if the exclusive
link fails.
To fully benefit from Qnet's QoS, you need to have physically separate networks. For
example, consider a network with two nodes and a hub, where each node has two
connections to the hub:
Hub
Node 1
Node 2
250
Node 1
Node 2
Hub
251
The QoS parameter always begins with a tilde (~) character. Here we're telling Qnet
to lock onto the en0 interface exclusively, even if it fails.
Symbolic links
You can set up symbolic links to the various QoS-qualified pathnames:
ln -sP /net/lab2~preferred:en1 /remote/sql_server
252
You can't create symbolic links inside /net because Qnet takes over that
namespace.
Abstracting the pathnames by one level of indirection gives you multiple servers
available in a network, all providing the same service. When one server fails, the
abstract pathname can be remapped to point to the pathname of a different server.
For example, if lab2 failed, then a monitoring program could detect this and effectively
issue:
rm /remote/sql_server
ln -sP /net/lab1 /remote/sql_server
This would remove lab2 and reassign the service to lab1. The real advantage here
is that applications can be coded based on the abstract service name rather than
be bound to a specific node name.
253
Examples
Let's look at a few examples of how you'd use the network manager.
The QNX Neutrino native network manager lsm-qnet.so is actually a shared
object that installs into the executable io-pkt*.
Local networks
If you're using the QNX Neutrino RTOS on a small LAN, you can use just
the default en_ionet resolver. When a node name that's currently unknown
is being resolved, the resolver will broadcast the name request over the LAN,
and the node that has the name will respond with an identification message.
Once the name's been resolved, it's cached for future reference.
Since en_ionet is the default resolver when you start lsm-qnet.so, you
can simply issue commands like:
ls /net/lab2/
If you have a machine called lab2 on your LAN, you'll see the contents
of its root directory.
Remote networks
Qnet uses DNS (Domain Name System) when resolving remote names. To
use lsm-qnet.so with DNS, you specify this resolver on mount's command
line:
For security reasons, you should have a firewall set up on
your network before connecting to the Internet. For more
information, see pf-faq at
ftp://ftp3.usa.openbsd.org/pub/OpenBSD/doc/
in the OpenBSD documentation.
In this example, Qnet will use both its native en_ionet resolver (indicated
by the first mount= command) and DNS for resolving remote names.
Note that we've specified several types of domain names
(mount=.com:.net:.edu) as mountpoints, simply to ensure better remote
name resolution.
254
Examples
255
Chapter 14
TCP/IP Networking
As the Internet has grown to become more and more visible in our daily lives, the
protocol it's based onIP (Internet Protocol)has become increasingly important.
The IP protocol and tools that go with it are ubiquitous, making IP the de facto choice
for many private networks.
IP is used for everything from simple tasks (e.g., remote login) to more complicated
tasks (e.g., delivering realtime stock quotes). Most businesses are turning to the World
Wide Web, which commonly rides on IP, for communication with their customers,
advertising, and other business connectivity. The QNX Neutrino RTOS is well-suited
for a variety of roles in this global network, from embedded devices connected to the
Internet, to the routers that are used to implement the Internet itself.
Given these and many other user requirements, we've made our TCP/IP stack (included
in io-pkt*) relatively light on resources, while using the common BSD API.
We provide the following stack configurations:
NetBSD TCP/IP stack
Based on the latest RFCs, including UDP, IP, and TCP. Also supports
forwarding, broadcast and multicast, hardware checksum support, routing
sockets, Unix domain sockets, multilink PPP, PPPoE, supernetting (CIDR),
NAT/IP filtering, ARP, ICMP, and IGMP, as well as CIFS, DHCP, AutoIP,
DNS, NFS (v2 and v3 server/client), NTP, RIP, RIPv2, and an embedded
web server.
To create applications for this stack, you use the industry-standard BSD
socket API. This stack also includes optimized forwarding code for additional
performance and efficient packet routing when the stack is functioning as
a network gateway.
Enhanced NetBSD stack with IPsec and IPv6
Includes all the features in the standard stack, plus the functionality targeted
at the new generation of mobile and secure communications. This stack
provides full IPv6 and IPsec (both IPv4 and IPv6) support through KAME
extensions, as well as support for VPNs over IPsec tunnels.
This dual-mode stack supports IPv4 and IPv6 simultaneously and includes
IPv6 support for autoconfiguration, which allows device configuration in
plug-and-play network environments. IPv6 support includes IPv6-aware
utilities and RIP/RIPng to support dynamic routing. An Advanced Socket
257
TCP/IP Networking
API is also provided to supplement the standard socket API to take advantage
of IPv6 extended-development capabilities.
IPsec support allows secure communication between hosts or networks,
providing data confidentiality via strong encryption algorithms and data
authentication features. IPsec support also includes the IKE (ISAKMP/Oakley)
key management protocol for establishing secure host associations.
The QNX Neutrino TCP/IP suite is also modular. For example, it provides NFS as
separate modules. With this kind of modularity, together with small-sized modules,
embedded systems developers can more easily and quickly build small TCP/IP-capable
systems.
258
fs-nfs2
ntpd
snmpd
telnetd
routed
syslogd
ftpd
fs-cifs
inetd
lsm-pf-*.so
io-pkt
User
applications
pppd
pppoed
devn-*.so
devnp-*.so
devc-ser*
259
TCP/IP Networking
Socket API
The BSD Socket API was the obvious choice for the QNX Neutrino RTOS. The Socket
API is the standard API for TCP/IP programming in the UNIX world. In the Windows
world, the Winsock API is based on and shares a lot with the BSD Socket API. This
makes conversion between the two fairly easy.
All the routines that application programmers would expect are available, including
(but not limited to):
accept()
bind()
bindresvport()
connect()
dn_comp()
dn_expand()
endprotoent()
endservent()
gethostbyaddr()
gethostbyname()
getpeername()
getprotobyname()
getprotobynumber()
getprotoent()
getservbyname()
getservent()
getsockname()
getsockopt()
herror()
hstrerror()
htonl()
htons()
h_errlist()
h_errno()
h_nerr()
inet_addr()
inet_aton()
inet_lnaof()
inet_makeaddr()
inet_netof()
260
Socket API
inet_network()
inet_ntoa()
ioctl()
listen()
ntohl()
ntohs()
recv()
recvfrom()
res_init()
res_mkquery()
res_query()
res_querydomain()
res_search()
res_send()
select()
send()
sendto()
setprotoent()
setservent()
setsockopt()
shutdown()
socket()
For more information, see the QNX Neutrino C Library Reference.
The common daemons and utilities from the Internet will easily port or just compile
in this environment. This makes it easy to leverage what already exists for your
applications.
Database routines
The database routines listed below have been modified to better suit embedded
systems.
/etc/resolv.conf
You can use configuration strings (via the confstr() function) to override the
data usually contained in the /etc/resolv.conf file. You can also use
the RESCONF environment variable to do this. Either method lets you use a
nameserver without /etc/resolv.conf. This affects gethostbyname()
and other resolver routines.
/etc/protocols
261
TCP/IP Networking
262
Multiple stacks
Multiple stacks
The QNX Neutrino network manager (io-pkt) lets you load multiple protocol shared
objects. You can even run multiple, independent instances of the network manager
(io-pkt*) itself. As with all QNX Neutrino system components, each io-pkt*
naturally benefits from complete memory protection thanks to our microkernel
architecture.
263
TCP/IP Networking
264
NTP
NTP
NTP (Network Time Protocol) allows you to keep the time of day for the devices in
your network synchronized with the Internet standard time servers. The QNX Neutrino
NTP daemon supports both server and client modes.
In server mode, a daemon on the local network synchronizes with the standard time
servers. It will then broadcast or multicast what it learned to the clients on the local
network, or wait for client requests. The client NTP systems will then be synchronized
with the server NTP system. The NTP suite implements NTP v4 while maintaining
compatibility with v3, v2, and v1.
265
TCP/IP Networking
AutoIP
Developed from the Zeroconf IETF draft, lsm-autoip.so is an io-pkt* module
that automatically configures the IPv4 address of your interface without the need of
a server (as per DHCP) by negotiating with its peers on the network. This module can
also coexist with DHCP (dhcp.client), allowing your interface to be assigned both
a link-local IP address and a DHCP-assigned IP address at the same time.
266
267
TCP/IP Networking
/etc/autoconnect
Our autoconnect feature automatically sets up a connection to your ISP whenever a
TCP/IP application is started. For example, suppose you want to start a dialup
connection to the Internet. When your Web browser is started, it will pause and the
/etc/autoconnect script will automatically dial your ISP. The browser will resume
when the PPP session is established.
For more information, see the entry for /etc/autoconnect in the Utilities Reference.
268
Chapter 15
High Availability
The term High Availability (HA) is commonly used in telecommunications and other
industries to describe a system's ability to remain up and running without interruption
for extended periods of time.
The celebrated five nines availability metric refers to the percentage of uptime a
system can sustain in a year99.999% uptime amounts to about five minutes
downtime per year.
Obviously, an effective HA solution involves various hardware and software components
that conspire to form a stable, working system. Assuming reliable hardware components
with sufficient redundancy, how can an OS best remain stable and responsive when
a particular component or application program fails? And in cases where redundant
hardware may not be an option (e.g., consumer appliances), how can the OS itself
support HA?
269
High Availability
An OS for HA
If you had to design an HA-capable OS from the ground up, would you start with a
single executable environment? In this simple, high-performance design, all OS
components, device drivers, applications, the works, would all run without memory
protection in kernel mode.
On second thought, maybe such an OS wouldn't be suited for HA, simply because if
a single software component were to fail, the entire system would crash. And if you
wanted to add a software component or otherwise modify the HA system, you'd have
to take the system out of service to do so. In other words, the conventional realtime
executive architecture wasn't built with HA in mind.
Suppose, then, that you base your HA-enabled OS on a separation of kernel space and
user space, so that all applications would run in user mode and enjoy memory
protection. You'd even be able to upgrade an application without incurring any
downtime.
So far so good, but what would happen if a device driver, filesystem manager, or other
essential OS component were to crash? Or what if you needed to add a new driver to
a live system? You'd have to rebuild and restart the kernel. Based on such a monolithic
kernel architecture, your HA system wouldn't be as available as it should be.
Inherent HA
A true microkernel that provides full memory protection is inherently the most stable
OS architecture.
Very little code is running in kernel mode that could cause the kernel itself to fail.
And individual processes, whether applications or OS services, can be started and
stopped dynamically, without jeopardizing system uptime.
QNX Neutrino inherently provides several key features that are well-suited for HA
systems:
System stability through full memory protection for all OS and user processes.
Dynamic loading and unloading of system components (device drivers, filesystem
managers, etc.).
Separation of all software components for simpler development and maintenance.
While any claims regarding five nines availability on the part of an OS must be
viewed only in the context of the entire hardware/software HA system, one can always
ask whether an OS truly has the appropriate underlying architecture capable of
supporting HA.
270
An OS for HA
HA-specific modules
Apart from its inherently robust architecture, QNX Neutrino also provides several
components to help developers simplify the task of building and maintaining effective
HA systems:
HA client-side library (p. 273)cover functions that allow for automatic and
transparent recovery mechanisms for failed server connections.
HA Manager (p. 275)a smart watchdog that can perform multistage recovery
whenever system services or processes fail.
271
High Availability
272
Client library
Client library
The High Availability client-side library provides a drop-in enhancement solution for
many standard C Library I/O operations.
The HA library's cover functions allow for automatic and transparent recovery
mechanisms for failed connections that can be recovered from in an HA scenario.
Note that the HA library is both thread-safe and cancellation-safe.
The main principle of the client library is to provide drop-in replacements for all the
message-delivery functions (i.e., MsgSend*). A client can select which particular
connections it would like to make highly available, thereby allowing all other
connections to operate as ordinary connections (i.e., in a non-HA environment).
Normally, when a server that the client is talking to fails, or if there's a transient
network fault, the MsgSend* functions return an error indicating that the connection
ID (or file descriptor) is stale or invalid (e.g., EBADF). But in an HA-aware scenario,
these transient faults are recovered from almost immediately, thus making the services
available again.
Recovery example
The following example demonstrates a simple recovery scenario, where a client opens
a file across a network filesystem.
If the NFS server were to die, the HA Manager would restart it and remount the
filesystem. Normally, any clients that previously had files open across the old
connection would now have a stale connection handle. But if the client uses the
ha_attach functions, it can recover from the lost connection.
The ha_attach functions allow the client to provide a custom recovery function that's
automatically invoked by the cover-function library. This recovery function could simply
reopen the connection (thereby getting a connection to the new server), or it could
perform a more complex recovery (e.g., adjusting the file position offsets and
reconstructing its state with respect to the connection). This mechanism thus lets you
develop arbitrarily complex recovery scenarios, while the cover-function library takes
care of the details (detecting a failure, invoking recovery functions, and retransmitting
state information).
#include
#include
#include
#include
#include
#include
#include
#include
<stdio.h>
<string.h>
<stdlib.h>
<unistd.h>
<sys/stat.h>
<fcntl.h>
<errno.h>
<ha/cover.h>
273
High Availability
int curr_offset;
} Handle ;
int recover_conn(int oldfd, void *hdl)
{
int newfd;
Handle *thdl;
thdl = (Handle *)hdl;
newfd = ha_reopen(oldfd, TESTFILE, O_RDONLY);
if (newfd >= 0) {
// adjust file offset to previously known point
lseek(newfd, thdl->curr_offset, SEEK_SET);
// increment our count of successful recoveries
(thdl->nr)++;
}
return(newfd);
}
int main(int argc, char *argv[])
{
int status;
int fd;
int fd2;
Handle hdl;
char buf[80];
hdl.nr = 0;
hdl.curr_offset = 0;
// open a connection
// recovery will be using "recovery_conn", and "hdl" will
// be passed to it as a parameter
fd = ha_open(TESTFILE, O_RDONLY, recover_conn, (void *)&hdl, 0);
if (fd < 0) {
printf("could not open file\n");
exit(-1);
}
status = read(fd,buf,15);
if (status < 0) {
printf("error: %s\n",strerror(errno));
exit(-1);
}
else {
hdl.curr_offset += status;
}
fd2 = ha_dup(fd);
// fs-nfs3 fails, and is restarted, the network mounts
// are re-instated at this point.
// Our previous "fd" to the file is stale
sleep(18);
// reading from dup-ped fd
// will fail, and will recover via recover_conn
status = read(fd,buf,15);
if (status < 0) {
printf("error: %s\n",strerror(errno));
exit(-1);
}
else {
hdl.curr_offset += status;
}
printf("total recoveries, %d\n",hdl.nr);
ha_close(fd);
ha_close(fd2);
exit(0);
}
Since the cover-function library takes over the lowest MsgSend*() calls, most standard
library functions (read(), write(), printf(), scanf(), etc.) are also automatically HA-aware.
The library also provides an ha-dup() function, which is semantically equivalent to the
standard dup() function in the context of HA-aware connections. You can replace
recovery functions during the lifetime of a connection, which greatly simplifies the
task of developing highly customized recovery mechanisms.
274
275
High Availability
HAM hierarchy
The High Availability Manager consists of three main components.
Entities (p. 276)
Conditions (p. 277)
Actions (p. 278)
Entities
Entities are the fundamental units of observation/monitoring in the system.
Essentially, an entity is a process (pid). As processes, all entities are uniquely
identifiable by their pids. Associated with each entity is a symbolic name that can be
used to refer to that specific entity. Again, the names associated with entities are
unique across the system. Managers are currently associated with a node, so uniqueness
rules apply to a node. As we'll see later, this uniqueness requirement is very similar
to the naming scheme used in a hierarchical filesystem.
There are three fundamental entity types:
Self-attached entities (HA-aware components)processes that choose to send
heartbeats to the HAM, which will then monitor them for failure. Self-attached
entities can, on their own, decide at exactly what point in their lifespan they want
to be monitored, what conditions they want acted upon, and when they want to
stop the monitoring. In other words, this is a situation where a process says, Do
the following if I die.
Externally attached entities (HA-unaware components)generic processes (including
legacy components) in the system that are being monitored. These could be arbitrary
daemons/service providers whose health is deemed important. This method is useful
for the case where Process A says, Tell me when Process B dies but Process B
needn't know about this at all.
Global entitya place holder for matching any entity. The global entity can be
used to associate actions that will be triggered when an interesting event is detected
with respect to any entity on the system. The term global refers to the set of
entities being monitored in the system, and allows a process to say things like,
When any process dies or misses a heartbeat, do the following. The global entity
is never added or removed, but only referred to. Conditions can be added to or
removed from the global entity, of course, and actions can be added to or removed
from any of the conditions.
276
Conditions
Conditions are associated with entities; a condition represents the entity's state.
Condition
Description
CONDDEATH
CONDABNORMALDEATH
CONDDETACH
CONDATTACH
CONDBEATMISSEDHIGH
CONDBEATMISSEDLOW
CONDRESTART
CONDRAISE
CONDSTATE
CONDANY
277
High Availability
For the conditions listed above (except CONDSTATE, CONDRAISE, and CONDANY),
the HAM is the publisherit automatically detects and/or triggers the conditions. For
the CONDSTATE and CONDRAISE conditions, external detectors publish the conditions
to the HAM.
For all conditions, subscribers can associate with lists of actions that will be performed
in sequence when the condition is triggered. Both the CONDSTATE and CONDRAISE
conditions provide filtering capabilities, so subscribers can selectively associate actions
with individual conditions based on the information published.
Any condition can be associated as a wild card with any entity, so a process can
associate actions with any condition in a specific entity, or even in any entity. Note
that conditions are also associated with symbolic names, which also need to be unique
within an entity.
Actions
Actions are associated with conditions. Actions are executed when the appropriate
conditions are true with respect to a specific entity.
The HAM API includes several functions for different kinds of actions:
Action
Description
ham_action_restart()
ham_action_execute()
ham_action_notify_pulse()
ham_action_notify_signal()
ham_action_notify_pulse_node()
ham_action_notify_signal_node()
278
Description
that the node name specified for the
recipient of the signal can be the fully
qualified node name.
ham_action_waitfor()
ham_action_heartbeat_healthy()
ham_action_log()
Actions are also associated with symbolic names, which are unique within a specific
condition.
What happens if an action itself fails? You can specify an alternate list of actions to
be performed to recover from that failure. These alternate actions are associated with
the primary actions through several ham_action_fail* functions:
ham_action_fail_execute()
ham_action_fail_notify_pulse()
ham_action_fail_notify_signal()
ham_action_fail_notify_pulse_node()
ham_action_fail_notify_signal_node()
ham_action_fail_waitfor()
ham_action_fail_log()
279
High Availability
There are currently two different ways of publishing information to the HAM; both of
these are designed to be general enough to permit clients to build more complex
information exchange mechanisms:
publishing state transitions
publishing other conditions.
State transitions
An entity can report its state transitions to the HAM, which maintains every entity's
current state (as reported by the entity). The HAM doesn't interpret the meaning of
the state value itself, nor does it try to validate the state transitions, but it can generate
events based on transitions from one state to another.
Components can publish transitions that they want the external world to know about.
These states needn't necessarily represent a specific state the application uses
internally for decision making.
To notify the HAM of a state transition, components can use the
ham_entity_condition_state() function. Since the HAM is interested only in the next
state in the transition, this is the only information that's transmitted to the HAM. The
HAM then triggers a condition state-change event internally, which other components
can subscribe to using the ham_condition_state() API call (see below (p. 281)).
Other conditions
In addition to the above, components on the system can also publish autonomously
detected conditions by using the ham_entity_condition_raise() API call. The component
raising the condition can also specify a type, class, and severity of its choice, to allow
subscribers further granularity in filtering out specific conditions to subscribe to. As
a result of this call, the HAM triggers a condition-raise event internally, which other
components can subscribe to using the ham_condition_raise() API call (see below (p.
281)).
280
HAM as a filesystem
Effectively, HAM's internal state is like a hierarchical filesystem, where entities are
like directories, conditions associated with those entities are like subdirectories, and
actions inside those conditions are like leaf nodes of this tree structure.
HAM also presents this state as a read-only filesystem under /proc/ham. As a result,
arbitrary processes can also view the current state (e.g., you can do ls /proc/ham).
The /proc/ham filesystem presents a lot of information about the current state of
the system's entities. It also provides useful statistics on heartbeats, restarts, and
deaths, giving you a snapshot in time of the system's various entities, conditions, and
actions.
Multistage recovery
HAM can perform a multistage recovery, executing several actions in a certain order.
This technique is useful whenever strict dependencies exist between various actions
in a sequence. In most cases, recovery requires more than a single restart mechanism
in order to properly restore the system's state to what it was before a failure.
For example, suppose you've started fs-nfs3 (the NFS filesystem) and then mounted
a few directories from multiple sources. You can instruct HAM to restart fs-nfs3
upon failure, and also to remount the appropriate directories as required after restarting
the NFS process.
As another example, suppose io-pkt* (the network I/O manager) were to die. We
can tell HAM to restart it and also to load the appropriate network drivers (and maybe
a few more services that essentially depend on network services in order to function).
HAM API
The basic mechanism to talk to HAM is to use its API. This API is implemented as a
library that you can link against. The library is thread-safe as well as cancellation-safe.
To control exactly what/how you're monitoring, the HAM API provides a collection of
functions, including:
281
High Availability
Function
Description
ham_action_control()
ham_action_execute()
ham_action_fail_execute()
ham_action_fail_log()
ham_action_fail_notify_pulse()
ham_action_fail_notify_pulse_node()
ham_action_fail_notify_signal()
ham_action_fail_notify_signal_node()
ham_action_fail_waitfor()
ham_action_handle()
ham_action_handle_node()
ham_action_handle_free()
ham_action_heartbeat_healthy()
ham_action_log()
ham_action_notify_pulse()
ham_action_notify_pulse_node()
ham_action_notify_signal()
282
Description
ham_action_notify_signal_node()
ham_action_remove()
ham_action_restart()
ham_action_waitfor()
ham_attach()
Attach an entity.
ham_attach_node()
ham_attach_self()
ham_condition()
ham_condition_control()
ham_condition_handle()
ham_condition_handle_node()
ham_condition_handle_free()
ham_condition_raise()
ham_condition_remove()
ham_condition_state()
ham_connect()
Connect to a HAM.
ham_connect_nd()
ham_connect_node()
ham_detach()
ham_detach_name()
283
High Availability
Function
Description
ham_detach_name_node()
ham_detach_self()
ham_disconnect()
ham_disconnect_nd()
ham_disconnect_node()
ham_entity()
ham_entity_condition_raise()
Raise a condition.
ham_entity_condition_state()
ham_entity_control()
ham_entity_handle()
ham_entity_handle_node()
ham_entity_handle_free()
ham_entity_node()
284
ham_heartbeat()
ham_stop()
Stop a HAM.
ham_stop_nd()
ham_stop_node()
ham_verbose()
Chapter 16
Adaptive Partitioning
The QNX Neutrino RTOS supports adaptive partitioning to let you control the allocation
of resources among competing processes.
285
Adaptive Partitioning
Partition 2
Partition 3
50%
20%
30%
Partition scheduler
Figure 55: Static partitions guarantee that processes get the resources specified by
the system designer.
Typically, the main objective of competing resource partitioning systems is to divide
a computer into a set of smaller computers that interact as little as possible; however,
this approach isn't very flexible. Adaptive partitioning takes a much more flexible view.
QNX Neutrino partitions are adaptive because:
you can change configurations at run time
they're typically fixed at one configuration time
the partition behavior auto-adapts to conditions at run time. For example:
free time is redistributed to other scheduler partitions
filesystems can bill time to clients with a mechanism that temporarily moves
threads between time partitions
As a result, adaptive partitions are less restrictive and much more powerful. In addition
to being adaptive, time partitions allow you to easily model the fundamentally different
behavior of CPU time when viewed as a resource.
286
287
Adaptive Partitioning
Why adaptive?
To provide realtime performance with guarantees against overloading, QNX Neutrino
introduced adaptive partitioning. Rigid partitions work best in fairly static systems
with little or no dynamic deployment of software. In dynamic systems, static partitions
can be inefficient. For example, the static division of execution time between partitions
can waste CPU time and introduce delays:
If most of the partitions are idle, and one is very busy, the busy partition doesn't
receive any additional execution time, while background threads in the other
partitions waste CPU time.
If an interrupt is scheduled for a partition, it has to wait until the partition runs.
This can cause unacceptable latency, especially if bursts of interrupts occur.
You can introduce adaptive partitioning without changingor even
recompilingyour application code, although you do have to rebuild your
system's OS image.
behaves as a global hard realtime thread scheduler under normal load, but can
still provide minimal interrupt latencies even under overload conditions
maximizes the usage of the CPU's resources. In the case of the thread scheduler,
it distributes a partition's unused budget among partitions that require extra
resources when the system isn't loaded.
288
289
Adaptive Partitioning
each other. The design is divided among groups with differing system performance
goals, different schemes for determining priorities, and different approaches to runtime
optimization.
This can be further compounded by product development in different geographic
locations and time zones. Once all of these disparate subsystems are integrated into
a common runtime environment, all parts of the system need to provide adequate
response under all operating scenarios, such as:
normal system loading
peak periods
failure conditions
Given the parallel development paths, system issues invariably arise when integrating
the product. Typically, once a system is running, unforeseen interactions that cause
serious performance degradations are uncovered. When situations such as this arise,
there are usually very few designers or architects who can diagnose and solve these
problems at a system level. Solutions often take considerable modifications (frequently,
by trial and error) to get it right. This extends system integration, impacting the time
to market.
Problems of this nature can take a week or more to troubleshoot, and several weeks
to adjust priorities across the system, retest, and refine. If these problems can't be
solved effectively, product scalability is limited.
This is largely due to the fact that there's no effective way to budget CPU use across
these groups. Thread priorities provide a way to ensure that critical tasks run, but don't
provide guaranteed CPU time for important, noncritical tasks, which can be starved
in normal operations. In addition, a common approach to establishing thread priorities
is difficult to scale across a large development team.
Adaptive partitioning using the thread scheduler lets architects maintain a reserve of
resources for emergency purposes, such as a disaster-recovery system, or a
field-debugging shell, and define high-level CPU budgets per subsystem, allowing
development groups to implement their own priority schemes and optimizations within
a given budget. This approach lets design groups develop subsystems independently
and eases the integration effort. The net effect is to improve time-to-market and
facilitate product scaling.
Providing security
Many systems are vulnerable to Denial of Service (DOS) attacks. For example, a
malicious user could bombard a system with requests that need to be processed by
290
0%
0%
DOS
attack!
0%
0%
0%
0%
1%
Operating
System
Figure 56: Without adaptive partitioning, a DOS attack on one process can starve other
critical functions.
Some systems try to overcome this problem by implementing a monitor process that
detects CPU utilization and invokes corrective actions when it deems that a process
is using too much CPU. This approach has a number of drawbacks, including:
Response time is typically slow.
This approach caps the CPU usage in times when legitimate processing is required.
It isn't infallible or reliable; it depends on appropriate thread priorities to ensure
that the monitor process obtains sufficient CPU time.
The thread scheduler can solve this problem. The thread scheduler can provide separate
budgets to the system's various functions. This ensures that the system always has
some CPU capacity for important tasks. Threads can change their own priorities, which
can be a security hole, but you can configure the thread scheduler to prevent code
running in a partition from changing its own budget.
60%
20%
DOS
attack!
15%
5%
Operating
System
291
Adaptive Partitioning
Since adaptive partitioning can allocate any unused CPU time to partitions that require
it, it doesn't unnecessarily cap control-plane activity when there's a legitimate need
for increased processing.
Debugging
Adaptive partitioning can even make debugging an embedded system easierduring
development or deploymentby providing an emergency door into the system.
Simply create a partition that you can run diagnostic tools in; if you don't need to use
the partition, the thread scheduler allocates its budget among the other partitions.
This provides you with access to the system without compromising its performance.
For more information, see the Testing and Debugging chapter of the Adaptive
Partitioning User's Guide.
292
293
Glossary
A20 gate
On x86-based systems, a hardware component that forces the A20 address
line on the bus to zero, regardless of the actual setting of the A20 address
line on the processor. This component is in place to support legacy systems,
but the QNX Neutrino RTOS doesn't require any such hardware. Note that
some processors, such as the 386EX, have the A20 gate hardware built right
into the processor itself our IPL will disable the A20 gate as soon as
possible after startup.
adaptive
Scheduling policy whereby a thread's priority is decayed by 1. See also FIFO,
round robin, and sporadic.
adaptive partitioning
A method of dividing, in a flexible manner, CPU time, memory, file resources,
or kernel resources with some policy of minimum guaranteed usage.
application ID
A number that identifies all processes that are part of an application. Like
process group IDs, the application ID value is the same as the process id of
the first process in the application. A new application is created by spawning
with the POSIX_SPAWN_NEWAPP or SPAWN_NEWAPP flag. A process created
without one of those inherits the application ID of its parent. A process needs
the PROCMGR_AID_CHILD_NEWAPP ability in order to set those flags.
The SignalKill() kernel call accepts a SIG_APPID flag ORed into the signal
number parameter. This tells it to send the signal to all the processes with
an application ID that matches the pid argument. The DCMD_PROC_INFO
devctl() returns the application ID in a structure field.
asymmetric multiprocessing (AMP)
A multiprocessing system where a separate OS, or a separate instantiation
of the same OS, runs on each CPU.
atomic
Of or relating to atoms. :-)
In operating systems, this refers to the requirement that an operation, or
sequence of operations, be considered indivisible. For example, a thread
may need to move a file position to a given location and read data. These
operations must be performed in an atomic manner; otherwise, another
295
Glossary
thread could preempt the original thread and move the file position to a
different location, thus causing the original thread to read data from the
second thread's position.
attributes structure
Structure containing information used on a per-resource basis (as opposed
to the OCB, which is used on a per-open basis).
This structure is also known as a handle. The structure definition is fixed
(iofunc_attr_t), but may be extended. See also mount structure.
bank-switched
A term indicating that a certain memory component (usually the device
holding an image) isn't entirely addressable by the processor. In this case,
a hardware component manifests a small portion (or window) of the device
onto the processor's address bus. Special commands have to be issued to
the hardware to move the window to different locations in the device. See
also linearly mapped.
base layer calls
Convenient set of library calls for writing resource managers. These calls all
start with resmgr_*(). Note that while some base layer calls are unavoidable
(e.g. resmgr_pathname_attach()), we recommend that you use the POSIX
layer calls where possible.
BIOS/ROM Monitor extension signature
A certain sequence of bytes indicating to the BIOS or ROM Monitor that the
device is to be considered an extension to the BIOS or ROM Monitor
control is to be transferred to the device by the BIOS or ROM Monitor, with
the expectation that the device will perform additional initializations.
On the x86 architecture, the two bytes 0x55 and 0xAA must be present (in
that order) as the first two bytes in the device, with control being transferred
to offset 0x0003.
block-integral
The requirement that data be transferred such that individual structure
components are transferred in their entirety no partial structure component
transfers are allowed.
In a resource manager, directory data must be returned to a client as
block-integral data. This means that only complete struct dirent
structures can be returned it's inappropriate to return partial structures,
296
assuming that the next _IO_READ request will pick up where the previous
one left off.
bootable
An image can be either bootable or nonbootable. A bootable image is one
that contains the startup code that the IPL can transfer control to.
bootfile
The part of an OS image that runs the startup code and the microkernel.
bound multiprocessing (BMP)
A multiprocessing system where a single instantiation of an OS manages all
CPUs simultaneously, but you can lock individual applications or threads to
a specific CPU.
budget
In sporadic scheduling, the amount of time a thread is permitted to execute
at its normal priority before being dropped to its low priority.
buildfile
A text file containing instructions for mkifs specifying the contents and
other details of an image, or for mkefs specifying the contents and other
details of an embedded filesystem image.
canonical mode
Also called edited mode or cooked mode. In this mode the character device
library performs line-editing operations on each received character. Only
when a line is completely entered typically when a carriage return (CR)
is received will the line of data be made available to application processes.
Contrast raw mode.
channel
A kernel object used with message passing.
In QNX Neutrino, message passing is directed towards a connection (made
to a channel); threads can receive messages from channels. A thread that
wishes to receive messages creates a channel (using ChannelCreate()), and
then receives messages from that channel (using MsgReceive()). Another
thread that wishes to send a message to the first thread must make a
connection to that channel by attaching to the channel (using
ConnectAttach()) and then sending data (using MsgSend()).
chid
297
Glossary
298
299
Glossary
discrete (or traditional) multiprocessor system
A system that has separate physical processors hooked up in multiprocessing
mode over a board-level bus.
DNS
Domain Name Service an Internet protocol used to convert ASCII domain
names into IP addresses. In QNX Neutrino native networking, dns is one of
Qnet's builtin resolvers.
dynamic bootfile
An OS image built on the fly. Contrast static bootfile.
dynamic linking
The process whereby you link your modules in such a way that the Process
Manager will link them to the library modules before your program runs. The
word dynamic here means that the association between your program and
the library modules that it uses is done at load time, not at linktime. Contrast
static linking. See also runtime loading.
edge-sensitive
One of two ways in which a PIC (Programmable Interrupt Controller) can be
programmed to respond to interrupts. In edge-sensitive mode, the interrupt
is noticed upon a transition to/from the rising/falling edge of a pulse.
Contrast level-sensitive.
edited mode
See canonical mode.
EOI
End Of Interrupt a command that the OS sends to the PIC after processing
all Interrupt Service Routines (ISR) for that particular interrupt source so
that the PIC can reset the processor's In Service Register. See also PIC and
ISR.
EPROM
Erasable Programmable Read-Only Memory a memory technology that
allows the device to be programmed (typically with higher-than-operating
voltages, e.g. 12V), with the characteristic that any bit (or bits) may be
individually programmed from a 1 state to a 0 state. To change a bit from
a 0 state into a 1 state can only be accomplished by erasing the entire
device, setting all of the bits to a 1 state. Erasing is accomplished by shining
an ultraviolet light through the erase window of the device for a fixed period
300
301
Glossary
handle
A pointer that the resource manager base library binds to the pathname
registered via resmgr_attach(). This handle is typically used to associate
some kind of per-device information. Note that if you use the iofunc_*()
POSIX layer calls, you must use a particular type of handle in this case
called an attributes structure.
hard thread affinity
A user-specified binding of a thread to a set of processors, done by means
of a runmask. Contrast soft thread affinity.
image
In the context of embedded QNX Neutrino systems, an image can mean
either a structure that contains files (i.e. an OS image) or a structure that
can be used in a read-only, read/write, or read/write/reclaim FFS-2-compatible
filesystem (i.e. a flash filesystem image).
inherit mask
A bitmask that specifies which processors a thread's children can run on.
Contrast runmask.
interrupt
An event (usually caused by hardware) that interrupts whatever the processor
was doing and asks it do something else. The hardware will generate an
interrupt whenever it has reached some state where software intervention is
required.
interrupt handler
See ISR.
interrupt latency
The amount of elapsed time between the generation of a hardware interrupt
and the first instruction executed by the relevant interrupt service routine.
Also designated as Til. Contrast scheduling latency.
interrupt service routine
See ISR.
interrupt service thread
A thread that is responsible for performing thread-level servicing of an
interrupt.
302
Since an ISR can call only a very limited number of functions, and since
the amount of time spent in an ISR should be kept to a minimum, generally
the bulk of the interrupt servicing work should be done by a thread. The
thread attaches the interrupt (via InterruptAttach() or InterruptAttachEvent())
and then blocks (via InterruptWait()), waiting for the ISR to tell it to do
something (by returning an event of type SIGEV_INTR). To aid in minimizing
scheduling latency, the interrupt service thread should raise its priority
appropriately.
I/O message
A message that relies on an existing binding between the client and the
resource manager. For example, an _IO_READ message depends on the
client's having previously established an association (or context) with the
resource manager by issuing an open() and getting back a file descriptor.
See also connect message, context, combine message, and message.
I/O privileges
A particular right, that, if enabled for a given thread, allows the thread to
perform I/O instructions (such as the x86 assembler in and out
instructions). By default, I/O privileges are disabled, because a program with
it enabled can wreak havoc on a system. To enable I/O privileges, the thread
must be running as root, and call ThreadCtl().
IPC
Interprocess Communication the ability for two processes (or threads) to
communicate. The QNX Neutrino RTOS offers several forms of IPC, most
notably native messaging (synchronous, client/server relationship), POSIX
message queues and pipes (asynchronous), as well as signals.
IPL
Initial Program Loader the software component that either takes control
at the processor's reset vector (e.g. location 0xFFFFFFF0 on the x86), or
is a BIOS extension. This component is responsible for setting up the
machine into a usable state, such that the startup program can then perform
further initializations. The IPL is written in assembler and C. See also BIOS
extension signature and startup code.
IRQ
Interrupt Request a hardware request line asserted by a peripheral to
indicate that it requires servicing by software. The IRQ is handled by the
PIC, which then interrupts the processor, usually causing the processor to
execute an Interrupt Service Routine (ISR).
303
Glossary
ISR
Interrupt Service Routine a routine responsible for servicing hardware
(e.g. reading and/or writing some device ports), for updating some data
structures shared between the ISR and the thread(s) running in the
application, and for signalling the thread that some kind of event has
occurred.
kernel
See microkernel.
level-sensitive
One of two ways in which a PIC (Programmable Interrupt Controller) can be
programmed to respond to interrupts. If the PIC is operating in level-sensitive
mode, the IRQ is considered active whenever the corresponding hardware
line is active. Contrast edge-sensitive.
linearly mapped
A term indicating that a certain memory component is entirely addressable
by the processor. Contrast bank-switched.
message
A parcel of bytes passed from one process to another. The OS attaches no
special meaning to the content of a message the data in a message has
meaning for the sender of the message and for its receiver, but for no one
else.
Message passing not only allows processes to pass data to each other, but
also provides a means of synchronizing the execution of several processes.
As they send, receive, and reply to messages, processes undergo various
changes of state that affect when, and for how long, they may run.
microkernel
A part of the operating system that provides the minimal services used by
a team of optional cooperating processes, which in turn provide the
higher-level OS functionality. The microkernel itself lacks filesystems and
many other services normally expected of an OS; those services are provided
by optional processes.
mount structure
An optional, well-defined data structure (of type iofunc_mount_t) within
an iofunc_*() structure, which contains information used on a per-mountpoint
basis (generally used only for filesystem resource managers). See also
attributes structure and OCB.
304
mountpoint
The location in the pathname space where a resource manager has
registered itself. For example, the serial port resource manager registers
mountpoints for each serial device (/dev/ser1, /dev/ser2, etc.), and a
CD-ROM filesystem may register a single mountpoint of /cdrom.
multicore system
A chip that has one physical processor with multiple CPUs interconnected
over a chip-level bus.
mutex
Mutual exclusion lock, a simple synchronization service used to ensure
exclusive access to data shared between threads. It is typically acquired
(pthread_mutex_lock()) and released (pthread_mutex_unlock()) around the
code that accesses the shared data (usually a critical section). See also
critical section.
name resolution
In a QNX Neutrino network, the process by which the Qnet network manager
converts an FQNN to a list of destination addresses that the transport layer
knows how to get to.
name resolver
Program code that attempts to convert an FQNN to a destination address.
nd
An abbreviation for node descriptor, a numerical identifier for a node relative
to the current node. Each node's node descriptor for itself is 0
(ND_LOCAL_NODE).
NDP
Node Discovery Protocol proprietary QNX Software Systems protocol for
broadcasting name resolution requests on a QNX Neutrino LAN.
network directory
A directory in the pathname space that's implemented by the Qnet network
manager.
NFS
Network FileSystem a TCP/IP application that lets you graft remote
filesystems (or portions of them) onto your local namespace. Directories on
the remote systems appear as part of your local filesystem and all the utilities
305
Glossary
you use for listing and managing files (e.g. ls, cp, mv) operate on the remote
files exactly as they do on your local files.
NMI
Nonmaskable Interrupt an interrupt that can't be masked by the processor.
We don't recommend using an NMI!
Node Discovery Protocol
See NDP.
node domain
A character string that the Qnet network manager tacks onto the nodename
to form an FQNN.
nodename
A unique name consisting of a character string that identifies a node on a
network.
nonbootable
A nonbootable OS image is usually provided for larger embedded systems
or for small embedded systems where a separate, configuration-dependent
setup may be required. Think of it as a second filesystem that has some
additional files on it. Since it's nonbootable, it typically won't contain the
OS, startup file, etc. Contrast bootable.
OCB
Open Control Block (or Open Context Block) a block of data established
by a resource manager during its handling of the client's open() function.
This context block is bound by the resource manager to this particular
request, and is then automatically passed to all subsequent I/O functions
generated by the client on the file descriptor returned by the client's open().
package filesystem
A virtual filesystem manager that presents a customized view of a set of files
and directories to a client. The real files are present on some medium;
the package filesystem presents a virtual view of selected files to the client.
partition
A division of CPU time, memory, file resources, or kernel resources with
some policy of minimum guaranteed usage.
pathname prefix
See mountpoint.
306
307
Glossary
308
to use that device (e.g. the shell and the telnet daemon process, used for
logging in to a system over the Internet).
pulses
In addition to the synchronous Send/Receive/Reply services, QNX Neutrino
also supports fixed-size, nonblocking messages known as pulses. These carry
a small payload (four bytes of data plus a single byte code). A pulse is also
one form of event that can be returned from an ISR or a timer. See
MsgDeliverEvent() for more information.
Qnet
The native network manager in the QNX Neutrino RTOS.
QoS
Quality of Service a policy (e.g. loadbalance) used to connect nodes
in a network in order to ensure highly dependable transmission. QoS is an
issue that often arises in high-availability (HA) networks as well as realtime
control systems.
RAM
Random Access Memory a memory technology characterized by the ability
to read and write any location in the device without limitation. Contrast flash
and EPROM.
raw mode
In raw input mode, the character device library performs no editing on
received characters. This reduces the processing done on each character to
a minimum and provides the highest performance interface for reading data.
Also, raw mode is used with devices that typically generate binary data
you don't want any translations of the raw binary stream between the device
and the application. Contrast canonical mode.
replenishment
In sporadic scheduling, the period of time during which a thread is allowed
to consume its execution budget.
reset vector
The address at which the processor begins executing instructions after the
processor's reset line has been activated. On the x86, for example, this is
the address 0xFFFFFFF0.
resource manager
309
Glossary
A user-level server program that accepts messages from other programs and,
optionally, communicates with hardware. QNX Neutrino resource managers
are responsible for presenting an interface to various types of devices,
whether actual (e.g. serial ports, parallel ports, network cards, disk drives)
or virtual (e.g. /dev/null, a network filesystem, and pseudo-ttys).
In other operating systems, this functionality is traditionally associated with
device drivers. But unlike device drivers, QNX Neutrino resource managers
don't require any special arrangements with the kernel. In fact, a resource
manager looks just like any other user-level program. See also device driver.
RMA
Rate Monotonic Analysis a set of methods used to specify, analyze, and
predict the timing behavior of realtime systems.
round robin
A scheduling policy whereby a thread is given a certain period of time to
run. Should the thread consume CPU for the entire period of its timeslice,
the thread will be placed at the end of the ready queue for its priority, and
the next available thread will be made READY. If a thread is the only thread
READY at its priority level, it will be able to consume CPU again immediately.
See also adaptive, FIFO, and sporadic.
runmask
A bitmask that indicates which processors a thread can run on. Contrast
inherit mask.
runtime loading
The process whereby a program decides while it's actually running that it
wishes to load a particular function from a library. Contrast static linking.
scheduling latency
The amount of time that elapses between the point when one thread makes
another thread READY and when the other thread actually gets some CPU
time. Note that this latency is almost always at the control of the system
designer.
Also designated as Tsl. Contrast interrupt latency.
scoid
An abbreviation for server connection ID.
session
310
311
Glossary
312
Index
_CS_DOMAIN 248
_CS_HOSTNAME 248
_IO_STAT 166
_NOTIFY_COND_INPUT 82
_NOTIFY_COND_OBAND 82
_NOTIFY_COND_OUTPUT 82
_NTO_CHF_FIXED_PRIORITY 77
_NTO_TCTL_IO 86, 137, 139
_NTO_TCTL_RUNMASK 123
_NTO_TCTL_RUNMASK_GET_AND_SET_INHERIT 123
.longfilenames 184
/ directory 142
/dev 90, 142, 180, 186, 223
hd* 180
mem 142
mq and mqueue 90
ser* 223
shmem 186
zero 142
/net directory 246
/proc 142, 281
boot 142
ham 281
pid 142
/tmp directory 186
A
abort() 86
accept() 261
actions (HA) 278
adaptive partitioning 286, 288, 292, 293
debugging with 292
partitions 288
thread scheduler 293
affinity, processor 119, 122
alarm() 58
anonymous memory 95
Apple Macintosh HFS and HFS Plus 214
as_add_containing() 104
as_add() 103
Asymmetric Multiprocessing (AMP) 115, 116
asynchronous publishing 217
atomic operations 47, 54
attributes structure (resource manager) 172
autoconnect 268
AutoIP 266
B
background priority (sporadic scheduling) 42
barriers 29, 47, 50
and threads 50
bind() 261
bindresvport() 261
block-oriented devices 223
C
canonical input mode 230
cd command 149, 150
See also directories
CD-ROM filesystem 204
ChannelCreate() 75, 77
ChannelDestroy() 75
channels 75, 76
character devices 223
chmod() 174
chown() 174
chroot() 149
CIFS filesystem 211
clock services 56
clock_getcpuclockid() 56
clock_getres() 56
clock_gettime() 56
clock_settime() 56
ClockAdjust() 56, 57
ClockCycles() 56, 57
ClockId() 56
ClockPeriod() 56
ClockTime() 56
close() 92, 103, 174
COFF (Common Object File Format) 158
combine messages 172
conditions (HA entity states) 277
CONDVAR (thread state) 36
condvars 29, 45, 47, 49, 120
example 49
operations 49
SMP 120
confstr() 248
connect messages 169
connect() 261
ConnectAttach() 75
ConnectDetach() 75
consoles 233
physical 233
virtual 233
cooked input mode 230
cooperating processes 105
FIFOs 105
pipes 105
copy-on-write (COW) 193
CPU 119, 122, 286, 293
affinity 119, 122
usage, budgeting 286, 293
313
Index
CRC 189
critical section 45, 48, 49, 54, 60, 120
defined 45
SMP 120
current working directory 149, 150
D
dates, valid range of 56
DEAD (thread state) 36
deadlock-free systems, rules for 79
debugging, using adaptive partitions for 292
defragmentation of physical memory 138
design goals for QNX Neutrino 28
design goals for the QNX Neutrino RTOS 13
devc-con, devc-con-hid 233
devctl() 174, 227
device control 227
device drivers 22, 64, 163
See also resource managers
no need to link into kernel 64
similarity to standard processes 22
See also resource managers
device names, creating 148
directories 149, 150
changing 149
current working directory 149, 150
directories, changing 150
discrete multiprocessors 115
disks 181, 192, 201
corruption, avoiding 192
DOS disks, accessing 201
partitions 181
dladdr() 160
dlclose() 160
dlopen() 160
dlsym() 160
DMA-safe region, defining 104
dn_comp() 261
dn_expand() 261
domains of authority 142
domains, encryption 196
DOS filesystem manager 201
DT_RPATH 159
dumper 33
dup() 103, 151, 170, 171
dup2() 103, 151
dynamic interpreter 158
dynamic linker, See runtime linker
dynamic linking 155
E
edited input mode 230
editing capabilities (io-char) 230
ELF 131, 157
Embedded Transaction Filesystem (ETFS) 187
encryption 196
endprotoent() 261
endservent() 261
entities (HA process) 276
314
F
fast emitting mode (instrumented kernel) 109
fcntl() 151
FIFO (scheduling method) 40, 41, 47
FIFOs 105
See also pipes
creating 105
removing 105
See also pipes
file descriptors (FDs) 103, 150, 151, 152
duplicating 151
inheritance 151, 152
open control blocks (OCBs) 150
several FDs referring to the same OCB 151
typed memory and 103
files 105, 151, 201, 286
DOS files, operating on 201
FIFOs 105
opened by different processes 151
opened twice by same process 151
pipes 105
space, budgeting (not implemented) 286
filesystems 143, 147, 151, 185, 186, 187, 191, 192, 196,
201, 204, 210, 211, 212, 213, 214, 215, 281
accessing a filesystem on another node 147
Apple Macintosh HFS and HFS Plus 214
CD-ROM 204
CIFS 211
DOS 201
Embedded Transaction (ETFS) 187
HAM 281
Image 185
Linux Ext2 212
NFS 210
NTFS (fs-nt.so) 215
Power-Safe (fs-qnx6) 192, 196
encryption 196
QNX 4 143, 191
RAM 186
seek points 151
Universal Disk Format (UDF) 213
five nines (HA metric) 269
Flash 131
G
gethostbyaddr() 261
gethostbyname() 261
getpeername() 261
getprotobyname() 261
getprotobynumber() 261
getprotoent() 261
getservbyname() 261
getservent() 261
getsockname() 261
getsockopt() 261
global list 160
GNS (Global Name Service) 248
H
h_errlist() 261
h_errno() 261
h_nerr() 261
HA 269, 270, 273
client-side library 273
microkernel architecture inherently suited for 270
recovery example 273
HAM 275, 276, 281
API 281
hierarchy 276
ham_action_control() 281
ham_action_execute() 278, 281
ham_action_fail_execute() 279, 281
ham_action_fail_log() 279, 281
ham_action_fail_notify_pulse_node() 279, 281
ham_action_fail_notify_pulse() 279, 281
ham_action_fail_notify_signal_node() 279, 281
ham_action_fail_notify_signal() 279, 281
ham_action_fail_waitfor() 279, 281
ham_action_handle_free() 281
ham_action_handle_node() 281
ham_action_handle() 281
ham_action_heartbeat_healthy() 278, 281
ham_action_log() 278, 281
ham_action_notify_pulse_node() 278, 281
ham_action_notify_pulse() 278, 281
ham_action_notify_signal_node() 278, 281
ham_action_notify_signal() 278, 281
ham_action_remove() 281
ham_action_restart() 278, 281
ham_action_waitfor() 278, 281
ham_attach_node() 281
ham_attach_self() 281
ham_attach() 281
ham_condition_control() 281
ham_condition_handle_free() 281
ham_condition_handle_node() 281
ham_condition_handle() 281
ham_condition_raise() 280, 281
ham_condition_remove() 281
ham_condition_state() 280, 281
ham_condition() 281
ham_connect_nd() 281
ham_connect_node() 281
ham_connect() 281
ham_detach_name_node() 281
ham_detach_name() 281
ham_detach_self() 281
ham_detach() 281
ham_disconnect_nd() 281
ham_disconnect_node() 281
ham_disconnect() 281
ham_entity_condition_raise() 280, 281
ham_entity_condition_state() 280, 281
ham_entity_control() 281
ham_entity_handle_free() 281
ham_entity_handle_node() 281
ham_entity_handle() 281
315
Index
ham_entity_node() 281
ham_entity() 281
ham_heartbeat() 281
ham_stop_nd() 281
ham_stop_node() 281
ham_stop() 281
ham_verbose() 281
herror() 261
HFS and HFS Plus 214
high availability, See HA
High Availability Manager, See HAM
hstrerror() 261
htonl() 261
htons() 261
Hyper-Threading 118
I
I/O messages 169
I/O privileges 86, 137, 139
I/O resources 142
i8259 interrupt control hardware 64
idle thread 39, 65
ifconfig 242
Image filesystem 185
inet_addr() 261
inet_aton() 261
inet_lnaof() 261
inet_makeaddr() 261
inet_netof() 261
inet_network() 261
inet_ntoa() 261
inheritance structure 123
initial budget (sporadic scheduling) 42
inodes 192
input mode 228, 230
edited 230
raw 228
input, redirecting 105
instrumentation 107
interrupts can be traced 107
kernel can be used in prototypes or final products 107
works on SMP systems 107
interprocess communication, See IPC
interprocessor interrupts (IPIs) 120
INTERRUPT (thread state) 36
interrupt control hardware (i8259 on a PC) 64
interrupt handlers 29, 60, 120
See also ISR
SMP 120
See also ISR
interrupt latency 60
Interrupt Service Routine, See ISR
InterruptAttach() 62
InterruptAttachEvent() 62
InterruptDetach() 62
InterruptDisable() 62, 120
problem on SMP systems 120
InterruptEnable() 62, 120
problem on on SMP systems 120
InterruptHookIdle() 65
InterruptHookTrace() 65
316
J
JOIN (thread state) 36
K
kernel, See microkernel, Process Manager,
kill() 83
L
latency 60, 61, 64
interrupt 60, 64
scheduling 61, 64
LD_LIBRARY_PATH 159
LD_PRELOAD 160
libraries 160
loading before others 160
link() 201
linking 154, 155, 157
dynamically 155
sections 157
statically 154
Linux Ext2 filesystem 212
listen() 261
loadbalance (QoS policy) 250
lock() 174
locking memory 136
lseek() 174
lsm-pf-*.so 264
lsm-qnet.so 246, 247, 248, 249, 251
tx_retries option 251
M
Macintosh HFS and HFS Plus 214
malloc() 153
MAP_ANON 95, 140
MAP_BELOW16M 96
MAP_FIXED 95
MAP_LAZY 137
MAP_NOINIT 96, 97
MAP_NOX64K 96
MAP_PHYS 95, 140
MAP_PRIVATE 94, 137
MAP_SHARED 94
mem_offset() 140
memory 46, 54, 73, 91, 93, 95, 96, 99, 104, 132, 136,
137, 138, 139, 140
anonymous 95
DMA-safe region, defining 104
initializing 96
locking 136, 139
mapping 93, 140
physical, defragmenting 138
protection, advantage of for embedded systems 132
quantums 139
shared 46, 54, 73
superlocking 137
typed 99
unmovable 139, 140
Memory Management Units (MMUs) 132
memory-resident 136
message copying 71
message passing 24, 29, 68, 78, 243
API 78
as means of synchronization 24
network-wide 243
message queues 89
messages 24, 46, 71
contents of 24
multipart 71
tend to be tiny 46
metadata 193
microkernel 19, 20, 21, 27, 40, 107, 113, 117, 119
See also microkernel
comparable to a realtime executive 19
defined 19
general responsibilities 21
instrumentation 107113
instrumented 107
locking 119
managing team of cooperating processes 20
modularity as key aspect 19
priority of 40
services 21
SMP 117, 119
version of, determining 27
See also microkernel
mkfifo 105
mkfifo() 105
mkqnx6fs 193, 196
encryption 196
mlock() 137, 139
mlockall() 137
mmap_device_memory() 96
mmap() 92, 93, 97, 99, 140, 174
MMU 72, 73, 132
mount structure (resource manager) 172
mountpoints 142, 144, 177
order of resolution 144
mprotect() 92, 96
mq server 89
mq_close() 89, 90
mq_getattr() 90
mq_notify() 82, 90
mq_open() 89, 90
mq_receive() 89, 90
mq_send() 89, 90
mq_setattr() 90
mq_unlink() 89, 90
mqueue resource manager 89
MsgDeliverEvent() 36, 78, 81
MsgError() 69, 70
MsgInfo() 78
MsgKeyData() 78
MsgRead() 78
MsgReadv() 74
MsgReceive() 36, 40, 67, 78
MsgReceivePulse() 78
MsgReceivePulsev() 74
MsgReceivev() 74
MsgReply() 67, 70, 74, 78
MsgReply*() 36
MsgReplyv() 74
MsgSend() 36, 37, 40, 67, 70, 74, 78
non-cancellation point variants 40
MsgSendPulse() 36, 78
MsgSendsv() 74
MsgSendv() 74, 125
MsgSendvs() 74
MsgWrite() 78
msync() 92
multicore processors 115
multiprocessing 115
munlock() 137
munlockall() 137
munmap_flags() 92, 97
munmap() 92, 97
MUTEX (thread state) 36
mutexes 29, 45, 47, 48, 49, 50, 120, 139
attributes 48
not currently moved when defragmenting physical memory
139
priority inheritance 48
recursive 49, 50
SMP 120
N
name resolution 246
network 246
name resolver 249
name_attach() 76
name_open() 76
NAND flash 187
NANOSLEEP (thread state) 36
317
Index
nanosleep() 36, 58
NAT 264
ND_LOCAL_NODE 246
nested interrupts 61
NET_REPLY (thread state) 36
NET_SEND (thread state) 36
network 25, 148, 149, 167, 243, 246
as homogeneous set of resources 25
flexibility 25
message passing 243
name resolution 246
pathnames 148
root 149
transparency 25, 167
NFS filesystem 210
NMI 65
node descriptor 246
network 246
node domain 248
node name 248
NTFS (fs-nt.so) 215
ntohl() 261
ntohs() 261
NTP 265
O
O_SYNC(ignored by Power-Safe filesystems) 195
object files, sections of 157
on 123
on command 149
open control blocks (OCBs) 150, 151
open resources 151
active information contained in OCBs 151
open() 93, 174
operations, atomic 54
output, redirecting 105
P
pages 132
parallel devices 235
partially bound executable 155
partitions 286, 288
adaptive 288
static 286
thread scheduler 286
partitions (disk) 181
partitions (thread scheduler) 286
pathconf() 174
pathname 148
converting relative to network 148
pathname space 142, 164, 246
mapping 164
pause() 83
performance 46, 60
context-switch 46
realtime 60
Persistent Publish/Subscribe, See PPS
physical memory, defragmenting 138
pidin 33
pipe manager 105
318
pipe() 105
pipes 105
creating 105
Point-to-Point Protocol (PPP) 259
Point-to-Point Protocol over Ethernet (PPPoE) 267
popen() 105
POSIX 14, 16, 17, 31, 32, 89
defines interface, not implementation 14
message queues 89
profiles 16
realtime extensions 16
standards of interest to embedded systems developers 16
suitable for embedded systems 16, 17
threads 16, 31, 32
library calls not involving kernel calls 31
library calls with corresponding kernel calls 32
UNIX and 14
posix_mem_offset() 103
posix_spawn*() family of functions 126, 137
memory locks 137
posix_typed_mem_get_info() 99
posix_typed_mem_open() 99
Power-Safe (fs-qnx6) filesystem 192, 196
encryption 196
PPP (Point-to-Point Protocol) 259
PPPoE (Point-to-Point Protocol over Ethernet) 267
PPS 217, 218, 219, 220, 221, 222
files 219
filesystem 219
modes 221
notification groups 222
objects 219
options 219
persistence 218
publishing 220
qualifiers 219
subscribing 221
preferred (QoS policy) 250
prefix 142
prefix tree 142
printf() 153
priority 38, 39, 40, 42, 48, 77, 129, 217
background and foreground (sporadic scheduling) 42
inheritance 48, 77, 129, 217
messages 77
mutexes 48
inversion 40, 48
of microkernel 40
range 39
process groups 128, 129
membership, inheriting 128
remote node 129
Process Manager 125
See also microkernel,
capabilities of 125
required when creating multiple POSIX processes 125
See also microkernel,
processes 20, 22, 28, 31, 105, 126, 131, 136, 137, 151
as container for threads 31
cooperating 105
via pipes and FIFOs 105
I/O privileges, requesting 137
Q
Qnet 241, 243255, 250, 251
limiting transmission retries 251
redundant 250
QNX 4 filesystem 143, 191
QNX 6 filesystem, See Power-Safe filesystem
QNX Neutrino 13, 18, 19, 22, 25, 28, 29
design goals 13, 28
extensibility 22
flexibility 18
microkernel 19
network as homogeneous set of resources 25
network flexibility 25
network transparency 25
319
Index
QNX Neutrino (continued)
preemptible even during message pass 29
realtime applications, suitability for 18
services 29
single-computer model 25
QoS (Quality of Service) 249, 250
policies 250
quantums 139
R
raise() 83
RAM 131
RAM "filesystem" 186
RAM disk 181
rate monotonic analysis (RMA) 42
raw input mode 228, 229
conditions for input request 228
FORWARD qualifier 229
MIN qualifier 228
TIME qualifier 228
TIMEOUT qualifier 229
read() 82
readblock() 172
readdir() 147, 178
reader/writer locks 47, 52
READY (thread state) 36
realtime performance 60, 61
interrupt latency and 60
nested interrupts and 61
scheduling latency and 61
RECEIVE (thread state) 36
recv() 261
recvfrom() 261
redirecting 105
input 105
output 105
redundant Qnet 250
relative pathnames, converting to network pathnames 148
remove() 105
replenishment period (sporadic scheduling) 43
REPLY (thread state) 36
res_init() 261
res_mkquery() 261
res_query() 261
res_querydomain() 261
res_search() 261
res_send() 261
resource managers 163, 164, 167, 169, 170, 171, 172, 177
atomic operations 172
attributes structure 172
can be started and stopped dynamically 163
communicate with clients via IPC 167
context for client requests 169
defined 164
don't require any special arrangements with the kernel 163
iofunc_*() shared library 172
message types 169
mount structure 172
shared library 170
similarity to traditional device drivers 163
similarity to user-level servers 164
320
S
scaling 15
advantages of 15
of applications 15
scatter/gather DMA 71
SCHED_PRIO_LIMIT_ERROR() 39
SCHED_PRIO_LIMIT_SATURATE() 39
sched_yield() 38
SchedGet() 32
SchedSet() 32
scheduling 38, 40, 41, 42, 61, 119
FIFO 40, 41
latency 61
method 41
determining 41
setting 41
round-robin 40, 41
SMP systems 119
sporadic 40, 42
threads 38
seek points 151
segments 157
select() 82, 261
SEM (thread state) 37
sem_destroy() 55
sem_init() 55
sem_post() 55
sem_trywait() 55
sem_wait() 55
semaphores 29, 45, 47, 53, 120
named 53
SMP 120
SEND (thread state) 37
SIGTSTP 86
SIGTTIN 86
SIGTTOU 86
SIGURG 86
SIGUSR1 86
SIGUSR2 86
SIGWAITINFO (thread state) 37
sigwaitinfo() 37, 83
SIGWINCH 86
single-computer model 25
slay 123
sleep() 58
sleepon locks 47, 52
SMP (Symmetric Multiprocessing) 115, 117
snapshot (Power-Safe filesystem) 194
socket() 261
sockets (logical flash drives) 205
software interrupt, See signals
SPAWN_EXPLICIT_CPU 123
SPAWN_SETGROUP 128
SPAWN_SETSIGDEF 128
SPAWN_SETSIGMASK 128
spawn() 123, 126, 127, 128, 137
family of functions 126, 127, 128, 137
memory locks 137
spinlocks 121
SPOF 275
sporadic scheduling 40, 42
STACK (thread state) 37
startup code (startup-*) 57, 118
tickless operation 57
stat() 174
states 36, 37, 136
CONDVAR 36
DEAD 36
INTERRUPT 36
JOIN 36
MUTEX 36
NANOSLEEP 36
NET_REPLY 36
NET_SEND 36
READY 36
RECEIVE 36
REPLY 36
RUNNING 36
SEM 37
SEND 37
SIGSUSPEND 37
SIGWAITINFO 37
STACK 37
STOPPED 37
WAITCTX 37
WAITPAGE 37, 136
WAITTHREAD 37
static linking 154
static partitions 286
STOPPED (thread state) 37
stty 231
subscriber, connection to publisher 217
superblocks 194
superlocking memory 137
symbol names, resolving 160
321
Index
symbolic links 150
cd command and 150
symbolic prefixes 147
Symmetric Multiprocessing, See SMP
SyncCondvarSignal() 32, 55
SyncCondvarWait() 32, 55
SyncDestroy() 32, 55
synchronization services 47, 55
SyncMutexEvent() 86
SyncMutexLock() 32, 55
SyncMutexUnlock() 32, 55
SyncSemPost() 55
SyncSemWait() 37, 55
SyncTypeCreate() 32, 55
system 22, 56, 292
emergency access to 292
page 56
processes 22
similarity to user-written processes 22
T
tcdropline() 227
tcgetattr() 227
tcgetpgrp() 227
tcinject() 227
TCP/IP 257, 259
resource manager (io-pkt*) 259
stack configurations 257
tcsendbreak() 227
tcsetattr() 227
tcsetpgrp() 227
Technical support 12
terminal emulation 233
textto 201
thread scheduler 286
ThreadCancel() 32
ThreadCreate() 32, 37
ThreadCtl() 32, 86, 123, 137, 139
ThreadDestroy() 32
ThreadDetach() 32
ThreadJoin() 32
threads 29, 31, 33, 35, 38, 40, 41, 42, 45, 46, 50, 86,
119, 125, 171
all share same kernel interface 125
and barriers 50
attributes of 33
cancellation handlers 33
concurrency advantages 46
defined 31
I/O privileges, requesting 86
life cycle 35
migration, reducing 119
names 33
priority 33, 38
priority inversion 40
process must contain one or more 31
register set 33
scheduling 38, 41, 42
FIFO 41
round-robin 41
sporadic 42
322
threads (continued)
signal mask 33
stack 33
states 35
synchronization 45
tid 33
TLS (thread local storage) 33
tickless operation 57
time_t 56
timeout service 58
timer_create() 58
timer_delete() 58
timer_getoverrun() 58
timer_gettime() 58
timer_settime() 58
TimerAlarm() 58
TimerCreate() 58
TimerDestroy() 58
TimerInfo() 58
timers 29, 57, 58
cyclical mode 58
tickless operation 57
TimerSettime() 58
TimerTimeout() 58
timeslice 41
TLB (translation look-aside buffer) 133, 136
TLS (thread local storage) 34
TraceEvent() 113
transactions 187
transparency of network 25
typed memory 99
Typographical conventions 10
U
UART 65
UDF (Universal Disk Format) filesystem 213
umount 242
uname 27
Universal Disk Format (UDF) filesystem 213
unlink() 105
UNMAP_INIT_OPTIONAL 97
UNMAP_INIT_REQUIRED 97
utime() 174
V
variable page size 136
vfork() 126, 130
virtual addresses 132
virtual consoles 233
W
WAITCTX (thread state) 37
WAITPAGE (thread state) 37, 136
WAITTHREAD (thread state) 37
watchdog 134
wide emitting mode (instrumented kernel) 109
Windows (Microsoft) 215
NTFS (fs-nt.so) 215
Z
zero-copy architecture (io-pkt*) 238
323
Index
324