POSIX and the Verification Grand Challenge: a Roadmap

Leo  Freitas

POSIX and the Verification Grand Challenge: a Roadmap

2008, 13th ICECCS

13th IEEE International Conference on Engineering of Complex Computer Systems POSIX and the Verification Grand Challenge: a roadmap Leo Freitas, Jim Woodcock Department of Computer Science University of York, YO10 5DD, UK Andrew Butterfield School of Computer Science and Statistics Trinity College Dublin, Dublin 2, Ireland E-mail: {leo, jim}@cs.york.ac.uk, Andrew.Butterfield@cs.tcd.ie Abstract ties, the protocol that implements these properties, and to verify correctness of the protocol [52]. In the next section, we discuss the POSIX pilot project. The roadmap structure is given in Section 3. After that, we present the pillar of our work in Section 4. The POSIX standard is discussed in Section 5, where a brief overview of fault tolerance aspects via IBM CICS is given in Section 6. Section 7 decribes current efforts to standardise flash memory devices. Some conlcusions are drawn in Section 8. We present a research roadmap for the second pilot project in the Verified Software Grand Challenge on formally verified POSIX file stores. The work is inspired by the requirements for NASA’s forthcoming Mars Rover missions. The roadmap describes an integrated and comprehensive body of work, including current work, as well as further opportunities for collaboration. 2. What makes a pilot project? 1. Introduction In [34], Joshi & Holzmann suggest a pilot project for the GC. They characterise, motivate, and justify an interesting verification mini-challenge. It breaks the GC into smaller pilot projects, where each has the following features: (i) it would be of sufficient complexity that traditional methods, such as testing and code reviews, are inadequate to establish its correctness; (ii) it would be of sufficient simplicity that specification, design and verification could be completed by a dedicated team in a relatively short time, say 2–3 years; and (iii) it would be of sufficient importance that successful completion would have an impact beyond the verification community, to both academia and industry. At the Menlo Park workshop at SRI [48], the POSIX file-store interface of the Linux Kernel [33] was suggested as a candidate pilot project. The suggestion involved a small subset of POSIX suitable for flash-hardware with strict fault-tolerant requirements to be used by forthcoming NASA missions. Due to the nature of the environment this small subset would run in, two important robustness requirements for fault-tolerance were later agreed [12]: (i) no corruption in the presence of unexpected power loss; and (ii) recovery from faults specific to flash hardware (i.e., bad blocks, bit corruption, wear-levelling, etc). In recovery from power loss in particular, they require the file system to be reset-reliable in the following sense: if an operation Op is in progress at the time of a power loss, then on reboot, the file system state will be as if Op either has successfully completed or has never started. Recent advances in theory and tool support have inspired industrial and academic researchers to join up in an international Grand Challenge (GC) in Verified Software [22]. With tools backed by mature theory, formal methods are becoming more effective, and their use is easier to justify, not just as an academic or legal requirement, but as a business case. That is, despite the initial extra effort, the gains given by formal methods in terms of reliability, accountability, and precision, can save money. Also, as tool maturity rises, levels of expertise from the user tend to decrease, hence helping to de-skill the process and making it cheaper to industrialise. Work has started with the creation of a Verified Software Repository (VSR) with two principal aims: (i) assembling a collection of verified software components; and (ii) performing industrial-scale verification experiments with theoretical significance and tool-support impact [11]. A research roadmap for the entire challenge is hosted at SRI’s Computer Science Laboratory (see qpq.csl.sri.com). Related work The first VSR pilot project experiment took place during 2006: the successful mechanisation of a sanitised version of the first ITSEC Level 6 high-integrity smart-card banking application: Mondex [49]. In that experiment, seven groups used different theories and tools to mechanise the specification of Mondex’s security proper- 978-0-7695-3139-7/08 $25.00 © 2008 IEEE DOI 10.1109/ICECCS.2008.35 153 POSIX file system The choice of the POSIX file-system interface is interesting for various reasons: (i) it is a clean, well-defined, and standard interface that has been stable for many years; (ii) the data structures and algorithms required are well understood; (iii) although a small part of an operating system, it is complex enough in terms of reliability guarantees, such as unexpected power loss, concurrent access, or data corruption; and (iv) modern information technology is massively dependent on reliable and secure information availability. All these reasons go beyond the verification community interest, as well as the intended initial use on forthcoming NASA Mars Rover missions, as developed by the Jet Propulsion Laboratory (JPL). An initial subset of POSIX file systems [41] has been chosen for the pilot project. There is no support for: (i) file permissions; (ii) hard or symbolic-links; or (iii) entities other than traditional files and directories (e.g., no pipes, sockets, etc). Adding support for (i) is not difficult and may be done later, whereas support for (ii) and (iii) is more difficult and might be beyond the scope of the challenge. Existing flash-memory file-systems, such as YAFFS2 [55], do not support these features, since they are not usually needed for the kinds of embedded systems with flash memory. is guaranteed. Flash hardware is very relevant to spaceflight missions, since the devices have no rotating parts, but the architecture can also be generalised to other types of hardware, as well as other fault-tolerance concerns. One important benefit of this orthogonality is that it allows different groups applying varied techniques to collaborate while working in parallel. This makes the file store project collaborative, in contrast to the previous Mondex pilot project, which was essentially competitive [16]. Since POSIX is a considerably bigger, more copmplex problem, this seems the best way forward. At first, we intend to work on a minimal file system that is enough for NASA’s purposes, but the ultimate goal is much wider than this: to achieve conformance certification (see Section 5). In what follows, we present a roadmap following Intel’s architecture. Our presentation of the roadmap does not portray the chronological order in which the documentation was found and the work was built-up. Instead, it provides a logical and systematic view of our researches. For each section, we present the work done so far, what remains to be done, the known participating and interested parties, where the remaining challenges lie, and so on. Our aim in this paper is to distill our findings and the relevant references in a logical and discernible way to inspire others to join us! 3. Roadmap structure 4. Intel’s architecture We present in this paper a roadmap for the POSIX pilot project, containing an organised set of activities, interests, documents, goals, and achievements. It has been compiled in 2007, and represents the result of painstaking digging through articles, standards, books, and other sources. Over this period, in the various workshops and conference meetings, three important milestones came to life: (i) a good synergy of interest with various researchers from different technical background; (ii) enough critical mass of results in the form of models, theories, tools, and experiments; and finally, (iii) requests from different sources wishing to participate, but not knowing where to start, or where to look for information. These led us to write this paper. We tried to organise our achievements, as well as all the work that remains to be done, in the hope that colleagues would go beyond being merely interested and start to get involved in doing the work. We have divided this paper in five branches, each tackling different aspects of the mini-challenge. After considerable search through documentation, external guidance, and other sources, we decided to follow a guideline for file systems given in an Intel architecture document [29], which allowed us to divide the challenge into three aspects layered orthogonally: (i) file store functionality; (ii) hardware interaction; and (iii) fault tolerance. This architecture follows POSIX conventions and is particularly useful for flash-hardware devices, where Power Loss Recovery (PLR) In 2004, Intel released a document containing a set of API’s for file systems that is layered at various levels of abstraction [29], and is particularly targeted at flash hardware. It contains the API’s signature (in C) with main data structures, error codes, expected functionality, control-flow algorithms, contracts among various layers, and so on. This architecture clearly states a reference guide for file system implementation that carefully considers varied aspects, such as file store functionality, operating system interoperability, real-time issues, hardware interfacing, fault tolerance aspects at various layers, and so on. Unfortunately, to our knowledge, a reference implementation for this architecture is not freely available. Yet, based on the given level of detail, this reference API is most likely to have been derived from some implementation. Our aim is to try to link our formal models created from the use of the various verification techniques, so that a reference implementation can be constructed correctly through the refinement calculus [39, 2]. We want to have a verified file store that is correct by construction using refinement, from requirements down to code, shaped according Intel’s API’s, and taking into consideration both fault tolerance and flash hardware aspects along the way. These API’s are laid out in eight layers wrapped up at the top-level to look and behave like the POSIX standard API. We divided these layers into four categories, where each 154 category has a set of important associated documents: (i) file operations involving both the code for a file database, as well as the actual file data itself; (ii) virtual blocks and their operations, which maps physical into local blocks of various sizes providing garbage collection and space management; (iii) device independent and device abstraction operations; and (iv) device-specific operations. Fault tolerance is handled by different techniques at different layers, where the property under concern is power loss recovery. It is maintained through different transaction processing schemes. For instance, redundant virtual (and physical) areas are used at lower layers closer to hardware. In what follows, we present a brief top-down description of each layer, addressing both their functionality, their interaction, and the way fault tolerance is added at each level. not, and that appropriate permission checks have not failed for the requested operation within the operating system access control policies. File system layer This is a generic interface into the file system that understands the POSIX conventions, and which is used by individual operating systems to communicate with the file system core. The requirements of this layer are pretty much determined by the POSIX standard specification. The later handles file system operations through twenty available API’s, where calls are directed down to the appropriate data object API. That involves: (i) locating static file information, as well as the file data root; (ii) performing the file system operation over the right data object; (iii) updating both static and dynamic meta-data file information, in the case of write-like operations; (iv) returning appropriate results and/or error codes; and so on. It also provides file and directory management in order to keep track of open file handles within the file system that are being used by the operating system. Top-level POSIX API layer This layer represents the entry point of the execution flow for any file system operation. It assembles the file system layer API’s below into standard POSIX file system API’s [41]. As Intel put it [29, p.12], it is a “concise interface that is POSIX -aware.” It is used “to implement all file and directory operations per the POSIX standard”, where “variances are documented and called out (...)”, since “certain operating systems have already implemented” parts of these artefacts. An implementation of the standard POSIX API’s containing all available file system functionality is supposed to wrap up the layered API’s described below. Since different operating systems already have their own file system requirements, and their own interfaces for varied resources, an intermediate OS-dependent layer is needed in order to accommodate the general file system core. Fault tolerance for PLR is added at this level by using a transaction processing mechanism within the file database in the data object layer below, so that multiple operations per file handle are allowed until a commit or rollback takes place. This is much like what was implemented for the IBM CICS transaction processing system [24, 23], which has some mechanised models available [14, 38, 56, 18], and is further discussed below (see Section 6). The layer also provides other user facilities like: searching, moving, and renaming of files and directories; file access, permission, and cache control; volume management for partitioning and user quotas; multi-threaded access; custom set of API’s (packed within one API) for non-POSIX functionality; and so on. Together with the twenty API’s, there are eighty different return/error codes related to varied aspects of the functionality, such as general file system behaviour (21), specific flash file system behaviour (34), or the interaction with the data object layer (25). Seven complex data structures are defined, and they handle: (i) a file allocation table; (ii) volume format; (iii) space usage information, such as fragmentation size granularity and allowance; (iv) various initialisation options, such as maximum number of open files or path length; (v) types of file; (vi) file and directory information, such as size, attributes, and time-stamp; and (vii) storage/caching for open file information like seek offset, and access (e.g., read-only, read-write, etc) and share (e.g., share-read, share-write, etc) modes. Together with file access and share modes, there are other execution modes also defined for: file search, open, and seek; RAM buffer usage; transaction operation status; and non-POSIX I/O control sub-API commands. Real-time OS-wrapping and OS-resources translation layers The real-time OS wrapping layer provides an entry-point for operating system specific API’s, hence it is not described in the Intel document. The description of the layer allows room for OS-specific implementation needs. Similarly, the OS-resources translation layer provides an access point to required resources managed by the operating system, such as mutex semaphores, background threads, hardware interrupts, error codes, and so forth. Calls to API’s in these layers from within the main file system layers means that allocated resources at the different layers are allocated and released properly. For example, when a flash array (i.e., the flash-hardware abstraction within the low-level layer) is no longer needed because the volume is unmounted, the device driver will be released; or when an error is returned by the Intel API’s, it is translated into the appropriate operating system error code, and control flow is transferred to the right thread of execution. These OS-wrapping API’s are also responsible for telling the file system layer if transaction processing is required or 155 Data objects layer It provides a common structural organisation for different types of data, such as static and dynamic allocation information (e.g., file access table), file data and directory pages, etc. In this way, it offers an interface with 23 API’s for the file system components to uniformly access and manipulate data. Among other functionalities, it also: provides a uniform way of manipulating these various types of data; joins together logical units of data separate across multiple virtual blocks, which is important to handle fragmented data or larger devices; accesses appropriate logical units irrespective of their type; and so forth. These enable the translation of read/write commands within varied types of data into read/write commands within multiple uniform logical units of data. Power Loss Recovery is added at this level through partial data write schemes. This allows transacted operations on a file, where appropriate PLR steps are taken whilst writing the data, so that all partial data writes are successful. All layers have initialisation and shutdown API’s to detect any power loss issues (or indeed other fault tolerance characteristics needed), and to perform the appropriate recovery operations. Not surprisingly, each layer calls the layer below in order to perform lower-level initialisation/shutdown. To implement the PLR schemes, a set of extra PLRspecific API’s is also provided (i.e., 10 of the 23 API’s), where failure error codes provide the guarantee that no change is made to persistent data. needed to avoid early malfunction of flash hardware. Requests are then passed to the flash interface layer below. By doing so, basic allocation can take place regardless of the kind of device being used. During initialisation, the flash interface layer is called again in order to detect (or build) a physical block table. There is also an interaction with the reclaim layer, which is placed alongside the basic allocation layer. Reclaim (or free-space garbage collection) is performed by associating logical block numbers to each physical erase block within the volume. Basic allocation then divides and allocates particular units of equal and predetermined size for the data object layer above. During reclamation, a backup mechanism is in place in case of power loss; the recovery process restores physical block information that would otherwise be lost. Error detection and correction (EDAC ) schemes are also incorporated, so that the file system can detect and recover from corruption to the file system structure. For that, redundant meta-data is physically stored. This is certainly the most complex of all layers. It is further divided into seven sub-layers, which we have separated in three categories: (i) interfacing with the data objects layer above, and the flash interface below; (ii) logical-tophysical block mapping and management, which is a quite delicate task, as it can severely affect the file system’s performance; and (iii) physical block management, which includes compression and EDAC algorithms. Among these layers there are around 15 API’s, 9 complex data structures representing the various mappings.Again, failure error codes ensure no change has been made to persistent data. Basic allocation layer It breaks up physical erase blocks into logical units of equal size, whilst keeping track of individual units and their status. It also keeps track of the used, free, and dirty (invalid/erasable) space within each erase block. In this way, it manages available space within a given volume, which includes the functionality needed for volume formatting, mounting, and unmounting. One wellknown file system part within this layer is the file allocation table, which is built in RAM. It allocates units on request from the data object layer above, one unit at a time (i.e., one execution thread at a time), and is responsible for PLR of each logical unit. It is indifferent to the type of data requested, since the data object layer above performs requests in logical block units. The way PLR schemes are implemented at this level depends on the nature of the hardware being used. The specific details on how PLR is performed at the logical (virtual) units level are similar to the way it takes place at the physical (hardware) block level, where a translation layer between logical and physical blocks/volumes is in place. A series of logical/physical tables are managed and maintained by this layer, which together with the reclaim layer, are the most complex in the whole architecture design. For instance, for flash hardware, the logical block table keeps track of the various wear-levelling schemes Reclaim layer It is used for garbage collecting free space from the dirty (invalid) space generated during file system operations. It copies data from source (dirty) to destination (free) blocks, so that the source block is reclaimed as free space for future write operations. This includes preemptive reclaim before large file writes. These processes are completely shielded from power loss. The employed PLR mechanisms are also further protected with EDAC algorithms. Moreover, it is here that the wear-levelling algorithms used in flash hardware reside. An erase block is reclaimed based on the amount of dirty space it contains, which depends on the frequency that the logical block is deleted or updated. As usage varies widely among applications, it is impossible for the basic allocation layer to optimally foresee what is to be reclaimed. Information on free/dirty space is fed back to the basic allocation layer to maintain the logical block information structure. During reclamation, each physical erase block of a volume is assigned a unique logical block address that is stored within the block itself, except for the destination block, which will store data. Then, upon reclamation of a physical block, its logical block address is copied along with 156 valid data to the destination block as part of the reclamation process. Thus, the logical address “follows” the data of the erase block it is being moved from. When the basic allocation layer accesses the reclaim information, it must perform logical-to-physical block address translation. This is a known bottleneck of the file system, since it is frequently performed, and should be quite carefully implemented to achieve optimal performance. As reclamation is power-loss recoverable, the process of copying valid data to some other free location before erasing data must be carefully performed. A naı̈ve approach would be to use the basic allocation RAM cache. This is a bad idea for two reasons: (i) erase blocks tend to be large, hence requiring much memory; and (ii) in the event of power loss, all data in transit on the RAM would be lost. Instead, this layer sets aside one logical block (i.e., the destination block) to be used as temporary storage during the reclaim process. This way, the reclaim destination block is never used to store data, but only for the reclaim process. When a source block is targeted for reclaim, the destination block is erased and contains no data. Next, only valid data is copied to the destination block at the right offsets. After all valid data is copied, the original source block becomes the next spare destination block. Also, a Reclaim-In-Progress (RIP) write-flag is used per block to differentiate between a block being written to a block being reclaimed. A table of RIP flags is maintained by the basic allocation layer. For flash hardware, the reclaim module performs another fundamental task: wear-levelling. As flash physically degrades on writing, the reclaim layer also ensures that adequate wear-levelling of physical blocks is in place in order to maximise the flash usage life. Due to its complexity, more information on the use of the eight reclamation API’s is also available. These are five specific state-machine-like algorithms on how to select, idle, relocate, erase, and update different source/destination blocks during reclamation, which includes handling the RIP flag. External OS-resources are required, such as mutex semaphores, and background daemon threads. Semaphores are used to guarantee that only one reclaim per volume can occur at a time, since there is only one spare destination block per volume to perform reclamation; whereas a background thread is kept idle until the semaphore-protected erase region is available for reclamation. As both basic allocation and the reclamation process seem to be amongst the most sensitive and complex tasks within the file system core, a great degree of care ought to be taken when implementing them. For instance, as they are cental to file store performance, one challenging extension would be to have multi-threaded reclamation algorithms, which also consider wear-levelling of the flash with EDAC of data in the process. In a survey of algorithms and data structures for flash memories [21], vari- ous options are explained, yet none exploit concurrency. The same is true for patented industrial wear-levelling algorithms (e.g., www.patentstorm.us/patents/ 6732221-description.html). This is an exciting area for research with direct industry impact and interest. Flash interface layer It shields upper layers from platform-specific low-level device drivers. Its five API’s are quite simple: read, write, erase, initialise, and shutdown. They are used to translate all basic allocation and reclaim requests into simple read, write, or erase calls that are platform/hardware specific. They communicate with the appropriate device driver, hence linking the file system with a particular hardware device. With 31 additional execution (result/error) codes, this layer reports back to the file system information on the managed hardware, such as block size and block count. These return codes are then translated back into appropriate file system error codes. For flash hardware, this layer translates file system volumes into flash arrays, which are an abstract view of the way the physical device is laid out (see Section 7). For file system operations, it translates the basic allocation layer operations into physical flash device operations, such as page programming or physical block erasure [26, 5, 6]. Any system resources obtained during these hardwarerelated operations are released. For instance, when a volume is mounted, the corresponding device driver will be loaded. That also means that initialisation and shutdown of hardware, which may include hardware PLR and other fault tolerance schemes, are performed. Low-level hardware layer This represents the device driver directly manipulating the hardware. To ensure integrity, this is the only available access point to hardware. Depending on the platform and device type, the driver behaviour (and implementation) may vary. Hardware interaction obviously depends on the device. For flash, there is usually the ability to perform read-whilewrite operations, where specific quality-of-service timing constraints are in place [26, Ch. 4]. For instance, it is possible for some devices to erase a physical block, while executing code from another physical block in the same device. Such operations are performed either from the hardware itself or via software. For hardware read-while-write, disabled interrupts and scheduling ensure proper completion of the command cycles [26, Ch. 7], whereas software read-while-write operations also work with interrupts and scheduling, but uses flags indicating the mode the flash array is in, to control operation suspension and pool for hardware interrupts requests. Other functionalities encompass: creating virtual flash devices to allow a single physical device to be broken into 157 fication. The whole of POSIX is formed by around 1,800 API’s with over 4,000 pages of documents. The process for the current IEEE 1003.1-2003 version involves a quite impressive test-suite (see get.posixcertified.ieee. org/docs/testsuites.html) with around 23,000 test cases, which need to succeed in under 12 hours of execution. Certification can take up to eight months, yet it usually takes two, and is valid for twelve months. At present, the certification process costs between US$5k–18k, depending on various factors, such as being a new certification or a renewal, being a product family or platformspecific certification, and so on. During the early phase of the standardisation process, the POSIX API’s requirements [28] (done in July 1995) were formally specified in [42] using the Z notation [47] (in August 1995). That is, an abstract specification capturing these requirements in Z was created from the set of informal requirements to show how they were sometimes ambiguous or contradictory. This work is the model for the top-level POSIX API wrapping-up layer discussed above. This Z specification served as a guideline for the actual version of the POSIX standard (from February 1998), with a revised version published in 2003 [33]. In fact, the Z specification from 1995 [42] was inspired by an earlier, even more abstract formal definition of UNIX file systems given in [40]. We mechanised [42] finding some interesting results. Due to lack of space, we will not discuss this further; some of the results are available in [20]. This was the starting point of our bibliography archaeology work and it took place in the end of 2006. In the remainder of this section we present our results on this front. multiple virtual devices, and vice-versa (i.e., multiple physical devices into single virtual device); interleaving flash devices across a data bus for improved data throughput; striping data across multiple (virtual) devices, hence allowing multiple writes to take place concurrently; physical block locking; etc. These features are particularly useful to assemble large amounts of data, and to considerably increase device’s performance. The nine low-level API’s operate over nine general flashdevice structured data types directly linked with the kind of hardware being handled, where 26 error codes specific to flash are used to provide detailed information back to the flash interface layer above. For each API, a detailed controlflow algorithm and finite state machine is given. As Intel is involved in standardising flash hardware, we expect to formally relate the finite state machines from [26, Ch. 7], with the algorithms and control-flow diagrams provided in the architecture document. For instance, one option would be to use process algebras like CSP [43] to analyse these information/control flow diagrams. One successful story of such attempt is an automated technique developed in Brazil and used by Motorola, which goes from a requirements table, through state machine diagrams, down to a CSP process that can be model checked [7]. 5. POSIX and the Open Group POSIX is the Portable Operating System Interface: an open operating system standard interface produced by IEEE that is recognised by both ISO and ANSI. It is widely accepted word-wide, with UNIX and Linux being the bestknown implementations (see www.unix.org). A reference implementation can be either POSIX conforming or POSIX compliant. The former means the implementation adheres in full to the published standard, where various optional subsets may also be included, such as threads or real-time extensions. The latter means the standard is partially adhered to, where documentation must be available showing which features are supported. Another terminology used is POSIX -aware, which means that although an implementation does not conform (or comply) to POSIX, it uses some known POSIX ideas or API’s. Certification for conformance is granted by accredited and independent certification authorities, and is managed by IEEE (get.posixcertified.ieee.org) and the The Open Group. The latter is the “vendor-neutral and technology-neutral” consortium responsible for “developing a range of services that provides strategy, management, innovation, standards, certification and test development”. It was through them that we first got in touch with an invaluable source for POSIX [33]. The certification process is quite lengthy, and various stages are involved. There are also different levels of certi- UNIX filing system In [40], Morgan & Sufrin described an abstract specification of a UNIX filing system given by few data structures and operations of files and file storage using the Z notation [47] (the M&S specification). It is divided into three parts comprising: (i) file storage database with file creation and data manipulation operations; (ii) file descriptors (or channels) used for random access of file data (i.e., a file access table); and (iii) directories and links. Although it does not completely model POSIX behaviour, such as error codes or file permissions. This was our starting point for the functional requirements for the subset of POSIX that is within JPL’s interest. As the POSIX standardisation body chose the Z notation, and we are quite familiar with Z and its tools, we followed suit and started the work in Z as well. Nevertheless, this is not necessarily a requirement of the project, and other formalism such as B [45] could be used as well. We took the file database and the directories of the M&S specification and mechanised them using the Z/Eves theorem prover [44]. That meant parsing, typechecking, consistency checking (e.g., making sure that partial functions 158 are applied within their domains), adding automation lemmas, and so on. In this process alone, we found (and fixed) some inconsistencies from the original M&S work. This resulted in two MSc. theses [19, 30]. For the file database, we proved a data refinement to a Z hash map of the filing operations using an extended set of forward simulation laws [51, Ch. 16-18], which include refinement of the API interface [8]. Hash maps were considered, as it we have evidence that industry-scale file systems use such data structure. This resulted in a series of publications [15, 13, 17]. Since a Z hash map is a result in itself, and since the Verified Software Repository is as much about verified components as verified experiments, we did some extra work on exporting (informally, but systematically) some of the Z hash map invariants and operation pre and postconditions as JML [3] annotations for Java HashMaps [15]. These annotations can be used to perform further checks on the Java code, such as static analysis with ESC/Java [9], or functional correctness verification with JACK [4], loop/invariant detection with LOOP [25], and so on. This turned out to be helpful in prototyping a flash file store for Java. The exercise was also useful in foreseing a possible link between Z and JML, as well as extending the JML mathematical type system with the Z mathematical toolkit. And there is still plenty of work to do. We would like to refine both the file seeking and directory structure down to code in a similar fashion. There is ongoing work in describing the directory structure part of M&S using EventB and the RODIN tools (www.event-b.org). We also plan to refine it to a B+-tree specification [10] written in VDM [31], which we think is another quite interesting exercise. This B+-tree specification shows how to go from an abstract VDM B+-tree down to an annotated Pascal code using refinement. Recently, we worked on theoretical results on how to trade theorems among different theories [53], and in this case, different logics: from the three-valued logic of VDM, to the semi-classical two-valued logic of Z. With that and some other minor adjustments, we believe it is possible to use Z tools to mechanise this specification and generate code annotations for the B+-tree in a similar way we did for the Z hash map mentioned above, in order to allow code-level verification. This time, we want to generate annotations for C# using the Spec# tools (research. microsoft.com/specsharp). In a recent visit to Microsoft Research (Redmond), in order to familiarise ourselves with Spec# and its tools, it turned out that such a B+tree could be of interest for the Vienna Hypervisor: the hardware abstraction layer written in C that Vienna, Microsoft ’s next operating system, will use for virtualisation. Other interesting aspects of the problem yet to be tackled are: (i) introduction of fault tolerance aspects to the file store database; (ii) reference implementation for the relevant Intel API layers that has been derived through refinement from the requirements, where formal annotations are added and analysed in the source code; (iii) test-case generation and model based testing, as advocated in [50]; and so on. 6. IBM CICS and fault tolerance We mechanised the Z specifications of two modules of the IBM CICS transaction processing system. They were the file control [24] and task control [23] API’s [14, 38, 56, 18]. They are important because they enabled us to inject fault tolerance for files, as well as the conditions under which API’s can be called. There is considerable work to be done in porting the results from IBM CICS into Intel’s API’s. Due to lack of space, we do no discuss this further. 7. Open NAND Flash Interface (ONFi) Flash memory devices are often used as the data-storage medium of choice. Of particular interest are file stores based on the relatively recent NAND flash Memory technology, which has a recent standard [26]. NAND flash is now very popular in portable devices, such as MP3 players and datakeys. Flash memory is seen as ideal for these purposes as it has good physical handling properties: it is nonvolatile, shock-resistant, and capable of operating under a wide range of pressures and temperatures. For spacecraft, it is even more valuable since it has no moving/rotating parts. There are two types of flash memory: (i) NOR flash memory, which can be programmed (written) at byte level, but must be erased at block level, is relatively slow, but suits random access; and (ii) NAND flash memory with higher speed, but where programming must be done at the page level, making it a sequential access device. The former suits non-volatile core-memory, whilst the latter is better suited for implementing file systems. A key issue when in developing these devices is in the way the hardware is physically laid out, as well as its available features, which usually varies by manufacturer. This difficulty is then inherited by those designing equipment relying on flash memory, which was a limiting factor in the adoption of the technology. In order to alleviate this situation, leaders in the flash manufacturing industry formed a standardisation organisation: the Open NAND Flash Interface consortium (ONFi). Their aim is to develop a common standard to which most manufacturers would then adhere. The current ONFi standard is then the obvious choice as our modelling target. Flash device data modelling A flash memory device is decomposed hierarchically into targets, logical units, blocks, pages, and data units. Each of these are important 159 for modelling different perspectives, as they capture boundaries that define if and when operations can be performed. The basic data unit in a flash memory is either a Byte (8 bits), or a Word (16 bits), depending on the type of device. A page is an array of data items, consisting of a main page, plus some spare locations used for EDAC. A page is the basic unit for programming (writing). A block is a collection of pages, and is the smallest unit to which an erase operation can be applied. A logical unit (LUN) is the smallest sub-entity within a device that is capable of operating independently. They compare to the logical blocks from Intel’s basic allocation layer. It comprises a collection of blocks, along with a status register and at least one page register in RAM. A target, within a device, is the smallest unit that can communicate independently off-chip. It is made of one or more logical units. An erased data item has all its bits set to logical 1’s, and programming a data item involves changing some of those to 0’s, which means that overwriting already written data is typically not possible without an intervening erasure operation. Erase operations need to be kept as low as possible, since flash memory physically degrades when erased. Such formal models of flash devices capture the relevant aspects of behaviour (i.e., functional, performance/timing, reliability, etc), in a way that allows them to be tied into formal descriptions of the surrounding hardware and associated software operations. ability. The probability of failure is initially very low, but rises over time, as measured by the number of operations performed. This requires wear-levelling algorithms at the hardware-level to minimise the failure rate. This requires us to model failure properly, with a particular emphasis on the fact that such failures have a persistent and lasting effect. Our initial model of ONFi [5, 6] focuses on the first level, viewing operations as atomic. A key concern was to describe formally the state in which devices are shipped, as memory faults will already be present as bad blocks, and there is a scheme in place to mark such blocks. There is no other published work on the formal modelling of NAND flash devices, to the best of our knowledge, but there has been a considerable body of work done on formal models of file systems, and the technical, usage, and reliability aspects of NAND flash devices. We know that work with the ONFi standard is also being undertaken at the University of Minho (in Portugal), which is led by José Nuno Oliveira, but no publication is yet available. Flash device finite state machines The internal control of flash devices relies (conceptually at least) on a collection of communicating finite state machines, whose interactions support the sequencing and interleaving of operations. These state machines are quite similar to those presented in the low-level layer in Section 4. A wide range of material has been published regarding the implementation of file systems on NAND flash memory, most of which utilise some form of log-structuring [32, 54, 55, 36]. Of interest to a potential space application are techniques that use NAND flash to implement low-power file caches for mobile devices [37, 35]. A key feature is the need to cope with the accumulation of errors over time, a mechanism which is well understood [1, 46]. Flash device chip-set operations With the data structure for such devices modelled, we need to provide support for key operations, such as programming, reading, and erasing, as well as other fault-tolerance features like PLR. These are specified as a command set table with 22 operations, where 13 are optional. This layered modelling of ONFi devices is quite interesting, as consistency checks can be performed at different levels of abstraction, as shown in recent publications [5, 6]. This result can be further exploited by industry to explore different operation configurations. For instance, each operation can be modelled as atomic, hence the flash is performing one operation at a time. Still, current flash device are not atomic and require the appropriate sequencing of inputs and outputs to complete any operation, as well as the need to wait for certain tasks (typically data transfer) to complete. This provides an opportunity to optimise performance by interleaving operations and the use of cache techniques. Most of those are part of the optional operations within the ONFi standard. All these options are described to some level of detail in the ONFi standard, and an obvious modelling goal is the formal description of all three, as well as how they relate. No matter the level, a key issue that arises is that certain operations may fail, with various degrees of observ- Continued standardisation The ONFi standard is evolving continuously, with version 2 due out in early 2008. The ONFi consortium have another standard [27] that looks at devices capable of supporting, at the hardware level, an access mode based on logical block addresses, which are always 40 bits in length, accessing blocks of 512-bytes, regardless of the underlying block size or number of address bits of the real device. A separate committee, entitled Non-Volatile Memory Host Controller Interface (NVMHCI), is another standardisation initiative on flash device drivers lead by Intel, which works alongside ONFi. Its first standard draft should be ready by early 2008. It corresponds to the functionality described in the flash interface layer API. The University of York has become the first academic member of ONFi, and we are currently discussing NVMHCI membership as well. Membership entitles early 160 References access to non-public draft documents, as well as participation in various moderated discussion lists. Once we better understand the underlying models, we intend to present our findings to the standardisation committee through these discussion lists. In a recent Grand Challenge workshop at ICFEM in Florida (in November 2007), members of ONFi participated and gave interesting insights on the issues they face. We also had the chance to show some inconsistencies in the current public standard and ask advice on how to proceed, as well as clarifying some hidden assumptions within the standard. [1] S. Aritome et al. Reliability issues of flash memory cells (invited paper). Proc. of the IEEE, 81(5):776–788, May 1993. [2] R.-J. Back and J. von Wright. Refinement Calculus: A Systematic Introduction. Graduate Text in Computer Science. Springer-Verlag, 1998. [3] L. Burdy, Y. Cheon, D. R. Cok, M. D. Ernst, J. R. Kiniry, G. T. Leavens, K. R. M. Leino, and E. Poll5. An Overview of JML Tools and Applications. In Eighth International Workshop on Formal Methods for Industrial Critical Systems (FMICS), Electronic Notes in Theoretical Computer Science, pages 73–89. University of Nijmegen, Elsevier, March 2003. [4] L. Burdy, A. Requet, and J. L. Lanet. Java Applet Correctness: a Developer-Oriented Approach. In Proceedings of Formal Methods Europe, Pisa, number 2805 in Lecture Notes in Computer Science, pages 422–439. Formal Methods Europe, Springer-Verlag, 2003. [5] A. Butterfield, L. Freitas, and J. Woodcock. Mechanising a Formal Model of Flash Memory. Science of Computer Programming, 2008. under review. [6] A. Butterfield and J. Woodcock. Formalising flash memory: first steps. In 12th ICECCS, pages 251–260, Auckland, Jul. 2007. IEEE. [7] G. Cabral and A. Sampaio. Formal specification generation from requirement documents. In Brazilian Symposium on Formal Methods (SBMF), 2006. [8] D. Cooper, S. Stepney, and J. Woodcock. Derivation of Z Refinement Proof Rules: Forwards and backwards rules incorporating input/output refinement. Technical Report YCS2002-347, University of York, 2002. [9] D. Detlefs, K. R. M. Leino, G. Nelson, and J. B. Saxe. Extended Static Checking. Technical Report 159, COMPAQ Systems Research Center (SRC), www.research.digital.com/SRC/, 1998. [10] E. Fielding. The specification of abstract mappings and their implementations as B+ -trees. Technical Report PRG-18, Oxford University, 1980. [11] L. Freitas et al. Verified Software Repository @ SourceForge. http://vsr.sourceforge.net/gc6index.html, 2006. [12] L. Freitas et al. Workshop on the vsr grand challenge: POSIX file stores. Dublin, 2006. [13] L. Freitas, Z. Fu, and J. Woodcock. POSIX file store in Z/Eves: an experiment in the verified software repository. In 12th ICECCS, Auckland New Zealand, Jul. 2007. IEEE. [14] L. Freitas, K. Mokos, and J. Woodcock. Verifying the CICS File Control API with Z/Eves: an Experiment in the Verified Software Repository. In 12th ICECCS, pages 290–298, Auckland New Zealand, Jul. 2007. IEEE. [15] L. Freitas and J. Woodcock. Proving Theorems about JML Classes. In Formal Methods and Hybrid Real-time Systems, volume 4700 of LNCS, pages 255–279. Springer, 2007. [16] L. Freitas and J. Woodcock. Mechanising Mondex with Z/Eves. Formal Aspects of Computing Journal, 20(1), January 2008. [17] L. Freitas, J. Woodcock, and Z. Fu. POSIX file store in Z/Eves: an experiment in the verified software repository. Science of Computer Programming, 2008. under review. 8. Conclusions In the quest to formally specify a POSIX file store, we divided the work suggested in [34] following an orthogonal architecture provided in [29] that enables separation and later combination of concerns, such as functional requirements, fault-tolerant imperatives, and various hardware devices. This is crucially important in order to allow collaborative work among scientists with different interests and backgrounds to collaborate in completing the challenge. We report on the various references we found, and how they all fit together to make a cohesive and comprehensive body of work. We defined a roadmap reporting achieved goals, summarising important information, and offering suggestion for collaboration as “micro-challenges” within the pilot project. Within the roadmap, we identified opportunities and presented results that contribute to all three branches of the verified software verification Grand Challenge: theories, tools, and experiments. Our efforts are firstly aimed at a particular user (NASA ’s JPL), hence we concentrate on an initial small subset of POSIX file store functionality of their interest. This is possible because we follow Intel ’s architecture mentioned above, hence we have a modular project development strategy. The results of this work are collected and available on-line at the VSR repository at SourceForge [11]. Collaboration and exploitation In due time, this effort could lead to a formally verified POSIX compliant file store that is widely used in other main stream industries. In this process, undoubtedly new tools will be created, mature tools will be improved, theories will be extended, and other experiments will benefit from the results. Future work Throughout the document we pointed out various opportunities for collaboration and ambitious “micro-challenges” within the POSIX verification Grand Challenge pilot project. We are working on an extended version of this roadmap, as well as a commented bibliography paper providing a reference manual for the various documents and sources available. 161 [18] L. Freitas, J. Woodcock, and Y. Zhang. Verifying the CICS File Control API with Z/Eves: an experiment in the verified software repository. Science of Computer Programming, 2008. submitted for revision. [19] Z. Fu. A refinement of the UNIX Filing System using Z/Eves. Master’s thesis, University of York, Oct. 2006. [20] Z. Fu. POSIX 1003.21 Standard in Z/Eves. Master’s thesis, University of York, Sep. 2007. [21] E. Gal and S. Toledo. Algorithms and Data Structures for Flash Memories. ACM Computer Surveys, 37(2):138–163, 2005. [22] T. Hoare. The verifying compiler: A grand challenge for computing research. Journal of the ACM, 50(1):63–69, 2003. [23] I. Houston. The CICS Application Programming Interface: Task Control. Technical Report TR12.307, IBM UK, Hursley Park, 1991. [24] I. S. C. Houston and J. B. Wordsworth. A Z Specification of Part of the CICS File Control API. Technical Report TR12.272, IBM UK, Hursley Park, 1990. [25] M. Huisman. Reasoning about Java Programs in HigherOrder Logic using PVS and Isabelle. PhD thesis, Universiteit Nijmegen, 2001. [26] Hynix Semiconductor et al. Open NAND Flash Interface Specification. Technical Report Revision 1.0, ONFI, www.onfi.org, Dec. 2006. [27] Hynix Semiconductor et al. Open NAND Flash Interface Specification: Block Abstracted NAND. Technical Report Revision 1.0, ONFI, www.onfi.org, 18th July 2007. [28] IEEE POSIX Working Group. Interface Requirements for Realtime Distributed Systems Communication. Technical Report IEEE P1003.21, IEEE, Jul. 1995. [29] Intel Flash File System Core Reference Guide, version 1. Technical Report 304436001, Intel Coorporation, Oct. 2004. [30] V. S. Jegannathan. Specification and Refinement of a Naming System in Z, for the UNIX File System. Master’s thesis, University of York, Sep. 2007. [31] C. B. Jones. Systematic Software Development Using VDM. Prentice-Hall, 2nd edition, April 1990. [32] H. joon Kim and S. goo Lee. A new flash memory management for flash storage system. In COMPSAC, page 284. IEEE Computer Society, 1999. [33] A. Josey, editor. The Single UNIX Specification Version 3. Open Group, 2004. ISBN: 193162447X. [34] R. Joshi and G. J. Holzmann. A Mini-Challenge: Build A Verifiable Filesystem. In Verified Software: Theories, Tools, Experiments (VSTTE), Zurich, Switzerlan, 2005. IFIP Working Conference. [35] T. Kgil and T. Mudge. Flashcache: a nand flash memory file cache for low power web servers. In CASES ’06: Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems, pages 103–112, New York, NY, USA, 2006. ACM Press. [36] S.-H. Lim and K.-H. Park. An efficient NAND flash file system for flash memory storage. IEEE Transactions on Computers, 55(7):906–912, July 2006. [37] B. Marsh, F. Douglis, and P. Krishnan. Flash memory file caching for mobile computers. In T. N. Mudge and B. D. [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] 162 Shriver, editors, Proceedings of the 27th Annual Hawaii International Conference on System Sciences, Vol. I: Architecture, HICSS’94 (Maui, Hawaii, January 4-7, 1994), volume 1, pages 451–460, Los Alamitos-Washington-BrusselsTokyo, 1994. IEEE Computer Society Press. K. Mokos. Specifying the IBM CICS File Control API For the Verified Software Repository. Master’s thesis, University of York, Sep. 2006. C. Morgan. Programming from Specifications. PrenticeHall, 1994. C. Morgan and B. Sufrin. Specification of the UNIX Filing System. IEEE Transactions on Software Engineering, 10(2):128–142, 1984. Open Group Technical Standard. Protocols for Interworking: XNFS, Version 3W. Technical Report C702, The Open Group, Feb. 1998. ISBN: 1859121845. P. Place. POSIX 1003.21—Real Time Distributed Systems Communication. Technical report, Software Engineering Institute @ Carnegie Mellon University, Aug. 1995. A. W. Roscoe. The Theory and Practice of Concurrency. International Series in Computer Science. Prentice-Hall, 1997. M. Saaltink. Z/Eves 2.0 User’s Guide. ORA Canada, 1999. TR-99-5493-06a. S. Schneider. The B-Method—an Introduction. Palgrave, 2002. A. Sikora, F.-P. Pesl, W. Unger, and U. Paschen. Technologies and reliability of modern embedded flash cells. Microelectronics Reliability, 46(12):1980–2005, 2006. J. M. Spivey. The Z Notation: A Reference Manual. Prentice-Hall, 1998. SRI. Workshop on the Verification Grand Challenge. www.csl.sri.com/users/shankar/VGC05, Feb. 2005. S. Stepney et al. An Electronic Purse: Specification, Refinement, and Proof. PRG 126, Oxford University, Jul. 2000. M. Utting and B. Legeard. Practical Model-Based Testing: A Tools Approach. Morgan-Kaufmann, 1st edition, 2007. J. Woodcock and J. Davies. Using Z: Specification, Refinement, and Proof. International Series in Computer Science. Prentice-Hall, 1996. J. Woodcock and L. Freitas. Z/Eves and the Mondex Electronic Purse. In 3rd ICTAC, volume 4281 of LNCS, pages 15–34. Springer, 2006. J. Woodcock and L. Freitas. Linking VDM and Z. In In Proceedings of 13th ICECCS, Belfast, LNCS. Springer, 2008. under review. D. Woodhouse. JFFS: The Journalling Flash File System. Ottawa Linux Symposium, Oct 2001. YAFFS Direct Interface (YDI) User’s Guide. www.aleph1.co.uk/node/349, Jul. 2006. Y. Zhang. Specifying the IBM CICS Task Control API For the Verified Software Repository. Master’s thesis, University of York, Sep. 2007.

Log In

POSIX and the Verification Grand Challenge: a Roadmap

Related papers

Related papers

Related topics