A taxonomy of datatypes
Brian Meek
King's College Londo n
b .meek@hazel.cc .kcl.ac.uk
Introductio n
This is the second article based on language-independent standardisation work being carried out b y
international standards working group ISO/IEC JTC1/SC22 WG11 (Language Bindings) . The first ,
Programming languages - towards greater commonality [Meek 1994] was an overview, which briefly describe d
the various and the relationships between them . This article goes into one of these projects in greater detail .
Both articles are based on presentations to the DECUS (UK and Ireland) symposia, this one given in May 1994 .
A third article, What is a procedure call?, based on a similar presentation in 1993 [Meek 1993], is projected .
What exactly is a datatype ?
It is surprisingly difficult to get people to agree on what constitutes a "datatype" . Most have a clear idea of what
a datatype is - an idea, anyway - but as with many things in this field - like what a procedure call is, discussed i n
the projected third paper - they are often not the same idea .
Data being ubiquitous, the datatype concept crops up in a wide range of contexts, and different contexts databases, communications, many different programming languages - have their own culture and conventions .
Such a variety of background is almost certain to lead to different perceptions . The taxonomy described here i s
based, though informally and fairly loosely, on that used in Draft International Standard 11404, Languag e
the "language independent" (LI) indicates the attempt to avoid th e
Independent Datatypes (LID) ;
presuppositions usually present in the culture and conventions of any particular language - or indeed any othe r
use of the concept . Interpretations and details are the author's own, and should not be assumed necessarily t o
be present in the standard itself .
In constructing any taxonomy it is necessary to decide not only what a datatype is, but what it is not . Despit e
the apparent derivation of the word, it is not simply "a type of data", but a concept in its own right . (This is th e
main reason for removing the space in "data type" .) Things that have a datatype may not themselves be a n
item of data (at least of that datatype ; it may be a value of some other datatype) . An integer variable is not itself
an integer . If that distinction is not clear (and some programmers find it difficult to see), an X channel (for I/O) i n
occam is not an X value - it is something that transports X values and only X values . Something "of datatype X"
has properties (depending on its own nature) that in some way relate to the type of data that X values are . Thi s
is a fine distinction, but it emphatically is not nitpicking : it is crucial to any taxonomy, if it is to have any hope o f
encompassing generically the many different views that there are .
The next thing is to abandon any representational view of a datatype . Data is an abstraction, capable o f
representation in many forms, and "datatype" is a more abstract still ; it is disastrous to link the concept t o
particular representational forms . This applies as much to second-order representational assumptions (e .g . a
complex value as an ordered pair of real values, or an array as a contiguous block of values) as it does to first order assumptions such as bit patterns . Such assumptions may be useful, even essential, in particula r
contexts, but a generic taxonomy cannot afford such restrictions .
Other things that have to go from the definition of datatype are the operations that can be performed on the dat a
values . Many languages people are startled by this, even horrified . To them, the values and the operations o n
them are inseparable . In that data in a program is of limited use if you cannot do operations, this is a n
understandable attitude, but it overlooks two things .
One is that you can distinguish between static and dynamic aspects - programming languages separate stati c
data declaration and dynamic procedural aspects, implicitly if not explicitly . The point of this is that som e
functionality related to languages is purely static as far as the data is concerned . For example, a dat a
transmission channel operates with the data (dynamic) but not on the data, which is unchanged (if the channe l
is working correctly!) . Operations such as addition and subtraction are an irrelevance here, and can even be a
nuisance, for example when talking of conformity to a standard .
159
ACM SIGPLAN Notices, Volume 29, No . 9, September 1994
The other overlooked aspect is that it is not always obvious what the operations should be . Languages do no t
always agree, and some with "derived datatype" facilities allow default operations to be suppressed or replaced ,
and others added . An example is that, for a new Integer called year, add and multiply are suppresse d
(meaningless) and subtraction is replaced ; the subtraction is normal but the datatype of the result is not year but
a datatype representing number of years .
I am not saying you could not include operations in a taxonomy, but to do it properly needs the full machinery o f
object-orientation . In standards work you need to set yourselves a specific achievable target, not make a n
open-ended commitment, so DIS 11404 does not attempt it . A taxonomy based on values alone is infinite, bu t
manageably infinite ; allow operations, and it becomes hard to see how or where you would ever stop, quit e
apart from (in a standard) specifying testable conformity requirements .
The basis of the taxonom y
The taxonomy follows the common practice of starting with a number of primitive datatypes and then usin g
these to construct others . There are three main kinds of constructed datatypes : subtypes, generated datatypes ,
and aggregates . (In fact aggregates are technically also generated datatypes, but important enough to deserv e
separate classification . )
Primitive datatype s
Primitive datatypes are datatypes whose values are regarded fundamental - not subject to any reduction . The y
just "are" . Those values are what some languages have called "atomic", or "plain values" . Many primitive
datatypes are also generic, in the sense that they have an unlimited number of values, and hence the datatype s
often used in practice are confined to a finite subset of them . The reason that they are used in the taxonomy ,
rather than "actual" achievable datatypes, is threefold : it is a convenient way to identify a class of datatype s
which is infinite in extent ; language definitions commonly use them, meaning that they simplify the bindin g
between the LI datatypes and the ones used by specific languages ; and it allows for the possibility of actuall y
supporting them if a language is designed to do so (in the sense, for example, that an integer of any arbitrar y
magnitude can be accommodated, even if in practice at some point you will run out of storage or time) .
It is important here to distinguish "a language" from any given implementation of that language . A languag e
definition will usually specify Integer as a datatype, but leave the actual range of integers supported to th e
implementation . The most they are likely to specify is that the range must be of continuous values and cover a t
least a specific minimal range ; some do not even go that far . However, for the LI datatype taxonomy, each on e
of all the possible subsets of the integer domain is a distinct datatype . If you are specifying a LI service in term s
of LI datatypes, you cannot just bind Integer to integer and leave it at that - you won't always be able to b e
certain that a service will work, For some services, a mismatch, or certain kinds of mismatch, will not matter ,
but for some (or for some uses of the service) it can matter . Programmers have long experienced suc h
problems when transferring applications, even from one implementation of a language to another - and eve n
when both conform to the language standard, if the standard has weak conformity requirements in this area .
The primitive datatypes of the taxonomy are Boolean, State, Enumerated, Character, Ordinal, Date-andtime, Integer, Rational, Scaled, Real, Complex and Void . This is a much longer list than that which mos t
languages designate as "primitive", sometimes because they classify datatypes differently, sometimes becaus e
they represent some in terms of others (or assume that the programmer will) . A simple example of such a
represented datatype is Complex as an ordered pair of real numbers, the "real" and "imaginary" parts .
However, LID eschews this form of "representation" as well as the more obvious bit-level form . A LID datatyp e
is just a set of values, and using the cartesian form to identify them is only a convenience for some purposes think of the polar form, for example . Similarly, Rational is not a directly supported datatype of any well-known
language (though Forth goes some way towards it), but its value-space is distinctive, and here again th e
integer-pair relationship runs into trouble, e .g . through multiple representations (2/4, 3/6, 34/68 etc) and specia l
cases (110, etc) . It is best treated as primitive in its own right .
Most of these primitive datatypes are generic, the actual specific, usable datatypes being derived by the use of
parameters or qualifiers . Boolean and Void are exceptions . (An earlier paper [Meek 1990] discusses th e
multiplicity of ways two-valued datatypes like Boolean can be - and are - handled . )
160
Subtype s
In the taxonomy subtypes are created by modifying the value-space of a "base" datatype in various ways specifying a range or size ; selecting values ; excluding values ; extending the value-space ; or defining explicitl y
how the value-space is constructed from that of a "base" datatype . Any combination of these is possible too .
These are fairly self-explanatory, even though eyebrows might be raised at the idea of extending, where th e
subtype ends up with a wider value space than the base datatypel However, in the taxonomy any datatype ca n
be used as the base, not just the primitive ones, and in that context extension is a useful subtype constructor,
e .g . you can make a new subtype by extending an existing one .
Generated datatype s
The (non-aggregate) generated datatypes in the taxonomy are Pointer, Procedure, and Choice datatypes .
Such datatypes are produced from other datatypes by the methods familiar from languages that include them ,
e .g . in the case of a procedure datatype it is constructed from the datatypes of the procedure parameters and o f
the returned result (if any) . The primitive datatype Void is useful here for subroutine procedure datatypes tha t
do not return a value . Vold is also used in choices, where not making any choice between alternative "proper "
datatypes is an allowed option .
Aggregate datatypes
An aggregate datatype is one whose values are made up of a number, in general more than one, of componen t
values, each of which is a value of another datatype . In many ways this the most complex and interestin g
subclass of datatypes, simply because of the number of combinations and variations that are possible .
It is important for any taxonomy to get clear what qualifies as an aggregate, what you regard as a component o f
it, and what it means to talk about a value of such a datatype . Here, the statement that each value of a n
aggregate has "in general" more than one component value does not preclude cases where the "aggregate "
value has only one component, or even none at all . (As far as the LID standard is concerned, languages ma y
allow such "degenerate" cases in their own right, regard them as equivalent to non-aggregate types, or forbi d
them altogether .) Next, a component value may itself be of an aggregate datatype . The difference betwee n
aggregate and non-aggregate components is immaterial when considering the properties of the composite ;
while "inside" the aggregate the individual components remain as single entities ("closed boxes") . Similarly, th e
aggregate as whole is regarded as having a single value, a closed box, whose datatype is determined by th e
datatypes of its components and the structure of the aggregate .
The approach adopted here is to start with the most general form of aggregate datatype, capable of containin g
anything, and then describing additional properties or constraints used to identify various kinds of aggregat e
datatype that are encountered computationally . These properties and constraints are not all mutuall y
orthogonal ; they may interact with others, in various ways . The particular mix of properties used for a give n
aggregate datatype will depend on the envisaged computational uses of the datatype and its values . Thi s
taxonomy shows a way of building any of the commonly-found forms of aggregate, and how to construct other s
if needed, by appropriate mixing and matching of a relatively small number of properties . The taxonomy is i n
fact capable of expressing bizarre conceptual datatypes of no practical interest, including unlimited numbers o f
components, values not in practice representable, and so on - again sharing this property with many definition s
of programming languages .
The main kinds of aggregate in the taxonomy are Bag, Set, Record, Sequence, Array, and Table . These wil l
be introduced as they appear in the discussion of the various properties aggregates can have . They are
technically also datatype generators, though we shall call them datatypes for simplicity - the components will b e
of one or more base datatypes from which the aggregate datatype values are constructed .
The most general aggregate datatyp e
The most general aggregate datatype is one whose values are each made up of any number of componen t
values (including zero or one component values), .every such component value being of any datatype at all .
This kind of aggregate datatype is called a Bag . It is completely unstructured, with no internal relationships a t
all . It is of limited practical interest but is useful in the taxonomy since all other aggregate datatypes can b e
expressed in terms of constraints and properties applied to it .
161
Distinguishing components by component values
Since a Bag is completely unstructured, it is not in general possible to distinguish between different componen t
values of a Bag value . This is because the components may have the same datatype and the same value o f
that datatype : a Bag value may contain duplicate components . An example can be found in the simpl e
probability exercise : a bag contains four white balls and eight blacks balls, what is the probability that the first
two balls removed from the bag will be of the same colour ?
Usually, component values of aggregate values are distinguished by their relationships with the structure of th e
aggregate datatype as a whole, and relationships which exist between the components within it the aggregat e
as a result of their membership of the aggregate (rather than as values of their own datatype) .
A Set, in this taxonomy, is a Bag which has the constraint added that there are no duplicate values, i .e . give n
two components, either their two datatypes are different, or they have the same datatype but have two differen t
values of that datatype .
Homogeneity
Homogeneity is another constraint that can be added, either to Bag or Set . This means that the componen t
values are drawn solely from one datatype, known as the base datatype . If the base datatype is B, th e
datatypes resulting are Bag of B and Set of B respectively.
Between the total generality of unconstrained Bag and Bag of B there would seem to be a limitless number o f
intermediate cases where component values may be drawn from a range of different datatypes but not al l
possible datatypes . In this taxonomy many such possibilities can be accommodated by using Choice .
Size
Returning to the most general Bag, another constraint that can be placed on values is the size of the aggregat e
value, i.e . the allowed number of components . In fact this consists of two constraints, one being the lower limi t
on the number of components, which must exist, and the other being the upper limit, which may or may no t
exist . If these two are the same, every value of the aggregate datatype has the same size, if they are not the n
any size between the upper and lower limits is possible . The size constraint on aggregate values should not b e
confused with the number of different possible such values, which depends on the number of possibl e
combinations of possible values of the components within their own datatypes .
The lower limit on size must be at least zero, the upper limit at least the same as the lower limit . For the
purposes of the taxonomy, it is assumed that both limits are actually achieved ; if no upper limit exists, thi s
means that values of the aggregate datatype of any arbitrary size are valid . Where no upper limit exists, thi s
means that the aggregate datatype has an unlimited (infinite) set of possible values . In any practical case o f
course, only a finite (though perhaps unspecified) number of actual aggregate values, each of a finite (thoug h
perhaps not predetermined) size, will be used .
More complicated situations are possible where the size of each aggregate value may be any one of two o r
more possibilities not expressible in terms of lower and upper bounds of a contiguous range : e .g . the size might
be defined as a multiple of three (3 components, 6 components, . . . .) . Again, the taxonomy can be extended to
cover this by use of Choice .
It is at this point that it becomes evident that there may be interaction between the various constraints . Fo r
example, if Set of B is defined where the base datatype B has a finite number n of distinct values, then becaus e
of the constraint "every component value is distinct" in the definition of Set, n is the largest number o f
components that any value of datatype Set of B can possibly have .
in general from now on, additional properties or constraints used to identify various kinds of aggregate datatyp e
may interact with others, in various ways .
162
The aggregate datatype and its base datatype
It is very important in this taxonomy to distinguish between properties of the aggregate datatype and those tha t
the base datatype has in its own right . In general, any property of the base datatype does not induce the sam e
or a similar property in any aggregate datatype whose values are composed of its values, except perhaps wher e
it interacts with a property of that aggregate datatype . For example, the finiteness of the base datatype in th e
above example does induce a constraint on the upper limit of size of values of the aggregate datatype, but onl y
through interaction with another constraint on the aggregate datatype . It does not induce a similar constraint fo r
Bag . Similarly, a base datatype may have a specific ordering, but this does not induce a similar ordering of th e
values of the aggregate, or of the components within any aggregate value . In fact, as will be seen, either o r
both such orderings at the aggregate level will normally be different from any ordering of the base datatyp e
values, and indeed can validly exist whether a base datatype ordering is defined or not .
Distinguishing components by taggin g
It is possible to distinguish the different components by "tagging" each one ; for example, a language ma y
provide a means of defining names (identifiers) for the various components . The term 'tag" is used here simpl y
to avoid any implication that the distinguishing syntax is necessarily an alphanumeric identifier - it could, fo r
example, be a numerical label . The essence of tagging is that the tags, themselves, are not in themselves
members of a datatype, and have no significance except as a means of referring to the aggregate component s
concerned . In particular, they do not imply any ordering of the components within the aggregate . (For example ,
numerical labels when used as tags are not members of an arithmetic datatype, and have no arithmeti c
properties .) However, the combination of a tag with a value of the relevant aggregate datatype will of cours e
have a datatype, that of the component selected . In this sense the tag does have an associated datatype, tha t
of the aggregate component it tags, but on its own it is not a "value" of anything . Of course, the datatype of an y
given tagged component can be a Choice datatype, of any required degree of complexity .
Theoretically, tagging can be partial, i .e . only some of the components are given tags . This taxonomy could b e
extended to include it, but for current purposes the complications this would entail are not justified by th e
benefits .
Record datatypes
A Record datatype in this taxonomy can be described as a Bag in which each component is tagged . However,
tagging imposes so many constraints on what is in the Bag, that the unstructured nature of the unconstraine d
Bag only remains through the absence of homogeneity (see above) and of ordering (see below) . In this
taxonomy, tagging determines the number of components (i .e . it fixes the size of the aggregate), and it fixes th e
datatypes of them . However, as noted earlier, flexibility of component datatype can be achieved through usin g
Choice datatypes for them, and similarly Choice of Record can be used to achieve size variability, and othe r
alternatives such as "variant records" which may be needed whether or not the number of components i s
variable .
Ordered datatype s
In Bags, Sets and Records, the components of the aggregate are unordered. That means that, given any pai r
of components, it is meaningless to ask whether either comes before (precedes) or after (succeeds) the other .
If ordering is a property which is added to the aggregate, this means that an ordering relationship exist s
between the various components . This can be done simply by giving mutual ordering properties between th e
components (e .g . suitable operations on them) ; by giving them a position (in some sense) in the overal l
aggregate structure, which induces an ordering ; or both . Any precedence relationship between components i s
independent of any precedence relationship between the values of those components . For example if Cl and
C2 are components of an aggregate and Cl comes before C2 in the aggregate, then this says nothing abou t
ordering of the values . If the aggregate is inhomogeneous, Cl and C2 may have different datatypes . If it i s
homogeneous, Cl and C2 may be members of a datatype which is unordered . If the datatype of Cl and C2 i s
an ordered datatype, the value of Cl need not precede the value of C2 by the ordering rules of that datatype .
Ordering of components within aggregates does not result in an ordering of values of the aggregate datatype .
In this taxonomy, the values of any datatype either are totally ordered, or are unordered . Theoreticall y
datatypes can exist which are partially ordered . Extending the taxonomy to include such a concept is possibl e
163
but would add considerable complication with little apparent benefit . For the purposes of this taxonomy it i s
unnecessary to pursue this further.
(Few if any actual programming languages seem to support partially ordered datatypes, and in practica l
applications needing it, often it is adequate to use total ordering but then either ignore ordering when no t
required, or check that it is meaningful in a particular case . )
In this taxonomy, Sequence is an aggregate with a strict and unique ordering, but whose components i n
general are not distinguishable from one another in any other way than by this ordering . By "strict and unique"
is meant that every component (except the first) has one and only one immediate predecessor, and (except th e
last) one and only one immediate successor . A Sequence can be homogeneous (Sequence of B) or not, an d
its size may be fixed or variable in any of the ways discussed earlier .
Starting from the first component and taking successor after successor until the last component is reached ,
each component can be related to the Sequence as a whole, in terms of "distance" from one end of th e
Sequence or the other . This does not imply that any given component can be accessed in any other way tha n
by systematic searching for it through the ordering . There may not even be any direct means of identifyin g
which is "first", though commonly in languages this can be indicated through the lexical ordering of the definitio n
of aggregate values . Notionally components may be distinguished (keyed or indexed, see below) by implici t
association with ordinal values first, second, third, . . ., but in this taxonomy a Sequence does not have a genera l
such method of picking out any individual component . Ways may be provided (e .g . special operations on th e
aggregate as a whole) of finding (say) the first or the last, but not for all . Operations to find the (immediate )
successor or predecessor are characterising of Sequence in this taxonomy, and those do apply to al l
components . In this taxonomy, therefore, Sequence is distinct from Vector, which has a similar structure but i s
indexed, as shown below .
The taxonomy can be extended to include cyclic datatypes (where the choice of "first" component is arbitrar y
and the successor of the "last" is the "first"), and cases with more complicated topology (directed graphs o r
digraphs) . Here too, for current purposes the complications this would entail are not justified by the benefits . I f
it is really needed in a particular case, it can be done at the operations level, but suitably modifying those fo r
obtaining the predecessor or successor of a component .
Recursive aggregate s
A special situation arises when a datatype is defined recursively, i .e . where the base datatype of th e
components is a Choice which includes, as one of the alternatives, the aggregate datatype itself . (There has to
be a Choice, to enable the recursion to terminate .) While in principle this can apply generally, in this taxonom y
only one case appears explicitly, namely Tree identified as a recursively defined Sequence, i .e . Tree of B is
Sequence of (Choice of (B,Tree of B)) . In some programming languages this kind of aggregate is called a list ,
but the word "list" is avoided here it is sometimes taken to mean what in this taxonomy is called a Sequence .
Sequence can be regarded as a "flat" form of Tree, where the recursive property is not used ; this is clear fro m
the definition of Tree . In fact, assuming other constraints (such as size constraints) are the same for both, i n
general the values of the datatype Tree of B will include all of the values of the datatype Sequence of B .
Distinguishing components by keyin g
Tagging was introduced earlier as a means of identifying each individual component of an aggregate, and it wa s
noted that in this taxonomy the tags are not values of a datatype, and have no significance except as names fo r
the components they reference .
Keying is similar to tagging except that keys are values of a datatype ; i .e ., there is a one-to-on e
correspondence between values of the key datatype and the components of the aggregate datatype . The ke y
datatype may or may not be ordered, and this ordering may or may not be used in relation to the keying ; that is ,
though the datatype the keys are taken from is an ordered datatype, this property need not be used by th e
aggregate datatype being keyed .
Keying may either be internal or external to the aggregate datatype . If the keying is internal, this means that th e
values of the keys appear themselves explicitly in the aggregate as components in their own right . In general ,
164
this results in the aggregate datatype having more than one dimension, so further discussion of internal keyin g
will be deferred until the discussion of dimensionality .
If keying is external, the key values are used more like tags but with the added feature that they have propertie s
of their own . In particular, if the key datatype is ordered, it induces an equivalent ordering on the components o f
the aggregate . In the case of internal keying, any ordering of the key datatype will not in general induce a n
ordering of the aggregate components, since that will depend on where the keys appear as components
themselves .
In this taxonomy, a keyed aggregate is homogeneous . The base datatype, as usual, can he made a Choice, fo r
greater flexibility . As with tagging, keying could be partial, but this taxonomy is not extended to include that fo r
the same reasons as before .
Distinguishing components by indexing
Indexing is a special case of keying where the values of the key datatype are an uninterrupted range o f
successive values from a base datatype from which the keys are drawn . The number and values of the key s
are defined in terms of the lower and upper bounds of the range . Indexing is commonly used in language s
since it can provide a bridge between mathematical concepts such as indexed variables, which are useful i n
many applications, and an efficient representational method using contiguous storage locations and simpl e
means of calculating the storage position of an individual indexed component .
A Vector is an aggregate with one index, where indexing is understood in the sense described here . Eac h
value of the index identifies a unique component of the aggregate datatype, and the ordering of the inde x
datatype induces an ordering on those components . A Vector becomes a Sequence if the index datatyp e
values identifying the components are discarded, but the ordering remains, though to be able to extrac t
individual components some means other than indexing of identifying individual components will be needed .
Some languages merge, or regard as interchangeable, the concepts of Sequence and Vector . Some treat al l
Vectors of the same length as indistinguishable, and some place constraints on the index datatype and/or o n
the bounds . Examples of constraints are restricting the index datatype to Integer, and fixing the lower bound .
Some languages allow (or assume) that Vectors (and/or Arrays, see below) are of variable bounds or size . I n
this taxonomy, this can be allowed for by suitable use of Choice . Variable Vectors could equally be provide d
by extending the taxonomy to allow either or both of the actual bounds of any aggregate value to be one of a
range . It could also be done by "padding" shorter aggregate values with meaningless values for the "missing "
components, or by accompanying each aggregate value by other values identifying the meaningful values, bu t
in this taxonomy these are regarded as representational matters .
As with tagging and keying, indexing could be partial, but this taxonomy is not extended to include that for th e
same reasons as before .
To summarise the differences between tagging, keying and indexing, tags individually identify components bu t
are not values of a datatype, while keys do the same but are values of a datatype . Keys may or may not b e
ordered, and can be provided externally or internally . Indexes are external keys forming a consecutive rang e
from an ordered datatype . Partial use (some but not all) has already been referred to, and the use of a mixtur e
(e .g . some components are indexed, the rest tagged), though theoretically possible, is left out of this taxonom y
for similar reasons . However, occasion may arise where hybrids occur, for example aggregates which are bot h
keyed and tagged, so components can be identified in two different ways . That the same aggregate value ca n
be used either as a Record or as a Vector is not in itself a problem, but in general it will be necessary t o
determine which of the two is required for an external mapping .
Aggregates with more than one dimensio n
The discussion so far has been of aggregates either with no structure or where the structure is essentially one dimensional . This is a consequence of the property of ordering . A Tree might be regarded as an exception, bu t
in this taxonomy it is regarded as one-dimensional even if some components have substructure . Indeed, since
the components which are themselves Trees can be regarded as one-dimensional in the same way, th e
substructures can be "unpacked" recursively just as they are built recursively, until no substructures remain, i n
165
wnicn case a sequence or oase aatatype components bas been produced . rus process, sometimes calle d
flattening, is similar to one that can be used in list processing as a simple form of searching, though in that cas e
the substructures are "visited" but not removed . This is often displayed lexically by use of parentheses, wher e
e .g . (a, (b, c), (d, (e, f), (g, h))) is a Tree and (a, b, c, d, e, f, g, h), with the parentheses removed, is the flattene d
version . Note that this "unpacking" of the Tree components of the original Tree releases their own component s
to the parent aggregate value, but does not destroy the ordering of them .
In this taxonomy, an aggregate datatype is multidimensional if more than one piece of information is needed t o
identify each component . For example, one might need two indexes, or two keys, or one key and then a n
ordering, or one key and one index . Each of these would be two-dimensional . An aggregate needing thre e
indexes, or two indexes and a key, would be three-dimensional, and so on .
Multidimensionality can either be inherent or induced. Multidimensionality is inherent if the aggregate datatyp e
is defined directly in terms of using more than one piece of information to identify each component . The most
common example of inherent multidimensionality is the Array datatype . Array (Ibi:ubi,lb2 :ub2) of B, where B
is the base datatype and Ibi, ubi, 1b2, and ub2 are values of an index datatype, is a two-dimensional aggregat e
datatype, and two index values, one for each dimension ; this can readily be extended to any required numbe r
of dimensions . The index datatype is the same for all dimensions . Aggregate datatypes are possible ar e
defined similarly but do not have this constraint, but in this taxonomy it is not called an Array . A Vector i s
identical to an Array with only one dimension .
With inherent multidimensionality, there is no ordering of dimensions, and hence no general ordering o f
components, though the components along each dimension, when the indexes for the other dimensions ar e
fixed, are ordered (induced, as usual, by the ordering of the index values) . The ordering of definition of th e
bounds as shown above is purely lexical, and the use of 1 and 2 etc in the names used for the bounds is purely
a linguistic convenience without any implied ordering .
Multidimensionality is induced by defining an aggregate datatype as an aggregate of components which ar e
themselves aggregates, but which are then unpacked, so that its own components are thereafter regarded a s
components of the larger datatype . Here, unpacked is used in the sense that, though the outermost structur e
boundaries have been removed, the "exposed" components retain any properties they had as part of tha t
(component) aggregate, and make them available in the overall aggregate . For example, an unpacked Record
will retain its keys, a Vector will retain its indexing, and an ordered aggregate will retain the ordering of th e
components that came from it . Various kinds of two-dimensional and higher dimensional "tabular" datatype s
can be produced in this way .
We have already used the concept of unpacking in relation to flattening of a Tree, where it can be noted that the
ordering of components within subtrees is retained on unpacking, hence making the flattened Tree ordered and
hence a Sequence .
Note that the definitions make it possible to produce a multidimensional aggregate with the same component s
either by direct definition (inherent) or by unpacking aggregates of aggregates . In this taxonomy, the two ar e
kept distinct, and in particular properties like ordering may be different between the two . The definition o f
inherent multidimensionality does not imply an ordering of dimensions or of components, whereas unpacking o f
aggregates of (ordered) aggregates preserves the ordering within the (internal) aggregates . Thus, for example ,
a Vector of Vectors is one-dimensional (e .g . [V 1 , V2]) before unpacking, and two-dimensional and ordered (e .g .
[v 11 , v 12, v21 , v22] after unpacking, whereas the equivalent two-dimensional Array is not (totally) ordered . These
distinctions are maintained in this taxonomy because it is logically possible and because languages exist tha t
make such distinctions . However, some languages may regard them as equivalent, or even define th e
multidimensional aggregates as being built up of successive aggregating and unpacking of one-dimensiona l
aggregates .
The definition of unpacking has further consequences when higher dimensionality than two is involved, wit h
various intermediate stages . For example, a three-dimensional Array will have corresponding structures of th e
kind Vector of Vector of Vector, Vector of two-dimensional Array, and two-dimensional Array of Vector .
Each of these will have their own ordering rules .
Finally, a Table is a special form of multidimensional aggregate with internal keying . It is easiest to visualise i n
its two-dimensional form, with one or more columns (say) holding keys to identify particular rows . Th e
166
complications possible in building Tables, using similar elaborations to those already discussed, can be left fo r
the reader to imagine !
Subaggregate s
As noted, it is possible to obtain an aggregate of lower dimensionality by fixing one or more components in othe r
dimensions . For example a two-dimensional Array can yield "row" Vectors and "column" Vectors, dependin g
on which of the two indices is fixed .
Other methods of producing subaggregates are selecting subsets or subranges of keys or indexes, which ca n
apply equally to one-dimensional aggregates . Mostly the various possibilities will be clear from the nature of th e
original datatype . Note that the properties of the subaggregate may not necessarily be the same as those of th e
original . Clearly a reduction of dimensionality changes that property, but it can change others too, for example a
Vector derived from an Array of higher dimensions will have the induced ordering of the indexing whereas th e
original was not (totally) ordered . Properties may also be lost, e .g . a subaggregate produced from a Vector by
selecting specific and non-consecutive values of its index will no longer be a Vector, and the residual inde x
values effectively form a kind of external keying .
The properties that are preserved, changed, gained or lost can be deduced from consideration of the propertie s
of the original, including index or key datatypes as appropriate, and the method of obtaining the subaggregates .
Derived datatypes and datatype generator s
The taxonomy allows for new datatypes to be produced from existing ones (copies, or "clones") and for furthe r
datatypes and datatype generators to be derived from the basic primitive and generated ones - Tree (mentioned
above) is one example, and CharacterString and BitString are others . Non-aggregate derived datatype s
include Bit, Modulo, and Timelnterval . Space precludes detailed discussion of them, but details are in th e
standard .
Conclusio n
The taxonomy has been introduced here to show the kind of way in which the LID and other LI standards ca n
distil the essence of commonality of concept that underlies the often confusing and conflicting versions that exis t
in programming languages and elsewhere [Meek 1994] . Its utility, and the utility of the forthcoming standard, i f
properly exploited, is that will help to bridge the gulf between different, incompatible approaches betwee n
languages and systems, that have so bedevilled users over the years .
Reference s
ISO/IEC DIS 11404, Language Independent Datatypes, 199 4
B .L . Meek, Two-valued datatypes, Sigplan Notices of the ACM, Vol 25 No
8,
pp 75-79, August 199 0
B .L . Meek, What is a procedure call?, DECUS Symposium 1993, submission to Sigplan Notices of the ACM in
preparatio n
B .L . Meek, Programming languages - towards greater commonality, Sigplan Notices of the ACM, April 199 4
(based on DECUS Symposium 1992 presentation)
167