The Need for Standards: Data Modelling and Exchange [1991]

Manfred Thaller

The Need for Standards: Data Modelling and Exchange [1991]

Manfred Thaller

2017

visibility

…

description

21 pages

link

1 file

»Über die Notwendigkeit von Standards: Datenmodellierung und Datenaustausch«. When discussing standardization and secondary analysis in computer driven historical research, we should clearly distinguish between different approaches towards the usage of computers. The properties of four of themstatistical analysis, structured data bases, full text systems and annotation systems-are discussed and compared in some detail. Standards which want to convince users, that they should follow just one of these approaches will not succeed. What we need is a discussion of the communalities between the underlying information models and the identification of properties, for which clear conceptual models can be devised. Such clear conceptual models are a prerequisite for technical solutions, which ultimately can enable the exchange of data across the different approaches. Such conceptual models, therefore, are what we need as standards.

www.ssoar.info The Need for Standards: Data Modelling and Exchange [1991] Thaller, Manfred Veröffentlichungsversion / Published Version Zeitschriftenartikel / journal article Zur Verfügung gestellt in Kooperation mit / provided in cooperation with: GESIS - Leibniz-Institut für Sozialwissenschaften Empfohlene Zitierung / Suggested Citation: Thaller, M. (2017). The Need for Standards: Data Modelling and Exchange [1991]. Historical Social Research, Supplement, 29, 203-220. https://doi.org/10.12759/hsr.suppl.29.2017.203-220 Nutzungsbedingungen: Dieser Text wird unter einer CC BY Lizenz (Namensnennung) zur Verfügung gestellt. Nähere Auskünfte zu den CC-Lizenzen finden Sie hier: https://creativecommons.org/licenses/by/4.0/deed.de Terms of use: This document is made available under a CC BY Licence (Attribution). For more Information see: https://creativecommons.org/licenses/by/4.0 Diese Version ist zitierbar unter / This version is citable under: https://nbn-resolving.org/urn:nbn:de:0168-ssoar-54042-9 Historical Social Research Historische Sozialforschung Manfred Thaller: The Need for Standards: Data Modelling and Exchange [1991] doi: 10.12759/hsr.suppl.29.2017.203-220 Published in: Historical Social Research Supplement 29 (2017) Cite as: Manfred Thaller. 2017. The Need for Standards: Data Modelling and Exchange [1991]. Historical Social Research Supplement 29: 203-220. doi: 10.12759/hsr.suppl.29.2017.203-220. For further information on our journal, including tables of contents, article abstracts, and our extensive online archive, please visit http://www.gesis.org/en/hsr. Historical Social Research Historische Sozialforschung Other articles published in this Supplement: Manfred Thaller Between the Chairs. An Interdisciplinary Career. doi: 10.12759/hsr.suppl.29.2017.7-109 Manfred Thaller Automation on Parnassus. CLIO – A Databank Oriented System for Historians [1980]. doi: 10.12759/hsr.suppl.29.2017.113-137 Manfred Thaller Ungefähre Exaktheit. Theoretische Grundlagen und praktische Möglichkeiten einer Formulierung historischer Quellen als Produkte ,unscharfer’ Systeme [1984]. doi: 10.12759/hsr.suppl.29.2017.138-159 Manfred Thaller Vorüberlegungen für einen internationalen Workshop über die Schaffung, Verbindung und Nutzung großer interdisziplinärer Quellenbanken in den historischen Wissenschaften [1986]. doi: 10.12759/hsr.suppl.29.2017.160-177 Manfred Thaller Entzauberungen: Die Entwicklung einer fachspezifischen historischen Datenverarbeitung in der Bundesrepublik [1990]. doi: 10.12759/hsr.suppl.29.2017.178-192 Manfred Thaller The Need for a Theory of Historical Computing [1991]. doi: 10.12759/hsr.suppl.29.2017.193-202 Manfred Thaller The Need for Standards: Data Modelling and Exchange [1991]. doi: 10.12759/hsr.suppl.29.2017.203-220 Manfred Thaller Von der Mißverständlichkeit des Selbstverständlichen. Beobachtungen zur Diskussion über die Nützlichkeit formaler Verfahren in der Geschichtswissenschaft [1992]. doi: 10.12759/hsr.suppl.29.2017.221-242 Manfred Thaller The Archive on Top of your Desk. An Introduction to Self-Documenting Image Files [1993]. doi: 10.12759/hsr.suppl.29.2017.243-259 Manfred Thaller Historical Information Science: Is there such a Thing? New Comments on an old Idea [1993]. doi: 10.12759/hsr.suppl.29.2017.260-286 Manfred Thaller Source Oriented Data Processing and Quantification: Distrustful Brothers [1995] doi: 10.12759/hsr.suppl.29.2017.287-306 Manfred Thaller From the Digitized to the Digital Library [2001]. doi: 10.12759/hsr.suppl.29.2017.307-319 Manfred Thaller Reproduktion, Erschließung, Edition, Interpretation: Ihre Beziehungen in einer digitalen Welt [2005]. doi: 10.12759/hsr.suppl.29.2017.320-343 Manfred Thaller The Cologne Information Model: Representing Information Persistently [2009]. doi: 10.12759/hsr.suppl.29.2017.344-356 For further information on our journal, including tables of contents, article abstracts, and our extensive online archive, please visit http://www.gesis.org/en/hsr. The Need for Standards: Data Modelling and Exchange [1991] Manfred Thaller ∗ Abstract: »Über die Notwendigkeit von Standards: Datenmodellierung und Datenaustausch«. When discussing standardization and secondary analysis in computer driven historical research, we should clearly distinguish between different approaches towards the usage of computers. The properties of four of them – statistical analysis, structured data bases, full text systems and annotation systems – are discussed and compared in some detail. Standards which want to convince users, that they should follow just one of these approaches will not succeed. What we need is a discussion of the communalities between the underlying information models and the identification of properties, for which clear conceptual models can be devised. Such clear conceptual models are a prerequisite for technical solutions, which ultimately can enable the exchange of data across the different approaches. Such conceptual models, therefore, are what we need as standards. Keywords: Standardization, information modelling, data bases. 1. Introduction1 If in 2086 the Association for History and Computing commissions a volume to celebrate its first hundred years, the appointed author will undoubtedly begin research on “Part I: The Formative Years” by plundering the funding proposals written by historians from about the late-1950s. There will be discovered with astonishing regularity the same bold claim: “this project needs special attention because it is important in its own right and because it will make available a huge corpus of historical material of central importance to the discipline”. A contradictory theme will emerge when the same author gets round to a thorough review of the conference proceedings and journals prepared by and for computer-literate scholars in history and other humanities discipline say before 1991. For there will be found the persistent lament that scholars make too little use of that very machine-readable infor∗ 1 Reprint of: Manfred Thaller. 1991. The Need for Standards: Data Modelling and Exchange. In: Modelling Historical Data (= Halbgraue Reihe zur Historischen Fachinformatik A 11), ed. Daniel Greenstein, 1-18. St. Katharinen: Scripta Mercaturae. This paper has been submittted to the editor in the author’s usual Germlish. In turning it into proper English, the editor of this volume has considerably improved it, not only stylistically, but also in substance. Historical Social Research Supplement 29 (2017), 203-220 │© GESIS DOI: 10.12759/hsr.suppl.29.2017.203-220 mation whose compilation was justified in part with the promise of secondary analysis.2 To scholars working with computers in the 1990s, the contradiction is easily explained. Analyzing someone else’s dataset is difficult. In fact, it is so difficult that it is often simpler and less costly to create one’s own dataset from scratch rather than use one prepared by another researcher from the same underlying documentary sources. This unnecessarily expensive and somewhat irrational behaviour has not gone unnoticed by computer-literate historians or by agents of funding bodies. At least both groups have paid lip-service to the idea that some sort of standard for encoding machine-readable information is a good thing, and several independent initiatives have recently been launched to define such standards. Strangely, however, such initiatives have yet to bear fruit. In this paper, I will argue that attempts to define suitable standards have so far failed to produce practical results because they have concentrated narrowly on defining simple input formats or tag sets3. Computers promise the historian the means of integrating into a scholarly investigation more information than was hitherto thought possible. Already they have opened up whole new areas of research and illuminated new aspects of the past by enabling historians to exploit sources whose bulk and complexity made them inpenetrable to manual analysis. In future, it is likely that the benefits will be extended when each of us can have access to datasets compiled by colleagues conducting research in related areas or with related documentary sources. Censusbased urban studies, for example, will benefit from access to machine-readable data compiled for similar studies of neighbouring townships or simply by using the occupational coding schemes already implemented in earlier studies and openly available through online databases. The computer-literate psephologist will log onto online databases to extract voting data relevant to a particular investigation, supplementing them only with those data which are necessary to the study but not yet available in machine-readable form. The historian struggling to interpret the cryptic abbreviations which are so common in medieval documents will get help from machine-readable data dictionaries established to decipher the catalogue of abbreviations found in a computer-assisted study of related sources. But this promise can only be kept if machine-readable “information” can be exchanged easily between individuals who may have radically different research aims historical sources frequently contain information which may be as profitably 2 3 Concern was already being voiced back in 1962: cf. Jean Claude Gardin, “The use of Computers in Anthropology” in Dell Hymes (ed.), The Use of Computers in Anthropology, London 1965, 99. During the past 15 years the author has been connected with the development of Κλειω – software which is specifically tailored to the needs of historians. The purpose of this paper, however, is neither to explain nor promote how Κλειω approaches the problems of data modelling and exchange but to show how software which uses rather different approaches to the same problems might fruitfully coexist. At the same time we cannot very well conceal the fact that Κλειω and the Historical Workstation Project attempt to implement in a practical way the rather more abstract and theoretical propositions about data modelling and exchange outlined in these pages. HSR Suppl. 29 (2017) │ 204 exploited by statistical analysis as by text searches – and that means ultimately between different computer processing environments which insist upon differently structured data. In this respect, the future relies in part on “standardization”. But it is not sufficient merely to define standard conventions for structuring machine-readable information. For if we want to create an infrastructure which allows historians to draw upon a vast array of machine-readable data, we will also have to convince the individual researcher to adopt whatever standards are devised so that the fruits of every computer-aided investigations can be contributed to the ever-growing pool of machine-readable resources. The problem here is that at present professional historians are pressed to produce results. Those that use computers in there research will only accept standards whose implementation doesn’t prolong the research process or constrain the development and use of specialist techniques. The political historian interested in determining the composition and inter-connectivity of ruling groups from autobiographies is hardly likely to abandon the tried and tested method of selecting a finite amount of data from the autobiographies and entering them into a tabular format appropriate to databases and statistical packages. At the same time, the political historian interested in the same autobiographies because they illuminate the political outlook of ruling groups is hardly likely to enter the text of the autobiographies into a statistical package or relational database in order to conduct concordance and collocate analyses. Moreover, since it is rarely possible from the outset of a research project to predict precisely what “relevant” information will be encountered in the documentary evidence, and how that information should be structured and later analyzed, no scholar will feel comfortable with a standard which seems to infringe upon his or her liberty to create solutions for situations which have yet to be encountered. A standard should not, in other words, prescribe to scholars how they should solve a given set of problems. The approach to standardization proposed by the Text Encoding Initiative (TEI) and developed elsewhere in this volume is I feel, prescriptive. It offers stock solutions to a finite set of problems associated with computer-aided data management and analysis and instructs the researcher to use these solutions regardless of the nature of his or her sources and analytical aims. It is also my view that the user of prescriptive standards like the consumer of ready-made clothes sacrifices a good fit for the sake of convenience. The alternative approach to standardization which we develop in this paper does not launch out from a finite set of model solutions and attempt to adapt these however uncomfortably to different data management and processing problems. It is rather more consumer-oriented and based on our belief that historians, like the political historians mentioned above, structure machine-readable data differently according to their very different analytical aims. A standard must accommodate each of these very different aims, but at the same time facilitate automatic conversion of data structured for very different uses. From this perspective, standardization needs to be descriptive rather than prescriptive. Its proper domain is not prescribing to scholars how they should solve a given problem but furnishing scholars with the tools necessary to describe in an unambiguous way the problems they encountered when creating their datasets and how they chose to solve them. HSR Suppl. 29 (2017) │ 205 When such a standard is available, only then will it be possible to create the converconversion facilities which will allow interchange of data modelled for use with very different processing environments.4 There are currently at least four radically different processing environments which historians use and which rely upon differently structured data. Statistical packages on the whole rely upon what I shall refer to as an interpretive data structure where numerically pre-coded data are presented in a tabular format. Databases rely on structured data where the researcher doesn’t pre-code information but is highly selective in what information is taken from the source and entered into database table(s). Full-text systems record the entire text of the historical source(s) in question and use a comprehensive tagging scheme to indicate where categories of information begin and end. This data structure I refer to as pre-edited. The fourth processing environment is that associated with so-called hypermedia systems and relies on an “image-bound” data structure. By outlining the central properties of the data structures appropriate to each of these processing environments it is possible to determine which properties the data structures share in common and to recommend how conversion between them – the central aim of a consumer-oriented standard – may be made possible. The essential properties of the data structures outlined above are perhaps best described by showing how each would organize information drawn from the same historical source. For this purpose, I have chosen a hypothetical report about monetary obligations connected to land holdings which the historian would want to administer in a database. A fragment of the source might read: At Michaelmas in the fortieth year of Elizabeth, our gracious Queen, Johnny Turner, who holds the smitthy behind the church, has paid the seven guineas he owed to his brother Frederick from the will of his late father, and all his property is now free of any obligations whatsoever. In a statistical package the data might look like this: 09071598 John Month: Day: Year: Christian Name: Surname: Occupation: Amount in sh.: Payment to: (1xx == Relative 11x == Brother) Type of obligation: Obligation remaining? Turner columns: columns: columns: columns: columns: columns: columns: columns: 01 03 05 10 30 50 54 59 - 02 - 04 - 08 - 29 - 49 - 52 - 57 - 61 456 0147 110 7 0 columns: columns: 63 65 - 63 - 65 This data structure is very well known but probably not so widely used as it once was. It is easily identified by the numerical codes which are used to represent 4 The preference for descriptive versus prescriptive standards is developed with respect to well-known domain of occupational coding by Daniel Greenstein, “Standard, MetaStandard: A Framework for Coding Occupational Data”, Historical Social Research 16 (1991) 1, 3-22. doi: 10.12759/hsr.16.1991.1.3-22. HSR Suppl. 29 (2017) │ 206 information whose meaning needs to be interpreted by the historian before it can be coded and entered into the database. Let us quickly summarize the properties of this interpretive data structure: - The physical characteristics of the source are ignored. - The syntax by which relationships between individual chunks of information are expressed in natural language is ignored. - Translating the semantics of the source into more precise categories depends upon the researcher’s subjective “understanding” of the source. - Formalized treatment of the source, e.g. by means of statistical analysis, is extremely easy: the computer is seen as a tool to analyze relationships between clearly defined categories of information. - Exploratory treatment of the source, finding out what an ambiguous term might mean, is well nigh impossible – once you have decided what a term means it is hard to find out that you have been wrong. - Exchanging data between similarly interested research projects is extremely easy: FORTRAN formats are software independent and where common numerical coding schemes are employed it is unimportant from what sources or research projects data are derived. - Exchanging data between differently interested research projects is sensible only in rare cases. A machine-readable version of tax lists compiled to analyze the development of social classes will be useless to anyone interested in the tax lists to illuminate the movement of capital between economic sectors – numerically coded categories of information will have been devised along different and incompatible lines. - Projects which employ this method of data processing are usually completed on time. With the application of a little common sense the researcher can code and enter data quickly and estimate very accurately the amount of time that data entry is likely to take. - Data entry is error prone due to mistyping and miscoding, and it is hard to find errors once they have been made. Datasets may therefore be unreliable. There is no need here to rehearse the well-developed debate about the strengths and weaknesses associated with this kind of data processing.5 Most historians today would avoid such an environment and opt instead for some kind of database management system. While quite a few database systems are available, let’s take as an example one which relies upon a more or less self-explanatory “descriptor – descriptum” or “tag – content” logic. We introduce into it an additional convention which appends as a suffix to interpreted terms the original spelling in appropriate cases. The interpreted and original terms are only separated by an at-sign (@). The data shown below are divided into three hierarchically- related “entities”: “payment”, “holder”, and “addressee”. 5 See Mark Overton: “Computer Analysis of an Inconsistent Data Source: the Case of Probate Inventories”, Journal of Historical Geography 3 (1977), 317-26. HSR Suppl. 29 (2017) │ 207 payment$date=7 Sep 1598@Michaelmas 40 Eliz I/sum=7g/ furtherobligation=none/reason=will holder$surname=Turner/cname=Johnny/occupation=blacksmith@ holds the smitthy behind the church addressee$cname=Frederick/relationship=brother In this structured environment selected portions of information are taken from the original source and entered into the appropriate database “field” more or less unchanged. The main characteristics of this data structure are: - The physical characteristics of the source are ignored. - The syntax by which relationships between individual chunks of information are expressed in natural language is ignored. - The semantics of the source are translated into more precise categories with the aid of various software tools: machine-readable code books or look-up tables which translate terminology found in database fields into numerical codes; algorithms which normalize variant forms of, for example, chronological references; a combination of algorithms and look-up tables for normalizing categories of information, for example, currency references, before they are numerically coded, etc. - Formalized treatment of the source, e.g. by means of statistical analysis, is possible but will depend almost entirely on the extent to which the software permits the construction of the look-up tables and algorithms mentioned above. Generally, however, such databases are used as tools which help to derive formal categories and to prepare them for analysis often in an interpretive processing environment (statistical package). - Exploratory treatment of the source, finding out about what an ambiguous term might mean, is extremely easy and is one of the principal reasons that historians will use this processing environment. - The production of meaningful lists and catalogues from the categories of information stored as fields in the database is also quite simple. But where such lists comprise several and / or very lengthy fields it may not be so simple to print them out neatly onto a page. Text-formatting facilities are normally extremely limited. - Exchanging data between similarly interested research projects is possible. As a rule, what goes into one data base can be written out as an ASCII file and read into another6. However, where two projects have made slightly different decisions about what it means to, for example, “transcribe information almost as it appears in the original”, data exchange can require an astonishing amount of sophisticated programming.7 - Exchanging data between differently interested research projects is no more difficult than between similarly interested ones, but it is not easier either. 6 7 Cf. in a wider context also Leen Breure, “Historische databasesystemen”, in Onno Boonstra et al.: Historische Informatiekunde, Hilversum: Verloren 1990, 123. Cf. Rainer Metz: “Global Data Banks: a Wise Choice or Foolish Mistake”, Historical Social Research 16 (1991) 1, 98-102. doi: 10.12759/hsr.16.1991.1.98-102. HSR Suppl. 29 (2017) │ 208 There is always the danger that an unreasonably large amount of information will be recorded in a maze of machine-readable database tables. Projects therefore require far more attention to careful project management and pilot study than those which use interpretive data structures. - The reliability of the data is very high which is why this processing environment was invented in the 1970s in the first place. Full text processing environments require a pre-edited data structure which incidentally, is the data structure in which SGML is most commonly associated and which provides the main focus of the TEI. The same pre-edited approach was employed extensively in the 1970s by Alan Macfarlane in the “Earls Colne Project” – an attempt to create a machine- readable transcript of the surviving pre-1870 documents from one English village. With each document Macfarlane hoped to transcribe the whole text, adding additional markup to delimit specific categories of information.8 Unfortunately Macfarlane’s enormous undertaking did not produce easily-accessible documentation, thus rendering the surviving database more or less unapproachable. The software supporting the project was way ahead of its time and the tagging conventions actually incorporated most of those now supported by SGML, plus a few (notably those dealing with backward references) which are at best very difficult to express in SGML. Entered up in the pre-edited form our example data might look as follows: (payment (date At (feast Michaelmas feast) in the (year fortieth year) year of (ruler Elizabeth ruler) date), our gracious Queen, (holder (cname Johnny cname) (surname Turner surname), who (status holds the smitthy behind the church status) person), has paid the (amount seven guineas amount) he owed to his (addressee (relationship brother relationship) (cname Frederick cname) addressee) from the (reason will of his late father reason), and all his property is now (further free of any obligations further) whatsoever. payment) - The properties of this data structure may be characterized as follows: The physical characteristics of the source are ignored. The syntax by which relationships between individual chunks of information are expressed in natural language is explicitly used to interpret that information. The amount of explicit markup required in addition to that already present in the form of normal writing conventions depends on the quality of the parsing software being used. The practicality of this approach is also largely dependent on the quality of the parsing software and on the availability of tools which aid in the insertion of complex markup schemes. - The semantics of the source are translated into more precise categories with machine- readable codebooks and algorithms not unlike those used by structured processing environments. - 8 Alan Macfarlane: Reconstructing Historical Communities, Cambridge 1977; Donald E. Ginter et al., “A Review of Optimal Input Methods: Fixed Field, Free Field and the Edited Text”, Historical Methods Newsletter 10 (1977), 166-76; Timothy J. King, “The Use of Computers for Storing Records in Historical Research”, Historical Methods 14 (1981), 59-64. HSR Suppl. 29 (2017) │ 209 Formalized treatment of the source e.g., by means of statistical analysis, is possible but contingent on the quality of the codebooks and algorithms mentioned above. Generally the pre-edited data structure is best suited to textual analyses. - Exploratory treatment of the source, finding out about what an ambiguous term might mean, is extremely easy. - Formatting the machine-readable text or any part thereof for printed production is also quite simple, and this processing environment is frequently used by projects for which printed output is of primary importance. But as many researchers have discovered, transforming a dataset which is appropriate for printed production and textual analysis to one appropriate to more formalized statistical analysis requires more work after data entry has been completed than many are prepared to accept. - Exchanging data between projects is possible but requires sophisticated software. Predicting the compatibility of the tagging conventions used in different datasets can be an extremely tricky business. However, with the preedited environment a project’s research aims have no bearing on the possibilities for data exchange since in any event, datasets comprise complete machinereadable transcriptions of their underlying sources. - Researchers who use the pre-edited data structure often have a very clear picture of how their printed output should look but are less clear in their analytical aims. Because decisions about analytical aims have far-reaching ramifications for how text is tagged, project management and pilot investigation take on the greatest importance. While the strengths and weaknesses of the three data structures outlined above are fairly well understood those associated with the image bound data structures used in HyperCard, HyperMedia, HyperTalk ... in other words, the Hyper World, are far less so. Unfortunately, an illustration of how our sample data when prepared for Hyper use is not easily provided in print and can only be described in the following general terms: a) The document itself is contained in the data processing system in scanned, that is bitmapped, form. b) The categories of information delimited for analytical purposes in the other data structures discussed above could be represented and a link established between the delimited categories and the appropriate section of the bitmapped image from which they were derived. c) It is possible to move from any of the delimited categories (attributes, fields, or whatever your preferred terminology) to the original graphical representation of that part of the source from which the categories were derived and to do this right on the same screen. d) It is also possible in theory (though much rarer in practice) to look at a portion of the scanned document and go from there to the interpretation of that specific portion as represented in the delimited category (attributes, fields, ...) which was derived from it. - HSR Suppl. 29 (2017) │ 210 Ignoring the messianic statements that the Hyper crowd, like the proponents of any new technological innovation, are prone to make, I would propose the following as the central properties of the bound image data structure. - The physical characteristics of the source are maintained. - The syntax and semantics of the sources are maintained but in varying degrees. Indeed, bound image systems may simply situate a special data type (one called image) on top of structured or pre-edited data. The resultant dataset spectacularly changes the data’s visual appearance but adds few other processing capabilities. - Formalized treatment of the source e.g., by means of statistical analysis, is usually not central to research projects which use a bound-image data structure. Indeed the visual spectacle of the bound-image environment often obscures its analytical potential from the researcher’s view. Hopefully this is a transient phenomenon. Logically, the tools for formalizing information that is bound to an image are no more difficult to realize than those associated with data structures which haven’t any such links. - Exploratory treatment of the source, browsing through the source (frequently up to the point of getting lost in it), is what such systems are principally used for. - Data exchange between projects, whatever their research interests, takes on an entirely different set of problems as those hitherto encountered. Exchanging images is especially difficult because it may be dependent on the hardware and the software used by the different projects. The main problem here, however, seems to be, that the necessary expertise is relatively rare. In principle standards for image files – like TIFF – are much more oriented toward interchange than standards for character representation were at a more mature stage in their development. More problematical is the fact that there seems to be virtually no notion that image-bound systems may have some analytical interests. Consequently, techniques for refining HyperCard stacks into statistical cases remain undeveloped. These four data structures and their associated processing environments exist with equal right. It would be extremely hard to convince an econometrician to prepare an early-twentieth-century list of production statistics to use the pre-edited environment exploited by Macfarlane. On the other hand, if you have just spent a few million dollars to convert into a bound image dataset an entire archival collection of documents relating to early Spanish colonial expansion, you might manage to remain civil when told that standardization entails describing the categories you used in a specific “variable” of your data set.9 9 The reference here is to a Spanish project to convert the entire contents of the Archivos de las Indias into a bound-image type database. While no recent published descriptions of the project are known to the author, information from Spain suggests that the material will almost certainly be scanned by 1992, in time for the anniversary celebrations of Columbus’s discovery of the Americas. Description of the material, the process of actually binding images to structured interpretations of them, will, however, continue for quite a few years yet. HSR Suppl. 29 (2017) │ 211 Now, admittedly, relationships between the letters of Christofero Columbus and Polish steel production in the 1930s are few and far between. But relationships between the kinds of data being structured differently are all but artificial. Those between the mandates of the Castilian crown and the trade between Sevilla and the Americas in the 16th and the first half of the 17th century are much more apparent. And, while the Chaunus published their tables as bound volumes10 in the 1950s, their quality being just below the level of OCR today, these volumes might quite easily be read after the next advance of OCR technology; having sufficient structural markup in the printed layout to facilitate parsing them into a database without the addition of much explicit markup. Indeed, while most bound image projects are currently dedicated to browsing through data rather than to analyzing it statistically, some have the potential to express in numerical terms things like the relative placement of various motifs on images thereby opening up whole new possibilities of formalized analysis. And while we have argued the extremes to emphasize the ultimate unity of the field, the possibilities of mutual reference between datasets prepared in a structured way and those prepared by pre-editing markup are probably much more obvious than the ones between interpretive numerical datasets and bound images. All this leads me to suggest that there are two kinds of standards, both of which are necessary. For the adherent of a specific type of computer usage, standardization will entail the development of generally agreed rules for that type of computer usage. This approach to standardization might produce rules governing how machine-readable information should be structured in statistical, relational or full-text databases. Such standards would be extremely appealing to the technical experts who want to define how to proceed in their own specialist areas. But when you look at the technical landscape as a “consumer” you might decide that each of the processing environments may at one time or other be appropriate. This is precisely the case for the consumer who also happens to be an historian. What the polymath consumer wants from a standard is not, or not only, some instruction about how to become agreeable to one of the main communities of computer users. Rather, the polymath interested in our example data, for instance, would want to (a) find out whether blacksmiths paid off their inherited obligations more quickly than other occupational groups, (b) get from court records an alphabetical list (orthographical fancy and all) of all people living in a particular village, (c) analyze whether any change occurred in the literary context surrounding references to royalty,11 (d) to place a sensible caption under a scanned drawing of a village smith’s shop which will be the opening screen in an interactive teaching unit, and 10 11 Pierre et Huguette Chaunu, Seville et l’Atlantique, vols. 1-8 (Paris, 1955-1960). It is significant that textual analyses have recently received the most methodological attention. Virtually the only recent methodological contribution to the Journal of Interdisciplinary History has been the paper by Mark Olsen and Louis-Georges Harvey, “Computers in Intellectual History: Lexical Statistics and the Analysis of Political Discourse” 18 (1988), 449-64. HSR Suppl. 29 (2017) │ 212 (e) do whatever seems appropriate with the computer currently being invented in some laboratory or garage when that computer becomes available.12 There is as yet another reason why standards should attempt to unify ongoing work in different processing environments by facilitating data exchange between them rather than to emphasize the differences between user communities engaged with distinctive computer methods. In the corpus of methodological articles published by computer-using historians since the late 1960s it is difficult to distinguish very much in the way of consolidated methodological advance. This is probably because many such articles fail to generalize from descriptive accounts of “how I used software LMN during my summer vacation” to address the wider intellectual and methodological implications of their work. By the early 1980s the field of history and computing was consequently littered with interesting accounts of different projects but still hadn’t any rigorous intellectual or theoretical underpinnings. Lacking any definitional boundaries or intellectual rigour, the field of history and computing was flooded after the introduction of the microcomputers with a community of computer users who simply rediscovered for themselves the principals and problems already documented by the late 1960s but never formalized into an accessible, instructive and coherent literature. The effects were devastating. No recent number of Historical Methods, the pioneering journal which contained the most important papers on computer usage in the 1970s and early 1980s, contained any discussion of the wider implications of HyperText and other nonlinear systems. The journal has instead more or less abandoned discussion of computational techniques for statistical methods and other interdisciplinary matters. Where very occasionally computational issues are addressed, they are addressed in a manner which assumes that readers are computer novices. These developments have not happened by chance. Rather the editorial made them explicit in a 1985 number of the journal. There Daniel Scott Smith stated that the introduction of microcomputers and of undergraduate teaching in “computer literacy” would be such important events, that Historical Methods would actually modify its editorial policy and nominate a special editor (Janice L. Reiff) exclusively responsible for those subjects. By 1987 the journal’s new emphasis was baldly apparent in two papers on text processing and computer-aided instruction.13 Both papers undoubtedly have their value but only further detract from any approach to consolidate or advance methodological discussion. The journal’s “emphasis on microcomputers and teaching” culminated in a 1988 special issue entitled Special Issue: History, Microcomputers, and Teaching which, after two years still could not say very much more about this new and important development than: “The articles that follow provide a variety of examples of ways in which microcomputers have been used to do such teaching. It is hoped that future issues of Historical Methods will contain articles that demonstrate useful and imaginative ways to use these 12 13 It is extremely important to emphasize that the experience of the last two decades has taught us that any “standard” which is tied to a specific level of technological development is bound to fail. Richard Jensen, “The Hand Writing on the Screen”, HM 20 (1987), 35-45; Richard C. Rohrs, “Sources and Strategies for Computer Aided Instruction”, HM 20 (1987), 79-83. HSR Suppl. 29 (2017) │ 213 machines in both teaching and research.” These resounding words were followed by the almost complete disappearance of computer related articles from Historical Methods. We describe this phenomenon at length to emphasize that when two ways of using a computer are not understood as being very closely related parts of a larger field, they have a tendency to grow apart as very distinctive sub-disciplines and to fragment the discipline. “Standards” for dataprocessing in history, which do not strive for cross-fertilization between the “subcultures” of computer usage – whether these are defined by their orientation toward teaching or research, or by their respective processing environments – “standards”, which are directed at a specific part of the wider user community will be harmful, not productive. So this second “consumer-oriented” standard is not only a prerequisite for data exchange but essential if the unity of the discipline is to be preserved. But is it practicable or just a utopian pipe dream? The difficulties in developing a consumeroriented standard cannot be underestimated. Not the least of these is that technical experts engaged with one type of data processing tend to think of all other types simply as degenerated cases which are in principle already catered for in their own favoured environments. One assumes, however, that while narrow technical standards may be easier to define, the second variety is the only one which will ever be accepted by researchers themselves. A second obstacle is that a utopian consumer standard need to be built upon a conceptual understanding of the kinds of data found in and used by specialists within a given academic discipline, in this case, in history. The specialist’s prescriptive standard is derived from a conceptual understanding of specific computer related techniques. The problem is that whereas the conceptual understanding of specific computer models is already well developed, historians haven’t yet developed a formal understanding of the datatypes that they so frequently encounter in their computer-aided research. To flesh out this abstract distinction between the two different kinds of standards, we will look more closely at what happens when you try to turn the contents of statistical “variables” into database “fields” and then into tagged information as might be found in a preedited full text system. In the case of statistical datasets containing interpretive numeric codes, variables are atomic. That is, between the variable referencing the day, month and year of a date, for example, there is just the same relationship as that which exists between the year of birth and occupation of the person described in any one record. So if you want to copy a “date”, you would have to formulate three independent copy operations, e.g. COMPUTE commands in SPSS. In most systems using structured data – typically data base systems – you would use a specific data type “temporal information”. So by using a “date” with any command for copying or any other manipulation, you would always use one command to address the three logical components as one entity even though a date still has a day, a month and a year component. Practically this could mean, that, going from a statistical package to a data base, you would have to write an output procedure which turns the content of three variables into a string, the parts of which are delineated by a separating character and vice versa. Similarly in most data base systems, “note” or “comment” fields are just atomic fields which have no implicit technical relationship to any other field, except that it might happen to be one of several fields which HSR Suppl. 29 (2017) │ 214 characterize a particular entity. For example, assume the itemized records in a tax list show only the amount of taxes due for those individuals who have paid up in full, and that in a few cases there is a scribbled note which indicates that only a specific part of the total amount had been paid. In most data base systems the two amounts of money would be represented in two fields (amount owed and amount paid, respectively). The relationship between these two fields would be no different than the relationship between the amount paid and, for example, the Christian name of the taxpayer. In systems using pre-edited markup, however, you could simply put both amounts of money into one bracket, indicating that both amounts form the “tax”, being subdivided further into “amount owed” and “amount payed” thereby implying that when copying information on “tax” that both amounts need to be taken care of. We shall not continue with bound images in this respect: though they obviously rely upon additional relationships between individual chunks of data. What should be clear by these examples is, why it is so difficult to devise standards which do not explicitly take into account the fact that very different processing environments require very differently structured data. For the statistical package the notion of a datatype is almost meaningless – or at least trivial. For a structured database package which can handle the temporal information “Michaelmas 40 Eliz I” or something similar, documenting the precise rules used to normalizing such dates for analytical purposes is positively central to exchange. But where does that bring us? When statistical software does not have a data type “temporal information”, it does not. So how are we to exchange data between systems which “know” about temporal information, for example, and those which do not? With such profoundly different processing environments does the whole notion of exchange rapidly recede into the realm of fantasy? In our opinion this is certainly not the case. A proposal for a descriptive standard which starts from the problems that the consumer wants to tackle and not from a specific technical solution being propagated would focus on the problem of temporal information as follows:14 1) There exists the phenomenon “temporal information”. Temporal information can (in a simplified way) be described by an arbitrary subset of the following conceptual variables: - Day of the month. - Reference point within a notational system (e.g. “Michaelmas”). - Systematic offset within a calendar system (e.g. “Tuesday after x”). - Absolute offset from a reference point (e.g. “the day before”). - Era within a calendar system (e.g. the reign of Elizabeth the first). - Name of the month. - Year within a calendar. - Applicable calendar (e.g. the Islamic one). - Offset within an era (e.g. year of reign). 14 See the more detailed proposal by the author in “A Draft Proposal for a Standard for the Coding of Machine-Readable Sources”, Historical Social Research 11 (1986) 4, 3-46, doi: 10.12759/hsr.11.1986.4.3-46 and elsewhere in this volume. HSR Suppl. 29 (2017) │ 215 Symbolic macro formulated as an expression out of the preceding terms (e.g. “dedication day” to be expanded to “Sunday after Ascension”). 2) There exists a finite number of mappings between possible subsets of these elements. These mappings can be described by: - A number of algorithms, each converting one representation into another. - A number of “lexica” or “look-up tables”, describing the meaning of a set of instances of one of these subsets in terms of other elements. Okay, SPSS will still not understand Michaelmas 40 Eliz I. Or will it? In principle, if we take the preceding paragraphs seriously there is no reason, why SPSS could not understand the expression Michaelmas 40 Eliz I so long as there was a complete lexicon of all feasts on a local calendar and one giving the years during which different rulers reigned. With such lexica we could mechanically map various terms into nominally scaled variables thereby converting Michaelmas 40 Eliz I into something like 1245 40 126. And, while this author would not necessarily like to do it, one could implement the necessary conversion algorithm as a huge set of COMPUTEs, RECODEs, IFs and the like, provided a well understood algorithm existed. A generic description of this approach to standardization might then be described as follows: A “standard” for data collection and exchange, should define a set of basic “data types” of the kind just discussed. It should further identify various types of software systems – statistical packages, database administrators, natural language parsing systems, administrators of nonlinear data – which have clearly defined capabilities for treating each of the data types that are conceptually defined.15 Now, such a “standard” does not prescribe for researchers how to organize and enter information in machine-readable form, but it does allow them to tell a fellow researcher interested in using the dataset precisely how it was constructed and the decisions that were made when data were collected, organized and entered into the system. That kind of standard must not simply consist of paper documentation but of software which can convert the representation of like datatypes (e.g. temporal information) into a format suitable to any number of different processing environments.16 Recent discussions have brought to light seven data types which if developed conceptually would be sufficient to describe the entire range of information that historian’s encounter. They are: - Full text. - 15 16 This paper intentionally avoids making clearcut distinctions between types of software systems. In practice the only important distinctions are those which distinguish the data structures used by a given system. A good example of how elusive more precise categorization of software systems can be is found in the discussion of text bases by John J. Hughes, Bits, Bytes & Biblical Studies, Grand rapids 1987, 497. Format conversion tools might be provided as frontends to data base software. Cf. Martin Gierl, Thomas Grotum and Thomas Werner, Der Schritt von der Quelle zur historischen Datenbank. StanFEP: Ein Arbeitsbuch, St. Katharinen: Scripta Mercaturae 1990 (= Halbgraue Reihe zur Historischen Fachinformatik A 6); Kathrin Homann: StanFEP. Ein Programm zur freien Konvertierung von Daten, St. Katharinen: Scripta Mercaturae 1990 (= Halbgraue Reihe zur Historischen Fachinformatik B 6). HSR Suppl. 29 (2017) │ 216 “Descriptors” or fields coded in a “controlled vocabulary”. This may seem the most obscure category for the statistical minded since it includes information as diverse as occupations, educational careers, qualifications, degrees and the like. The point is, that while in statistical systems such terms map onto some system of numeric values at the nominal level of scaling, in other worlds of software – e.g. within databases – they are frequently treated by complex networks of interrelated thesauruses or look-up tables and are thus structurally very similar. - Names. Once again these are simply instances of nominal variables in statistical systems. Outside of this processing world, however, they are usually represented by strings between which algorithmically expressed degrees of proximity exist. - Numbers. These are actually much more complicated in historical sources than calendar dates: being non-decimal, expressed in layered systems of units and so on. - Calendar dates / temporal units. - Spatial references. - Images. A descriptive standard should define conceptual models for each of these abstract data types which would then inform how information is structured within a specific processing environment in accordance with the data modelling capabilities of that environment. A specific implementation would also include a defined set of importing and exporting procedures with which data could be converted into neutral formats or formats which are more congenial to other processing environments. Our original idea has been to create a standard not by prescribing to historians how to behave, but by describing their behaviour and their use of several different processing environments each of which relies upon radically different data structures. From this perspective, the only generic solution to standardization seems to be one which attempts some abstract definition of the types of information which actually occur in historical sources and which are represented rather differently in different processing environments. These conceptual models then should underlie all practical recommendations about how data should be organized and entered into machine-readable form in different processing environments. They are also essential to the development of conversion facilities which will enable data organized for one processing environment to be translated into a format suitable for other rather different ones. The descriptive, consumer-oriented standard is therefore also a model driven one. But does its promise justify the enormous effort which will be required simply to define the conceptual models let alone to create the conversion facilities? We should remind the reader what is at stake. Without the conceptual models of different historical data types the aims of standardization (any notion of standardization) cannot be achieved. Recap these briefly: secondary analysis; more fruitful cooperation between colleagues at different universities; the ability to change processing environments in mid-project; the ability to continue ten or fifteen years from now work begun today on soon-to-be outmoded hardware and software. There is no question whether we have to undertake the effort; the question is, how to organize it. And here we might derive some valuable lessons from previous unsuccessful attempts at defining standards. Two reasons seem to lay behind past failures and both can be described as vicious circles. Firstly, there is the problem of secondary analysis. So long as - HSR Suppl. 29 (2017) │ 217 secondary analysis remains so difficult as to be almost unapproachable, few people will attempt it. Similarly, so long as only a few people actually use a standard which includes written documentation and a supported environment, there is no reason to produce one. And so long as there aren’t any software tools necessary for effective data exchange and thus for secondary analysis, secondary analysis will continue to be difficult and unapproachable and few will attempt it. Secondly, there is the problem of database preparation. So long as databases are not proofread just as painstakingly as printed monographs, they will not be trusted and used very much for secondary analysis. As long as databases are not trusted and used, their creation and more importantly their refinement through painstaking editing will provide only the smallest bonus for the academic reputation of an historian. As long as the creation of databases offers few benefits in terms of one’s career, databases will not be made as free of errors as printed editions are and so will neither be trusted nor used for secondary analysis. Work on the consumer-oriented standard outlined above needs to be undertaken in three concurrent phases. - We need well understood and rigorously documented conceptual models of the historical datatypes defined above. - We need tools to transfer data from one processing environment to another. No historian will follow any kind of standard which offers no guarantee about enhanced prospects for data exchange. - We have to teach users why they might profit from the standards that are developed – e.g. how they might benefit from secondary analysis. This is not a problem amongst like-minded members of small user communities. One reason for SGML’s success is its immediate appeal to publishers who have a clearly defined purpose in mind: printing books in different editions and different layouts, without reentering the text. But with the wider community of computerusing historians such instruction is central if a standard is to be developed and widely used. As long as secondary analysis, for example, is just a vague idea, people will pay lip service to it but certainly not invest their time, effort and limited finances into actually trying it. And without training courses which show researchers how they can get results out of somebody else’s data, it will remain hard to convince anybody to prepare their own data for such purposes. We started by saying that a standard must not be prescriptive. Computer applications are a growth area in historical research and a standard must never prevent somebody entering the computer-using community right now and inventing a new and better solution. The reason, why this is usually not emphasized as much as it should be is that researchers discussing standardization frequently have “guru status” within their respective communities. And, when you are asked six times a month “Very interesting, but just precisely what shall I really do in this specific situation?” you tend to forget that the “obvious answer” that one formulates on the basis of fifteen or twenty years of experience in a specific subsection of the field might be all but obvious to someone who entered data processing from a different angle. In other words, for reasons of methodological purity we should abstain from giving recipes. But all of us are asked for recipes some of the time. So what we need are two completely different types of documents and approaches to HSR Suppl. 29 (2017) │ 218 standardization. On the one hand we need standards which describe to someone expert in a particular processing environment, for example statistically oriented types of processing, what to do with information which comes in a format specific to an altogether different processing environment, for example, a HyperCard stack. That kind of standard can only be built upon a careful analysis of the common core shared by otherwise distinctive processing environments and that core we have identified as the fairly atomic conceptual models of specific types of historical information. But standards available at this level are alone insufficient. For a large number of “customers”, as opposed to technical experts, a strange situation exists. They are at once interested in expressing their problems in specifically historical terms but not in abstractly conceptual ones. Thus the abstract conceptualization of historical datatypes will be too technical and of little interest to the user who simply wants to know that defined connections exist between whatever processing environment he or she chooses and the other processing environments used by the rest of the computer-using historical community. To resolve this contradiction, I think that it is the responsibility of experts in the various branches of data processing to develop standards at the level of model solutions to generic problems – solutions which are based on their knowledge of exchange possibilities between different processing environments. Such model solutions will, for example, give some guidance to users who are interested in analyzing data from a parish register using software XYZ, or compiling a database of wills with database ABC, and so on.17. Such model solutions would not express inflexible opinions about how to analyze a specific source but give some guidance by demonstrating the current state of the art, and would almost surely be used by quite a few newcomers to computeraided historical analysis. Computer usage in history and probably in most of the humanities, is currently being discussed, then, at two different levels. We have on the one hand the historian who has almost exclusively subject-oriented interests and who wants some advice on how to use computer techniques to obtain a few narrowly defined objectives without having to fully embrace the complex arguments for and against the alternative processing environments and the implications of their use in a given project. A “standard” for this clientele has to consist of specific guidance on how to accomplish a given task with the help of specific tools. On the other hand there are 17 Such guidance exists for Κλειω in: Thomas Werner and Thomas Grotum: Sämtlich Hab und Gut ... Die Analyse von Besitzstandslisten, St. Katharinen: Scripta Mercaturae 1989 (= Halbgraue Reihe zur Historischen Fachinformatik A 2); Jürgen Nemitz: Die historische Analyse städtischer Wohn- und Gewerbelagen: die Auswertung sozialtopographischer Quellen, St. Katharinen: Scripta Mercaturae 1989 (= Halbgraue Reihe zur Historischen Fachinformatik A 3); Barbara Schuh: “Von vilen und mancherley seltzamen Wunderzaichen”: die Analyse von Mirakelbüchern und Wallfahrtsquellen, St. Katharinen: Scripta Mercaturae 1989 (= Halbgraue Reihe zur Historischen Fachinformatik A 4); Peter Becker: Leben, Lieben, Sterben. Die Analyse von Kirchenbüchern, St. Katharinen: Scripta Mercaturae 1989 (= Halbgraue Reihe zur Historischen Fachinformatik A 5); Steffen Wernicke and Martin Hoernes: “Umb die unzucht die ich handelt han ...”: Quellen zum Urfehdewesen., St. Katharinen: Scripta Mercaturae 1990 (= Halbgraue Reihe zur Historischen Fachinformatik A 9). HSR Suppl. 29 (2017) │ 219 consultants, specialists responsible for data processing at various institutions, proproponents of various ways of teaching computing in different disciplines. For them a standard should provide the tools which would ensure that the very different solutions they collectively promote remain mutually compatible. For this group, we have to define more abstract definitions of what our kind of data processing is all about.18 Borrowing keywords from the most recent wave of industrial PR, such definitions have to be independent of “obvious” technical solutions which get surpassed by the next wave of methodological and software evolution. A “standard” for expert users has to consist of definitions and tools which allow them to maximize their current technical knowledge and to benefit the researchers who are consulting them without precluding future technical developments. What, then, is standardization about? Nobody needs a standard in those areas of computer processing in which only a few highly expert users are engaged. Standards are needed where many people work in slightly different ways and want to be able to communicate with one another. At least in history standards should not tell the “silly crowd” how to behave; they should define instead what all the valuable “schools” of computer usage already in existence have in common. They should, more than anything else, allow the historian well versed in one usage to drift painlessly into another without losing the data already entered and without having to relearn absolutely everything. 18 Such definitions have been demanded for some time not just as a means of developing data processing standards but as a means of qualitatively improving data processing within the humanities. Cf. Jean Claude Gardin, “Current and Future Role of D(ata)B(ase)s in the S(ocial) S(ciences and) H(umanities) in Thomas F. Moberg (ed.), Data Bases in the Social Sciences and the Humanities 3 (Osprey: Paradigm Press, 1987), 112-6. HSR Suppl. 29 (2017) │ 220

Log In

The Need for Standards: Data Modelling and Exchange [1991]

Sign up for access to the world's latest research.

Related papers

Related papers

Related topics