Academia.eduAcademia.edu

Reporting and Communicating Data Quality

In 2002, I joined the UNIGIS postgraduate diploma course Geographic Information Science at VU University Amsterdam (www.unigis.nl). The aim of this course is to give students an understanding of the technical, geographical and organizational aspects of GIS. The program consists of eight modules and three workshops. Each module is concluded by writing two essays on the subject of the modules. The 16 essays are incorporated in my Academia UNIGIS Essay section.

Reporting and Communicating Data Quality in Desktop GIS Jan Stafleu Reporting and Communicating Data Quality in Desktop GIS Module: Assignment: Intake: Author: Date: Word count: 4. 2A. September 2002. Jan Stafleu. June 14, 2003. 2623 (excluding references) 1 Introduction This report evaluates the functionality of desktop GIS for reporting and communicating data quality. The report consists of four sections: (1) introduction; (2) the facilities of IDRISI to report and communicate data quality metadata; (3) a comparison of the facilities IDRISI and GeoMedia; and (4) an assessment of advanced options for reporting and communicating data quality. 1.1 Data quality characteristics A data quality report is a report, which enables the transfer of information on the quality of a particular data file or set of data. The data quality report has been defined by bodies such as the AGI in the U.K., and the NCDCDS in the U.S. The NCDCDS describes five sections required in the quality report:  Lineage  Positional accuracy  Attribute accuracy  Logical consistency  Completeness 2 Reporting and Communicating Data Quality in IDRISI This chapter examines the following facilities of IDRISI 2.0 to report and communicate data quality metadata:  Log Files – providing a basic form of audit information and lineage;  Macro Files – allowing automated production and reproduction of analyses;  Macro Modeler – graphic representation of the GIS analysis;  Metadata Files – documentation files accompanying each data file used in an analysis;  Statistical functions. These facilities will be illustrated using an application from Assignment 2 of Module 2 (ACME International Industries – Site location in Anytown region). Part of the cartographic model of this application is given in Figure 1. Module 4 - Taa 2A - Reporting and Communicating Data Quality d20030614.doc Page 1 of 12 Reporting and Communicating Data Quality in Desktop GIS railway_station bus_station POINTRAS POINTRAS railway station raster bus station raster Jan Stafleu OVERLAY pt = r + b public transport BUFFER 1000 m accessible area Figure 1: Example of a cartographic model. Two vector files are rasterized (POINTRAS) and combined (OVERLAY) to create a public transport file. The BUFFER operation is then used to calculate accessible areas within 1000 m of a public transport facility. 2.1 Log Files A record of all commands undertaken and warning or error messages produced is automatically initialized each time IDRISI is launched. Log files are ASCII files that may be viewed and edited using the IDRISI function EDIT or any other word processing program. The syntax of command parameters in the log file is essentially identical to the macro command structure. Because of this, log files can be used as a starting point for creating macros. Example Figure 2 shows a log file for the manual implementation of the cartographic model of Figure 1. If the log is stored with the results of the analysis, it can act as a basic lineage of the implementation. Module 4 - Taa 2A - Reporting and Communicating Data Quality d20030614.doc Page 2 of 12 Reporting and Communicating Data Quality in Desktop GIS Jan Stafleu Figure 2: Log file created when the GIS analysis from Figure 1 is performed manually. 2.2 Macro Files Macro files are used to automate the execution of multiple analytical steps (commands) in IDRISI. The file contains the commands and all the parameters necessary to run them in the order they should be run. Nearly every IDRISI command may be run in this mode. The information normally entered interactively in the dialog box is instead typed into the macro file along with the module name. Commands are executed in the order they appear in the file. Macro files may be created using the IDRISI function EDIT or another word processing program that supports ASCII. Each IDRISI command has its own syntax in a macro file. Macro files can be executed repeatedly and thus provide a means to reproduce results. This capability to reproduce results is a great advantage over the log file discussed above. However, like the log file, the macro files are not very intelligible and they would require some additional information to make a true lineage. Example The log file of Figure 2 is used to create a macro file (Figure 3). Module 4 - Taa 2A - Reporting and Communicating Data Quality d20030614.doc Page 3 of 12 Reporting and Communicating Data Quality in Desktop GIS Jan Stafleu Figure 3: Macro file allowing an automated GIS analysis. 2.3 Macro Modeler Version 2.0 of IDRISI offers the macro modeler, a graphical modeling environment for building and executing multi-step models. Facilities are included for batch processing (running many inputs through the same model to produce many outputs) and for dynamic modeling (using the output of one iteration of a model as an input into the next iteration). The macro modeler is also a rather sophisticated form of lineage: it is a graphical representation of the GIS analysis performed. Example The example from the previous sections was modeled using the Macro Modeler (Figure 4). The model shows input files, such as raster images (purple rectangles) and vector layers (green rectangles). These are linked with IDRISI commands (parallelograms) that in turn link to output data files. Note that the graphical model is much like the original cartographic model described in Figure 1. The model can be run repeatedly (as is the case with the text-based macro file). All data files can be viewed by a simple mouse-click on the symbol in the model. Even a summary of the metadata of each data file is available. Module 4 - Taa 2A - Reporting and Communicating Data Quality d20030614.doc Page 4 of 12 Reporting and Communicating Data Quality in Desktop GIS Jan Stafleu Figure 4: The Macro Modeler as a means for both documenting and automating GIS analyses. Note similarities with the cartographic model in Figure 1. 2.4 Metadata Files IDRISI has functionality for metadata creation in documentation files. Each file type – raster images, vectors and attributes - has its own style of documentation file. These files have data fields, which record details of each data set. The metadata consists of a series of lines (properties) containing vital information about the corresponding image file (Figure 5). Some properties are calculated automatically (e.g. number of columns and rows; minimum and maximum X and Y coordinates), whereas other properties must be entered manually by the user (e.g. display minimum and maximum). Two properties are of particular interest for data quality assessment; these are positional error and value error (i.e., attribute error). The user must enter both properties manually. Positional error indicates how close a feature’s actual position is to its mapped position in the image. Unfortunately, the field is for documentation purposes only and is not currently used analytically by any of IDRISI’s operations (Eastman, 2001a). The value error field is very important and should be filled out whenever possible. It records the error in the data values that appear in image cells. This field is analytical for some commands and is intended to be incorporated into more commands in the future (Eastman, 2001a). An example of PCLASS using value error is given section 4.2 Module 4 - Taa 2A - Reporting and Communicating Data Quality d20030614.doc Page 5 of 12 Reporting and Communicating Data Quality in Desktop GIS Jan Stafleu Figure 5: Metadata of the public transport raster file. See text for discussion. Figure 6: Metadata of the public transport raster file – continued from Figure 5. See text for discussion. Text in the Lineage field was created automatically. Module 4 - Taa 2A - Reporting and Communicating Data Quality d20030614.doc Page 6 of 12 Reporting and Communicating Data Quality in Desktop GIS Jan Stafleu Some of the information stored (consistency, completeness, lineage, comments) conforms to the US NCDCDS guidelines. However, in the current version of IDRISI these fields are only text based and, with the exception of the lineage field, data is entered manually by the user (Figure 6). The lineage field is automatically filled with the command line used to create the file, in a format similar to that of the macro file. 2.5 Statistical functions Statistical functions, in particular HISTO, can be used to show the distribution of data. Using HISTO, anomalies (e.g. gross errors introduced by typos in data entry) can be easily spotted. 2.6 Summary The data quality metadata and lineage, which can be handled by IDRISI, are summarized in the table below, identifying the method by which data quality information is recorded (manually or automatically) and reported (by text, statistics or graphics). Type Log file Macro file Macro modeler Documentation – properties Documentation – notes HISTO Recording method Automatically Manually (*) Manually Part automatically; Part manually Lineage in part automatically; otherwise manually Automatically Reporting method Text Text Graphics Statistics Text Statistics and plots (*) Note that macro files can be run automatically, but have to be created manually. There is no “recording” functionality as is the case in MS Word or MS Excel macros. Module 4 - Taa 2A - Reporting and Communicating Data Quality d20030614.doc Page 7 of 12 Reporting and Communicating Data Quality in Desktop GIS Jan Stafleu 3 Comparison of the facilities of IDRISI en GeoMedia In this chapter we will compare and contrast IDRISI's facilities for reporting and communicating data quality with those provided by Intergraph’s GeoMedia Professional, version 5.0. 3.1 Main differences IDRISI and GeoMedia have an entirely different approach of desktop GIS. IDRISI is based on raster images and has a large collection of commands, which operate on these images to create new images. The raster images are stored in separate, independent files. Lineage, such as provided by the macro modeler, is needed to understand how the files are connected. GeoMedia, on the other hand, is vector-based and data is stored in databases. These databases can either be read-write or read-only, the latter being connections to external databases. Instead of commands producing new images, GeoMedia has queries. Queries are based on any combination of databases and other queries. A GIS analysis consists of a set or sequence of queries. In the example from Chapter 2, we would have two databases for the railway and bus stations, a query that combines the two datasets, and a query that would calculate a buffer around each public transport station. Another important difference is that in GeoMedia, changes in the databases will propagate directly through all queries that are based on that database. If, for example, a new bus station opens, we would see the effect of the new bus station directly in the buffer query. This difference has both advantages and disadvantages. The main advantage is that we do not need to rerun an entire analysis after data changes (as is the case in IDRISI). The disadvantage is that analyses, which were performed in the past, cannot be reproduced, because any changes since the original analysis will influence the results of the analysis today. In order to guarantee reproducibility, we would have to import and store all data. Another solution to this problem would be to time stamp the original data. 3.2 Recording metadata Since we do not have to rerun analyses, we might conclude that we do not need lineage documentation like IDRISI’s macro modeler. However, it is unclear how queries in GeoMedia relate to one another. Each query by itself is documented (Figure 7), but the documentation is restricted to the data sources of the query itself. If these data sources are themselves queries, we have to inspect the definition of these queries to find out what data sources they use, and so on. Alternatively, the user must have the discipline to record the lineage of the query in the rather small Description field (Figure 7). This Description field is also the only place in GeoMedia where we can record some metadata. Module 4 - Taa 2A - Reporting and Communicating Data Quality d20030614.doc Page 8 of 12 Reporting and Communicating Data Quality in Desktop GIS Jan Stafleu Figure 7: Documentation of an attribute query in GeoMedia. 3.3 Logging and macros GeoMedia does not have simple logging or macro language tools. Instead, it offers full integration with programming languages such as Visual Basic, Power Builder and C++ (comparable with IDRISI’s API-gate). With these programming languages, we should be able to build additional functionality that can create and manage lineage. 3.4 Logical consistency GeoMedia offers several tools to check and correct logical consistency. These tools are important for vector files where issues of topological errors will arise. GeoMedia includes a full set of production tools to help the user capture clean, accurate data the first time with minimal editing. 3.5 Other Products GeoMedia’s manufacturer, Intergraph, offers other products that handle metadata and lineage. SMMS (Spatial Meta-Data Management System) allows users to create, edit, view and publish FGDC-compliant spatial metadata. The system can be integrated with GeoMedia. GIDM (Geospatial Intelligence Data Management) is a data management tool that includes temporality (time versioning) and rich metadata describing currency, accuracy, and lineage of data. However, these products are not part of the standard desktop GIS and are not considered here. Module 4 - Taa 2A - Reporting and Communicating Data Quality d20030614.doc Page 9 of 12 Reporting and Communicating Data Quality in Desktop GIS Jan Stafleu 4 Other options for reporting and communicating data quality The previous chapters focused on the data quality aspects lineage, metadata, reproducibility, and logical consistency. In this chapter, we will discuss some more advanced options for quantification and visualization of error and uncertainty in spatial data. 4.1 Uncertainty in class identification Chrisman (1987) and Aspinall and Pearson (1994) describe how the accuracy of class identification in thematic maps can be quantified using a confusion matrix. The confusion matrix compares the classes identified on a map with classes identified in the field. IDRISI offers functionality to select sample points in a random or systematic fashion. These sample points can then be visited in the field, and the measured values can be plotted against the values in the map (IDRISI modules SAMPLE and ERRMAT; Eastman 2001a). 4.2 Error propagation Uncertainty in any one data layer will propagate through an analysis and combine with other sources of error. Uncertainty in a data layer and its propagation are therefore important data quality characteristics. We have already seen that IDRISI records measures of both positional error and value error (attribute error) for each data set. The value error is analytical for some (but not all!) commands. An example of the PCLASS command using value error is given in Exercise 3-2 of the IDRISI Tutorial (Eastman, 2001b). In this Exercise, Eastman (2001b) simulates flooding of a coastal area due to a sea level rise of 0.48 m with a standard deviation (uncertainty) of 0.08 m. Uncertainty in the elevation data is estimated 0.30 m. The combined (propagated) error is calculated from these two uncertainties and results in a value error of 0.31 m. A traditional GIS analysis would neglect the uncertainties and simply subtract 0.48 m from all heights in the dataset. PCLASS on the other hand uses the combined value error to calculate the probability that land will be below sea level (Figure 8). The same procedure can be used to visualize uncertainties in the location of boundaries (positional error) on a thematic map as described by Hunter and Goodchild (1996). PCLASS, however, is one of the few operations that use the value error. Module 4 - Taa 2A - Reporting and Communicating Data Quality d20030614.doc Page 10 of 12 Reporting and Communicating Data Quality in Desktop GIS Probability that area is below sealevel Sea Jan Stafleu Land Figure 8: New coastline calculated after a sea level rise. Left: traditional analysis, showing the new coastline as a sharp boundary. Right: analysis using uncertainties in both sea level rise and elevation measures. Areas appearing black have an estimated probability of being inundated of 0, while those that are green approach a probability of 1. There is a range of colors in between where the probability values are less certain. Modified after Eastman (2001b). 4.3 Monte Carlo modeling Monte Carlo simulation is used to visualize the propagation of error throughout the analysis by introducing random error in the original data sets. The analysis is run twice, first without introducing error, and then a second time using data with simulated error. An overlay operation can then be used to calculate the difference between the two. IDRISI offers tooling for Monte Carlo simulations (Eastman, 2001a). An application of Monte Carlo modeling would be to simulate the propagation of error through the buffer operation, as described by Veregin (1994, 1996). 4.4 Raster overlay Hunter and Goodchild (1996) propose to use raster images, such as aerial photographs, as a background to vector polygons to visualize data quality. This is an easy task in desktop GIS. 4.5 Animation and sound The methods mentioned are above are all static displays of error. Fisher (1992) and Hunter and Goodchild (1996) propose sophisticated dynamic displays, where animations and even sound help the user to assess the effect of error. For example, if areas of a certain soil type are known to be heterogeneous, we can model this by showing inclusions of other soil types within the otherwise homogeneous area. Monte Carlo modeling can be used to create a random distribution of the inclusions. However, this static display would give the false impression that we also know the location of the inclusions, which is not the case. In an animated display, the inclusions would constantly change location, thus avoiding misinterpretation. Unfortunately, animations and sounds are currently not available in desktop GIS. Module 4 - Taa 2A - Reporting and Communicating Data Quality d20030614.doc Page 11 of 12 Reporting and Communicating Data Quality in Desktop GIS Jan Stafleu 5 References Aspinall, R.J. and Pearson, D.M., 1994. Describing and managing data quality for categorical maps in GIS. In: Proceedings of the Conference on GIS Research UK 1994, Leicester, 11-13 April, p.161-168. Chrisman, N.R., 1987. The accuracy of map overlays: a reassessment. Landscape and Urban Planning, 14, p.427-439. Eastman, J.R., 2001a. IDRISI Guide to GIS and Image Processing (2 volumes). Clark Labs, Worcester, USA. Eastman, J.R., 2001b. IDRISI Tutorial. Clark Labs, Worcester, USA. Fisher P.F., 1992. Animation and sound for the visualization of uncertain spatial information. Presented at the AGI Workshop on Visualisation, p.181-185. Hunter G.J., and Goodchild M.F., 1996. Communicating uncertainty in spatial databases. Transactions in GIS, 1, p.13-24. Veregin, H., 1994. Integration of simulation modeling and error propagation for the buffer operation in GIS. Photogrammetric Engineering and Remote Sensing, 60, p.427-435. Veregin, H., 1996. Error propagation through the buffer operation for probability surfaces. Photogrammetric Engineering and Remote Sensing, 62, p.419-428. Module 4 - Taa 2A - Reporting and Communicating Data Quality d20030614.doc Page 12 of 12