Reporting and Communicating Data Quality in Desktop GIS
Jan Stafleu
Reporting and Communicating Data Quality
in Desktop GIS
Module:
Assignment:
Intake:
Author:
Date:
Word count:
4.
2A.
September 2002.
Jan Stafleu.
June 14, 2003.
2623 (excluding references)
1 Introduction
This report evaluates the functionality of desktop GIS for reporting and
communicating data quality. The report consists of four sections: (1) introduction; (2)
the facilities of IDRISI to report and communicate data quality metadata; (3) a
comparison of the facilities IDRISI and GeoMedia; and (4) an assessment of
advanced options for reporting and communicating data quality.
1.1 Data quality characteristics
A data quality report is a report, which enables the transfer of information on the
quality of a particular data file or set of data. The data quality report has been defined
by bodies such as the AGI in the U.K., and the NCDCDS in the U.S. The NCDCDS
describes five sections required in the quality report:
Lineage
Positional accuracy
Attribute accuracy
Logical consistency
Completeness
2 Reporting and Communicating Data Quality in
IDRISI
This chapter examines the following facilities of IDRISI 2.0 to report and
communicate data quality metadata:
Log Files – providing a basic form of audit information and lineage;
Macro Files – allowing automated production and reproduction of analyses;
Macro Modeler – graphic representation of the GIS analysis;
Metadata Files – documentation files accompanying each data file used in an
analysis;
Statistical functions.
These facilities will be illustrated using an application from Assignment 2 of Module
2 (ACME International Industries – Site location in Anytown region). Part of the
cartographic model of this application is given in Figure 1.
Module 4 - Taa 2A - Reporting and Communicating Data Quality d20030614.doc
Page 1 of 12
Reporting and Communicating Data Quality in Desktop GIS
railway_station
bus_station
POINTRAS
POINTRAS
railway station
raster
bus station
raster
Jan Stafleu
OVERLAY
pt = r + b
public
transport
BUFFER 1000 m
accessible area
Figure 1: Example of a cartographic model. Two vector files are rasterized (POINTRAS) and
combined (OVERLAY) to create a public transport file. The BUFFER operation is then used to
calculate accessible areas within 1000 m of a public transport facility.
2.1 Log Files
A record of all commands undertaken and warning or error messages produced is
automatically initialized each time IDRISI is launched. Log files are ASCII files that
may be viewed and edited using the IDRISI function EDIT or any other word
processing program.
The syntax of command parameters in the log file is essentially identical to the macro
command structure. Because of this, log files can be used as a starting point for
creating macros.
Example
Figure 2 shows a log file for the manual implementation of the cartographic model of
Figure 1. If the log is stored with the results of the analysis, it can act as a basic
lineage of the implementation.
Module 4 - Taa 2A - Reporting and Communicating Data Quality d20030614.doc
Page 2 of 12
Reporting and Communicating Data Quality in Desktop GIS
Jan Stafleu
Figure 2: Log file created when the GIS analysis from Figure 1 is performed manually.
2.2 Macro Files
Macro files are used to automate the execution of multiple analytical steps
(commands) in IDRISI. The file contains the commands and all the parameters
necessary to run them in the order they should be run. Nearly every IDRISI command
may be run in this mode. The information normally entered interactively in the dialog
box is instead typed into the macro file along with the module name. Commands are
executed in the order they appear in the file.
Macro files may be created using the IDRISI function EDIT or another word
processing program that supports ASCII. Each IDRISI command has its own syntax
in a macro file. Macro files can be executed repeatedly and thus provide a means to
reproduce results. This capability to reproduce results is a great advantage over the
log file discussed above. However, like the log file, the macro files are not very
intelligible and they would require some additional information to make a true
lineage.
Example
The log file of Figure 2 is used to create a macro file (Figure 3).
Module 4 - Taa 2A - Reporting and Communicating Data Quality d20030614.doc
Page 3 of 12
Reporting and Communicating Data Quality in Desktop GIS
Jan Stafleu
Figure 3: Macro file allowing an automated GIS analysis.
2.3 Macro Modeler
Version 2.0 of IDRISI offers the macro modeler, a graphical modeling environment
for building and executing multi-step models. Facilities are included for batch
processing (running many inputs through the same model to produce many outputs)
and for dynamic modeling (using the output of one iteration of a model as an input
into the next iteration). The macro modeler is also a rather sophisticated form of
lineage: it is a graphical representation of the GIS analysis performed.
Example
The example from the previous sections was modeled using the Macro Modeler
(Figure 4). The model shows input files, such as raster images (purple rectangles) and
vector layers (green rectangles). These are linked with IDRISI commands
(parallelograms) that in turn link to output data files. Note that the graphical model is
much like the original cartographic model described in Figure 1.
The model can be run repeatedly (as is the case with the text-based macro file). All
data files can be viewed by a simple mouse-click on the symbol in the model. Even a
summary of the metadata of each data file is available.
Module 4 - Taa 2A - Reporting and Communicating Data Quality d20030614.doc
Page 4 of 12
Reporting and Communicating Data Quality in Desktop GIS
Jan Stafleu
Figure 4: The Macro Modeler as a means for both documenting and automating GIS analyses.
Note similarities with the cartographic model in Figure 1.
2.4 Metadata Files
IDRISI has functionality for metadata creation in documentation files. Each file type
– raster images, vectors and attributes - has its own style of documentation file. These
files have data fields, which record details of each data set.
The metadata consists of a series of lines (properties) containing vital information
about the corresponding image file (Figure 5). Some properties are calculated
automatically (e.g. number of columns and rows; minimum and maximum X and Y
coordinates), whereas other properties must be entered manually by the user (e.g.
display minimum and maximum). Two properties are of particular interest for data
quality assessment; these are positional error and value error (i.e., attribute error).
The user must enter both properties manually.
Positional error indicates how close a feature’s actual position is to its mapped
position in the image. Unfortunately, the field is for documentation purposes only and
is not currently used analytically by any of IDRISI’s operations (Eastman, 2001a).
The value error field is very important and should be filled out whenever possible. It
records the error in the data values that appear in image cells. This field is analytical
for some commands and is intended to be incorporated into more commands in the
future (Eastman, 2001a). An example of PCLASS using value error is given section
4.2
Module 4 - Taa 2A - Reporting and Communicating Data Quality d20030614.doc
Page 5 of 12
Reporting and Communicating Data Quality in Desktop GIS
Jan Stafleu
Figure 5: Metadata of the public transport raster file. See text for discussion.
Figure 6: Metadata of the public transport raster file – continued from Figure 5. See text for
discussion. Text in the Lineage field was created automatically.
Module 4 - Taa 2A - Reporting and Communicating Data Quality d20030614.doc
Page 6 of 12
Reporting and Communicating Data Quality in Desktop GIS
Jan Stafleu
Some of the information stored (consistency, completeness, lineage, comments)
conforms to the US NCDCDS guidelines. However, in the current version of IDRISI
these fields are only text based and, with the exception of the lineage field, data is
entered manually by the user (Figure 6). The lineage field is automatically filled with
the command line used to create the file, in a format similar to that of the macro file.
2.5 Statistical functions
Statistical functions, in particular HISTO, can be used to show the distribution of data.
Using HISTO, anomalies (e.g. gross errors introduced by typos in data entry) can be
easily spotted.
2.6 Summary
The data quality metadata and lineage, which can be handled by IDRISI, are
summarized in the table below, identifying the method by which data quality
information is recorded (manually or automatically) and reported (by text, statistics or
graphics).
Type
Log file
Macro file
Macro modeler
Documentation –
properties
Documentation – notes
HISTO
Recording method
Automatically
Manually (*)
Manually
Part automatically;
Part manually
Lineage in part
automatically; otherwise
manually
Automatically
Reporting method
Text
Text
Graphics
Statistics
Text
Statistics and plots
(*) Note that macro files can be run automatically, but have to be created manually.
There is no “recording” functionality as is the case in MS Word or MS Excel macros.
Module 4 - Taa 2A - Reporting and Communicating Data Quality d20030614.doc
Page 7 of 12
Reporting and Communicating Data Quality in Desktop GIS
Jan Stafleu
3 Comparison of the facilities of IDRISI en GeoMedia
In this chapter we will compare and contrast IDRISI's facilities for reporting and
communicating data quality with those provided by Intergraph’s GeoMedia
Professional, version 5.0.
3.1 Main differences
IDRISI and GeoMedia have an entirely different approach of desktop GIS. IDRISI is
based on raster images and has a large collection of commands, which operate on
these images to create new images. The raster images are stored in separate,
independent files. Lineage, such as provided by the macro modeler, is needed to
understand how the files are connected. GeoMedia, on the other hand, is vector-based
and data is stored in databases. These databases can either be read-write or read-only,
the latter being connections to external databases. Instead of commands producing
new images, GeoMedia has queries. Queries are based on any combination of
databases and other queries. A GIS analysis consists of a set or sequence of queries.
In the example from Chapter 2, we would have two databases for the railway and bus
stations, a query that combines the two datasets, and a query that would calculate a
buffer around each public transport station.
Another important difference is that in GeoMedia, changes in the databases will
propagate directly through all queries that are based on that database. If, for example,
a new bus station opens, we would see the effect of the new bus station directly in the
buffer query. This difference has both advantages and disadvantages. The main
advantage is that we do not need to rerun an entire analysis after data changes (as is
the case in IDRISI). The disadvantage is that analyses, which were performed in the
past, cannot be reproduced, because any changes since the original analysis will
influence the results of the analysis today. In order to guarantee reproducibility, we
would have to import and store all data. Another solution to this problem would be to
time stamp the original data.
3.2 Recording metadata
Since we do not have to rerun analyses, we might conclude that we do not need
lineage documentation like IDRISI’s macro modeler. However, it is unclear how
queries in GeoMedia relate to one another. Each query by itself is documented (Figure
7), but the documentation is restricted to the data sources of the query itself. If these
data sources are themselves queries, we have to inspect the definition of these queries
to find out what data sources they use, and so on. Alternatively, the user must have
the discipline to record the lineage of the query in the rather small Description field
(Figure 7). This Description field is also the only place in GeoMedia where we can
record some metadata.
Module 4 - Taa 2A - Reporting and Communicating Data Quality d20030614.doc
Page 8 of 12
Reporting and Communicating Data Quality in Desktop GIS
Jan Stafleu
Figure 7: Documentation of an attribute query in GeoMedia.
3.3 Logging and macros
GeoMedia does not have simple logging or macro language tools. Instead, it offers
full integration with programming languages such as Visual Basic, Power Builder and
C++ (comparable with IDRISI’s API-gate). With these programming languages, we
should be able to build additional functionality that can create and manage lineage.
3.4 Logical consistency
GeoMedia offers several tools to check and correct logical consistency. These tools
are important for vector files where issues of topological errors will arise. GeoMedia
includes a full set of production tools to help the user capture clean, accurate data the
first time with minimal editing.
3.5 Other Products
GeoMedia’s manufacturer, Intergraph, offers other products that handle metadata and
lineage. SMMS (Spatial Meta-Data Management System) allows users to create, edit,
view and publish FGDC-compliant spatial metadata. The system can be integrated
with GeoMedia. GIDM (Geospatial Intelligence Data Management) is a data
management tool that includes temporality (time versioning) and rich metadata
describing currency, accuracy, and lineage of data. However, these products are not
part of the standard desktop GIS and are not considered here.
Module 4 - Taa 2A - Reporting and Communicating Data Quality d20030614.doc
Page 9 of 12
Reporting and Communicating Data Quality in Desktop GIS
Jan Stafleu
4 Other options for reporting and communicating
data quality
The previous chapters focused on the data quality aspects lineage, metadata,
reproducibility, and logical consistency. In this chapter, we will discuss some more
advanced options for quantification and visualization of error and uncertainty in
spatial data.
4.1 Uncertainty in class identification
Chrisman (1987) and Aspinall and Pearson (1994) describe how the accuracy of class
identification in thematic maps can be quantified using a confusion matrix. The
confusion matrix compares the classes identified on a map with classes identified in
the field. IDRISI offers functionality to select sample points in a random or systematic
fashion. These sample points can then be visited in the field, and the measured values
can be plotted against the values in the map (IDRISI modules SAMPLE and
ERRMAT; Eastman 2001a).
4.2 Error propagation
Uncertainty in any one data layer will propagate through an analysis and combine
with other sources of error. Uncertainty in a data layer and its propagation are
therefore important data quality characteristics.
We have already seen that IDRISI records measures of both positional error and value
error (attribute error) for each data set. The value error is analytical for some (but not
all!) commands. An example of the PCLASS command using value error is given in
Exercise 3-2 of the IDRISI Tutorial (Eastman, 2001b). In this Exercise, Eastman
(2001b) simulates flooding of a coastal area due to a sea level rise of 0.48 m with a
standard deviation (uncertainty) of 0.08 m. Uncertainty in the elevation data is
estimated 0.30 m. The combined (propagated) error is calculated from these two
uncertainties and results in a value error of 0.31 m. A traditional GIS analysis would
neglect the uncertainties and simply subtract 0.48 m from all heights in the dataset.
PCLASS on the other hand uses the combined value error to calculate the probability
that land will be below sea level (Figure 8).
The same procedure can be used to visualize uncertainties in the location of
boundaries (positional error) on a thematic map as described by Hunter and
Goodchild (1996).
PCLASS, however, is one of the few operations that use the value error.
Module 4 - Taa 2A - Reporting and Communicating Data Quality d20030614.doc
Page 10 of 12
Reporting and Communicating Data Quality in Desktop GIS
Probability that area is below sealevel
Sea
Jan Stafleu
Land
Figure 8: New coastline calculated after a sea level rise. Left: traditional analysis, showing the
new coastline as a sharp boundary. Right: analysis using uncertainties in both sea level rise and
elevation measures. Areas appearing black have an estimated probability of being inundated of 0,
while those that are green approach a probability of 1. There is a range of colors in between
where the probability values are less certain. Modified after Eastman (2001b).
4.3 Monte Carlo modeling
Monte Carlo simulation is used to visualize the propagation of error throughout the
analysis by introducing random error in the original data sets. The analysis is run
twice, first without introducing error, and then a second time using data with
simulated error. An overlay operation can then be used to calculate the difference
between the two. IDRISI offers tooling for Monte Carlo simulations (Eastman,
2001a).
An application of Monte Carlo modeling would be to simulate the propagation of
error through the buffer operation, as described by Veregin (1994, 1996).
4.4 Raster overlay
Hunter and Goodchild (1996) propose to use raster images, such as aerial
photographs, as a background to vector polygons to visualize data quality. This is an
easy task in desktop GIS.
4.5 Animation and sound
The methods mentioned are above are all static displays of error. Fisher (1992) and
Hunter and Goodchild (1996) propose sophisticated dynamic displays, where
animations and even sound help the user to assess the effect of error. For example, if
areas of a certain soil type are known to be heterogeneous, we can model this by
showing inclusions of other soil types within the otherwise homogeneous area. Monte
Carlo modeling can be used to create a random distribution of the inclusions.
However, this static display would give the false impression that we also know the
location of the inclusions, which is not the case. In an animated display, the inclusions
would constantly change location, thus avoiding misinterpretation. Unfortunately,
animations and sounds are currently not available in desktop GIS.
Module 4 - Taa 2A - Reporting and Communicating Data Quality d20030614.doc
Page 11 of 12
Reporting and Communicating Data Quality in Desktop GIS
Jan Stafleu
5 References
Aspinall, R.J. and Pearson, D.M., 1994. Describing and managing data quality for
categorical maps in GIS. In: Proceedings of the Conference on GIS Research UK
1994, Leicester, 11-13 April, p.161-168.
Chrisman, N.R., 1987. The accuracy of map overlays: a reassessment. Landscape and
Urban Planning, 14, p.427-439.
Eastman, J.R., 2001a. IDRISI Guide to GIS and Image Processing (2 volumes). Clark
Labs, Worcester, USA.
Eastman, J.R., 2001b. IDRISI Tutorial. Clark Labs, Worcester, USA.
Fisher P.F., 1992. Animation and sound for the visualization of uncertain spatial
information. Presented at the AGI Workshop on Visualisation, p.181-185.
Hunter G.J., and Goodchild M.F., 1996. Communicating uncertainty in spatial
databases. Transactions in GIS, 1, p.13-24.
Veregin, H., 1994. Integration of simulation modeling and error propagation for the
buffer operation in GIS. Photogrammetric Engineering and Remote Sensing, 60,
p.427-435.
Veregin, H., 1996. Error propagation through the buffer operation for probability
surfaces. Photogrammetric Engineering and Remote Sensing, 62, p.419-428.
Module 4 - Taa 2A - Reporting and Communicating Data Quality d20030614.doc
Page 12 of 12