Spatial Analysis and Modeling GII-07
Spatial Analysis and Modeling GII-07
Spatial Analysis and Modeling GII-07
GII-07
Spatial anglysis and modeling
Authors
1 Module -Ann Blyth, Dave Cake
2 Module - Ann Blyth, Ian Laing, Dave Cake
3 Module - Martin Andresen
4 Module - dr. Gennady Gienko and Michael Govorov
5 Module - Ian Laing, Dave Cak
Reviewed by
doc. dr. Antanas Dumbrauskas (Lithuanian University of Agriculture)
Study material for civil servants distance learning programme “GII Distance Learning”, which has
been approved by a law “RE Approval of Study Programmes for Civil Servants” (No 27V-72; Apr
17, 2007) by the State Service Department under the Ministry of the Interior has been prepared on
the basis of an EU co-financed project “Qualification Improvement of the Lithuanian Geographical
Information Managers” (No BPD2004-ESF-2.2.0.-02-05/0143)
Beneficiary of the project National Land Service under the Ministry of Agriculture.
Contractor of the project HNIT-BALTIC UAB, Vilnius Gediminas Technical University
© Author exclusive copy rights belongs to National Land Service under the Ministry of Agriculture
Table of Contents
1 Data Exploration ................................................................................................................4
1.1 Data Attributes............................................................................................................6
1.2 Querying and Selecting Vector Data ........................................................................17
1.3 Querying and Selecting Raster Data ........................................................................28
1.4 Summarizing and Interpreting Data ..........................................................................32
References..........................................................................................................................39
2 Basic Spatial Analysis......................................................................................................40
2.1 Topic 1: Vector Analysis Methods ............................................................................41
2.2 Raster Analysis Methods ..........................................................................................54
2.3 Generalizing Data.....................................................................................................73
References..........................................................................................................................81
3 Spatial Statistics ..............................................................................................................82
3.1. Classical Correlation, Spatial Autocorrelation, and the Location Quotient................83
3.2. Modifiable Areal Unit Problem and the Ecological Fallacy........................................88
3.3. Pattern Analysis........................................................................................................90
3.4. Edge Effects .............................................................................................................98
3.5. Density Estimation and Hot Spot Mapping .............................................................101
3.6. Local Spatial Statistics............................................................................................107
References........................................................................................................................112
4 Geostatistics ..................................................................................................................114
4.1. Introduction to Geostatistics ...................................................................................115
4.2. Classification of Geostatistics Methods ..................................................................119
4.3. Interpolation, Estimation and Smoothing Methods .................................................121
4.4. Kriging Prediction ...................................................................................................131
4.5. Model Validation .....................................................................................................154
4.6. Conclusion and Comparison of Geostatistical Methods .........................................164
References........................................................................................................................167
5 Spatial Modelling ...........................................................................................................169
5.1 Modelling Spatial Problems ....................................................................................170
5.2 Classification of Spatial Models ..............................................................................173
5.3 Model Types ...........................................................................................................175
5.4 Model Builder..........................................................................................................177
5.5 Model Examples .....................................................................................................179
References........................................................................................................................191
1 Data Exploration
What makes Geographic Information Systems (GIS) unique is the ability to link data to spatial
locations and query and summarize these data based on specific analysis requirements.
Functionally, GIS provides a sophisticated tool for reporting the results of a database. These
reports may be for an entire dataset (or table) or for a portion of the dataset (e.g., based on the
results of a query or data summary). These ‘data reports’ can take the form of tabular summaries,
graphs and maps. The following module will teach you about the different types of data and how to
select and query the information within a database. In addition, the types of reports that can be
generated are detailed. This module will examine data exploration in the following four topics:
It should be noted that many of the subtopics discussed detail specific technical concepts. To make
these concepts easier to understand examples, based on the ArcMap (version 9.2) user interface
have been provided. While these examples are based on ArcMap, conceptually they will be
applicable to other GIS user interfaces.
GIS has been characterized as a set of tools for collecting, storing, analyzing and displaying
geographic data. Much of the effort in GIS focuses on tasks relating to the ability to represent and
describe real world objects in a digital environment. Topics in this field relate to the abstraction of
features as points, lines and polygons in a GIS database, the understanding and use of coordinate
systems and map projections for describing locations on the earth’s surface, and the physical file
formats used to represent features in a spatial database. Much of the capability of GIS software
applications emphasizes these areas of interest. Several courses in this training program focus on
these topics, including:
GII-01 Elements of GIS
GII-05 Geographic DBMS
GII-06 Geodesy and Cartography
While these GIS functions relate primarily a spatial inventory of features, the analysis functions of a
GIS seek to help in the understanding of the patterns and processes which lie beneath the features
represented in a spatial database. It is these analytical capabilities which separate GIS from
related applications such as computer aided design (CAD) and automated cartography. Spatial
analysis might help researchers understand a process or distribution of features, or it might help an
organization make better decisions based on a more thorough understanding of the data.
GII-01, being the introductory GIS course, covers most GIS areas including spatial analysis in a
generalized manner. As a result, some of its content will overlap with this course (GII-07)
somewhat, particularly in Module 2 of this course. The intention is to have this course cover such
material in more detail than was presented in GII-01. GII-04 will address highly specialized
analysis techniques such as network analysis, terrain modeling and hydrological modeling as part
of its discussion of specific application areas in the field of GIS. This course will concentrate on the
major generalized analysis techniques. As a result, GII-04 and GII-07, while perhaps sharing some
basic concepts, should not have significant overlap.
The goal of this course is to provide an introduction to the techniques used in the analysis of spatial
data. There is a broad range of analytical tools in a typical GIS application, ranging from simple
actions such as a measurement of distance to sophisticated models which seek to predict spatial
outcomes based on existing or predicted conditions. Given this vast range of techniques for spatial
analysis, most can only be covered to a limited depth.
Module 1 looks at the most basic of analytical tools, those to simply visualize or examine existing
spatial data. We will discuss simple query and selection procedures, and look at elementary ways
of summarizing existing data using basic statistics, summaries and thematic maps.
Module 2 examines what many would consider the mainstream analysis tools. These are the
geoprocessing tools and techniques which are used constantly in any GIS workplace, such as
overlay, buffer and clip operations. Tools for both raster and vector analysis will be examined in
this module.
Module 3 will discuss the statistical methods applied to spatial data. These have their basis in
classical statistics, but which accommodate the limitations and characteristics of spatial data. Such
tools are very effective for describing and understanding large volumes of spatial data. We will
look at topics such as spatial autocorrelation, pattern analysis and density estimation methods.
Module 5 examines the nature and use of models in spatial analysis. This module will discuss the
many different types of models and examine several example models in detail.
Data are stored in four basic formats - three of these involve the storage of spatial information:
vector; raster; and triangular irregular network (TIN). In addition, you can utilise tabular data which
can subsequently be related to spatial datasets.
Vector
Vector data is constructed using points. The location of these points is defined by up to three
coordinates: x, y and z. At minimum, an x and y coordinate pair is required to specify a point’s
location. The z coordinate can represent an additional value related to a point, for example,
elevation. The values of the coordinates themselves are a function of the coordinate system the
data is stored in (e.g., latitude and longitude coordinates, or Universal Transverse Mercator [UTM]).
Elevation values (a common z coordinate) can also be stored in a variety of units (e.g., metres or
feet). A vector data model uses points (with their associated x and y coordinates) to construct
spatial features in the form of points, lines and polygons (areas) (Figure 1). Points have 0
dimension, possessing only the property of location. Lines are one-dimensional, having a length
property. The simplest line is defined by two points: a start and an end point. The shape of more
complex lines is defined by an infinite number of points existing between the start and stop point –
curved lines can be smoothed by increasing the number of points. Polygons (or area features) are
closed lines and are two-dimensional. These features have the properties of area and perimeter.
An area may exist alone or be connected to other features through shared boundaries. Topology
defines the spatial relationships between connecting or adjacent vector features. For example, how
polygons share common boundaries and how lines snap to one another (e.g., road intersections).
Points
(x,y)
(x,y) (x,y)
Lines
(x,y)
(x,y)
(x,y)
(x,y)
Polygons (x,y)
(x,y)
Discrete features (individually discernable features), such as, sampling locations (usually stored as
points), roads (usually stored as lines) and parcel boundaries (usually stored as polygons) are
examples of data typically represented by a vector data model.
Raster
In a raster data model, data are represented by a surface divided into a grid of regularly sized cells.
Typically, a raster is a grid that has an origin (usually upper left-hand corner) and the location of
each pixel is defined by this origin and an offset. A data grid is usually a rectangular shape and its
size is defined by a number of rows and columns; the grid extent is calculated by multiplying the
size of the grid (number of columns by number of rows) by the size of a pixel (expressed in a metric
system). Although a wide variety of raster shapes are possible (e.g., triangles, hexagons) generally
a series of rectangles, or more often, squares, called grid cells, are used. Each cell stores an
attribute or value related to a portion of the earth’s surface. A raster data model is ideally suited for
storing information related to continuous features (features that are not spatially discrete) for
example temperature data or elevation. Images such as, air photos and satellite images are also
1
Which are at the same time disadvantages of raster
© National Land Service under the Ministry of Agriculture, 2007
7
Spatial anglysis and modeling
stored in raster format (Figure 2). Data are stored in a simple data structure based on rows and
columns linked to fixed cell locations.
Linked to each cell is a value corresponding to the attribute being displayed (e.g., a precipitation
dataset would have a rainfall amount linked to each cell or a land use coverage would have
attributes such as urban, agriculture and forest linked to the cell). The values associated with each
cell can be positive or negative, integer or floating-point. In some cases values of NODATA can be
used to represent the absence of data (Figure 3).
Field
Name Application
Range/Length
Text Up to 64,000 Suitable for the storage of names, classes or other text-
characters based attributes
Date mm/dd/yyy Dates and times
hh:mm:ss
A/PM
Short integer -32,768 to 32,767 Numeric attributes without fractional values (i.e., whole
numbers) within a specified range. This type of field
would be useful for the storage of coded values.
Long integer -2,147,483,648 to Numeric attributes without fractional values with the
2,147,486,647 allowed ranges (this type of field can storage a decimal
value).
Single Approximately - Numeric attributes with fractional values within the
precision 3.4E38 to 1.2E38 allowed ranges (this type of field can storage a decimal
floating point value).
number
(Float)
Double Approximately - Numeric attributes with fractional values within the
precision 2.2E308 to allowed ranges (this type of field can storage a decimal
floating point 1.8E3.8 value).
number
BLOB Varies Images or other multi-media files
GUID 36 characters Customized applications requiring global identifiers
* The field names and range used are derived from the ArcMap field types but the examples would be applicable to
other GIS software packages.
Numeric fields may be stored in a variety of numeric data types, for example: short integer; long
integer; single precision floating point number, double precision floating point number. Each type
varies in the size and method used to store numeric data values as indicated in Table 1. Text fields
are used to store alphanumeric symbols (e.g., vegetation names). The date data type can store
dates, times or dates and times. Images, programming code and assorted multi-media files are
stored as binary large object (BLOB) data types. GlobalID and GUID data types store registry style
strings that uniquely identify a feature or table row within a geodatabase and across geodatabases.
Attribute data can also be defined by measurement scale. These attribute types include categorical
data (e.g., nominal, ordinal data types) and numeric data (e.g., interval, ratio and cyclic data). Each
of the different data types are defined below:
Nominal data describes different categories of data (e.g., land use types or vegetation).
Nominal data can be numerical (e.g., phone numbers or numerical values assigned to
classes of land use) but there is no implied ranking between the classes.
Ordinal data implies a ranking between classes, for example, traffic volumes may be
assigned the classes High, Moderate or Low. However, while ranked, ordinal data are
qualitative in nature (i.e., they do not have an associated numerical value).
© National Land Service under the Ministry of Agriculture, 2007
11
Spatial anglysis and modeling
Interval data are quantitative, having known measurements or intervals between the
values. Examples of interval data include: elevation data, precipitation levels, or
temperatures.
Ratio data are also quantitative in nature. They are similar to interval data but they are
based on a meaningful, or absolute, zero value. Density values would be an example of
ratio data as densities can range from 0 to infinity.
The data type used to store the information associated with a given attribute is a function of the
measurement scale required to represent the data. For example, nominal or ordinal data would be
stored in using a character data type, whereas interval, ratio or cyclic data would be stored as an
integer or decimal (float). As mentioned above, nominal data can be represented by numerical
values which would typically be assigned an integer data type. However, in this instance the
numbers are being used as codes that would reference a look-up table for the full values.
There are four types of relationships possible between records within tables in a relational
database:
one-to-one – One record in the first table is related to one (and only one) in the second
table. For example, the attribute data for a series of water quality sampling stations could be
stored in a related table. There would be the same number of stations as records in the
attribute table with a relationship established through a station identifier attribute common to
both tables.
one-to-many – One record in a table is related to many records in another table. For
example, several houses may be located on the same street. The spatial feature
representing the street would have a unique identifier (e.g., STREET_ID) and the house
records would be related to the street via this identifier.
many-to-one – Many records in a table may be related to one record in another table. For
example, a land use dataset might have hundreds of land use polygons with twenty possible
land use classes stored in a separate look-up table. In this example the spatial data would
store a numeric code ranging from one to twenty to represent the polygon’s land use. A
relation between the two tables can be established based on this code. Using a one-to-
many relationship in this instance is beneficial because it reduces the time required to enter
the land use attributes – rather than having to type out the full land use class name each
time the operator only has to enter the numeric code value. This both speeds up data entry
and reduces the potential for errors (e.g., spelling mistakes) in the data.
many-to-many – Many records in a table may be related to many records in another table.
For example, many vegetable types might be grown on a single farm and these types of
vegetables may be grown on more than one farm.
Attribute tables can be linked to the spatial layers through the join and relate functions. These are
described below:
Join
A join involves appending the fields from one table to another through a relationship based on an
attribute (e.g., an identifier) common to both tables. A join is usually used to attach additional
attributes to a spatial data layer. Data can be joined in either a layer or a table view. While the
names of the fields used to establish the join do not have to be identical, the data type does have
to be the same. For example, strings are joined to strings and numeric fields to other numeric
values. The join function is suitable when you are linking tables with one-to-one and many-to-one
relationships. Joins are inappropriate with one-to-many relationships because only the first
occurrence of a matching value is assigned to a record in the target table.
If you are in an edit session, be aware that only the columns from the source table can be
changed. The data in the appended columns can not be edited. If new fields are added they are
added to the target table with no effect to the join tables. The appended columns can be
referenced when calculating values in the columns of the target table.
Relate
As mentioned above, joins are used to establish one-to-one or many-to-one relationships between
layers and tables. The relate function can be used to establish one-to-many or a many-to-many
relationships because relates are bidirectional. Related tables are connected but the tables are
physically separate – data is accessed when you work with the layer’s attributes. The properties of
a relate (e.g., the fields and tables involved) are stored separately and when creating a relate you
will be prompted to enter a relate name. The advantage of relates is that multiple files can be
related, however, they can increase the time required to access and process the data.
© National Land Service under the Ministry of Agriculture, 2007
15
Spatial anglysis and modeling
The number of features selected will be displayed in the lower left-hand corner of
ArcMap. The features associated with the selected attributes will also be
simultaneously highlighted in the map window.
Boolean Operators
Boolean (or logical) operators are used to set the conditions of the criteria, from which an
evaluation of True or False is derived. Boolean operators include:
o (=) equal to
o (>) greater than
o (<) less than
o (>=) greater than or equal to
o (<=) less than or equal to
o (<>) not equal to
Boolean Connectors
Suppose you wanted to select features that satisfy two or more criteria, for example, how many 2
lane gravel roads were built in the year 2002? In this case, individual queries can be combined to
answer a more complex questions. Boolean connectors (AND, OR, NOT, XOR) are used to
combine these multi-part questions (Figure 8):
Steps for selecting features by cursor: The simplest way to query features is to click on one
or, in order to capture several at a time, drag a box around an area of interest using a ‘select
features’ pointer tool in ArcMap. To do this:
Ensure the target layer is checked in the Selection Tab at the bottom of the table of
contents in ArcMap
From the Tools toolbar, click on the Select Features button
In the map window, click on an individual feature or drag a box around a group of
features you wish to select
The features will simultaneously be selected and highlighted in the map window and
attribute table
Steps for selecting by location: More powerful than simply selecting features is a query that
examines the spatial relationships between features in different layers (e.g., where are the roads
that are crossed by streams?)
© National Land Service under the Ministry of Agriculture, 2007
20
Spatial anglysis and modeling
From the Menu Bar in ArcMap, click Selection and choose Select By Location
In the Select By Location window, click the drop-down arrow and choose a selection
method (see Section 5.2.3 for more information on selection methods) (Figure 10).
Check on the layer(s) you wish to select features from (the target layer).
Choose from the drop-down menu, the type of spatial relationship you are looking for:
o Intersect – returns any feature that geometrically shares a common part with
the source feature
o Are within a distance of – returns any feature within a specified buffer distance
of the source feature
o Completely contain – returns any feature that wholly contains a feature in the
source layer
o Are completely within – returns any feature wholly contained by the feature(s)
in the source layer
o Have their centre in – returns any feature with its centroid falling within the
geometry of the source feature
o Share a line segment with – returns any feature that has at least two adjacent
vertices in common with the source feature
o Touch the boundary of – returns any feature that touches the boundary of
features in the source layer
o Are identical to – returns any feature that is exactly equal (has identical
vertices) to a feature in the source layer
o Are crossed by the outline of – returns any feature that has at least one edge,
vertex, or endpoint in common with a feature in the source layer
o Contain - returns any feature that contains a feature in the source layer
o Are contained by - returns any feature contained by the feature(s) in the
source layer
Choose from the bottom drop-down menu, the layer you want to relate to (the source
layer)
Click Apply to run the query
As with the Select by Attributes, the selected features will be highlighted in the map
window. Additionally, attributes associated with the selected features will be
simultaneously selected and highlighted in the attribute table.
80
Add to Current
Selection (+20)
Select from
Current Selection
20
Note that field names are delimited differently depending upon the
source format of the feature class.
When querying a feature class from a file geodatabase, shapefile or
coverage, enclose field names in double quotes (e.g., “AREA”)
When querying a personal geodatabase feature class, enclose field
names in square brackets (e.g., [AREA])
When querying an SDE feature class, fields are not enclosed at all (e.g.,
AREA)
o Click an operator button to add it to the SQL expression (e.g. =)
o Click Get Unique Values to view the values for the selected field
o Double-click on a value to add it to the SQL expression (or manually type the
specific value you are looking for)
Note that depending upon the data type of the field you are querying,
the syntax is slightly different.
When querying a text field, values to find are enclosed in single quotes
(e.g., [FEATURE] = ‘Bridge’)
When querying a numeric field, values are not enclosed in any
characters (e.g., [AREA] > 12.0)
o Click OK
Click the Browse button and navigate to the desired geodatabase or folder
Choose the type of featureclass you wish to export to from the Save as type drop-
down menu (e.g. geodatabase featureclass or shapefile)
Type a name for the new layer to be created and click Save and then OK
The newly created featureclass can then be added to ArcMap for further analysis
Steps for copying features from one layer to another existing layer:
Start Editing in ArcMap by clicking Editor and choosing Start Editing – ensure the
correct layer is set as the target layer
Create a selection of features from the source layer
Click the Copy button on the toolbar
Click the Paste button on the toolbar
The selected feature from the source layer will be copied to the target layer
Click Editor and choose Save Edits and then Stop Editing
Double-click on a value to add it to the SQL expression (or manually type the specific
value you are looking for - in single quotations)
Click Apply to run the query
Extract by Attribute
Extracting raster cells by an attribute is accomplished by executing a ‘where’ clause query which
results in a new raster. A ‘where’ clause is a conditional statement that establishes the desired
criteria of the query. Consider a wildlife biologist modelling goat habitat, where the animals are
only found on steep slopes. We can extract all cells from a slope raster with a value greater than
or equal to 20% to easily identify potential goat habitat. In this case the where clause would be
‘[SLOPE] >= 20’ (the square brackets [ ] surrounding the slope raster is normal syntax for rasters in
ArcMap).
Select by Location
Selecting by location in the raster world allows the user to extract cells based on location and
investigate spatial relationships between disparate layers. Suppose a city planner wanted to
© National Land Service under the Ministry of Agriculture, 2007
29
Spatial anglysis and modeling
summarise the area of forested land cover in an urban area, but only within certain municipal
parks. GIS can aid in this investigation by using a park polygon layer to extract forested areas from
a city-wide land use raster. Restricting the data to the parkland would facilitate summarising the
data according to the requirements of the city planner.
There are a number of ways to extract raster cells by their location, the majority of which use a
specified geometric shape to exclude or include individual or groups of cells in a raster dataset.
Extract by Mask – allows the user to extract raster data with a polygon feature class. The polygon
is applied as a mask and only the raster cells falling inside the mask are processed.
Extract by Shapes
Extract by Points – uses a list of coordinate values (x, y values representing points) to output
only the cells of a raster situated at these point locations.
Extract by Circle – uses the centre coordinate and radius of a circle to output the cells of a
raster situated inside (or outside) of the circle.
Extract by Polygon/Rectangle – uses a list of coordinate values defining an area to output
the cells of a raster situated either inside or outside of the area.
1.3.2 Reclassification
Raster reclassification (also known as recoding or transforming) allows you to simplify or aggregate
data within a raster dataset. For example, if you have a dataset with 10 tree species values and
you want to group all tree species into a single class, the reclassification function will allow you to
do this.
An example of grouping cells with common characteristics to create a simplified raster (e.g.,
combining continuous slope values into fewer slope classes [0-10% = 1, 11-20% = 2, etc.]) is
provided in Figure 17.
To summarize data in ArcMap open an attribute table view and then right-click on the name of the
field you want to summarize. Select (by left-clicking) on the summarize function. This will present
the summarize dialog box (Figure 18). The name of the field you selected to summarize will appear
in the Select field to summarize step. If you want to select a different field just pull down the list and
select one of the field names. The second step in the dialog box prompts you to select the
summary statistics to be included in the output table. In the land use example we want to sum the
total area of each land use class so you would tick (by left-clicking) on the Sum check box. The
final step is to specify the name of the output table. If you have any records selected (e.g., the
results of a query) the summarize function gives you the option of summarizing all the records or
only the selected records. When you have finished making the selection, click on the OK button
and then click Yes when prompted to add the new table.
1.4.2 Statistics
A GIS will allow you to calculate statistics describing the contents of numeric fields. You can view a
count of the number of records, the minimum and maximum values for the field, the sum of the
values, a mean and the standard deviation. In addition, many interfaces will display a histogram
providing a frequency distribution for the values in the field. The dialog box will allow you to
generate statistics for other field in the table by pulling down the field dropdown list and selecting
another field name.
To calculate statistics for a field in ArcMap open an attribute table view and then right-click on the
name of the field you want to describe. Selecting Statistics from the list will display a dialog box
(Figure 19). Note that statistics can only be generated for numeric type fields.
Figure
19. Statistics
Results
Example
1.4.3 Graphs
Graphs allow you to examine and summarize the data in a format that is often easier to understand
than tabular data because they allow you to visualize the data in different ways. In map layouts
they can be used to show additional information related to information on the map or present the
same information in a different way. Using our land use example, we could generate a map where
each land use class is displayed in a unique colour and then include a graph that provides a
summary of the total area of each land use class. The map window and the graph would
compliment one another, giving the reader more details concerning the information being
displayed. There are a number of different types of graphs (both two- and three-dimensional) to
choose from. Each graph type has display properties that can be adjusted based on the type of
data you are displaying and the way you want to present the data. Various types of graphs
available are described below, with some simple examples of common graph types:
Polar – A polar graph is similar to a line graph but it displays angular data (in degrees or
radians) on a circular grid. They are useful for displaying the
results of mathematical formulas.
Area – An area graph is similar to a line graph in that one or more lines are displayed on
and x, y grid and are also useful for showing trends in values. The shading in area graphs
can more effectively emphasize differences in quantities
Scatter – A scatter plot uses symbols (e.g., crosses) to plot x, y (and potentially z) values
based on the attributes in the dataset. Scatter plots allow multiple variables to be displayed
effectively and may allow associations between variables to be examined more effectively.
For example, if we wanted to examine the potential association between annual income
rates and life expectancy these two variables could be represented using a scatter plot.
Bubble – Bubble graphs are similar to scatter plots but they allow you to plot three variables
in two dimensions. Rather than using uniformly sized symbols a bubble plot uses symbols
(or bubbles) that are proportionally sized to portray the values associated with the third
variable. Using the scatter graph example of comparing income to life expectancy we could
compare the third value of weight to examine potential correlations between the variables.
Bar and Column – Bar or column charts are also referred to as
histograms. They group data into equal intervals (represented
as classes) and use either bars or columns to depict the number
or frequency of values in each class. These types of graphs are
useful for showing trends in values (e.g., monthly temperature
or precipitation values).
When creating a graph you should determine the variable(s) you want to graph and then select a
graph type that will effectively display the data. To create a graph in ArcMap and add it to a layout
select the Graph option from the Tools menu and click on Create. This will display the first page of
the Graph Wizard where you will be prompted to select a graph type (Figure 20). The example that
follows shows how to generate a column graph depicting land cover classes
1. Under the Graph type drop-down menu, choose the type of graph you want to create
2. Under the Layer/Table drop-down menu, choose the layer or table you want to graph
3. Choose the data field you wish to graph from the Value field drop-down menu and specify
an x field if desired
4. Click the check box if you wish to create a legend to accompany the graph
5. Change the bar style and colour if desired
6. Click Next
7. Select the radio button next to all features or selected features
8. Enter a title for the graph
9. Click the Tabs under Axis properties to adjust the visibility and title names for each axis
10. Click Finish
Self-Study Questions
1. What are the advantages of the raster data model over the vector data model?
2. What data type (e.g., text, long integer, etc.) would be most appropriate to store each of the
following attributes of a land parcel feature class?
3. A land use attribute (e.g., agriculture, forest, urban, etc.) is an example of which level of
measurement (nominal, ordinal, interval, ratio or cyclic)?
5. Describe a query which would select all LandParcels with a LandUse classification of
Agricultural and an AreaHa (area measured in hectares) larger than 10 hectares.
References
2. Chrisman, N. Exploring Geographic Information Systems. John Wiley and Sons, 1997.
Tasks performed by a GIS analyst will typically involve making use of several of these analytical
tools. For example, a simple analytical problem might be to determine the amount of agricultural
activity within 500m of streams, perhaps as a means of quantifying riparian disturbance, or water
quality degradation. To answer such a question, an analyst must buffer the streams by 500m,
overlay the buffers with agriculture land use polygons, and then quantify the resulting intersection.
This module seeks to enumerate and define many of the common analysis tools which are
available in most commercial GIS applications, which can then be combined to resolve specific
analytical problems.
For organisational reasons, these functions are divided into the following topics:
Clip
Working much like a cookie-cutter, clip allows you to intersect two feature layers to extract a portion
of a dataset (the input layer) based on the spatial extent of another dataset (the clip layer). The clip
function creates a new data layer (output) consisting of the features of the input layer that fall within
the extent of the clip layer (Figure 1).
The input layer to be clipped may contain points, lines or polygons; however, because the element
of area is required, the clip layer must be a polygon. The field names and attributes of the features
in the output layer’s table are maintained (i.e., they are identical to those of the input table). One
potential exception to this rule are area, length and perimeter fields, which, depending on the
software and/or data format being used, may or may not automatically recalculate. The values of
any features intersected by the clip boundary may require updating to reflect the change in area.
Split
© National Land Service under the Ministry of Agriculture, 2007
41
Spatial anglysis and modeling
Split is used to divide an input layer into two or more independent layers, based on geographically
corresponding features in a split layer. Similar to the clip function, the input layer may consist of
point, line or polygon features, however, the split layer must be a polygon to define the areal extent
of the analysis. The features in the input layer are broken up along the boundaries of the split layer
features as illustrated in Figure 2.
Select
The Select tool may be used to create a new layer containing features extracted from an input
layer. This is achieved through the execution of a user-defined query expression to select a subset
of the data; these selected features are permanently extracted to a new output layer (Figure 3).
Erase
Erase creates a new output layer by discarding features from the input layer that fall within the area
extent of the erase (overlay) layer (Figure 4). The input layer can be points, lines, or polygons;
however, because the element of area is required, the erase layer must be a polygon. Features in
the output layer will be of the same geometry type as the features in the input layer. Examples of
how the erase function could be used include:
In a map layout, erase can be used to develop a mask to allow only those features falling
within a given area (e.g., a study area boundary) to be displayed.
In a suitability analysis, erase could be used to apply suitability rules. For example, if
potential sites have to have a 200 metre setback from wetlands then wetland features can
be buffered by 200 metres and the buffer polygon used as the erase layer to remove
potential sites falling within this zone from consideration.
2.1.2 Overlay
Central to GIS analysis is the integration of data to reveal the relationship(s) between two or more
data sources. Overlay is one method of integrating information as it combines the spatial and
attribute data from two or more input layers to create a new output layer. The spatial form of the
new layer is shaped by the geometric intersection of the input and overlay features. Generally, the
overlay of two or more layers results in a more complex output layer, with more polygons,
intersections and/or line segments than what is present in the input layers.
Each feature in the newly created output layer contains a combination of attributes from the input
layers. Overlay functions, when associated with geometrical (or ‘physical’) overlays of data layers,
are implemented by certain mathematical operations – both arithmetic and logical. Arithmetical
operations commonly used, but not limited to, include addition, subtraction, division and
multiplication. Logical operations are aimed at finding where specified conditions occur and use
logical operands such as: AND; OR; >; and <.
As discussed later in this module, methods for overlaying vector data differ from those of raster
data related methods. However, basically vector-based methods do not generalize the data but,
due to relatively more intensive processing requirements, may be more appropriate for smaller or
sparser datasets. Raster analysis methods generalize the data based on the largest cell size found
among the input layers. Raster-based grid calculations are, however, often faster and easier.
Four basic rules for combining the attributes of several layers can be applied to overlay analyses,
and these are presented in Figure 5Figure below. While these rules are presented here with the
vector analysis methods, they are equally relevant to a discussion of raster overlay methods.
1. Enumeration Rule: Each attribute is preserved in the output layer and all unique combinations are recognized.
For example, a soils layer, vegetation layer and precipitation layer are overlaid yielding a derivative coverage
with a unique polygon for each possible combination.
2. Dominance Rule: One value wins – the dominant (e.g., highest) value is the only one value assigned. For
example, an overlay based on a series of sensitivity layers would assign the highest sensitivity value to each
derivative polygon.
3. Contributory Rule: Each attribute value contributes to the result - each source layer contributes to the result.
For example, environmental sensitivity could be calculated based on the sum of a set of input layers: wildlife
habitat sensitivity; riparian sensitivity; slope; and proximity to human disturbance.
4. Interaction Rule: A pair of values contribute to the result (i.e., decisions in each step may differ)
point-in-polygon – The point features, which maintain their spatial location and attribute
integrity in the output layer, are also assigned the attributes of the polygon they fall within.
This type of overlay might allow an association between meteorological stations (the met
station point layer) and vegetation types (the forest polygon layer) to be identified (Figure 6).
The output layer would be the meteorological station point file with the addition of a
vegetation type attribute.
© National Land Service under the Ministry of Agriculture, 2007
45
Spatial anglysis and modeling
line-in-polygon – The line features, which maintain their spatial location and attribute
integrity in the resulting output layer, are assigned the attributes of the polygon they fall
within. This type of overlay would allow you to determine the vegetation types (derived from
the forest polygon layer) associated with each line of a road layer (the road line layer)
(Figure 7). Because a line may overlap multiple polygon types, the output layer (the road line
map) will generally contain more line segments than the input layer.
polygon-on-polygon – The polygon geometries from the input and overlay layers combine
to create a new set of polygons where each new polygon maintains the attributes from both
input layers (Figure 8). This type of overlay might be used to find the association between
slope and avalanche chutes. Polygon-on-polygon is the most common of the vector overlay
methods.
Identity
Intersect
Symmetrical Difference
Union
Update
Identity
Identity is an overlay function that produces an output layer with the same area extent as the input
layer (Figure 9). All input features and attributes are maintained. The operation also preserves the
geometry and attributes of the overlay (or identity) layer that fall within the input layer’s extent. The
input layer may contain points, lines or polygons, but the identity features (overlay) must be
polygons. An example of when the identity function could be used would be if you wanted to
identify those roads located beneath 1,000 metres elevation. The input layer would be the roads
layer, the identity layer a polygonal coverage representing all areas lower than 1,000 metres. The
resulting output layer would be populated with an attribute identifying those roads below 1,000
metres. We are then also able to conclude that the roads without an elevation attribute are above
the 1,000 metre contour.
Identity layer
Intersect
Intersect creates an output layer by preserving only those features (or portions of features) that are
common to both inputs (Figure 10). All features in the output layer contain attribute data from both
of the input layers. The inputs may contain different geometry types (points, lines or polygons), but
normally the overlay input is a polygon layer. The output geometry can only be of the lowest
common geometry of the input layers - a point input combined with a polygon overlay will produce
a point output with the points containing attributes from the polygon they fall within. An example of
when the intersect function could be used would be if you wanted to identify only those roads
located beneath 1,000 metres elevation. The input layers would be the roads layer and a polygonal
coverage representing all areas beneath 1,000 metres. The resulting output layer would be a
subset of roads – only the features located below the 1,000 metre contour.
Union combines and maintains all features and attributes from both input and overlay layers
(Figure 12). Both input layers must have polygon geometry in order to execute the operation. Using
our chemical plant example, a union would allow you to identify those residences falling both inside
and outside the zone of influence of the plant.
One of the typical errors arising from overlaying polygon layers involves the generation of slivers
in the output layers. Slivers are very small polygons created along shared boundaries during the
overlay of inputs (Figure 13). In an overlay analysis, this problem can be the result of overlaying
two files of different scales. Slivers can also be due to digitizing errors, non-precise geo-
referencing, or data export. The Eliminate function may be used to help operators remove slivers
resulting from polygon overlay operations.
2.1.3 Proximity
Proximity is a spatial relationship concept which corresponds to the geographic “nearness” of
features. This allows us to select features located within a certain distance of other specified
features, or to create new features by expanding a feature’s extent. For example, we may wish to
find all hotels within 10 kilometres of the Vilnius Cathedral located in the city centre. Buffers and
other proximity-based analyses help us answers these types of questions.
Buffer
Building on the notion of proximity, buffering creates a zone of inclusion or exclusion by creating an
area around existing point, line and polygon features based on a specified distance. The GIS
software extends a line in all directions around the features until a solid polygon has been formed
and, finally, a new layer containing the buffer results is created. Point buffers are circular in shape,
while the form of line and polygon buffers is defined by the geometries of the input features (Figure
14).
Figure
14.
Types of
Buffers
With
line
buffers,
the
buffer zone can be created on both sides of the line, or on either only the left or right side. In the
case of polygon buffers, users may choose how the buffer zone is created, depending on where
the area of interest lies:
A newly-created buffer feature may be used in several ways. It may form a new feature, such as a
riparian zone, or a road polygon based on the stream or road centreline. It might also form the
basis of a proximity analysis, by then overlaying the buffer with another input layer to find features
which intersect the buffer.
The size of buffers can be defined by a variable distance, thus allowing individual features to have
different buffer widths; normally, each particular buffer width would be based on distance value in
an associated attribute table. For example, using a hydrology network to create varying riparian
zones representing restricted logging areas, larger rivers might be buffered by a greater width than
lower order streams and creeks. Related to varying buffer size, is the notion of creating multiple
buffers for features. It is possible to create multiple rings around features to delineate zones
depicting some hierarchy or changing level of influence. This might be useful in defining zones of
varying noise level around machinery in a manufacturing plant. When creating buffers, the option
also exists to dissolve the resultant adjacent polygon buffers (i.e., merging overlapping areas)
(Figure 15). We will discuss dissolving in more detail in section 5.3.1.
Multiple zones (or rings) Varying buffer size defined by Dissolved (top) and un-
distance dissolved buffers (bottom)
Figure 15. Buffer Variables
To avoid potential errors, it is important to know the units of measurement for a particular dataset
(e.g., metres versus kilometres). Because buffering will operate on a selection set, ensuring the
desired features are selected prior to initiating the buffer operation may also be of importance.
Near
Near calculates the distance between each point in an input layer and the nearest point or location
along a line in another layer (also called the Near Feature layer). The resultant distance values are
recorded in the input layer’s attribute table. A fire department might use this tool to determine the
closest water hydrant (near feature layer) for each school (input layer) in a certain district.
Point Distance
Point distance determines the distances between each point feature in an input layer to all points in
another layer, within a specified search radius. The results are recorded in an output table,
containing fields for the feature’s unique identifier and distance values. Using our hydrant example,
the fire department might wish to expand their search to determine the distances separating each
school from all water hydrants within a specified search radius.
2.1.4 Statistics
Frequency
Frequency produces a table summarizing the unique codes (and their frequency) for a specified set
of fields from an input feature layer or table. This might be useful to a vegetation ecologist
interested in determining the frequency of all the plant types within a particular study area.
Summary Statistics
The summary statistics tool calculates one or more of the following statistics on numeric fields in an
attribute table: sum, mean, maximum, minimum, range, standard deviation, first and last. The
© National Land Service under the Ministry of Agriculture, 2007
52
Spatial anglysis and modeling
resulting output table can be saved in a variety of formats, including; dbase, or as a personal
geodatabase table.
Analysis Extent
It may be necessary to perform an analysis on only a portion of a larger raster dataset, the area of
interest may be defined by specified minimum and maximum map coordinates (x, y) which define a
rectangle. Alternatively, some combination of input raster datasets may be used to define the
analysis extent based on multiple inputs. In this case, either the union or intersect of the rasters
defines the area. With a union setting, the analysis extent encompasses the entire area of all input
rasters. Using the intersect option results in an analysis extent equal to only the area of overlap
between all input raster datasets (i.e. the minimum of the inputs).
Masks
Setting a mask allows you to conduct analyses on a selected set of cells, and hence, is another
method to define the extent of analysis. Only those cells that are identified by the mask will be
considered when running the analysis. Two methods exist for setting analysis masks:
1. By attribute: Selecting rows in the attribute table of a raster allows you to restrict analysis to
particular raster values. For example, a wildlife ecologist may wish to examine only those
areas above a certain elevation to assess bear denning habitat.
2. By area: An existing feature layer (point, line or polygon) or raster dataset defines the spatial
extent of analysis. Only those cells falling within the mask extent will be processed during
analysis. For instance, the same ecologist might wish to limit the denning habitat analysis to
areas within a park boundary and therefore a polygon, defining the extent of the park, would
be used as a mask.
Cell Size
Establishing an output raster cell size is heavily influenced by the resolution or cell size of the input
datasets. As a guideline, the output cell size should be equal to, or larger than, the coarsest cell
size of the input raster datasets (known as the maximum of inputs). This ensures the resolution of
the output raster is consistent with that of the least accurate (the coarsest) input dataset. Using a
cell size finer than that of the input rasters does not ‘improve’ the accuracy of data. Other options
available include:
Minimum of inputs – The output cell size equals the input raster with the smallest cell size.
Take caution with this option because, as mentioned above, refining cell sizes of coarser
inputs does not result in accurate higher resolution output.
As specified – Allows you to specify an output cell size.
Same as Layer – Allows you to specify the output resolution equal to the cell size of an
input raster.
Such terrain tools are discussed in detail in the course GII-04, but are presented here in summary
form for completeness.
, we are able to view and identify the areas of gentle, moderate and steep slope by theming the
raster on grouped slope values (e.g., 0-2°, 2-4°, etc.). For instance, we can see the steeper
areas (regions coloured yellow & red) occur along the banks of Vilnia and Neris rivers. In
order to mitigate the potential threat of landslides, a structural engineer might develop and
examine this type of slope information.
As with a slope calculation, an output raster is created by determining the aspect for each cell in
a raster DEM or each facet in a TIN, depending on the format of the input dataset. Cells with a
slope equal to zero are assigned an aspect of -1. An aspect layer reveals patterns in terrain
not visible in a simple DEM or even slope map, as evident in Figure 20.
As with any raster dataset, you must consider resolution when creating a slope or aspect raster. As
a general rule, the output cell size should never be smaller than the cell size of the input raster.
Contour
The contour tool is used to create a layer showing contours, or lines of equal elevation (also called
isolines) (Figure 21). Contour maps allow you to identify and view regions of equal elevation and,
by examining the spacing and position of isolines, they enable you to infer where steep slopes,
cliffs, river valleys and ridge lines occur. Contour lines are spaced closely together in areas of
steep terrain and further apart in flatter areas. In the areas surrounding water, the peaks of contour
lines indicate the upstream direction of streams and rivers. Important in the creation of contours is
the selection of a contour interval – the distance in elevation between adjacent contour lines. A
higher resolution DEM or TIN will be able to produce a tighter contour interval than a surface with a
coarse cell size.
Hill Shade
The hill shade tool creates a shaded relief raster from an elevation grid or TIN. By employing an
illumination and shadowing of a surface layer, a hill shade can be very effective in representing
relief or terrain, as it gives the impression of a three-dimensional landscape (Figure 22). Four
factors combine to create a hill shade:
1. Azimuth of the Sun – the direction of incoming light measured clockwise in degrees from
north (0° - 360°);
2. Altitude of the Sun – the angle of the illuminating source measured in degrees above the
horizon (0° - 90°);
3. Slope of the Surface – the slope of the cell or facet from the input DEM or TIN respectively;
4. Aspect of the Surface – the aspect of the cell or facet from the input DEM or TIN
respectively.
Each cell in the output hill shade is assigned an illumination value (ranging from 0 (black) to 255
(white) that when viewed simultaneously, gives the appearance of 3-D terrain.
Viewshed
The viewshed tool answers the question, ‘what features or regions of a surface layer are visible
from one or more vantage points?’ Two input datasets are required to run a viewshed analysis; the
first being a point layer with one or more viewpoints and the second a DEM or TIN surface
representing a terrain model.
Viewshed analysis works on the notion of ‘line-of-sight’ - a line connecting a viewpoint to a target.
A feature that separates a viewpoint from a target (e.g., a mountain) will render that target invisible.
Conversely, if no feature on the surface is blocking the view from an observation point to a target,
then that target is visible. In the case of a viewshed, the target is, in fact, every cell or facet of the
input surface layer. The output raster summarizes all the possible line-of-sight permutations and
classifies each cell as either ‘visible’ or ‘not visible’ (Figure 23). By creating a viewshed, a logging
company could minimize the visual impact of their forest harvest operations by restricting harvest in
areas that can be seen from a community or highway.
SPOT: A z-value (height) which overrides the value normally derived from the surfaces z-
value.
OFFSETA: A value specifying a vertical distance to be added to the z-value of the
observation point (e.g., 1.5 metres added to a point to represent the average height of a
person standing in that location).
OFFSETB: A value specifying a vertical distance to be added to the z-value of each target
cell in the surface (e.g., this could be used to simulate average vegetation height).
AZIMUTH1: The first of two values (in degrees) used to limit the scan of a viewshed to a
certain direction.
AZIMUTH2: The second of two values (in degrees) used to limit the scan of a viewshed to a
certain direction.
VERT1: A value used to establish the upper vertical angle limit of a viewshed analysis.
VERT2: A value used to establish the lower vertical angle limit of a viewshed analysis.
RADIUS1: a value used to establish the minimum distance in which to begin a viewshed
analysis (i.e., raster cells closer than the value set in RADIUS1 will not be considered for
analysis).
© National Land Service under the Ministry of Agriculture, 2007
62
Spatial anglysis and modeling
RADIUS2: a value used to establish the maximum distance in which to restrict a viewshed
analysis (i.e., raster cells beyond the value set in RADIUS2 will not be excluded from
analysis).
Generalized Raster functions (i.e., those not for specific applications such as terrain) fall into three
categories:
Below are some examples of the statistics that can be generated for a raster dataset:
Maximum: determines the highest value among input rasters on a cell-by-cell basis
Minimum: determines the lowest value among input rasters on a cell-by-cell basis
Majority: determines the value that occurs most often among input rasters on a cell-by-
cell basis
Minority: determines the value that occurs least often among input rasters on a cell-by-
cell basis
Sum: calculates the sum of values from input rasters on a cell-by-cell basis
Mean: calculates the mean (average) value from input rasters on a cell-by-cell basis
Median: calculates the median (half of the values are above, half below) value from
input rasters on a cell-by-cell basis
Std. dev: calculates the standard deviation from input rasters on a cell-by-cell basis
Range: determines the range in values (highest to lowest) from input rasters on a cell-
by-cell basis
Variety: determines the number of unique values from input rasters on a cell-by-cell
basis
For example, you might wish to determine the maximum value between multiple input rasters; each
cell in the output raster is derived from the values of the corresponding cell in each input raster.
For instance, you could derive the highest rainfall for each cell location from a series of five input
rasters representing five consecutive years of rainfall data.
In addition to arithmetical functions, Boolean (e.g., AND, OR, XOR, NOR) and logical operands
(e.g., >, <, = etc) are often used in association with local functions.
Neighbourhood Statistics
Like local operations, neighbourhood functions can use the same summary statistics to generate
output cell values. For example, the Sum statistic could be employed to incorporate the data from
surrounding cells into each output focal cell as shown in Figure 29 below.
Data simplification (or the application of a spatial filter) is another important application of
neighbourhood statistic functions. This type of operation is useful when generalizing raster data
and serves to reduce the level of variation between neighbouring cells in the input layer. A typical
simplification called the moving average method uses a Mean operator to calculate the average
value on a moving 3x3 or 5x5 cell neighbourhood rectangle; the mean value derived from a
neighbourhood is assigned to the focal cell of that neighbourhood in the output raster. Figure
30Figure outlines the typical statistics and their output when applying a spatial filter to a raster.
Cells with No Data can be processed in one of two ways when executing a local or neighbourhood
function:
1. assign No Data to output cell regardless of the combination of input cell values (i.e., as long
as one input cell in the neighbourhood is No Data, then the output will be No Data); or
2. ignore the No Data cell and complete the calculation without it (i.e., calculate the maximum
value in a neighbourhood, disregarding the No Data cell). Normally, No Data cells are
ignored (i.e., the calculation (total sum) is executed based on neighbouring cells with
values).
2.2.6 Distance
GIS gives us the ability to measure distances between features. At the simplest level distance
statistics can be determined within the map interface through a distance query tool. The user can
click on one location and then click on another and the distance between the two locations will be
calculated and displayed on the screen. Typically these query tools will also allow you to draw a
polyline on the screen and will summarize the distance for the length of the line.
In vector datasets, distance measurements are typically determined between point features, either
located in a single dataset or in multiple datasets. For example, distances between cities can be
calculated and stored as an attribute of each city location. Distances may also be determined from
points in a layer to the nearest point along a line (either from a polyline or the edge of a polygon).
This type of calculation would allow you to determine the distance between an offshore water
quality sampling location and the nearest part of a coastline.
Distances are of interest in and of themselves; however, they can also be fed in as a variable when
developing a model. For example, in a wildlife habitat model, proximity to water is critical to most
wildlife species and therefore distance to the nearest fresh water source (e.g., a stream or lake)
would be an important input to evaluating wildlife habitat capability. It is in these types of analyses
that the functions available within raster data models are most useful. For example, when
developing our wildlife habitat model we can develop a surface that specifies distance to fresh
water sources. Details related to distance functions available in a raster environment are provided
below.
Straight Line
Straight line distance is the physical distance between two points. It is also referred to as Euclidian
distance. In a raster dataset, straight line distances are calculated between cells based on the cell
centres (Figure 34).
(1, 1)
(3, 3)
If the cell size was 10 metres the distance between the two points would be 28.3 metres based on
the following calculation:
10 x (3 - 1)2 + (3 - 1)2
10 x 22 + 22
10 x 4+4
10 x 8
10 x 2.8284 = 28.2843
This type of distance calculation illustrates how distance between raster cells is calculated but
straight line distance functions allow us to develop output rasters that measure distance from every
cell to source cells or points. This type of distance analysis allows us to determine the relationships
between locations. For example, Figure 35Figure illustrates the results of a raster developed to
illustrate the distances between cities (the yellow points). This type of decision support information
assists regional planners when they are locating facilities (e.g., perhaps a hospital or a recreation
centre can service a number of smaller communities and therefore distance can be used to help
select an optimal location for a new facility).
Cost Weighted
Cost weighted distance analyses calculate the cost for traveling across the landscape. This type of
analysis allows you to identify the easiest route between two places based on other considerations
beyond just physical distance. For example, the shortest distance between two communities may
involve having to cross a mountain range and therefore, while the physical distance to avoid the
mountains would increase the longer route may well have a lower cost associated with it. Typically,
cost weighted distances involve the consideration of multiple factors, represented by numerous
rasters. The cost represents the sum of the input raster layers. Some of the rasters may represent
attribute values that lower distance costs (i.e., positive features) while others increase the cost.
Costs can be represented as either actual values or as a relative numbers where factors are
ranked relative to one another with higher numbers being associated with higher costs. The
application of relative cost values are useful in that they allow you to standardize different cost
types to a common scale.
A cost raster is generated by assembling the data associated with each of the cost factors in the
analysis. A raster is then developed for each factor detailing a cost for each cell. The local statistics
function may then be used to generate a derivative (output) raster that sums the total cost for each
cell based on the various cost factor rasters.
Shortest Path
A cost raster allows you to determine the accumulated cost of a given route and evaluate the
shortest path (the path with the lowest associated cost). The cost is determined by summing the
costs associated with the links between adjacent cells (horizontal, vertical and diagonal). To
determine the shortest path we must first calculate the costs between all the linked cells. We then
select a source cell (the starting point) from which the cost associated with movement to that cell’s
adjacent cells is determined. The adjacent cell with the lowest cost then represents the next cell
along the route. The cost of progressing to that cell’s neighbours is then added to the total cost.
The shortest path is the result of this iterative calculation.
It allows you to simplify a classified dataset into more generalized classes. For example,
land use polygons can be aggregated based on land use values (e.g., roads, buildings and
paved areas could be grouped into an all-encompassing class called urban).
It can be used to remove boundaries between polygons having identical attributes in a given
field. For example, if you have merged datasets from different sources the resulting product
may have a number of slivers or overlapping values. The dissolve function will allow these
unnecessary boundaries to be removed from the dataset.
Dissolve can also help us summarize data because while merging the polygons we can also
aggregate mathematical data. For example, using the dissolve described above to create an urban
land use class we can easily determine the total area of all urban features from the resulting
dissolved coverage. In addition, other related attributes (e.g., population values, annual incomes,
birth rates) can be aggregated. For example, if annual income is an attribute available for a set of
municipalities in a city, we can determine the total and/or average income for the entire city (or a
group of municipalities within the city) during the dissolve by summing and averaging the income
attribute.
Simplify Line
The process of simplifying a line involves the removal of small bends in the line. This is
accomplished by removing some of the points (or vertices) within the line. The result is a
generalized version of the feature (Figure 37) that maintains the general shape of the original
version. Lines are simplified to improve cartographic display, for example, a detailed line digitized
at a scale of 1:20,000 might have so much detail it appears fuzzy when plotted at a scale of
1:100,000. In addition, more complex lines increase the processing times associated with various
analyses (e.g., buffering) and often simplifying a feature prior to a buffer analysis will not affect the
accuracy of the resultant buffer polygon.
When simplifying a line it is important to specify the degree of simplification to ensure the resultant
feature resembles the original feature. This is done by selecting an analysis tolerance at a scale
appropriate to the source data. For example, when the goal is to improve the cartographic display
of the data, the tolerance should be set equal to, or greater than, the minimum allowable spacing
between graphic elements. Some trial and error may be required when selecting a tolerance
appropriate for all features in a dataset.
Input Output
Polynomial Approximation with Exponential Kernel (PAEK) calculates the smoothed version
of the line using a parametric continuous averaging technique. The placement of additional
© National Land Service under the Ministry of Agriculture, 2007
74
Spatial anglysis and modeling
points in the lines is based on a weighted average of the coordinates for all the points in the
original feature. Points closer to the current coordinate are weighted more than points further
away. The resulting line may not contain any of the vertices present in the original feature
with the exception of the start and end points.
Bezier Interpolation fits Bezier curves along each line segment of the original feature and
then the curves are connected at the vertices through the application of the Bessel Tangent
As when simplifying a line, it is important to select an appropriate tolerance for the analysis to
ensure the desired results are achieved. In line smoothing, the tolerance specifies the length of a
moving path along the line used to calculate the new ‘smoothed’ coordinates – a higher tolerance
results in a longer path (i.e., more points are factored into the result) and therefore yields a
smoother line. Again a certain level of trial and error should be applied when selecting a tolerance
level to ensure important characteristics (e.g., bends) are not removed.
Input Output
Aggregate
Aggregation is a resampling technique that allows you to generate a lower resolution grid (a grid
with larger cell sizes) based on the attributes of an input grid. The values for each output cell can
be based on the mean, median, sum, minimum or maximum of the input cells falling within the
extent of the output cell. Figure 39Figure provides an illustration of an aggregation analysis where
the resulting output grid is based on the mean of the input dataset.
1 1
2
3 3
Boundary Clean
Boundary clean can be used to smooth the boundaries between regions. It cleans boundaries on a
relatively large scale by making a series of passes through the data – the first pass involves an
examination of cells outside the region and the second considers cells inside the region. Basically
all regions of less than a three by three block of cells will have their values updated based on the
values of surrounding cells. During the ‘outside pass’, regions outside the current region (one cell
in each direction) are assigned values based on the values of neighbouring higher priority zones.
On the ‘inside pass’, cells that are not completely surrounded by cells of the same value are then
evaluated and may then be replaced with the values of surrounding cells. It should be noted that
thin portions of regions may be replaced, for example a region representing a linear feature (e.g., a
river that is numerous cells long but only two cells wide) will be removed.
Expand
Expand allows specified regions within a raster dataset to be expanded based on a user-specified
number of cells. Cells with lower priority values (determined by the user) are labelled as
background cells. Cells with a higher priority (foreground cells) are allowed to expand into regions
of low priority. The technique would be useful to remove no data values from a raster dataset by
allowing cells with surrounding values to expand to ‘fill’ the no data locations.
Filtering
Spatial filtering is designed to highlight or suppress specific features in an image based on their
spatial frequency. Spatial frequency is related to the concept of image texture. It refers to the
frequency of the variations in tone that appear in an image. ‘Rough’ textured areas of an image,
where the changes in tone are abrupt over a small area, have high spatial frequencies, while
‘smooth’ areas with little variation in tone over several pixels, have low spatial frequencies.
In practical implementation, filters are applied to the source raster by means of moving windows
(kernels). A common filtering procedure involves moving a 'window' of a few pixels in dimension
(e.g., 3x3, 5x5, etc.) over each pixel in the image, applying a mathematical calculation using the
pixel values under that window, and replacing the central pixel with the new value. The window is
moved along in both the row and column dimensions one pixel at a time and the calculation is
repeated until the entire image has been filtered and a ‘new’ image has been generated. By
varying the calculation performed and the weightings of the individual pixels in the filter window,
filters can be designed to enhance or suppress different types of features. The moving window
process is illustrated in Figure 40. Typically a kernel shape is square or rectangular, but circles and
annuli are also used.
Thematic applications of filtering include: surface slope and aspect calculations using a DEM,
calculations of weighting functions for advanced multi-criteria raster analysis and many others.
The majority filter function generalizes data by replacing cells in a raster dataset based on a values
present in the majority of the cell’s surrounding values. Two criteria must be met before a cell’s
value will be replaced:
there must be a large enough number of surrounding cells (e.g., more than half) with a
common value; and
the cells having a common value must be spatially connected (e.g., contiguous) around the
centre of the filter kernel. This criterion minimizes the potential corruption of spatial patterns
in the data.
Nibble
The nibble function can be applied to edit portions of a raster dataset where the values are known
to be incorrect or missing (e.g., areas of no data). A query or selection set is first applied to select
the cells in the grid that are to be replaced. A mask is applied to specify the extent of the analysis –
selected cells falling within the mask will be the ones replaced. The selected cells are then
reassigned the values of their nearest neighbours through a Euclidean Allocation (cells are
allocated based on closest proximity using a Euclidian distance (a straight-line).
Region Group
A scanning process is applied (similar to a moving window analysis) to assign a unique number to
those cells falling within each region of a raster dataset. The resultant (output) dataset contains
unique values for each unique region (Figure 41). The values assigned cannot be controlled by the
user. The region group function allows you to examine potential spatial patterns in your data by
helping you identify unique regions or zones.
© National Land Service under the Ministry of Agriculture, 2007
78
Spatial anglysis and modeling
1 1 1 1 1 1 1 1
3 3 4 4 2 2 3 3
3 1 1 1 2 4 4 4
2 2 2 4 5 5 5 6
Thin
The thin function allows you to reduce the number of cells required to represent linear features in a
raster dataset. For example, a paper map scanned at a high resolution would potentially represent
linear features by a region numerous cells wide (e.g., a single-line river on the source hard copy
could appear in the raster as a region 5 cells wide and 200 cells long). Thin would allow you to
reclassify the cells in the raster resulting in the river in our example being represented by a region
now a single cell wide by 200 cells long.
Self-Test Questions
1. Describe the difference between the Contributory Rule and the Dominance Rule for use in
combining attribute values during overlay.
2. You wish to determine the ratio of agricultural lands lying within 500m of streams to agricultural
lands lying more than 500m from streams. To do this, you have two vector polygonal layers:
polygons delineating agricultural land, and 500m buffer polygons generated from a streams
layer. Is it appropriate to use the Intersect form of a vector overlay to arrive at the necessary
ratio? Why or why not?
3. Describe the difference between a local, focal and zonal operation on raster layers.
4. Why might you use the Smooth function on a linear feature class?
5. Does executing a 3x3 mean filter on a raster grid have the effect of smoothing out extreme
values, or does it enhance extreme values in the raster? Why?
References
2. Chrisman, N. Exploring Geographic Information Systems. John Wiley and Sons, 1997.
3 Spatial Statistics
The most frequent method of reducing large quantities of data into a manageable amount of
information is through the use of statistical analysis. Broadly defined as the collection, analysis,
interpretation or explanation, and presentation of data, statistical analysis can provide valuable
information from vast arrays of data.
Though there are many forms of statistics, two classifications of statistics are classical statistics
and spatial statistics. Classical statistics is the set of methods that most people come across in
government reports, sports numbers, academic research, and the media. These methods, as with
all statistical methods, have restrictions on their applicability, limiting the ways in which classical
statistics may be applied.
One of these restrictions is the independence of observations. This simply means that one
observation is not related to another: the information from one observation gives you no
information about another observation. This restriction is most often violated in the context of
spatial data.
The first law of geography, usually referred to as Tobler’s Law, states that "[e]verything is related to
everything else, but near things are more related than distant things" (Tobler 1970: 236). This
means that the information given regarding one location can be used to know information regarding
another location, with that information’s value decreasing as the two locations become further
apart.
Because of this violated restriction a whole branch of statistics has emerged to deal with this lack of
independence of observations. This module introduces some of these spatial statistical methods,
with application using crime data in Lithuania measured at the municipal level.
The following six topics of spatial statistics are examined in this module:
First, when analyzing data using classical correlation, you are interested in know IF two variables
are related. Just because two variables are correlated does not mean that there is an actual
relationship between two variables. In a very humorous paper, an economist showed that there is
a strong relationship between the national inflation rate and dysentery, so it is important that
correlations are interpreted carefully and that the two variables you choose to correlate should be
related to each other.
Second, when two variables move in the same direction (either increase or decrease) the two
variables are said to be positively correlated. It is a little confusing to think of two variables being
positively correlated when they are both decreasing, but this is just the terminology. When one
variable goes up and the other goes down, the two variables are said to be negatively correlated.
These two types of correlation are shown in Figure 1.
referring to negative correlation. If the value is zero, there is no correlation. There are some links
to further reading on the Internet, for those interested, listed in the Additional Resources section
below.
The biggest difference when making the calculation for spatial autocorrelation, compared to
classical correlation, is that the neighbours of spatial units must be specified in order to calculate
spatial autocorrelation. There are two dominant methods to determine whether or not two spatial
units are neighbours: distance and contiguity. Distance, most often used in point pattern analysis,
simply states that if one point is within some distance of another point they are neighbours. The
distance used depends on the phenomenon being measured and the context of the study.
Contiguity is simply a measure of whether or not two areal spatial units are next to each other. For
example, Lithuania and Poland share a national border, so they are contiguous. What becomes
important is the order of that continuity. The following example provides an explanation.
However, a different measurement of contiguity, Rook’s contiguity, does not consider a spatial unit
with a shared boundary only at a corner to be considered contiguous. Most often, Queen’s
contiguity is used with vector data because most social-economic-political spatial units are not
square or rectangular. Also, using Queen’s contiguity allows for the researcher to not violate
Tobler’s First Law of Geography. It is also important to note that other measures of contiguity are
possible and these can affect the spatial analysis: neighbours may be based on the length of
common shared borders between spatial units or bi-directional weights based on the flow of
individuals into and out of adjacent spatial units. For example, to continue with the national
examples, the international border shared between Lithuania and Latvia is longer than that of
Lithuania and Poland, but Poland has a much larger economy. As such, the measurement of
neighbours (and their respective importance) depends on the context. However, for most analyses
first order Queen’s contiguity is used.
n n
n wij y i y yj y
i 1 j 1
I ,
n
2
yi y wij
i 1 i j
where the variables representing n and y are the same as above and wij represents a matrix that
defines the spatial neighbours of the spatial units. The actual formula is rather complicated and not
particularly important to understand at this stage because many software programs, including
ArcGIS, make these calculations for you. It is important to note that there are some similarities
between the formula for Moran’s I and for classical autocorrelation. The primary difference is that
only one variable is analyzed with spatial autocorrelation. The spatial autocorrelation occurs
between the spatial units, not different variables.
Moran’s I has ranges from -1 to +1, just as with classical correlation. This is primarily the reason
why Moran’s I is favoured over Geary’s C—this latter measure of spatial autocorrelation does not
have the same numerical similarity to classical correlation. For Moran’s I, a value of zero means
there is no spatial autocorrelation, a value greater than zero means that there is positive spatial
autocorrelation, and a value lesser than zero means that there is negative spatial autocorrelation.
So, when spatial autocorrelation is positive, the most often phenomenon when investigating social-
economic-political spatial units, neighbouring spatial units have similar values. However, when
spatial autocorrelation is negative, neighbouring spatial units have dissimilar values.
availability of crime data for Lithuanian municipalities, the location quotient’s utility is shown in this
context.
C in C tn
LQ N N
n 1
C in n 1
C tn
where Cin is the count of crime i in spatial unit n, Ctn is the count of all crimes in spatial unit n, and
N is the total number of spatial units. In the context of this example, the location quotient is a ratio
of the percentage of a particular type of crime in a municipality of Lithuania relative to the
percentage of that same crime in all of Lithuania. If the location quotient is equal to one, the
municipality has a proportional share of a particular crime; if the location quotient is greater than
one, the municipality has a disproportionately larger share of a particular crime; and if the location
quotient is less than one, the municipality has a disproportionately smaller share of a particular
crime.
Consider the following example for Lithuanian municipalities. As shown in Figure 3, break and
enter is not a significant problem for most Lithuanian municipalities: the legend, omitted here for
brevity, goes from green (low crime) to yellow (moderate crime) to red (high crime). There are a
few municipalities that have moderately high break and enter rates, but for the most part, it is not a
problem in the majority of Lithuanian municipalities.
The location quotient provides a different picture of break and enters. Clearly there are some
municipalities that are overrepresented (red) with regard to Lithuania as a whole. Though most of
these municipalities are identified with higher break and enter crime rates, now some other
municipalities appear to be overrepresented. Moderate overrepresentation (orange) is present for
many municipalities that are considered to have low crime rates. This does not mean that these
municipalities are not safe, crime is a phenomenon that occurs at all places and all times, but it
does show that some municipalities are more popular for break and enters than others. This
becomes important information to policymakers because existing police and crime prevention
resources should be more focused on break and enters in these municipalities—this does not
necessarily mean that more of these resources need to be allocated. The application of the
location quotient is applicable to any phenomenon that may specialize in particular places, which is
most of human activity.
b) Location Quotient
The scale effect was just mentioned above regarding the use of municipalities and counties in the
same analysis. Simply put, (spatial) statistical results may change depending on the spatial
resolution of the data. Generally speaking, the larger the area of a set of polygons, the more likely
they are to be of similar value and, consequently, highly correlated. This is because the larger
spatial unit is an average of its constituent parts. Let us assume that there is a high degree of
variability in unemployment across Lithuanian municipalities and that each county contains
municipalities that have both high and low levels of unemployment. Once the municipalities are
aggregated into counties, it is possible that all the counties will appear to have similar levels of
unemployment simply because the variability has been averaged out.
The aggregation or zoning effect occurs when statistical results change because of the
methodology used to form the areal spatial unit. Again, let us consider the case of Lithuanian
counties and municipalities. Each of the current ten counties has a given geographical area. Now
suppose that the municipalities in each of the counties were “shuffled” such that each county now
had a different set of municipalities, but the geographical area within each county is essentially
unchanged—some small changes in the geographical areas may be unavoidable. By maintaining
the same geographical size of each county the scale effect is avoided. However, because the
municipalities in each of the counties are now different, statistical results comparing counties will
also change.
In short, through the alteration of the number and arrangement of the spatial units a completely
different set of results may be generated in an analysis.
The MAUP can only truly be minimized if the spatial units of analysis are specifically the
appropriate units for a given analysis. In other words, the MAUP will disappear once we know the
“true” spatial units. Of course, such a level of knowledge, if ever attained, will be different for every
analysis: meaningful areal units for unemployment will necessarily be different from meaningful
areal units for heart attack risk. The difficulty is: what do we do given that we know it is a problem
and we do not currently have a solution?
The simple answer to this question is to do as many analyses as possible using different spatial
units. If all of the analyses produce the same (or very similar) qualitative and quantitative results,
then there is little to worry about—for example, all correlations would be positive and approximately
the same magnitude. If all of the analyses produce qualitatively similar results (all have positive
correlations), but the magnitude of these results have a range then err on the side of caution and
do not make any overstatements regarding the relationship between two variables. If, in the worst
case scenario, there is absolutely no consistency in the results (correlations are positive and
negative, and of varying magnitudes) then simply do not trust the results. More analysis with
different spatial units, variables, or a longer time series is necessary. Any policy based on such
results will most probably be ill-informed.
Unfortunately, most analyses only have one choice of spatial units to analyze. When this situation
occurs, as it most certainly will, the results must be interpreted with caution noting that they may be
sensitive to the particular spatial units analyzed.
3.2.3. The Modifiable Areal Unit Problem and the Ecological Fallacy
The ecological fallacy is usually grouped together with the modifiable areal unit problem, as it is
here, but it is a different issue that arises, particularly with geographical data. First noted by
Robinson (1950), the ecological fallacy occurs when the results of an analysis at one spatial
resolution is inferred to mean something at another spatial resolution. For example, consider a
county with two municipalities. The county has an average monthly income of 1000 LTL. From
this statistic, one can then infer that the average monthly income in each of the two municipalities is
also 1000 LTL. However, in reality, the two average monthly incomes may be 500 and 1500 LTL.
The ecological fallacy then reduces to the fallacy of division: what is true of the whole is also true of
its parts. However, one only needs to view his or her own neighbourhood to notice a wide ranging
variation in monthly incomes such that any average is not representative of anyone. As such, what
is true of the whole is not necessarily true of the part.
What can be done to avoid the ecological fallacy? Unlike the MAUP, the ecological fallacy is easily
avoided through careful and thoughtful statements regarding inference. If your analysis only
assesses counties, then any inference you make can only involve counties, similarly for
municipalities, neighbourhoods and individuals. All too often academic research, government
research, and the media commit the ecological fallacy.
One of the troubles with recognizing the ecological fallacy is that it occurs without the use of spatial
units, the most obvious and visible form of the ecological fallacy. Groups of people, defined by any
social characteristic, may have an average tendency such as high intelligence. This, however,
does not mean that every member of that group of people is highly intelligent.
In the study of patterns, there are three general forms of spatial distributions: uniform, random, and
clustered, all shown in Figure 4. It should be noted that although the patterns represented in
Figure 4 are in point form and that the discussion regarding pattern analysis presented here is also
in regard to points, there are methods of pattern analysis for spatial units represented as polygons.
However, pattern analysis is traditionally in the realm of points so point pattern analysis is dealt
with in this section of the module. Cluster analysis using areal spatial units is undertaken in Topic
5, below.
Lastly, shown in Figure 4c, there is the clustered spatial point pattern distribution. With this spatial
distribution, the density of points varies to a high degree because some areas have the majority of
the points and other areas have none of the points. As such, the prediction of other points can
take place because if you are at a location that has a point, other points are likely to be close by,
and if you are at a location with no points, there are likely no other points close by. Such a
clustering of points may represent, for example, the location of retail outlets in a central business
district or the outbreak of disease related to an environmental toxin with a fixed source.
Two forms of pattern analysis are outlined below. The first, and most common, is nearest
neighbour analysis that uses the distances between points to determine whether or not a spatial
point pattern has a particular distribution. The second, quadrat analysis, places the points within
areal units to assess the form of spatial distribution.
It should be noted that in much of this literature a “point” means any location on a map and an
“event” is a point on the map where something has occurred and is represented by a dot: a case of
disease, a crime, or a traffic accident. Though for the remainder of this topic the term “point” is
used, this difference become important when discussing edge effects in the following topic.
For example, as shown in Figure 5, there may be two dead-end streets (cul-de-sacs) that are
separated by a park. For simplicity, consider that the residences on either side of the park are not
even able to see each other because of trees, etc., but that the residences are actually quite close.
Using Euclidean distance to measure the distance between two points may place the two
residences on either side of the park as nearest neighbours when the actual network distance
between these two residences (a more realistic measure of distance in this situation) may in fact be
quite long. Of course, these network distances may be used in a nearest neighbour analysis. The
essence of the analysis remains the same, but the calculation of distance becomes more
computationally intense. The standard Euclidean distance is used in the example below.
n
where NNDi is the nearest neighbour distance for point i an n is the number of points. Note that if
the spatial point pattern is perfectly clustered, meaning that all points are in one location, then the
value of NND* is zero. Second, the expected average nearest neighbour distance for a random
spatial distribution needs to be calculated:
1
NND R ,
2 Density
where Density is the number of points divided by the area under study. These two variables are
then used to calculate the standardized nearest neighbour index (R):
NND *
R
NND R
in order to determine the nature of the spatial point pattern. A property of the standardized nearest
neighbour index, R, is that it has a defined range (0 – 2.149) that allows for the following
classification:
Most often, a descriptive classification is not sufficient for an analysis, so inferential statistical
testing is necessary to determine the nature of the spatial point pattern. In such a test, the null
hypothesis is that the spatial point pattern is random, using the following test statistic:
NND * NND R 0.26136
Z NND , where NND .
NND n Density
Using a one-tailed statistical test and a probability value of 10 percent p-value = 0.10), the critical
value for ZNND is approximately 1.28. Therefore, if ZNND > 1.28 the spatial point pattern is
significantly more dispersed than random, and if ZNND < -1.28 the spatial point pattern is
significantly more clustered than random, and if -1.28 < ZNND < 1.28 the spatial point pattern is
insignificantly different from a random Poisson spatial process. The following example illustrates
an application of nearest neighbourhood analysis and its statistical analysis.
Figure 6 shows the locations of five (5) points in a study area that appear to be in somewhat of a
clustered spatial pattern. The area for this study area is 100: 10 * 10 units.
Point X Y NN NND
A 2 2 B 2.24
B 4 3 C 2
C 4 5 B 2
D 6 2 E 1
E 6 1 D 1
From these numbers, the remaining variables are easily shown to be: NND* = (8.24)/5 = 1.648,
NNDR = 1 / (2*sqrt(Density)) = 1/(2*sqrt(5/100)) = 1/(2*sqrt(0.05)) = 2.236, and R = 1.648 / 2.236 =
0.737. Consistent with the observation above, this spatial point pattern is more clustered than
random. The question now is whether or not that spatial clustering is statistically significant. This
question is answered through the calculation of ZNND = (1.648 – 2.236) / 0.523 = -1.13.
Consequently, the null hypothesis of a spatially random distribution cannot be rejected with this
statistical test. Therefore, though the distribution above appears to be more clustered than
random, this is not the case statistically.
Generally speaking, if each of the cells contains the same number of points (no variability), then
the spatial point pattern distribution is considered to be uniform, or dispersed; if the cells contain
very different numbers of points (large variability), then the spatial point pattern distribution is
considered to be clustered; and if there is a moderate amount of variability in the number of points
in each of the square cells, the spatial point pattern is considered to be random. These differences
are shown in Figure 7.
Though similar in form, the statistical test in quadrat analysis is simpler to execute than in nearest
neighbour analysis. The only statistics needed are the mean (or average) cell frequency and the
variance of the cell frequency, both classical descriptive statistics. Using these two variables, the
following statistic is calculated:
Variance of the cell frequencies
VMR ,
Mean of the cell frequencies
where VMR is an acronym for variance-mean ratio. If the VMR is equal to one, the spatial point
pattern distribution is considered to be random; if the VMR is greater than one, the spatial point
pattern distribution is considered to be more clustered than random; and if the VMR is less than
one, the spatial point pattern distribution is considered to be more dispersed than random. The
test statistics is a chi-square:
2
VMR m 1 ,
where m is equal to the number of cells and the null hypothesis is a random spatial distribution.
Given that the variance of the number of points in each cell of Figure 7a is zero, as an example,
the statistical tests will be carried out using Figures 7b and 7c. The necessary variables are
provided in Table 3.
Figure 7b Figure 7c
Variance of cell frequencies 2.5 21
Mean of cell frequencies 3 3
VMR 0.833 7
2
statistic 6.67 56
As shown in Table 3, Figure 7b shows definite signs of being a random spatial distribution, but
does show some evidence of being more dispersed than random. The results for Figure 7c,
however, show a clear degree of clustering with a chi-square statistic that easily rejects the null
hypothesis of a random spatial distribution—the relevant degrees of freedom for the critical values
is seven, nine observations minus the two parameters calculated (mean and variance). With a chi-
square statistic of 6.67, the spatial distribution of points in Figure 7b is not significantly different
from a random spatial distribution. As such, each of the examples shown in Figure 7 correspond
statistically with the appearances of their spatial point pattern distribution.
A further difficulty that is spatial is in regard to the size of the study area. Consider the spatial point
patterns displayed in Figure 8. Figure 8a shows what is perceived to be a random spatial point
pattern whereas Figure 8b shows what is perceived to be a clustered spatial point pattern. The
difficulty that is present is that these are precisely the same spatial point patterns shown at different
scales: Figure 8a is at a cartographically larger scale (smaller area) than Figure 8b.
Suppose you were investigating the impacts of an environmental toxin that is released into the air
from an industrial plant at the centre of the spatial point pattern. If one were to analyze the effect of
this environmental toxin and a corresponding disease at a scale equivalent to that of Figure 8a
then the industrial plant that produces the environmental toxin would not be viewed as problematic.
However, if the scale of analysis was changed to that of 8b, and consider this scale appropriate for
the problem at hand, then the industrial plant may be shown to be the cause of the corresponding
disease.
The problem is knowing the appropriate scale of analysis. Sometimes this may be relatively easy:
there may be a scientific consensus regarding the problematic concentrations of the environmental
toxin in the atmosphere. The concentrations of the environmental toxin could simply be measured
at various distances from the industrial plant and have the study area circumscribed based on that
information. Not all such problems are resolved so “easily”, however. The most appropriate action
to undertake in such a situation would be similar to that for the modifiable areal unit problem, in
general: undertake the analysis at a number of scales that may be applicable and look for
consistency in the results.
a) Random? b) Clustered?
Though technically all spatial analysis, or any analysis for that matter, does have an edge, edges
pose a problem in particular for point pattern analysis. This is because, as discussed above, the
distances between two events or a randomly selected point and an event are used to calculate
descriptive statistics and used in inferential analysis. As such, distance-based point pattern
analysis methods suffer from the presence of edge effects.
In some cases, the edge does not pose a problem in the analysis. For example, a portion of the
Lithuanian border is on the Baltic Sea. Consequently, we would not expect there to be any events
that are outside of the edge—there may be some people, and corresponding events, that live just
of the coast of Lithuania, but these events can easily be placed on a land region close to the water
for an analysis.
The problems occur with edges when the study area is part of a larger geographical region that the
underlying spatial process operates within. Consequently, any events occurring outside of the
study area but within the larger geographical region may interact with the events within the study
area (Diggle 2003). This is the edge effect. Consider the example shown in Figure 9.
Figure 9 shows the study area, its boundary (edge), and some events outside of the study area
that are related to the events inside of the study area. Such a situation may occur if someone is
able to obtain data for one municipality on the incidence of a disease, but not for any bordering
municipalities.
An excellent example of such a situation would be if there is an industrial plant located within one
municipality (the study area) that creates pollution that travels through the atmosphere. The
pollution from this industrial plant is not thought to impact neighbouring municipalities because of
distance, or neighbouring municipalities may not cooperate with the research project because of
jurisdictional issues, for example. Regardless, the events that are present outside of the study
area really should be included in the analysis of the effects stemming from the pollution of the
industrial plant.
It should be mentioned that this situation is the case in all statistical analyses. The relevant
information is not always available, so important information is not included. However, as with all
statistical analyses, there are methods to deal with these shortcomings stemming from the lack of
available data.
Of these three methods for accounting for the presence of edge effects, the most common method
of dealing with this problem is to create a buffer zone. A buffer zone is an area that is created a
certain distance from the outer edge of the study area. This is shown in Figure 10.
(Bailey and Gatrell 1995). Effectively, what this procedure does is allow the researcher to simulate
having those points outside of the study area and then undertake the analysis without concern for
imposing bias on the results. As such, the use of the buffer method, as well as the adjustment
method, eliminates the bias from the edge effects with the cost of increased variance from
decreased data (Diggle 2003).
Regarding the size of the buffer zone, there is no specific rule used in order to determine the
distance of the buffer zone from the outside of the study area. Rather, the process of creating a
buffer zone for the analysis may be applied repeatedly as a sensitivity analysis.
One of the applications of kernel density estimation is in the identification of hot spots, usually in
the case of some reported crime. Rather than trying to discern patterns on a map that potentially
has tens of thousands of events—in many cases there are so many crimes (think of personal theft
or automotive theft) that the entire map is covered in point events—kernel density estimation shows
where the areas with the most points, the “hottest”, are located. This spatial statistical method has
common availability as well as visual appeal.
A sensitivity analysis may be particularly important in many applications because there may be no
consensus on the grid cell size, but a range in grid cell sizes that different people have used. The
user can then perform the kernel density estimation using a variety of grid cell sizes within that
range in order to uncover any potential changes in the results. Ratcliffe (1999) has also proposed
a method for determining the grid cell size when no other method is available: draw a rectangle
around the study area that is small as it can be without entering inside of the study area, measure
the shorter of the sides of that rectangle (in metres), and then divide by one hundred and fifty.
The primary issue is to determine the appropriate radius, or bandwidth, and is the portion of the
kernel density calculation that is most sensitive to changes. As with the grid cell size, discussed
above, the appropriate bandwidth should be set according to the particular application. Other
research should be consulted to determine the appropriate bandwidth, or the appropriate range to
undertake a sensitivity analysis.
A nice feature of the kernel density estimation is that the values generated for each of the grid cells
is in a meaningful unit for describing most spatial point distribution—the number of events per
square kilometre, for example. Consequently, the values in each grid cell can be compared in a
meaningful way: grid cell 122 has a value of 100 and grid cell 123 has a value of 50, so grid cell
122 is twice as dense as grid cell 123.
Let’s consider a couple of simple, short examples. First, there is the equation for a kernel density.
This particular equation is for a quartic kernel:
2
3 d i2
1
di
r2 r2
where is the intensity value (kernel), is equal to 3.14, r is the radius (bandwidth), and di is the
distance between the centroid and event i within the bandwidth. Once the bandwidth is chosen,
2
3 hi2
is just a constant so it can be ignored in these examples. All that is left: 1 , and let us
r2 2
also ignore the squared term on the outside of the brackets, again for simplicity. Set r = 10. Now
let us consider two possible scenarios.
First there will be four points within the bandwidth: two of them will be two units from the centroid
and two will be three units from the centroid. In the second scenario there will be two points within
the bandwidth, both nine units from the centroid. The calculations for both examples are below.
In the first example, the value of is shown to be 3.74 and in the second example the value of is
shown to be 0.38—of course, the actual values of would be different if the complete formula was
used.
4 4 9 9
1 1 1 1
100 100 100 100
96 96 91 91
Example 1:
100 100 100 100
374
3.74
100
81 81
1 1
100 100
19 19
Example 2:
100 100
38
0.38
100
The first example has a much higher value of for 2 reasons. First, there are more events within
the bandwidth, and second, the events that are within the bandwidth in example 1 are closer to the
centroid. Consequently, the value of depends on both of these factors, so it is possible for one
grid cell value to be higher than another even if the latter grid cell has more events within its
bandwidth. To provide a trivial example, if there are no events within the bandwidth, = 0.
The methodological concern is probably the “worst” criticism of the use of kernel density estimates.
There are two types of point data. The first type of point data represents discrete events such as a
criminal occurrence, a traffic accident, and a case of disease. The second type of point data
represents a measurement of a continuous surface such as temperature or pollution levels. Given
that the second type of point data actually represent a continuous surface, it makes sense to
transform the points into a continuous surface for analysis—for budgetary reasons, all locations
cannot be measured to calculate temperatures across a country, for example.
The first type of point data, however, is not a continuous phenomenon. Consequently, there are
many people who state that you cannot meaningfully transform discrete point events into a
continuous surface because the underlying data themselves are not continuous. Those who map
criminal occurrences are particularly guilty of turning discrete events into continuous surfaces.
However, if one considers the continuous surface to be a measure of risk, this criticism becomes
less problematic. Whether or not a crime occurs at one intersection or another two intersections
away may simply be random, so the risk of criminal victimization may be equal at both
intersections. The difficulty arises when there is an intersection (and a small area around it), for
example, that never has any crime but the kernel density estimation procedure provides that
location with a risk of crime. Consequently, as with any statistical analysis, caution must be used
when interpreting kernel density results.
The second limitation is one that is relatively easy to resolve. The limitations arise because many
researchers who calculate kernel density estimates use only the event data for that calculation.
The resulting map will indeed show where the hotpots of that activity are, but those hotspots may
be misleading.
Crime rates, disease rates, traffic accident rates, etc. are calculated because if one wants to have
an accurate measure of risk, one must consider the population at risk of crime, disease, and traffic
accidents. Municipalities that have larger populations do not necessarily have larger volumes of
crime, disease, and traffic accidents. What is more important to know is whether or not there is
more crime, disease, or traffic accidents after controlling for the population at risk, usually the
population of the area under study—traffic volume in the case of traffic accidents. So in Figure 14,
is there really a high volume of some activity or are there simply more people in that area?
In order to address this limitation, there is a kernel density estimation technique called the dual
kernel density. This dual kernel density needs two sets of point data: the event data of interest,
and the population at risk of encountering that event. The difference in the resulting maps can be
astounding. This is shown when considering both Figures 14 and 15.
© National Land Service under the Ministry of Agriculture, 2007
105
Spatial anglysis and modeling
Suppose a Moran’s I statistic was calculated using Lithuanian municipalities for automotive theft.
The result would be Moran’s I = 0.029. This result shows a very small, and statistically
insignificant, degree of positive spatial autocorrelation. However, if you were to calculate a local
version of Moran’s I, described in detail below, you would find that the local Moran’s I results range
from -4.235 to 2.645, having an average of 0.134—each of the sixty municipalities in Lithuania has
a local Moran’s I statistic. Additionally, four of these local Moran’s I statistics are significantly
different from zero, contrary to the global Moran’s I result.
Because of this known variation within most study areas there has been a relatively recent
movement within spatial statistics that focuses on the differences in statistical results across space,
rather than simply assuming that these differences do not exist. Indeed, this has been one of the
criticisms of quantitative geography, that is searches for global generalities rather than considering
the local. As such this movement in spatial statistics, or spatial analysis more generally, has been
coined as local analysis, local modelling, and local spatial statistics.
Of course this is a strong case for the use of local spatial statistics, but there are more reasons why
they are fruitful to an analysis. Local spatial statistics are mappable. As stated above in the
Lithuanian municipality example, each areal spatial unit has a local spatial statistic associated with
it. Consequently, this is simply yet another variable in the data set that can be mapped. Because
this variable can be mapped, it is considered to be GIS-friendly. And lastly, local spatial statistics
can be used to search for hotspots: think of two or more spatial units that have statistically
significant positive local spatial autocorrelation.
The first well-known development in local spatial analysis is in regard to spatial point patterns, the
geographical analysis machine of Openshaw et al. (1987). The geographical analysis machine,
though extended by a number of researchers, is a computationally intense algorithm that we will
© National Land Service under the Ministry of Agriculture, 2007
107
Spatial anglysis and modeling
not cover in any detail here. However, its general formulation is worthy of note because it is
consistent across most local spatial statistical analyses.
First, there is a method of defining sub-regions within the entire study area; second, there is a
method of describing the spatial point pattern within each of these sub-regions; third, there is a
methodology for identifying those spatial point patterns that are significantly different from the
average; and fourth, there is a method of mapping the results. Though there are, at times,
substantial differences between the geographical analysis machine and more recent local spatial
statistical methodologies, these characteristics are present, in some form, in all local spatial
statistical methods.
where all of the variables listed here are defined above in the section covering spatial
autocorrelation. It should be noted the high degree of similarity between this local version of
Moran’s I and the global version of Moran’s I that allows this spatial statistical measure to be
classified as a local indicator of spatial association. Though this formula is complex, as with the
global indicator of spatial association, the local Moran’s I is easily computed within the ArcGIS
environment using ArcToolbox.
There are a number of steps to undertake in order to perform a local cluster analysis using the local
Moran’s I statistic. First, the local Moran’s I statistic needs to be calculated. Using ArcGIS, this
produces two new variable columns called LMiContig and LMzContig. LMiContig is the local
Moran’s I statistic value and LMzContig is the z-statistic associated with each LMi value—Contig is
in reference to the use of a contiguity spatial weights matrix, discussed above.
The first variable, LMiContig, is shown in Figure 16. The degrees of green (light to dark) represent
increasing negative values of the local Moran’s I statistic and the degrees of red (light to dark)
represent the increasing positive values of the local Moran’s I statistic for Lithuanian municipalities.
The white municipalities are spatial units with local Moran’s I statistics that are close to zero. As
should be rather evident, there is a large degree of variation in the local Moran’s I statistics across
Lithuanian municipalities with many of the local Moran’s I statistics being negative, unlike the
statistically insignificant positive global spatial autocorrelation. However, despite this geographical
variation, it is also important to identify the significance of these local Moran’s I statistics as well as
determine the nature of any local clusters.
Figure 16. Local Moran's I, Automotive Theft Rate per 10 000 Population
The significance levels of the local Moran’s I statistics are shown in Figure 17 (light to dark
increasing in significance), with only the areas that are solid black being statistically significant. A
very different picture of the spatial autocorrelation in Lithuanian municipalities should be emerging
for anyone viewing both of these maps and knowing that the global spatial autocorrelation is
statistically insignificant. Though there is a high degree of variation in the local Moran’s I statistics,
many of them are statistically insignificant, mirroring the global Moran’s I statistical result. The next
step is to select those Lithuanian municipalities that have statistically significant local Moran’s I
statistics, determine the nature of those Lithuanian municipalities and visualize them.
The categories represented are as follows: High – High (red), High – Low (pink), and Low – High
(blue)—there is also the Low – Low classification that is not represented on this map.
Municipalities with High – High and Low – Low categories have positive local spatial
autocorrelation (similar local Moran’s I values), and municipalities with High – Low and Low – High
categories have negative local spatial autocorrelation (not similar local Moran’s I values).
The following is a description of each of the four possible categories: municipalities that are
labelled as High – High have statistically significant positive local Moran’s I statistics, have a high
automotive theft rate, and are surrounded by other municipalities that have high automotive theft
rates; municipalities that are labelled as High – Low have statistically significant negative local
Moran’s I statistics, have a high automotive theft rate, and are surrounded by other municipalities
that have low automotive theft rates; municipalities that are labelled as Low – High have statistically
significant negative local Moran’s I statistics, have a low automotive theft rate, and are surrounded
by other municipalities that have high automotive theft rates; and lastly, municipalities that are
labelled as Low – Low have statistically significant positive local Moran’s I statistics, have a low
automotive theft rate, and are surrounded by other municipalities that have low automotive theft
rates.
The three municipalities that have high rates of automotive theft (Klaipeda town municipality and
Palanga town municipality in the West and Kaunas town municipality in the East) are on border
regions in Lithuania. This finding may have significance towards any automotive theft crime
prevention. However, this may simply be a coincidence. Also, despite being a neighbour to
municipalities with high automotive theft rates, Kazlu municipality has a low automotive theft rate:
what would a municipality close to a small cluster of municipalities with high rates of automotive
theft have such a low automotive theft rate? This question cannot be answered here, but these
questions could not even have been asked if only global spatial autocorrelation was used in the
analysis.
References
1. Anselin, L. (1995) Local indicators of spatial association – LISA. Geographical Analysis 27: 93 –
115.
2. Bailey, T.C. and A.C. Gatrell (1995) Interactive Spatial Data Analysis. Harlow, England:
Prentice Hall.
3. Brantingham, P.L. and P.J. Brantingham (1995) Location quotients and crime hot spots in the
city. In C.R. Block, M. Dabdoub, S. Fregly (eds.) Crime Analysis through Computer Mapping.
Washington, DC: Police Executive Research Forum, pp. 129 – 149.
4. Brantingham, P.L. and P.J. Brantingham (1998) Mapping crime for analytic purposes: location
quotients, counts and rates. In D. Weisburd and T. McEwen (eds.) Crime Mapping and Crime
Prevention. Monsey, NY: Criminal Justice Press, pp. 263 – 288.
5. Diggle, P.J. (2003) Statistical Analysis of Spatial Point Patterns, Second Edition. London:
Arnold Publishers.
6. Isard, W., I.J. Azis, M.P. Drennan, R.E. Miller, S. Saltzman, and E. Thorbecke (1998) Methods
of Interregional and Regional Analysis. Brookfield, VT: Ashgate Publishing Limited.
7. Miller, M.M., L.J. Gibson, and N.G. Wright (1991) Location quotient: a basic tool for economic
development studies. Economic Development Review 9: 65 – 68.
8. Openshaw, S. (1984) The Modifiable Areal Unit Problem. CATMOG (Concepts and
9. Techniques in Modern Geography) 38. Norwich: Geo Books.
10. Openshaw, S., M. Charlton, C. Wymer, and A. Craft (1987) Developing a mark 1
geographical analysis machine for the automated analysis of point data. International Journal of
Geographical Information Systems 1: 335 – 358.
11. Ratcliffe, J. (1999) Hotspot Detective for MapInfo Helpfile Version 1.0.
12. http://jratcliffe.net/ware/ index.htm
13. Robinson, W.S. (1950) Ecological correlations and the behaviour of individuals. American
Sociological Review 15: 351 – 357.
14. Tobler, W. R. (1970) A computer model simulation of urban growth in the Detroit region.
Economic Geography 46: 234 – 240.
Additional Resources
Classical Correlation
http://www.statsoft.com/textbook/stathome.html
Introductory Statistics:
http://www.psychstat.missouristate.edu/sbk00.htm
HyperStat Online:
http://davidmlane.com/hyperstat/
http://www.jratcliffe.net/research/maup.htm
http://www.jratcliffe.net/research/ecolfallacy.htm
http://www.ojp.usdoj.gov/nij/maps/ncj209393.html
http://www.icpsr.umich.edu/CRIMESTAT/files/CrimeStatChapter.8.pdf
http://www.icpsr.umich.edu/CRIMESTAT/download.html
4 Geostatistics
Geostatistics is one of most challenging aspects of spatial analysis. It involves the study of the
interpolation, smoothing, estimation and prediction of surface’s values based on discrete
measurements. This module provides an introduction to geo-statistics, including elements of
exploratory spatial data analysis, structural analysis including calculation and modeling of the
surface properties of nearby locations, and surface prediction and the assessment of results.
Methods discussed include Inverse Distance Weighting, trend analysis with global and local
polynomials, splines interpolation and techniques of kriging predictions. The notion of spatial
dependency and auto-correlation is also analyzed. Categorization of geostatistical methods,
recommendations for applications of these methods, and models of results validations are also
discussed.
Module Outline
Precipitation forecasts may be based on the observations of meteorological stations with limited
spatial distribution. Limitations of such an approach include the high cost of fieldwork and the
inaccessibility of some regions. For example in Lithuania, a meteorological network of 16 stations
was set up to monitor precipitations and other climate parameters. Annual precipitation data for
2005 ranges from 915 Pr 396 mm (see Table below).
Table below shows the precipitation levels in year 2005, and the coordinates of the observation
stations.
VARDAS ID mm X Y
Biržai 1 584 546484 6229692
Telšiai 13 620 389573 6206379
Utena 15 701 601070 6152592
Raseiniai 10 484 443915 6138743
Šilut 12 835 339702 6137606
Dotnuva 2 432 492560 6136744
Nida(Neringa) 8 915 310049 6134547
Ukmerg 14 655 549330 6123740
Kybartai 4 649 420364 6056681
Var na 16 861 535976 6012783
Lazdijai 7 722 468284 6010907
Klaip da 5 752 323400 6174934
Šiauliai 11 396 456608 6200104
Panev žys 9 527 522840 6177229
Kaunas 3 641 499666 6085879
Laukuva 6 642 388910 6166667
Vilnius 17 783 571223 6055189
It may be asked “What is the best way to visualize the spatial variation in the data?” One answer
would be “through geographic mapping”. Mapping here can include intermediate points, with no
observations but predicted values. These locations, with and without data, can be visualized in a
grid covering the study area.
Figure 1 : The geocoded Lithuanian meteorological station network based on the table data and an
interpolated surface
In the prediction of the intermediate locations, interpolation methods, such as kriging can be used.
For accuracy, optimal spatial prediction and respective principles can be applied. For example,
weighted averages of available observations may be used, with greater weights corresponding to
proximity to the prediction point.
A unique aspect of geostatistics is the use of regionalized variables, which are variables that fall
between random variables and completely deterministic variables. Regionalized variables describe
phenomena with geographical distribution (e.g. elevation of ground surface or temperature). For
regional data, the locations are known and fixed.
Although there may be spatial continuity in the observed phenomena, it is not always possible to
sample every location. Therefore, unknown values must be estimated from the size, shape,
orientation, and spatial arrangement of the sampled locations. If the spatial pattern changes, the
estimated values also change. For geostatistical data, the sampling and estimating of
regionalized variables create a pattern, which may be mapped as a raster grid or contour.
Therefore, geostatistics is about the analysis of spatially referenced phenomenon that was
observed via discrete measurements. Geostatistics uses spatial coordinates to formulate
continuous model of the analyzed phenomenon based on interpolation, smoothing, estimation
and/or prediction techniques. These techniques use information on the spatial coordinates and
their distribution of the empiric data.
Thus in correlation analysis, the linear relationship between two random variables is
determined (e.g. between soil pH and crop production; temperature and elevation, etc.).
Geostatistics is primarily concerned with the auto-correlation function, or the correlation
between observations separated by a measured distances and directions. Therefore, the
values are: (1) the measurement of a variable at a random point; and (2) the spatial pattern
of distance and directions between the measured variables at each point. Dependence
between these two sets of values may be measured with a variogram (or semi-variogram).
This tool will be discussed later in this module.
All data sets from a given area are implicitly related by their coordinates and are used to
model spatial structure. As there may be a spatial structure to the data, values at sample
points cannot be assumed to be independent. This contrasts with classical statistics, which
assumes independence, at least within the sampling strata. These contrary assumptions
(classical statistics and geostatistics) have major implications for sampling design and
statistical inference.
The key question of geostatistics is “How can we measure spatial dependence?” Important
measurement tools are the semivariogram and auto-covariance functions. The use of these tools
will be discussed later. The variogram is an effective tool for predicting the value of non-measured
points, using distance and direction with measured points, and the screening effect of clusters of
such information.
Figure 3 : Geostatistical predictors are weighted averages of available observations. Observations closer (i.e.
stronger correlated – not necessary nearer) to the prediction point should be given more weight in the
predictor. Screening effect: clusters of observations are down weighted.
3. Splines (piecewise polynomials) are deterministic smoother and based on local or global
neighborhoods.
4. Kriging is a stochastic predictor and can work as smoother or direct interpolator depending
on variogram or data. Data z(si) are observations from random variables Z(si) at locations si
in a study area D that form a stochastic process {Z(si) : si D, i = 1,2,…}. This process is
composed of at least two components Z(si) = (si) + (si) “trend plus residual” or three
components Z(si) =[ (si) + (si) ] + (si) “signal plus noise”, where:
(si) = deterministic trend,
(si) = stochastic, correlated residual,
(si) = random noise, uncorrelated.
Geostatistical techniques or methods can be categorized based on the three criteria presented in
Table 2.
Table 2.
Global method Global method uses every known or sample point available to estimate an
or
unknown value.
Local method Local method uses a sample of known points to estimate an unknown value.
Is there a difference at
None exact the sample locations? Yes. Inexact interpolation or approximate
interpolation predicts a value at the point location
Global Local
Deterministic Stochastic Deterministic Stochastic
Trend Surface Regression Inverse Distance Weighted (exact) Kriging (exact)
(inexact). (inexact) Splines (exact)
Interpolation only works where values are spatially dependant, or spatially auto-correlated, that is
where nearby location tending to have similar Z values. Examples of spatially auto-correlated
features are elevation, property value, crime levels, and precipitation. Non-auto-correlated example
is zeppelins consumed per household. Where values across a landscape are geographically
independent, interpolation does not work because value of (x,y) cannot be used to predict value of
(x+1, y+1).
Figure 5 : One and two-dimensional interpolation: unknown value is interpolated based on values and
distances to neighboring control points.
Inverse Distance Weighted method follows the principle of the First Law of Geography. IDW
determines cell values using a linearly weighted combination of points. The technique estimates
the Z value at an unknown point by giving increased weighting to nearer points, this creating an
inverse relation between weighting and distance. This can be described mathematically by
Shepard's formula of IDW below (Shepard, 1968):
n
zi
p
i 1 d ij
zj
n
, where:
1
p
i 1 d ij
IDW linear interpolation case (power is p = 1), known values at z1 and z2, and correspondent
distances to point A are d1A and d2A. Unknown value in point A will be calculated as:
d1A z1 w1 z 2 w 2
zA z1 (z 2 z1 )
(d1A d 2A ) w1 w 2
n
1 1
Where weights are w1 and w 2 , wi 1
d1A d 2A i 1
IDW Features: IDW can provide a good preliminary description of an interpolated surface. There
are no assumptions required of the data, but there is no assessment of prediction errors. IDW
works best for dense, evenly spaced sample points. It does not consider trends in the data and
cannot make estimates above the maximum or below the minimum sample values.
Figure 6: The IDW resulting surface will pass through the sample points, where the maximum and minimum
values in the interpolated surface can only occur at known points
The IDW model parameters: The characteristics of the interpolated surface in IDW interpolation
can be controlled by the following parameters:
By value of the power function
By the neighborhood search strategy that is limiting the number of input points that can be
used in calculating the value of each interpolated point. Limit of control points can be
defined by:
o A number of closest sample points used
o A search radius used for search neighborhood
o A shape of search neighborhood
o A combination of above strategies
The Power Function: Depending on the site conditions, the distance may be weighted in different
ways. If p = 1, this is a simple linear interpolation between points. In many cases, it was found that
p = 2 produce better results. In this case, close points are heavily weighted, and more distant
2
points are lightly weighted (points are weighted by 1/ d ij ). At other sites, p has been set to other
powers and yielded reasonable results. By changing the power, one can adjust the relative
influence of the sample points. Increased power means that the output values become more
localized and less averaged. Lowering the power that sample point values have provides a more
averaged output, because sample points farther away become more and more influential until all of
the sample points have the same influence.
1 p=0
p=1
Weight
p=2
p=4
p=6
Distance
A search radius, with the shape of the neighborhood defining the search boundaries, defines the
sample size. Some or all of the samples that fall within a radius to calculate the unknown point
value can be used.
Other parameters can be established, placing further restrictions on the selection of locations within
the neighborhood. There can be a fixed search radius, which will use only the samples contained
within it, regardless of number. In such a case, the distance of the radius of the circle used to
search for points around each interpolated location and a minimum/maximum number of points that
must be found can be specified.
Figure 8 : A search radius strategy - only the maximum of 15 closest points that are within the searching
radius are used for interpolation
A search radius also can be variable. A variable search radius will expand until the specified
sample size is found. One can specify the number of points to search for when calculating a value
for each interpolated cell and a maximum distance for the search radius.
The shape and structure of search radius also can be modified. Spatial dependencies (auto-
correlation) may depend only on the distance between two locations, termed isotropy. If there are
no directional influences in the data (isotropy), then a neighborhood can be a circle.
However, it is possible that the same autocorrelation value may occur at different distances when
considering different directions. In this case, there is greater variation in some directions than in
other directions. This directional influence can be seen in the semi-variogram and is called
anisotropy. If there is directional influence in data (anisotropy), then a neighborhood can be an
ellipse with the major axis running in the direction of the change. If the neighborhood is sectored,
then the constraints can be applied to each sector.
Other restrictions on the search neighborhood are physical barriers, such as mountains ridges,
which may prevent the interpolator from using samples points on one side of it. Modifications of
IDW method can include barriers in the analysis. A barrier can be a polyline dataset used as a
break that limits the search for input sample points.
IDW Summary:
IDW is most commonly used techniques for interpolation.
Inverse Distance Weighted method is:
o geostatistical methods;
o exact (in classical form);
o local;
o deterministic.
There are different modifications of Shepard's IDW method, none being perfect for any
application. Although methods differ in weighting (for example the following functions could
1
be used ; e cd ; e cd ² , where c is a constant) and number of observations used, all IDW
1 cd ²
modifications use location and value at sampling locations to interpolate the variable of
interest at unmeasured locations. Each method produces different results even with the
same data.
IDW accuracy is often judged by the root mean square (RMS) error for differences between
the measured (or/and control points) and interpolated values.
insuring that the sum of the squared deviations from the trend surface is a minimum. An estimated
trend, i.e. (si), or Z(x, y) is considered as the deterministic component of the stochastic process.
= +
Figure 10 : Each original observation is considered to be the sum of a deterministic polynomial function of the
geographic coordinates plus a random stochastic residual.
The global polynomial for the trend analysis is a linear function that has the form:
Where Z(x, y) is the data value at the described location, the a's are coefficients, and X and
Y are combinations of geographic location. Polynomial trend-surface analysis is basically a
linear regression technique, but it is applied to two- and three-dimensions instead of just
fitting a line.
Any desired degree of the polynomial can be chosen. Practically, for trend analysis, low
degrees are used (first, second and third orders). A flat surface describes by Z(x, y) a 0 . A
linear plane, which is first-order polynomial is Z(x, y) a 0 a1X a 2 Y . Allowing for one
band is a second-order quadratic polynomial is
2 2
Z(x, y) a 0 a1X a 2 Y a 3X a 4 XY a 3Y and so forth. The unknown a's coefficients
are found by solving a set of simultaneous linear equations which include the sums of powers
and cross products of the X, Y, and Z values.
Z a B , where Z is a linear matrix of known values, a is a linear matrix of known a’s coefficients,
B is quadratic matrix derived from values of X and Y of known geographic locations. The
optimization solution will be a B -1 Z , where B-1 is inversed matrix of B . The example of the
solution for computing first-order (linear) trend surface Z(x, y) a 0 a1X a 2 Y with least square
method is below.
Point X Y Z Value
1 69 76 20.820
2 59 64 10.910
3 75 52 10.380
4 86 73 14.600
5 88 53 10.560
0 69 67 ?
To solve for these three a’s unknowns for nth number of data points, the following three normal
equations are available:
n n n
Zi (x, y) n a0 a1 Xi a2 Yi
i 1 i 1 i 1
n n n n
X i Z i (x, y) a0 Xi a1 X i2 a2 X i Yi
i 1 i 1 i 1 i 1
n n n n
Yi Z i (x, y) a0 Yi a1 X i Yi a2 Yi2
i 1 i 1 i 1 i 1
" n n " n
n Xi Yi Zi
i 1 i 1 i 1
n n n "a 0 n
Xi X i2 X i Yi a1 X i Zi
i 1 i 1 i 1 i 1
n n n ! a2 n
Yi X i Yi Yi2 Yi Z i
!i 1 i 1 i 1 !i 1
The linear function is defined as Z(x, y) -10.094 0.020X 0.347Y . Once the coefficients have
been estimated, the polynomial function can be evaluated at any point within the study area (e.g.
for Z(69,67) -10.094 0.020 69 0.347 67 14.535 ). It is a simple matter to create a grid matrix of
values by substituting the coordinates of the grid nodes into the polynomial and calculating an
estimate of the surface for each node. Because of the least squares fitting procedure, no other
polynomial equation of the same degree can provide a better approximation of the data.
The global polynomial surface changes gradually and captures coarse-scale patterns in the data.
Global polynomial interpolation creates a slowly varying surface using low-order polynomials that
possibly describe some physical process (e.g., pollution and wind direction). The calculated
surfaces are highly susceptible to outliers (extremely high and low values, especially at the edges).
Global polynomial estimation can be used for fitting a surface to the sample points when the
surface varies slowly over area of interest. It is also applied to examining and/or removing the
effects of long-range or global trends. The trend polynomial analysis can help identify global trends
in the input dataset. Global polynomial approximation also can be use for regression analysis
between two or more variable.
n
2
w i (Z(x i , y i ) 0 (x i , y i )) ,
i 1
wi exp(-3d i0 /a),
Where 0 (x i , yi ) is the value of the polynomial, d i0 is the distance between the point and the center
of the window and a is a parameter that can be used to control how fast weights decay with
distance.
Local polynomial interpolation fits the specified order (e.g., zero, first, second, and third) polynomial
using all points only within the defined neighborhood. The neighborhoods overlap, and the value
used for each prediction is the value of the fitted polynomial at the center of the neighborhood. A
first-order polynomial looks like as 0 (x i , y i ) # 0 #1x i # 2 y i , for second-order polynomial is
# 0 #1x i # 2 y i # 3x i2
0 (x i , y i ) # 4 y 2i # 5x i y i , and so on. The minimization occurs for the
parameters { # j }.
Figure 13 : Local polynomial interpolation fits many polynomials, each within specified overlapping
neighborhoods. Thus, local polynomial interpolation produces surfaces that account for more local variation.
Global polynomial interpolation is good for creating smooth surfaces and for identifying long-range
trends in the dataset. However, in the earth sciences the variable of interest usually has short-
range variation and a long-range trend. When the dataset exhibits short-range variation, local
polynomial interpolation maps can capture the short-range variation.
Splines Summary:
Radial basis functions are deterministic interpolators that are exact.
There are more parameter decisions than IDW; therefore, it is more flexible than IDW.
There is no assessment of prediction errors. Radial basis functions do not allow
investigating the autocorrelation of the data.
The method provides prediction surfaces that are comparable to the exact form of kriging.
Radial basis functions make no assumptions about the data.
The South African engineer D. G. Krige was the first to formalize a method that uses a
mathematical model of the semivariogram for estimating a surface at grid nodes. The kriging
method, named after its founder, predicts the best linear unbiased estimates of a surface at
specified locations, based on the assumptions that the surface is stationary and the correct form of
the semivariogram has been chosen. The kriging procedures incorporate measures of error and
uncertainty in determining estimates. The calculation of unknown values is similar to IDW and
based on weights assigned to known values. However, these weights are only optimal weights and
the semivariogram is used for calculation weights. Semivariogram weights depend on the known
sample distribution – distance and direction.
Observed values are only one of many possible realizations of a random “stochastic” process. At
each point s , an observed value Z is one possibility of a random variable z( s ) . There is only one
value that is sampled and it is only one realization of a process that can produce different values.
Each point has its own random process, with the same form. However, there may be spatial
dependence among points. In this case, points are not independent. As a whole, they make up a
stochastic process over the whole field R (i.e. the observed values are assumed to result from
some random process but one that respects certain restrictions, in particular spatial dependence).
The set of observed values Z {Z( s ), $s R} is called a regionalized variable. This variable is
doubly infinite by 1) number of points and 2) possible values at each point.
Regionalized variable theory uses a related property called the semivariance to express the
degree of relationship (or autocorrelation) between points on a surface. The semivariance is simply
half of the variance in the values between each point pair that is separated by a known distance.
The variogram is a representation which plots semivariances against its separated distance. This
type of representation is discussed in detail below.
The idea of stationarity is used to obtain the necessary replication. Stationarity is an assumption
that is often reasonable for spatial data. There are two types of stationarity. One is called the first-
order or mean stationarity. In geostatistics, it is assumed that the mean is constant between
samples and is independent of location.
The second type of stationarity is called intrinsic stationarity for semivariograms and second-
order stationarity for covariance (http://en.wikipedia.org/wiki/Covariance). The intrinsic stationarity
for semivariograms is based on the assumption that the variance of the value’s difference is the
same between any two points that are the same distance and direction apart no matter which two
points you choose. Second-order stationarity is the assumption that the spatial auto-covariance is
the same between any two points that are at the same distance and direction apart no matter
which two points you choose. The auto-covariance is dependent on the distance between any two
values and not on their locations.
However, in reality, first-order stationarity is often not verisimilar. The observed mean value is often
different in several regions or has an obvious trend. Second-order stationarity is also often not
plausible, thus, it is observed that covariance often increases without bound as the area increases.
Solutions can be to use the differences between values, not the values themselves, and in a
“small” region. The differences between values are the same over the whole area. In addition, if the
© National Land Service under the Ministry of Agriculture, 2007
131
Spatial anglysis and modeling
trend is subtracted, the residuals may comply with first-order stationary. These and other solutions
(e.g. to use empirical variogram mean differences), as you will see below, are used for kriging
prediction.
4.4.1. Variography
Each pair of observation points has a variance. The variance is the difference between two
variables at two locations, raised to the second power. The semi-variance is variance divided by
two. The semivariance is defined as:
' (s i , s j )
1
2
% &
Z( s i ) - Z( s j ) 2 ,
Where ' ( s i , s j ) - semivariance and Z( s i ) and Z( s j ) are values in two locations s i and s j , which is
separated by a known distance. The formula calculates half the difference squared between the
values of the paired locations. Semivariances can be used as a measure of spatial dependences or
autocorrelation.
The semivariances can be summarized in a variogram. A variogram is obtained from the data. The
variogram is a point “cloud” that plots the variance between two values of the same variable at two
locations for n (n 1) / 2 points, where n is the number of observed points. The semivariances are
plotted against distance in a variogram “cloud”. Along the ordinate x-axis, variogram plots the
distance separating two locations; along the abscissa y-axis, variogram plots the semivariance that
is used to quantify autocorrelation.
The semivariance generally increases with distance and variograms are described by nugget, sill,
and range parameters. Sill is maximum semivariance or the height that the semivariogram
reaches, and represents variability in the absence of spatial dependence. It is often composed of
two parts: a discontinuity at the origin, called the nugget effect, and the partial sill, which added
produce the sill. Nugget is semi-variance as the separation approaches zero and represents
variability at a point that cannot be explained by spatial structure. The nugget can be divided into
measurement error and micro-scale variation and since either component can be zero, the nugget
effect can be comprised wholly of one or the other. Range is separation between point-pairs at
which the sill is reached or distance at which there is no evidence of spatial dependence.
As Z( s i ) and Z( s j ) get farther apart, they become less similar, and so the difference in their values,
Z( s i ) - Z( s j ) , will become larger. The anatomy of a typical semivariogram is represented in the
following figure:
Figure 15 : The anatomy of a typical semivariogram and the semivariogram view in ArcGIS Geostatistical
Analyst
There are n( n 1) / 2 point pairs that are used to calculate and build the variogram. These involve
large numbers and can become unmanageable to plot. For example, with 200 points, there are
19,900-point pairs. To reduce the number of points in the empirical semivariogram, the pairs of
locations are grouped based on their distance from one another into lag bins or by separations h .
For example, compute the average semivariance for all pairs of points that are greater than 100
meters but less than 200 meters apart. This is repeated for all samples that are h distance apart
and the average squared difference obtained. Therefore, the empirical semivariogram is a graph of
the averaged semivariogram values on the y-axis, and h distance (or lag) on the x-axis.
Figure 16 : The empirical semivariogram view in ArcGIS Geostatistical Analyst. For each bin, only the average
distance and semivariance for all the pairs in that bin are plotted as a single point on the empirical
semivariogram cloud graph.
Note that binning is the intrinsic stationarity assumption that allows replication. Mean values are
replaced with mean differences, which are the same over the whole field, at least within some
‘small’ lag separation h . Thus, it uses “averaging” in the semivariogram formula above and the
empirical semivariogram can be estimated for distances that are multiple of h as:
1 m(h )
' (h )
2m( h ) k 1
% &
Z( s i ) - Z( s j ) 2 - the semivariance is equal to the average of the squared
Where, h is regularly spaced points distance (separation or lag distance); Z( s i ) and Z( s j ) are
values in two locations s i and s j ; m(h ) is the number of point pairs separated by vector h or
number of points within a bin. In practice, there have to be enough known points in order to define
the set of vectors in each “bin”.
Bins are commonly formed by dividing the sample area into a grid of cells or sectors that are used
to calculate the empirical semivariogram. The size of the cell is called lag size and the number of
cells is called number of lags. Narrow intervals mean more resolution, but fewer point pairs for
each sample.
Figure 17 : A measure of the similarity between a variable’s values (the co-variance between the two values)
for distance h apart is obtained. Variograms with the respective 10000 and 50000 meters lags (bin sizes) are
produced different outputs (in the right-hand side of the figure)
1 m(h )
Table 4 : Binning the empirical semivariogram ' (h )
2m( h ) k 1
% &
Z( s i ) - Z( s j ) 2 with lag h 1 meter:
Lag distances in meters Distance Pairs Average Distance Semivariance ' ( s i , s j ) Average ' (s i , s j )
From 1 to 2 2, 2 2 12.5, 50 31.25
From 2 to 3 2.236, 3, 3 2.745 12.5, 12.5, 0 8.33
From 3 to 4 3.606, 3.606 3.606 50, 12.5 31.25
From 4 to 5 4.472, 4.123 4.298 50, 112.5 81.25
More then 5 5.657 5.657 112.5 112.5
Question: How does one find a lag size? A rule of thumb is to multiply the lag size by the number of
lags; the product of these numbers should be about half the largest distance among all points.
So far, the values in the semivariogram cloud are put into bins based only on the distance.
This variogram is called omnidirectional and isotropic (Greek “iso” + “tropic” = English
“same” + “trend”). So far, the direction between a pair of locations of lag h was not specified
in order to constrict the variogram, but variation may depend on direction, not just distance.
There are two types of directional components that can affect the predictions in output
surface:
A global trend is an overriding process that affects all measurements in a deterministic
manner. The global trend can be represented by a mathematical formula (e.g., a polynomial)
and removed from the analysis of the measured points but added back before predictions
are made. This process is referred to as de-trending.
Anisotropy (Greek “an” + “tropic” = English “not-” + “trend”) is a characteristic of a random
process that shows higher autocorrelation in one direction than another.
Anisotropy arises due to directionality of a process, for example, sand content in a narrow
flood plain has much greater spatial dependence along the axis parallel to the river;
secondary mineralization is changing near an intrusive dyke; population density is different in
a hilly terrain with long, linear valleys.
So here again, the notion of the phenomenon, such as anisotropy and respective directional
empirical variogram, can be brought back. A directional variogram defines the spatial
variation among points separated by space lag h . The difference from the omnidirectional
variogram is that h - a vector rather than a scalar. The number of directions may be different.
A directional variogram is estimated using the same equation as the omnidirectional. Note
that nugget must logically be isotropic (it is variation at a point).
Figure 18 : Point pairs are grouped based on common separation distances and directions (bandwidth
sectors)
The indicators of anisotropy are that the semivariogram is changed notably; ranges and/or sills are
considerably different for the respective directions. There are two types of anisotropy indicators:
1. Sill is the same, but different ranges in different directions; this is geometric, or also called
affine, anisotropy.
2. Range is the same, but sill varies with direction, this is zonal anisotropy.
Figure 19 : Zonal anisotropy: variogram (range) changes notable and depends on search directions
(respectively north-west and north–east search directions)
Therefore, if directional differences in the spatial dependences are detected, it can be accounted
for in the semivariogram or covariance models. This, in turn, has an effect on the geostatistical
prediction method.
Another measure that is used to estimate the strength of a spatial correlation as a function of
distance is auto-covariance and autocorrelation. The spatial auto-covariance is computed within
the same variable, using pairs of observations. Each pair of observations ( s i , s j ) has a
(Z( s i ) - Z)(Z( s j ) - Z) auto-covariance, showing how they jointly differ from the variable’s Z mean.
Spatial autocorrelation can also be analyzed using covariance functions and correlograms. The
auto-covariance function is C ( s i , s j ) cov(Z( s i ), Z( s j )) , where cov is the covariance. See more
about the measures of spread at the
http://www.spatialanalysisonline.com/output/html/Measuresofspread.html.
The covariance depends on the separation between points. The individual covariance has to be
summarized as a auto-covariance function of spatial separation. Once this is done, then the
covariance between any two locations in space can be predicted.
Figure 20 : The anatomy of auto-covariance function and the auto-covariance function’s view in ArcGIS
Geostatistical Analyst
The idea of correlation to one variable is called auto-correlation (the prefix auto- means “self” and
refers to a single variable). In such cases, a correlation is controlled by some other dimensions,
such as space, if the variable is collected at points in space or time, if the variable is collected as a
time-series. A measure of how much the variable is correlated to itself, considering the other factor
(time or space), is auto-correlation.
m (h )
(Z( s i ) - Z )(Z( s j ) - Z )
r(h ) k 1
n
(Z( s k ) - Z ) 2
k 1
The top part of this expression is like the covariance, but at a lag of h , and the bottom is like the
covariance at a lag of 0. These two components are the autocovariance at h and 0 lags. The set of
values { r (h ) } can then be plotted against the lag h , to see how the pattern of correlation varies
with lag. This plot is known as a correlogram, and provides a valuable insight into the behavior of
the autocorrelation at different lags or “distances”.
There are mathematical relationships between the semivariogram and the covariance functions that
appear as ' ( s i , s j ) sill - C ( s i , s j ) . If the regionalized variable is stationary, the semivariance for a
distance d is equal to the difference between the variance and the auto-covariance for the same
distance:
C(0)
C(d)
'
'd
Distance
If the regionalized variable is not only stationary, but also is standardized to have a mean of zero and
variance of 1, the semivariogram is a mirror image of the autocorrelation function:
(
Semivariance
Autocorrelation
'
Figure 23: Relationship between semivariance and autocorrelation for a stationary regionalized variable
So far, the models of empirical semivariogram and covariance clouds, which provide information on
the spatial autocorrelation of datasets, have been discussed. However, empirical semivariogram
and covariance models do not provide information for all possible directions and distances. The
main application of geostatistics is the prediction (optimal interpolation) of attribute values at
unsampled locations (kriging).
This fitted model is used in the kriging equations. Abstractly, this is similar to regression analysis,
where a continuous line or a curve of various types is fitted. However, only some functions can be
used (authorized models) to fit semivariogram and covariance clouds. Any variogram function must
be able to model the following: monotonically increasing values, possibly with a fluctuation (hole);
constant or asymptotic maximum (sill), non-negative intercept (nugget) and anisotropy. Theoretical
variograms must obey mathematical constraints so that the resulting kriging equations are solvable
(e.g., positive definite between-sample covariance matrices).
The following functions are authorized to model the empirical semivariogram: circular, spherical,
tetraspherical, pentaspherical, exponential, Gaussian, rational quadratic, hole effect, K-Bessel, J-
Bessel, etc (see the descriptions of some mathematical models at
http://webhelp.esri.com/arcgisdesktop/9.2/index.cfm?topicname=how_kriging_works). Any linear
combination of authorized models is also authorized.
The selected model influences the prediction of the unknown values, particularly when the shape of
the curve near the origin differs significantly. The steeper curve near the origin will bring the more
influence of the closest neighbors to the prediction. As a result, the output surface will be less
smooth.
Figure 24 : Comparison of variogram models and spherical model (blue curve), which is derived from the
intersection of random spheres of a given size, is used to fit the empirical variogram.
There are no exact rules on choosing the “best” variogram model or function. Each model is
designed to fit different types of phenomenon more accurately. For example, a Gaussian model
© National Land Service under the Ministry of Agriculture, 2007
140
Spatial anglysis and modeling
might be expected for a phenomenon that physically must be very continuous, such as the surface
of a ground-water table. A model that looks appropriate could be picked up based on an expert’s
examination of empirical semivariogram or covariance functions, and validation and cross-
validation statistics as a guide can be used (will be discussed later in the module).
Let’s look on an example of a theoretical variogram. The theoretical variogram is needed because
the empirical semivariogram values cannot be used directly in the matrix calculations in the kriging
system (discussed in next section). Thus, values from empirical variograms can introduce negative
standard errors for the predictions. Instead, the authorized fitted model has to be used when
determining semivariogram values for various distances. These authorized models are designed in
such way that they cannot introduce negative standard errors for the predictions.
The empirical variogram in the previous example was calculated as the following:
Lag distances in meters Empirical semivariances
(h ) ' (h )
From 1 to 2 31.25
From 2 to 3 8.33
From 3 to 4 31.25
From 4 to 5 81.25
More then 5 112.5
For example, a simple linear authorized model may use ' ( h ) c h , where c is constant and
defines the slope of the theoretical variogram line. Based on the regression analysis adjustment for
the averaged ' ( s i , s j ) from the table above, c is calculated to be 13.5. So, the theoretical model
will be ' (h ) 13.5h .
Based on this model, the theoretical semivariances between known points ' ( s i , s j ) are calculated
and presented in the following table:
variance is zero. In worst case scenarios, when there is no trust on a data value, the variance of
such a value is one on a normalized scale.
Conceptually, as it is shown later on, the variance in kriging plays the role of a weighting function.
For example, there is a single data value with zero variance. On can be relatively assured that the
missing values physically close to the known location will be well approximated by the known value.
With the points further away, there will be less certainly about unknown values; the uncertainty
increases with distance from the known value.
Variance for each known data value can be set to the uncertainty of that value. The estimation for
each of the unknown values is more concerned with the relative changes of uncertainties rather than
with their absolute values.
Each known data value has a variance function associated with it that is used to determine variance
of data values around the known location. The shape of the function is empirical and can be one of
four forms: linear, spherical, exponential, or Gaussian. These curves have their minimum value
(usually 0) at the known data location and their maximum value (usually 1) at some specified
distance (range) away from that point. All locations outside the range are considered to be
unaffected by the known data value. Just as one value for the uncertainty of all known data values
can be set, the same, one range can be used throughout the calculations.
The variance and range allows for a variance discontinuity at the known data value, commonly
referred to as the nugget. This causes a step increase in variance just away from the known data
value. In the best case scenario, this value set would be 0, producing no nugget effect.
The variogram is estimated from the data in two steps:
1. At first, the empirical variogram is estimated with a particular lag size and anisotropy
parameters;
2. In addition, a model to theoretical variogram is fitted with a particular authorized function.
Figure 25 : Kriging methods depend on mathematical and statistical models. Kriging methods rely on the
notion of autocorrelation.
The general formula for Kriging interpolators is formed as a weighted sum of the data:
n
z( s0 ) i Z ( si ) , where Z ( si ) is the measured value at the i-th location, i is an unknown weight
i 1
for the measured value at the i-th location, z( s0 ) is the prediction in location s 0 and n is the number
of measured values. Each point z( s0 ) is predicted as the weighted average of the values at all
sample points Z ( si ) .
Kriging is similar to IDW in that it weights the surrounding measured values to derive a prediction
for an unmeasured location. The general formula for both interpolators is formed as a weighted
sum of the data. However, in IDW, the weight depends solely on the distance to the prediction
location. Kriging weights come from a semivariogram that was developed by looking at the spatial
nature of the data - the spatial autocorrelation is quantified by using semivariances. Thus, kriging is
based on the theory of random processes, with covariances depending only on separation (i.e. a
variogram model). Nevertheless, in Kriging the weights are based not only on the distance between
the measured points and the prediction location, but also on the overall spatial arrangement among
the measured points.
In addition, the weights used in kriging involve not only the semivariances between the points to be
established and the known points (IDW method uses distances between the points to be
established and the known points), but also those between the known points.
Various kriging techniques are based on certain assumptions. Thus simple kriging is linear with a
known trend; ordinary kriging is linear with an unknown flat trend; universal kriging is linear with an
unknown polynomial trend; co-kriging is linear and multivariate and can have different types of
trends; Trans-Gaussian kriging is linear after transformation with a flat trend; indicator and
disjunctive kriging is non-linear and works with threshold or binary data; block kriging is linear and
works with average value over some small area (block) rather than at a point.
First, let us examine ordinary kriging. In ordinary kriging, the i weight depends on a fitted model
to the measured points, the distance to the prediction location, and the spatial relationships among
the measured values around the prediction location. Predictions are made as linear combinations
of known data values (a weighted average). The ordinary kriging prediction is unbiased. The
known points are predicted exactly; they are assumed to be without error, even if there is a nugget
effect in the variogram model.
In ordinary kriging it is also true what points closer to the point are predicted to have larger weights.
Clusters of points “reduce to” single equivalent points (i.e., over-sampling in a small area cannot
bias result). Closer sample points “mask” further ones in the same direction. Error estimates are
based only on the sample configurations, not the data values.
The theory of regionalized variables leads to an “optimal” interpolation method, in the sense that
the prediction variance is minimized. In ordinary kriging, prediction error should be as small as
possible. Ordinary kriging, as a “best linear unbiased predictor”, has to satisfy certain criteria for
optimality. However, it is only “optimal” with respect to the chosen fitted model!
In ordinary kriging models the value in location z( s i ) (s i ) ( s i ) is the sum of a regional mean
( s i ) and a spatially-correlated random component ( s i ) . The regional mean ( s i ) is estimated
from the sample, but not as the simple average, because there is spatial dependence. It is implicit
in the ordinary kriging system. Therefore, ordinary kriging predicts at points, with unknown mean
(must be estimated) and there is no trend (or flat trend).
When making predictions for several locations with ordinary kriging, it is expected what some of the
prediction values will be above the actual values and some below. Nevertheless, on average, the
difference between the predictions and the actual values should be zero (first order stationary
condition). This is referred to as “making the prediction unbiased” and this is the main constraint of
ordinary kriging. Formally, this constraint can be used to satisfy the sum of the weight i assigned
to each sample point sum to one. The unbiased condition is:
n
i 1
i 1
The variance at any point is finite and the same at all locations in the field; and the covariance
structure depends only on the separation between point pairs (second order stationary condition).
Using the unbiased constraint together with an “optimal” interpolation assumption that the
n
2
prediction variance is as small as possible - (z( s0 ) i Z ( si )) min , that is the difference
i 1
between the true value z( s0 ) and the predictor i Z ( si ) in unknown location s0 . The solution to the
minimization, constrained by unbiasedness, gives the ordinary kriging equations:
The ' ij semivariance values in matrix A and b are taken from the mathematical expression of the
semivariogram (fitted model). Also, a fourth variable is introduced called the LaGrange multiplier
) , to assure that the minimum possible estimation error is obtained. The ) depends on
covariance structure of the sample points.
The term “ordinary” infers there is no trend or strata; the regional mean must be estimated from the
sample. One of the main issues concerning ordinary kriging is whether the assumption of a
constant mean is reasonable. Sometimes there are good scientific reasons to reject this
assumption.
Ordinary kriging can use either semivariograms or covariances to express autocorrelation, it can
use transformations and remove trends, and it can allow for measurement error. The ordinary
kriging prediction error or kriging variance ˆ at a point depends on the semivariances between the
prediction point and the sample points b and the weights (including LaGrange multiplier)
computed in the ordinary kriging system. The ordinary kriging variance at the point is computed
from ˆ 2 ( s ) bT .
0
The variance measure ˆ is the important difference between kriging method and other
interpolation methods e.g. IDW. The ˆ can be used for each predicted point to estimate the
reliability of interpolation.
Ordinary kriging can be summarized by engaging in the following computational steps when
predicting the value for each unsampled point:
1. Compute distances between all pairs of sample points and after compute the respective
semivariances using the fitted variogram model and these distances for matrix A .
2. Compute distances between the prediction unknown point and each sample known point and
after compute the respective semivariances using the theoretical variogram model and these
distances for the vector b .
Steps 1 and 2 of the kriging system use modeled (or fitted, or theoretical) semivariances. Different
fitted models could give different kriging weights to the sample points and these will give different
prediction results.
n
5. Predict the point as the weighted average by i of the sample points from z( s0 ) i Z ( si ) .
i 1
6. Compute the prediction error ˆ as the scalar product of b and the vector.
Note: The variogram model ' (h ) used in these equations is estimated only once, using information
about the spatial structure over the whole study area, and so, the semivariances between sample
points ' ( s i , s j ) are computed only once for any point configuration. However, the semivariances at
a sample point ' ( s i , s0 ) must be computed separately for each point to be predicted.
Locations
s ( x i , yi )
(2,2) known
(4,6) known
(8,5) known
(5,5) unknown
h
(- )
Exponential fitted model ' ( h ) c(1 - e a ) is used with the following variogram parameters: sill
c 10 , effective range a 1.5 , and nugget c0 0 to fit theoretic variogram. Based on this model,
semivariances between known points ' ( s i , s j ) and known and predicted point ' ( s i , s0 ) are
calculated and presented in the following table, as well as distances:
Ordinary kriging equations, which based on Exponential fitted model, can be represented as the
following matrix:
" 1 "0.1989525
2 0.5321374
3 0.2689101
!) !1.6990883
"0.1989525
!1.6990883
Now let us briefly at other kriging techniques. There may be situations where the regional mean is
known. Sometimes it makes sense to assume a physically-based model that gives a known trend.
In the simple kriging model, the value in location z( s i ) ( s i ) as the sum of a known constant
and a spatially-correlated random component ( s i ) . For simple kriging, because is known
exactly, ( s i ) is also known exactly at the known locations. If ( s i ) is known, then it is easier to
estimate the autocorrelation than if ( s i ) had been estimated. Simple kriging uses the residuals
(the difference between the model and the observations), assuming that the trend in the residuals
is known to be zero.
There is no need for a LaGrange multipler in the simple kriging system. The simple kriging estimate
n
without the constraint that weights sum to 1, that is i 1 . However, any bias from the weights
i 1
must be compensated with respect to the (known) mean when predicting at a point. The equations
of the simple kriging system are:
The theory of regionalized variables can incorporate cases of first-order non-stationarity (i.e. where
a significant trend surface to the geographic coordinates exists or strata have significantly different
means). The intrinsic hypothesis only needs local first-order stationarity, so ordinary kriging can be
applied in local neighborhoods and would work in these cases. However, even then, useful
information about spatial structure is discarded. Accounting for a global trend would improve
predictions and allow one a better understanding of the processes that form the spatial variation.
Specially designed kriging methods could model both a global trend or stratification, and a local
spatial-dependence structure at the same time. One such method is universal kriging – a
procedure that includes a global trend as a function of the geographic coordinates within the kriging
system. Another one is regression kriging, also called “kriging after de-trending”, that models the
trend (geographic or feature space) and its residuals separately.
Universal kriging is a mixed interpolator that models a global trend as a function of the
geographic coordinates in the kriging system. In universal kriging, the value of variable z at
location s i is modeled as the sum of a regional non-stationary trend ( s i ) and a spatially-
correlated random component ( s i ) :
The sample base functions for linear drift (or for first order polynomial) are f 0( s i ) 1 , f 1( s i ) x ,
f 2( s i ) y , where x is abscissa and y the ordinate of a point. For quadratic drift, also second-order
terms will be included that are f 3( s i ) x 2 , f 4( s i ) xy and f 5( s i ) y 2 . Note that f 0( s i ) 1 estimates
the global mean as it is in ordinary kriging.
The unbiasedness condition for universal kriging is expressed with respect to the trend as well as
p n
the overall mean (as in ordinary kriging): i f j( s i ) f j( s 0 ) and i 1 . The expected value at
i 0 i 1
each point of all the functions must be that predicted by that function. The first of these is the
overall mean as in ordinary kriging.
The semivariances ' ( s i , s j ) for universal kriging are based on the residuals, not the original data,
because the random part of the spatial structure applies only to these residuals. The fitted model of
variogram is obtained in three steps:
1. Calculate the best-fit trend surface that will be used in universal kriging
© National Land Service under the Ministry of Agriculture, 2007
148
Spatial anglysis and modeling
The model parameters for the residuals will usually be very different from the original variogram
model. It often has a lower sill and shorter range because the global trend has been taken out of
some of the variation and the long-range structure.
Global universal kriging can be using all sample points when predicting each point. This gives the
same results as regression kriging. Such global variation is appropriate if there is a regional
trend. Universal kriging can also be local when the number sample points are restricted to
neighbours around the prediction point. Local variation of universal kriging allows the trend surface
to vary over the study area, since it is re-computed at each point. Similar to simple kriging, if the
trend is known, a “simple” variant can be used to universal kriging.
In regression kriging, ( s i ) can be obtained by subtracting the p -order polynomial from the
original data, when ( s i ) are assumed to be random. The mean of all ( s i ) is 0. Conceptually, the
autocorrelation is now modeled from the random errors ( s i ) . Regression kriging can be
considered as a type of a polynomial regression. However, instead of assuming the errors ( s i )
are independent, they are modeled as if autocorrelated. Universal and regression kriging allow for
measurement error as well.
It may be the case that the observed data is binary (with values of 0 or 1) or a variable that is
continuous may be reclassified into a binary variable by choosing some threshold. For example, if
values are above the threshold, they become a 1, and if they are below the threshold, they become
a 0. For example, surface can be classified as land as 1 and water body as 0. The indicator
kriging model is z( s i ) ( s i ) ( s i ) , where ( s i ) is an unknown constant and ( s i ) is a binary
© National Land Service under the Ministry of Agriculture, 2007
149
Spatial anglysis and modeling
variable. Using binary variables, indicator kriging calculates exactly the same as ordinary kriging.
The interpolations will be between 0 and 1 and predictions from indicator kriging can be interpreted
as probabilities of the variable being a 1 or of being in the class that is indicated by a 1.
The extension of the theory of kriging of regionalized variables to several variables, which have a
multivariate spatial cross-correlation as well as the individual univariate spatial auto-correlation, is
called co-regionalization. Co-kriging is a method of using supplementary information on co-
regionalized variables (co-variables) to improve the prediction of a target variable.
The idea of co-regionalization is that the process that drives one variable is the same, or at least
related to, the process that describes the other variables. For example, distribution of heavy metals
in soil can relate pollution or distribution of water pH level can relate to pollution and elevation. All
the variables involved have to be regionalized variables and, in addition, if they are related both in
geographic space, they are co-regionalized.
For example, a target variable has relatively few observations and can involve expensive additional
measurement. However, when more values of a second variable are available and a second
variable is co-related with the target variable, then this becomes the co-variable. This second
variable is, for example, easy and cheap to measure, so there are many observations of it.
Typically, there are more observations of the co-variable (i.e. the target variable was not measured
at some points).
Therefore, for theory of co-regionalization involves two types of variograms: the first variogram is
direct - that is a single variogram for each regionalized variable; the second is a cross variogram –
a pair of regionalized variables. The cross empirical variogram will be estimated for distances that
1 m(h )
are multiples of h as ' 1,2 ( h ) % &% &
Z1 ( s i ) - Z1 ( s j ) Z2 ( s i ) - Z2 ( s j ) , where are Z1 ( s i ) is a target
2m( h ) k 1
or main variable of interest and Z2 ( s i ) is a co-variable or co-regionalized variable. Cross-
variograms can depict either a positive or a negative spatial correlation.
The direct and cross-variograms must be modeled together, with some restrictions to ensure that
the resulting co-kriging system can be solved. Therefore, the co-kriging method uses information
on several variable types. The general models of ordinary co-kriging are:
z1 ( s i ) 1( s i ) 1(s i )
z2 (s i ) 2( s i ) 2 (s i)
Where 1( s i ) and 2( s i ) are unknown constants. There are two types of random errors, 1 ( s i ) and
2 ( s i ) , so there is autocorrelation for each of them and cross-correlation between them. Ordinary
co-kriging attempts to predict z1 ( s i ) , just like ordinary kriging, but it uses information in the
covariate ,z 2 ( s i )- in an attempt to do a better prediction.
The target variable is z1 ( s i ) , and both autocorrelation for z1 ( s i ) and cross-correlations between
z1 ( s i ) and the other variable type z 2 ( s i ) are used to make better predictions. It uses information
from co-variable z 2 ( s i ) to help make predictions, but it requires much more estimation, which
includes estimating the autocorrelation for each variable, as well as all cross-correlations.
© National Land Service under the Ministry of Agriculture, 2007
150
Spatial anglysis and modeling
The formula for ordinary kriging interpolators is formed as a weighted sums of the data and will be
n n
z( s0 ) i Z1 ( si ) .i Z 2 ( si ) , where Z1 ( si ) is the measured value of target variable, i is an
i 1 i 1
unknown weight for the measured value for the target variable; Z 2 ( si ) is the measured value of co-
variable, / i is an unknown weight for the measured value for the co-variable; z( s0 ) is the
prediction in location s 0 and n is the number of measured bin’s values.
The other co-kriging methods, including universal co-kriging, simple co-kriging, indicator co-kriging,
are all generalizations of the foregoing methods to the case where multiple datasets are used. Co-
kriging can allow for measurement error in the same situation as for the various kriging methods
(ordinary kriging, simple kriging, and universal kriging).
The kriging methods also can be summarized as steps in optimal spatial prediction modeling:
1. Sample phenomena (e.g. long-term average values of water levels), preferably at different
resolutions:
3. Check for trend (e.g. visualize by global polynomial). If it exists, fit with a trend surface
model (here is linear in north-west direction):
4. Model or fit the variogram with one or more authorized functions to the residuals
( s i ) z( s i ) ( s i ) . The ( s i ) is obtained by subtracting the first-order polynomial (top-left-
hand figure) from the original data. Each blue curve on the left-hand variogram represents
the fitted spherical model for the particular search direction (e.g. north, north-west, etc.)
(anisotropy models). In the right-hand variogram, the north-west fitted model (blue curve)
follows the major range’s direction.
5. Some kriging systems, with the variogram models of spatial dependence, can be applied to
produce kriging predictions based on the variogram and trend model at each predicted
point. Predictions are often done at each point on a regular grid (e.g. a raster map). For a
local approach, the kriging equations are solved separately for each point s0 , using the
semivariances around that point, in a local neighborhood; this gives a different set of
weights i for each point to be predicted (left-hand figure). The predicted plot of the fitted
line through the scattered known points is given in blue with the equation given just below
the plot (shown on the right-hand figure). If the data were not autocorrelated, all predictions
would be the same or every prediction would be the mean of the known data, in this case
the blue line would be horizontal. With autocorrelation and a good kriging model, the blue
line should be closer to the black dashed line.
6. Calculate the error of each prediction; this is based only on the sample point locations, not
their data values. The error plot is the same as the prediction plot, except that the known
values are subtracted from the predicted values (left-hand figure). For the standardized error
plot, the known values are subtracted from the predicted values, and then divided by the
estimated kriging prediction errors ˆ and other derivative standardized errors (right-hand
figure).
With any interpolation method, one would like to know how good the results will be. The model,
therefore, needs justification. This involves a model validation approach and methodology. Model
validation needs to be applied during the building of the model, including calibration of the model.
The main approaches used to compare model predictions with reality are:
1. Use of independent measures of validity for data and models; some of the measures can
be represented as a surface.
2. Separate validation dataset by dividing it into two samples - the training dataset and the
test dataset - and compare the predicted values used from training datasets for specified
locations with the values from the test datasets.
3. Cross-validation using calibration datasets.
4. Create many interpolation surfaces and use measures when comparing models.
At the data exploration stage, usually preceded by interpolation, the data have to be examined for
normality, outlets, global trend, stationarity, spatial autocorrelation, etc. Some of the kriging and co-
kriging methods (e.g. ordinary kriging) require that the data come from normal distributions,
therefore, data has to be explored for normality. If data did not normally distributed, normal score
transformations may be necessary to apply to the data.
The indicators (measures) that data is normally distributed can be obtained from a histogram. For a
normal distribution, a histogram has the unimodal bell shape, the mean and median values are
very close, and the kurtosis is close to 3. Normality also can be verified with numeric tests (e.g.
Shapiro-Wilks or Anderson).
In addition, the Normal QQPlot of the quantiles of the input dataset versus quantiles of the
standard normal distribution can be used. QQ plots are graphs on which quantiles from two
distributions are plotted relative to each other. For two identical distributions, the Normal QQPlot
will be a straight line. Therefore, comparing this line with the distribution of sampled points on the
Normal QQPlot provides an indication of univariate normality. If the distribution of sampled data is
asymmetric (i.e., far from normal), the points will deviate from the Normal QQPlot line.
Figure 26 : The Normal QQPlot against the long-term average values of water levels, and the transformation
options in the ESRI Geostatistical Analyst. See more about transformation at the
http://webhelp.esri.com/arcgisdesktop/9.2/index.cfm?TopicName=Box-
Cox%2C_Arcsine%2C_and_Log_transformations and
http://www.spatialanalysisonline.com/output/html/Datatransformsandbacktransforms.html.
One can use the global polynomial interpolation methods for visual inspection of linear or other
trends in data. Cross-section plots of data values along ordinate and abscissa coordinate axes can
also give an idea about global data trends.
Figure 27 : The long-term average values of water levels has linear (first order) trend (the map on left- hand
side). There is no explicit second order trend (the map on right hand side)
The data outliers or abrupt changes in data values, which can be caused by real abnormalities in
the phenomenon or measured errors, can be investigated by using the histogram (points on the tail
of the distribution), semivariogram (pairs of points with high values in the semivariogram cloud,
regardless of distance) and Voronoi diagram (high dissimilarity between neighbors). See about the
Voronoi diagram at the http://en.wikipedia.org/wiki/Voronoi_diagram.
Bias or mean error (ME) of estimated value against actual mean of the validation dataset is
1 n
ME (z( s i ) - Z( s i ))
ni 1
The kriging prediction error ˆ at a point or kriging variance at the point is computed from
ˆ 2 ( s ) bT .
0
1 n
The average kriging prediction error is ˆ (s i) .
ni 1
1 n
The mean standardized prediction error is (z( s i ) - Z( s i )) / ˆ (s i ) .
ni 1
1 n 2
The root-mean-square standardized prediction error is (z( s i ) - Z( s i )) / ˆ (s i) .
ni 1
Figure 29 : Prediction errors for the resulting surface of the long-term average values of water levels
When one compares these models, one should look for a model that satisfies the following conditions:
1. The lower root-mean-squared error
© National Land Service under the Ministry of Agriculture, 2007
157
Spatial anglysis and modeling
Thus, the kriging prediction error ˆ can be estimated for each predicted point. Therefore, if a
stochastic method, such as kriging is used, it allows the quantification of prediction errors and
representation of these errors as an error surface or map. Deterministic interpolation methods
(IDW, splines, etc) do not consider errors! Therefore, only RMSE could be used for comparisons
between a deterministic method and another interpolation method.
However, the question is “How to calculate RMSE for exact interpolation?” The RMSE for an exact
interpolation can be calculated from cross-validation. In cross-validation, each sample point is
removed; interpolation surface is created without this point, and this surface is compared to the
predicted value for removed location. Such process is repeated for each sample point. The cross-
validation RMSE is a summary statistic quantifying the error of the prediction surface. See more
about the cross-validation below in the text.
So far, maps of predictions, called prediction maps or interpolated maps, were created. Prediction
maps are produced from the interpolated values. An error map (e.g. the kriging prediction error
map) shows error is simply the square root of the variance of a kriging prediction or estimate ˆ 2 .
An error map quantifies the uncertainty of the prediction. If the data comes from a normal
distribution, the true value will be within ± 2 times the prediction error about 95 percent of the time.
Besides making predictions, the variability of the predictions from the true values is estimated. If
the average kriging prediction error is greater than the root-mean-squared error, the variability of
the predictions is overestimated; if the average kriging prediction error is less than the root-mean-
squared errors, the variability in the predictions is underestimated.
Figure 30 : The error map of long-term average values of water level estimates. Errors are larger in sparsely
sampled areas
It is possible to derive two other error estimation representations from the error map. These are
quantile and probability maps. The values of a quantile map reflect the upper or lower limits of
the true values. Quantile maps represent surfaces of values where the predictions exceed (or do
not exceed) the values at the specified probability. For example, if the quantile probability is set up
to 0.5, an output map will be produced the predicted median values at each known location. If the
quantile probability is set up to 0.75, an output map will be produced where there is a 75% chance
© National Land Service under the Ministry of Agriculture, 2007
158
Spatial anglysis and modeling
that an unknown value is below the surface value, and a 25% chance that the unknown value is
above the surface value.
Figure 31: The quantile maps of long-term average values of water levels estimates respectively for the
specified 0.5 and 0.75 quantile probabilities
Probability maps show the probability that the true value is above (or below) some specified
threshold value. For example, if the threshold value is set up to “exceed 100”, the map will show
the surface of probabilities when values may exceed the 100-measurement mark.
Figure 32 : The probability map of long-term average values of water levels estimates when the specified
threshold value exceeds 100
In order to use probability and quantile maps confidently, the data have to come from a full
multivariate normal distribution.
Few graphical plots can be used for validation of results of kriging prediction. A scatter plot of
predicted values (blue fitted line in the plot given below) versus sample values (dots) is one such
plot. It might be expected that the fitted line will be a diagonal line (the black dashed line).
However, the slope is usually less than one. It is a property of kriging that tends to under-predict
large values and over-predict small values (that is result in surface smoothing), as shown in the
following figure.
Figure 33 : A scatter plot of predicted and sample values. The fitted line through the scatter of points is shown
in blue with the equation given just below the plot.
If the sample data were not autocorrelated or independent in space, every prediction would be the
same and equal to the mean of the measured data. In such a case, the blue line would be
horizontal. With autocorrelation and a right kriging model, the blue line should be closer to the
diagonal black dashed line. The tighter the scatter about the diagonal line, the better.
In addition, the error QQ-plot can be used to show the distribution of the prediction error against
the corresponding standard normal distribution. The error plot is the same as the prediction plot,
except the measured values are subtracted from the predicted values.
For the standardized error QQ-plot, the sample values are subtracted from the predicted values
and divided by the estimated kriging errors.
If the standardized errors of the predictions from their measured values are normally distributed,
the points should be close to the dashed line that represents the normal distribution (in the plot
given below). If the errors are normally distributed, it confirms the appropriateness of using the
methods that rely on normality (e.g., ordinary kriging).
Figure 34 : QQ-plot of the prediction standardized error. The long-term average values of water levels are close
to the normal distribution
In such a case, validation creates a model for only a subset of the data - the training dataset;
therefore, it does not directly check a final model, which should include the entire available dataset.
Rather, validation checks whether a training dataset model is valid, for example, choice of
semivariogram model, lag size, and search neighborhood. The training dataset model is used for
the whole dataset.
b. The mean standardized prediction error of residuals with kriging variance; computed
for independent validation
Other than that, the types of graphs and summary statistics used to compare predictions to true
values are similar for both validation and cross-validation.
Figure 35 : Cross-validation removes each sample data location (red dot) one at a time and produces a model
Figure 36 : Cross-validation results: cross-validation compares the measured and predicted values for all
points
With enough points, the effect of the removed point on the model, which was estimated using that
point, is likely to be minor.
There are two issues to consider when comparing the results from different methods and/or
models: one is optimality and the other is validity (or the correct variability). It is important to get the
correct variability. In kriging, the predictions depend on the kriging prediction errors ˆi .
© National Land Service under the Ministry of Agriculture, 2007
162
Spatial anglysis and modeling
For example, the root-mean-squared error may be smaller for a particular model. Therefore, this
model may be considered as the "optimal" model. However, when comparing to another model, the
root-mean-squared error may be closer to the average estimated prediction error. If the average
kriging prediction error is close to the root-mean-squared prediction error, this is a more valid
model, because only the estimated kriging prediction error assesses uncertainty of the prediction
independently of the actual data values (only variogram model is required). If the average kriging
error is greater than the root-mean-squared prediction error, the variability of predictions is over-
estimated; if the average kriging error is less than the root-mean-squared prediction error, the
variability in predictions is under-estimated.
The purpose of sampling design is to establish the structure of spatial dependence (e.g.
semivariogram) with a minimal number of samples that will produce optimal sample spacing on a
map. Moreover, kriging methods include solutions to accomplish the task of sampling cost
minimization.
As mentioned above, in kriging the estimation error is based only on the sample configuration and
the chosen model of spatial dependence, not the actual data values. Therefore, if the spatial
structure (variogram model) is known, maximum or average prediction errors can be determined
before sampling is computed. Based on the prediction errors, sampling decisions can be made
based on a cost-benefit framework before fieldwork is undertaken.
Optimal point configuration can be established based on the following optimization criteria of
known kriging systems as some numerical measure of the quality of the sampling design:
1. Minimize the maximum kriging prediction error in the study area
2. Minimize the average kriging prediction error over the entire area
3. Maximize the information in a sample variogram in order to allow reliable variogram
estimation.
The key concepts that were introduced in this module relate to spatial dependence or spatial
correlation that is described in general as “the value of a variable at a point in space or time is
related to its value at nearby points”. Knowing the values of sample points allows one to predict
(with some degree of certainty) the value at any given chosen point.
Here the concept of correlation between variables is applied to correlation within or to one variable,
using distance or time to model the spatial relation. The auto-correlation term is used to describe
such correlation. Main question was “How to describe the auto-correlation?” Spatial structure is the
nature of the spatial relation: how far, and in what directions, is there spatial dependence? How
does the dependence vary with distance and direction between points?
A function of the distance and direction separating two locations is used to quantify autocorrelation.
Spatial structure can be described by range, direction, and strength. The type of interpolation
method that can be used will depend on many factors. A common approach is to try different
interpolation methods and compare the results to determine the best interpolation method for a
given situation. Still, real-world knowledge of the subject matter can affect what interpolation
method to use.
The quality of a sample point set can affect the choice of an interpolation method as well. Support
of a sample or the physical dimensions it represents will define coarser or finer resolutions of the
prediction result.
Some features of module interpolation methods are summarized in the following table:
1. Define autocorrelation.
2. Explain the difference between global and local interpolation methods.
3. What is an exact interpolation method?
4. Explain semivariance as a measure of spatial dependency.
5. Define the elements of nugget, range, and sill in a semivariogram.
6. What is the purpose of fitting a semivariogram with a mathematical model?
7. How does universal kriging differ from ordinary kriging?
8. Describe how a cross-validation analysis is performed.
Recommended Readings:
References
1. Johnston K., Ver Hoef J. M., Krivoruchko K., and Lucas N. (2003) ArcGIS® 9 Using ArcGIS®
Geostatistical Analyst, ESRI.
2. Berke, O. (1999) Estimation and Prediction in the Spatial Linear Model. Water, Air, and Soil
Pollution 110, 215-237.
3. Kang-Tsung Chang (2006) Introduction to Geographic Information Systems, Third Edition, The
McGraw Hill.
4. Bailey, T. & Gatrell, A. (1995) Interactive Spatial Data Analysis. Addison Wesley Longman,
Harlow.
5. Cressie, N. (1993) Statistics for Spatial Data (Revised Edition). Wiley, New York.
Terms used
Geostatistics
Regionalized variables
Auto-correlation
Spatial dependence
Inverse distance weighting
Trend
Global and local polynomial
Splines
Kriging
Regression
Power function
Neighborhood search strategy
Omnidirectional and isotropy
Anisotropy
Semivariance
Variogram
Empirical semivariogram
Lag bins
Auto-covariance
Theoretical, authorized or fitted variogram
Ordinary, simple, universal, regression kriging and co-kriging
Unbiased
Validation
Cross-validation
RMSE
ME
Kriging prediction error
Average kriging prediction error
Root-mean-square standardized prediction error
Error, quantile and probability maps
5 Spatial Modelling
In its simplest application, a Geographic Information System (GIS) can be used to organize and
edit datasets and display the results in the form of a map which can be viewed in digital format or
output to hardcopy. However, the ability of a GIS to generate models simulating real-world events
is perhaps its most significant feature. In reality any map is a spatial model as it provides a
representation of the real world, however, GIS can be used to analyse multiple variables and then
interpret these variables to simulate or predict potential events. This capability allows us to
interpolate information for large areas of the landscape. For example, it would be impractical (i.e.,
due to budgetary constraints) to conduct a wildlife habitat field survey for an entire country,
however, a GIS can be used to model wildlife habitat suitability for large areas.
The following module details some of the basic spatial modelling functions available:
Step 1: Identify the Problem – In order to establish a model that addresses a specific problem, it is
necessary to clearly define the problem and goal(s) of the model. Inherent in this step is
establishing parameters by asking the following types of questions:
For example, as part of an environmental impact assessment, a wildlife biologist might wish to
quantify the amount of wildlife habitat potentially impacted within the zone of influence of a
proposed development site. A model, or potentially a series of models, integrating variables
related to wildlife habitat could be used to solve the problem. When identifying the problem the first
step in this example would be to identify the species that are potentially present in the project area
and then select the species that habitat models will be generated for. For example, it may be
determined that half-a-dozen indicator species can be used to represent the various types of
wildlife present (e.g., a predator species, a prey species, a bird, a raptor, a furbearer and an
ungulate). These species then become the phenomena being modelled. In wildlife habitat models
the spatial extent of the study area is typically represented by the maximum home range extent of
the species or by the watersheds or ecosystems surrounding the proposed project area. These
types of boundaries can be used to help define the spatial extents of the model. For the purpose of
quantifying the amount of habitat that might be lost it would be effective to assess the wildlife
habitat available in multiple time periods. This would be done by modelling the present conditions
(baseline) and then comparing those results against a second set of model results depicting the
habitat available after the project was developed.
Step 2: Break the Problem Down – After you have established the problem and the goals, breaking
the problem down into its constituent parts helps create more manageable steps. This involves
identifying the objectives required to reach your goal, the phenomena involved, and the interactions
between these phenomena. Establishing the phenomena allows the modeller to identify and
assemble the datasets required for processing. Often, a flowchart can be useful in visualizing and
understanding the spatial and attribute relationships between the constituents.
In our wildlife habitat assessment example, the modeller would assess the habitat requirements
associated with each species being modelled (e.g., elevation, slope, proximity to water, vegetation)
and then compile a catalogue of relevant datasets and define how their associated attribute and
spatial relationships contribute to creating potential habitat for each species. For example, if we
know a species exists below 500 metres elevation, on slopes less than 10%, within 250 metres to a
source of freshwater and prefers coniferous forest habitat, we can use the data from a digital
elevation model (DEM) to identify areas meeting the elevation and slope requirements. A hydrology
layer would be required to model proximity to water and a vegetation or land cover dataset could
be used to identify coniferous forest habitats. The intersection of these four data layers allows us to
identify potential habitats for our species of interest.
Step 3: Develop and Calibrate the Model – Identifying the tools and mathematical operations
required for analysis is addressed at this stage. Using the data examined in Step 2, the modeller
builds the model from these tools and operations. Subsequent repeated running of the model
allows the modeller to calibrate the model. Calibration involves comparing the results of the model
with its input data and subsequently adjusting the parameters or mathematical operations to arrive
at more accurate results.
Using a GIS, the modeller would calibrate the model by repeatedly running the model and
comparing the results to the data. Adjustments to parameters and operations would be used to
refine the model results.
In our wildlife habitat map example, the initial results would illustrate potential habitat based on the
variables considered; we may then want to refine the results based on other criteria. For example,
perhaps ideal habitat (with a high rating) is within 100 metres of a source of freshwater and habitat
falling between 100 and 250 metres of freshwater would be assigned a moderate habitat rating.
Step 4: Validate the Model Results – Validation is an evaluation of the model’s capacity for
accurately predicting the real world phenomena. This involves comparing the results to field data
and/or running the model using a different set of data representing conditions that are unlike those
used in the calibration phase. If another set of data are not available, a suitable alternative
consists of splitting the one dataset into two subsets: one for developing and calibrating the model,
the other for the validation process.
Error Detection
Visual Inspection – one method for identifying errors is simply by looking at the output to see
if the results seem logical and consistent and that the output data reflects the input.
Documentation – metadata for input datasets can be used to ensure you are using the data
appropriately, based on scale, accuracy, attribute values.
Validation Rules – topological and attribute domain ranges are two useful validation rules
that allow you to check the data against the design.
Consistency of Results – collecting data again and repeating the conversion and processing
steps can be used to compare the results of a model.
© National Land Service under the Ministry of Agriculture, 2007
171
Spatial anglysis and modeling
Ground Truthing – verifying the results of a model in the field is a dependable method for
validating data.
Statistics – correlating the model results with a closely associated variable is a statistical
approach to data verification.
In our example, the wildlife biologist could validate the habitat suitability model results by
comparing areas having high predicted wildlife habitat ratings with known wildlife habitat locations
(e.g., denning sites).
Step 5: Implement the Model Results – Once the model has been validated, the modeller can then
implement the model results.
By using the validated results of the model, our wildlife biologist can determine how much habitat is
currently present and then overlay the proposed development footprint to determine the amount of
habitat lost after the proposed development is operation. In addition, the model results could be
used to identify high priority areas for subsequent field surveying efforts.
Limitations of Modelling
It is critical to understand that all models have limitations. Factors that may limit the use or
implementation of a model include:
insufficient data – for example data at the required level of detail may not be available
for all, or part of, the study area
lack of user understanding of the data – for example, many models represent a
probability analysis rather than an absolute interpretation of the data
inappropriate modelling – in this case the model would be generating incorrect results
either based on a faulty assumption in the model design or as a result of an error in
executing the model
5.2.1 Purpose
Descriptive Model
A descriptive model describes the current conditions of a real world environment (natural or
anthropogenic). A simple thematic map showing land use can be considered a descriptive model
in that it represents an environmental characteristic that actually occurs on the ground. Other
examples include digital elevation models, land use coverage, meteorological maps, and
vegetation cover maps.
Explanatory Model
As indicated by their name, explanatory models attempt to explain or account for the occurrence of
existing phenomena by identifying the factors involved and assessing their relative influence.
Examples of explanatory models include soil erosion models, ozone depletion, and algae blooms.
Predictive Model
A predictive (or prescriptive) model predicts (by using the factors identified in an explanatory
model) where you might find occurrences of a particular environmental phenomena. For example,
a habitat capability map, derived from vegetation and slope raster datasets, might be used to
predict the capacity of an area to support the habitation by a certain animal species. Forecasting
sea level rise based on changes in climate (e.g., mean temperature values) would be another
example of a predictive model.
Normative Model
Normative models attempt to affirm how phenomena (especially network) ought to operate in the
real world; they recommend optimal solutions for given situations. Examples include food aid
distribution, traffic volume levels, and route planning (e.g., travelling salesperson problems).
5.2.2 Methodology
Stochastic
A stochastic (or probabilistic) model is represented by a mathematical equation where at least one
of its variables or parameters is assumed to contain some level of randomness. Because of this
randomness, some degree of error or uncertainty is accepted and expressed as a measure of
probability along with the predictions of the model. An example might be an evaluation of the
probability of the occurrence of landslides due to forest harvest areas. Another example of a
stochastic model would be the application of a kriging analysis to a series of water quality sampling
measurements. This would result in an interpolated surface depicting the potential concentration of
a given contaminant (a predictive map) and the generation of a standard level of error for each
predicted value.
Deterministic
Contrary to the stochastic method, a deterministic model does not consider randomness to be a
part of any of the variables or parameters present in the mathematical equation.
Deductive
Deductive models derive specific conclusions by using general premises established in scientific
theory or physical laws - where the variables and their interactions are well understood. An
archaeological potential model that uses physical constraints (e.g., slope, aspect, proximity to
water) to predict the location of sites is an example of deductive logic.
Vector-Based Binary Model – a vector-based binary model uses overlay operations (e.g.,
intersect, union) to combine the shapes and attributes of the contributing data.
The most common application of binary raster data is site selection analysis, where multiple criteria
are evaluated to determine the most appropriate location for a certain facility or specific land use.
Two major methods are used to model siting analysis: select one site based on the evaluation of a
set of pre-selected sites using stringent selection criteria; or the evaluation of all potential sites.
Supposed a mining company wanted to select potential gold mining sites in a valley using the
following criteria:
Ranking
Also referred to as an index model, a ranking model calculates an index value specific to each unit
area, with the end result being a ranked map. Similar to binary models, ranking models evaluate
multiple criteria through the use of overlay operations and raster arithmetic. The difference lies in
the fact that while binary models are given a yes/no value (1 or 0), ranking models apply an index
value to each unit area.
Commonly, a weighted linear combination method is used to determine the index value for each
unit area. This method is completed in three steps:
1. Relative Importance – all criteria are evaluated against each other to determine the
relative importance of each criterion weighted on a scale of 0.0 to 1.0 (0 – 100%).
2. Standardize Criteria – the relative importance values are standardized between
criteria. This step facilitates the comparison of multiple variables.
3. Calculate Index Value – an index value is calculated for each unit area by adding the
weighted criterion values and then dividing by the total of the weights.
Ranking models are used predominantly in creating suitability or vulnerability maps which are
discussed further in the section on model examples (Section 1.5).
process(es), based on the layer and tool layout and order of steps as diagrammed by the modeller
in the display window.
After the desired data and tools have been added and suitably connected in the display window,
the model can be saved. The saved model configuration can serve as a template and used with
different model parameters/data inputs.
The steps used in the gold mine site selection example above could be inserted into Model Builder
and run as an automated and comprehensive process. The slope layer and road buffer processes
could be designed in Model Builder and then connected to a union tool, ultimately yielding a layer
with appropriate locations for the mine.
It is extremely important when implementing a model using the Model Builder interface to double
check the processing steps associated with each model element and to review the output data at
each step to ensure the model is creating the expected results. As with any automated process it
can often be difficult to determine if a mistake exists within the model and therefore the quality
assurance and quality control procedures are of great importance.
Discipline knowledge
o Wildlife experts
o GIS and data integration
Habitat Fragmentation
The mapping of habitat fragmentation is another tool to help manage environmental change.
Habitat fragmentation maps integrate wildlife habitat map layers with maps depicting the extent of
human disturbance to illustrate the size of available habitat patches and areas of continuous
habitat. The aggregate maps help us quantify the amount of viable habitat and the potential loss
resulting from existing and/or proposed development activities. They help us identify: the total area
and average patch size of habitat; potential increases in the amount of edge habitat (this can be a
positive or negative factor depending on the species of interest); the potential decrease in the
amount of interior habitat; patches of habitat that have the potential to become isolated; and the
potential increase in smaller habitat patches.
The results can be used to help identify the locations of major habitat reservoirs and habitat
refuges that are essential to the continued success of the species. A habitat reservoir is a large
area (the size of which is dependent on the species of interest) of habitat that has sufficient size
and ecological integrity to support a range of native species including species that need interior
habitats. A habitat refuge is a small patch of habitat that provides food, shelter and other needs for
wildlife. It may include human-modified ecosystems. Refuges generally are not large enough to
maintain the genetic diversity of a population but may act as important ‘stepping stones’ to habitat
reservoirs for species and for maintaining ecological functions.
Constraints Mapping
Constraints mapping is a type of suitability map. The analysis identifies opportunities and
restrictions (constraints) to project construction. The process involves assembling a variety of
different spatial data layers (e.g., terrain and topography, wildlife habitat, slope, protected areas,
soils, heritage resource information), assigning the data in each layer a sensitivity rating (based on
scientific and local knowledge), and combining the layers to develop a single map that integrates
all of the source data layers. The resultant derivative map product identifies and synthesizes the
complex relationships between different environmental datasets, while also considering location
and operational limitations posed by project design. The constraints map serves as a visual
decision-support tool that delineates areas determined to be of environmental and cultural
importance based on the occurrence of sensitive landscape features. The constraints designation
can be either avoidance of the site (e.g., a ‘no-go’ area) or a graduated level of concern. Ideally the
constraints map helps guide the placement of surface facilities into the least constraining locations,
thereby lowering the environmental and cultural impact of industrial activities.
The type of derivative surface that is created depends on the values and weights applied to the
input data layers. For example, corridor delineations depend on slope and ecological land
classification (ELC) type for defining where ideal animal corridors are present within the landscape,
while accessibility mapping will depend on slope and distance from communities.
Overview
Constraints mapping identifies opportunities and restrictions to project siting by integrating a series
of spatial layers into a single map. Figure 4 below provides an overview of the constraints mapping
process.
Geoprocessing,
Habitat Index
Index
Industrial Lakes
Activity
Etc.
SENSITIVITY LAYERS
Weighting
Constraints
The constraints map represents the visual result or output of the constraint model. The
constraints map is created from a mathematical combination of any number of sensitivity layers.
A sensitivity layer represents the sensitivity of a given discipline or subject of interest (e.g.,
wildlife, traditional knowledge or vegetation) relative to the objective of minimizing
environmental and cultural impacts while project siting. Each sensitivity layer is the result of
combining input data from potentially many GIS datasets that depict features on the landscape
such as roads, forest cut blocks or wildlife habitat.
To be incorporated into the model, each feature in a GIS dataset must be assigned a sensitivity
value - a constraint coefficient. The value assigned represents the sensitivity of a particular feature
to the constraints objective. For instance, being interested in minimizing the social and cultural
impacts of siting a project, we are aware of the importance of wetlands to wildlife and ecosystem
health – particularly marshes and swamps. These wetlands are therefore assigned a relatively high
sensitivity value (which recognizes that development may be more constrained where these
features exist). In the constraints model, the possible sensitivity values ranged from zero (0.0) to
one (1.0). A value of zero represents no sensitivity and a value of one represents the highest
sensitivity.
For a given point on the constraints map, the resultant constraint value is calculated as follows:
Constraint Map Unit Value = (wetland and riparian sensitivity value x 0.492) + (key wildlife
habitat sensitivity value x 0.268) + (disturbance sensitivity value x 0.140) + (vegetation
sensitivity value x 0.053) + (slope x 0.047).
The following is an example calculation for a map unit (e.g., a grid cell) where the final constraint
value was determined to be 0.707. The example Sensitivity Value is the value for that sensitivity
layer at a given pixel location. A Weighted Sensitivity is calculated for each map unit in the study
© National Land Service under the Ministry of Agriculture, 2007
184
Spatial anglysis and modeling
area. Weighted Sensitivity values range from 0.0 (no constraints) to 1.0 (fully constrained). The
constraint value represents the sum of the weighted sensitivity values for a given map unit.
Table 2. Constraints Calculation Example
1. For a sensitivity layer, the sensitivity at a given location on the landscape is determined by
choosing the highest of all sensitivities from the input GIS data. That means that if there
were three features in the same area all with a sensitivity of 1.0, they would be considered
no more important than an area were there was only one feature with a sensitivity of 1.0.
Both areas would be assigned a sensitivity of 1.0.
2. The weighting mechanism used in the final combination of sensitivity layers can dilute the
final constraint value. If, for example, the Wetlands and Riparian sensitivity layer had been
© National Land Service under the Ministry of Agriculture, 2007
185
Spatial anglysis and modeling
weighted much lower (e.g., 0.10), the other layers would have contributed much more to the
final result and the constraints map would be much different. The relative importance (or
lack thereof) of the wetland and riparian features would not be evident in the final output.
Application
Constraints mapping can provide information on environmental constraints as per the example
detailed, however, it can also be used to identify opportunities for siting a project. Ideally two
separate analyses would be run in parallel: the constraints analysis examining environmental
sensitivities; and an opportunities analysis examining existing development and terrain conditions
to identify the ‘positive’ factors associated with a given location (e.g., close proximity to existing
access rights-of-way, low slope, minimum number of road-stream crossings). Potentially the results
of the two approaches can either be used independently or integrated into a single map that
identifies optimal locations based on both engineering-based siting requirements and minimizing
environmental impact. The primary goal of the constraints process is to provide information for
more informed decision making, thereby minimizing the effects of industrial development. The
constraints map helps effectively identify potential, low impact development sites and, while field
verification of these sites is still a requirement, the constraints map allow the effort associated with
field programs to be focussed on a much smaller subset of the landscape.
A road density analysis can be conducted with all roads contributing equally to the final density
statistic or, alternatively, a weighting can be applied to roads with a greater width or a higher traffic
volume resulting in a weighted road density measurement.
5.5.4 U
n
i
v
e
r
s
a
l
S
o
i
l
L
o
s
s
E
q
uation
The Universal Soil Loss Equation (USLE) is a conventional model of soil erosion that takes into
account climate characteristics, soil properties, topography, surface conditions, and human
activities. It predicts the average soil loss attributed to the runoff from various slopes associated
with agriculture, rangeland, and other managed land systems (e.g., construction sites).
The equation is A = R K L S C P
where:
Each of the above factors contributes to a simulation of conditions that affect the severity of soil
erosion at a particular location.
GIS enables the model to incorporate the spatial portion of the equation:
© National Land Service under the Ministry of Agriculture, 2007
187
Spatial anglysis and modeling
Digital Elevation Models are used to calculate the slope length (L) and steepness (S) for
each cell
a land cover/land use maps may be used as a source for crop management practices (C)
and support practices (P) (Figure 9).
Figu
re 9.
Lith
uani
an
Lan
d
Use
Map.
Afte
r
imp
orti
ng
and
pre-
pro
ces
sing
, an
ove
rlay
ope
rati
on
is
use
d to
reclassify the data into equivalent units, combine the input data using the Universal Soil Loss
Equation, and generate a derivative soil erosion potential layer.
References
1. Blyth, C. Ann, and David Cake, Constraints Mapping as a Decision Support Tool for Project
Siting. GeoTec 2005 Conference. Vancouver, BC, February 13-15, 2005.
3. Chrisman, N. Exploring Geographic Information Systems. John Wiley and Sons, 1997.