C. H. Ettema, D. C. Coleman, G. Vellidis, R. Lowrance, and S. Rathbun. 1998. Spatiotemporal distributions of bacterivorous nematodes and soil resources in a restored riparian wetland. Ecology 79:2721-2734.


Supplements

Supplement 1: SAS Programs for geostatistical analysis.
Ecological Archives E079-001-S1.

Description
Author contact information
File list and downloads
File descriptions
References

Copyright


Description

This document contains supplementary material for the statistical analysis used by Ettema et al. (1998). SAS programs for geostatistical analysis, and accompanying output files for one of the nematodes investigated (Prismatolaimus) are presented and explained.

Author contact information

Dept. of Environmental Sciences
Sub-department of Soil Quality
Wageningen University
POB 8005
6700 EC Wageningen
The Netherlands
E-mail: christien.ettema@bb.benp.wau.nl


File list and downloads

More complete descriptions are provided in the following section

  1. prisnema.dat is a datafile with abundance data for the bacterivorous nematode Prismatolaimus.
  2. semivari.sas is a SAS program for estimating and modeling pooled semivariance in nonstationary data.*
  3. semivari.lst is the accompanying output file for Prismatolaimus.
  4. glstest.sas is a SAS program for testing the partial slopes of x, y-coordinates using GLS estimation.*
  5. glstest.lst is the accompanying output file for Prismatolaimus.
  6. unikrige.sas is a SAS program for universal kriging.*
  7. prism.nov is the accompanying output file for Prismatolaimus.
  8. E079001.zip is a single file containing all the components of this supplement

*NOTE: The *.sas files all were written for SAS version 6.08. All of these programs require the use of the SAS matrix language IML. For a short introduction toSAS-IML, click here.


File descriptions

1. prisnema.dat

The datafile prisnema.dat contains four tab-delimited variables: x-coordinate, y-coordinate, sampling date (t=1, 108, 172, and 290), and Prismatolaimus abundance per assembled core (2.22 cm diameter, 15 cm depth). See Ettema et al. (1998) for details on sampling and nematode enumeration. Note that in each SAS program, nematode abundance is converted to a per-meter-squared basis (using factor 2583.4745).


2. semivari.sas

The program semivari.sas contains four main parts following the data step.

The first part (1) is a simple ordinary-least-squares (OLS) regression of nematode abundance against x-coordinates (the y-coordinate variable is left out in the model statement as it is not significant; it may be added in for different data sets!). This OLS regression is necessary because there is a large-scale spatial trend in the data along the x-direction (i.e., the data are non-stationary). Basically, the OLS-residuals can be considered stationary data, with large-scale spatial trends removed.

In part 2, a main IML program (vgram) is defined in which semivariances are estimated using the OLS-residuals instead of the original data (residuals are read into the 'z' array in part 3). First, for every pair of points, their spatial separation distance is calculated as Euclidean distance (dist). Secondly, their "temporal distance" is calculated (dtime). For pooled spatial semivariance, we are not interested in the semivariance between points of different sampling dates; that's why the program contains a trick to leave these semivariances out of the final calculation:

The statement

dist=dist + 10000 *dtime;

combined with

if dist <= maxdist then append;

(the value of maxdist being set in part 3) throws out all the semivariances of pairs of points with a dtime > 0. Next, the distances are discretized into distance classes (this is needed as samples were taken on an irregular lattice).

The first "if-loop" in part 2 simply adds the mean-squared differences between subsequent pairs of observations to v[n]. It only does this for a limited number of size (distance) classes (size and nclass being defined in part 3). For a different data set, you have to play around with the settings of size, nclass and maxdist. The rule of thumb is to estimate semivariance up to 3/4 of the maximum separation distance in your data, and have at least 30 pairs of points per distance class.

The 'do i=1 to nclass' loop in part 2 sets the value-label of the distance class halfway (such that e.g. 'dist=15' really covers distances 10-20m). In addition, the average semivariance (vario) is calculated, per distance class i by dividing the summed semivariances (v[i]) by twice the number of pairs in the distance class: 2*count[i], which is the 2Nh in equation 1 in Ettema et al. (1998):

Equation 1:

The output of this loop contains 3 columns of data, containing the distance class (dist, i.e. h in Equation 1 above), the average semivariance in that class (vario, being gamma(h)), and the number of pairs of points in each distance class (n, being Nh).

Part 3 of the program reads the data from the data set a (the residuals!) into arrays, sets the values of the parameters size, nclass and maxdist, and invokes the execution of the main program (vgram).

NOTE: if the OLS yields only non-significant slopes for x, y-coordinates, the data can be considered stationary. In that case, some statements in (3) have to be changed:

use a;

changes into

 

AND

read all var{resid} into z;

changes into

read all var{prism} into z; (prism being the original data).

As mentioned earlier, the settings of size, nclass, and maxdist are to be adjusted to the spatial characteristics of your data set.

Part 4 of the program, following simple print and plot statements, fits an exponentional model to the semivariogram using SAS's non-linear regression procedure, PROC NLIN. The exponentional model is as follows:

Equation 2:

In the parameters statement, guess values are given for Co, Ce, and alpha. The model is fitted using weighted least squares (WLS) (_weight_ statement), that is, priority is given to fitting the model best at short distances. It is important to try different guess values to find the best-fitting parameters (see below for judging best fit).

3. semivari.lst

The output from semivari.sas, run with prisnema.dat is semivari.lst. It contains several parts. First, it shows the OLS output, with a significant slope for x. It also shows the estimated semivariance (vario) for 9 distance classes (dist) and number of observation pairs in each class (n). NOTE: if the slope for x was not significant, you would re-estimate the semivariances using the original data (see above). NOTE: the final, unbiased test for x is done in glstest.sas, described below.

Next, the output shows the semivariogram plot, revealing spatial autocorrelation at short distances, and leveling off (near-independence) at large distances. The plot is followed by the non-linear regression iterations and the message that convergence was attained. The relative structure can be calculated as (Ce/(Co + Ce)), and the range as (3/alpha). The model R^2 can be calculated from the ANOVA table as:

R^2 = (Corrected Total Weighted SS - Residual Weighted SS) / Corrected Total Weighted SS)

The best fitting model has the highest R^2 and the lowest Weighted Mean Square Residual.

The remainder of the output shows the predicted values, residuals and residual plot.


4. glstest.sas

Although OLS-regression gives unbiased estimates of partial slopes, their standard errors, as estimated by common regression software, are biased in the presence of spatial correlation. Thus, for a thorough evaluation of large-scale trend in the data, the slopes are tested using Generalized-Least-Squares (GLS). GLS estimates of regression coefficients use information from the spatial correlation matrix (which, in our case, is derived from the model fitted to the residuals-based semivariogram). By giving more weight to isolated sites than sites in crowded locations which present redundant information, GLS regression coefficient estimates are more precise than OLS estimates, and estimates of their standard errors are unbiased (Cressie 1993: p. 20-24). GLS estimates can be used to test effects of independent variables, such as x,y-coordinates.

The SAS program glstest.sas has 3 main parts following the data step. In part 1, the subroutine vmat is defined, in which the spatial correlation (not semivariance) matrix v[i,j] is calculated. To understand the calculations, remember the following equations for semivariance, covariance, and correlation:

Spatial semivariance:

Equation 3:Equation 3

where C(r) is the spatial covariance function and r is distance.

(NOTE: h is distance class - see Equation 1, above)

Spatial Correlation:

Equation 4: Equation 4

Combining Equations 3 and 4:

Equation 5: Equation 5

In the exponential model,

Equation 6:Equation 6

Thus, Eq. (5) can be re-written as:

Equation 7: Equation 7

IMPORTANT: The notation i^=j (in subroutine vmat) means "for i unequal to j " (i.e. so that r > 0). Some SAS environments require a slightly different notation, e.g. i¬=j. If not sure (and if the printout shows a negative sigma2), activate the print v statement and check that the diagonal entries of the v-matrix are all ones. Other tricky notations: Some SAS environments use a `, not a ', to transpose matrices. || is used to combine two matrices (if your screen shows these bars split in the center, it may not work properly!). // is used to place one matrix below the other. Check SAS-IML guides for further info.

The second section (2) contains the main program, gls, which performs the actual GLS estimation, using the subroutine vmat to calculate the correlation matrix. Remember the multiple regression model:

Equation 8: Equation 8

where

Equation 9: Equation 9

in which V is the correlation (ro(r)) matrix vmat. The GLS estimators for beta (beta in the SAS program) , var(beta) (varbeta) , and sigma-squared (sigma2) are, respectively:

Equation 10: Equation 10

Equation 11: Equation 11

Equation 12: Equation 12

The last section (3) reads data from the dataset wetland into arrays appearing in the IML program, and sets the values of program parameters Co, Ce, alpha, and maxdist. The final statement (run gls) invokes the actual execution of the main program gls and subroutine vmat.

NOTE 1: If both x- and y- coordinates are to be tested, add y to the design matrix xmat:

read all var{x y} into xmat;

NOTE 2: For Co, Ce and alpha fill in the best fitting values from the output of semivari.sas, i.e. from semivari.lst

NOTE 3: The program is set up to test the partial slope of x across the 4 seasons. To test for individual seasons, activate the appropriate statements in the data step (e.g. to test for the first date, t=1, drop the data of all other dates using the statements

if time=108 then delete;
if time=172 then delete;
if time=290 then delete;

Thus limiting the test to one sampling date will make the dtime-calculation in part (1) irrelevant (however, leave it in place - it doesn't hurt the other calculations, and reduces the risk that later on you forget to put it back in).


5. glstest.lst

The glstest output, glstest.lst, should be read as follows. The column titled beta contains the estimates for beta_0 and beta_1; beta_1 being the slope for the x-coordinate-variable tested. The diagonal of the matrix titled varbeta contains the variance estimates for the betas: var(beta_0) is the top-left value in the matrix, and var(beta_1) is the bottom-right value. The t-statistic for beta_1 is estimated as:

t = beta_1 / sqrt{var(beta_1)}

which here = -1577.945 / sqrt(763592.47) = -1.81

Compare this value to the t_(n-2) percentage point (in this particular test, n=107, and 2 is the amount of estimated model parameters, beta_0 and beta_1). Test double-sided.

Note that if y (the y-coordinates) is included as an explanatory variable in the design matrix xmat(see glstest.sas, NOTE 1), beta would show 3 instead of 2 estimates, and varbeta would be a 3x3, instead of a 2x2 matrix. In addition, the outcome of the t-statistic should be compared to the t_(n-3), instead of the t_(n-2), percentage point


6. unikrige.sas

The SAS program for universal kriging, unikrige.sas, consists of 7 parts. For an explanation of the universal kriging equations, click here.

Part 1 is a simple data input step, reading the measurements z(x,y) at sampled locations. It contains statements to drop all data except those of the sampling date for which kriging is desired (t=1 in this example).

Part 2 creates a new data set (points), the set of spatial points s0 for which you want to predict the value of 'z' (point coordinates are (x0,y0), read into (xp,yp) in part 6).

Parts 3 and 4 define two subroutines (gam1 and gam2), which are called in the main program krige in part 5. gam1 calculates the semivariances between pairs of sampled locations (x,y) and constructs the bgamma matrix (big gamma, see universal kriging equations). gam2 calculates the semivariances between sampled sites and the locations (xp,yp) for which predictions are to be made. It also constructs the sgamma matrix (small gamma, see universal kriging equations), which is concatenated with xmatp. xmatp contains the value(s) of the explanatory variable(s) at the locations s0 (xp,yp) where predictions are to be made (in this program, xmatp contains the x0-coordinates, as a significant trend along the x-direction was found in the GLS output! Note: In ordinary kriging, no explanatory variables are used).

Part 5, the main program krige, starts with initializing all the matrices used in the subroutines gam1 and gam2, and subsequently calls these subroutines. In the lambda= statement the universal kriging equations are solved. Using the resulting lambda, z is predicted (zhatm) for the unsampled sites s0. The 'do' loop takes care that the resulting matrix entries zhatm[i], xp[i] and yp[i] are output as data columns zhat, x0, and y0 into the IML data set pred which was initialized in the create statement right after proc iml in part 3 of the program.

Part 6 reads data from the dataset wetland into arrays appearing in the IML program, and sets the values of program parameters Co, Ce, and alpha (adjust these values according to your model fit! I.e. use the best-fitting parameters from semivari.lst). The final statement (run krige) invokes the actual execution of the main program krige and subroutines gam1 and gam2.

Note that If both x- and y- coordinates are significant explanatory variables, y has to be added to xmat and xmatp:

read all var{x y} into xmat;
read all var{x0 y0} into xmatp;

Finally, in part 7 of the program, the predicted values stored in the IML data set pred are written to a space-delimited ASCII file prism.nov, which can be used in graphics software to draw maps.

Note that unikrige.sas requires considerable memory and computation time. We used the batch facility on our mainframe system to temporarily access more memory and increase available CPU. For a quick check whether the program actually works for you, reduce the points data set to just a few data points. E.g.:

do x0=0 to 48 by 48;
do y0=0 to 108 by 108;

creates a tiny data set with which the program should run without system protests!


References

Cressie, N. A. C. 1993. Statistics for spatial data. John Wiley and Sons, New York, USA.

Ettema, C. H., D. C. Coleman, G. Vellidis, R. Lowrance, and S. Rathbun. 1998. Spatio-temporal distributions of bacterivorous nematodes and soil resources in a restored riparian wetland. Ecology 79(8): 2721-2734.


ESA Publications | Ecological Archives | Permissions | Citation | Contacts