Manual SAS GEOESTADISTICA
Manual SAS GEOESTADISTICA
Manual SAS GEOESTADISTICA
This example helps you explore aspects of automated semivariogram fitting with PROC
VARIOGRAM. The test case is a spatial study of arsenic (As) concentration in drinking water.
Arsenic is a toxic pollutant that can occur in drinking water because of human activity or, typically,
due to natural release from the sediments in water aquifers. The World Health Organization has a
standard that allows As concentration up to a maximum of 10 g/lt (micrograms per liter) in drinking
water.
In general, natural release of arsenic into groundwater is very slow. Arsenic concentration in water
might exhibit no significant temporal fluctuations over a period of a few months. For this reason, it is
acceptable to perform a spatial study of arsenic with input from time-aggregated pollutant
concentrations. This example makes use of this assumption for its data set logAsData. The data set
consists of 138 simulated observations from wells across a square area of 500 km 500 km. The
variable logAs in the logAsData data set is the natural logarithm of arsenic concentration. Often, the
natural logarithm of arsenic concentration (logAs) is used as the random variable to facilitate the
analysis because its distribution tends to resemble the normal distribution.
The goal is to explore spatial continuity in the logAs observations. The following statements read
the logAs values from the logAsData data set:
data logAsData;
input East North logAs @@;
label logAs='log(As) Concentration';
datalines;
193.0 296.6 -0.68153 232.6 479.1 0.96279 268.7 312.5 -1.02908
43.6 4.9 0.65010 152.6 54.9 1.87076 449.1 395.8 0.95932
310.9 493.6 -1.66208 287.8 164.9 -0.01779 330.0 8.0 2.06837
225.7 241.7 0.15899 452.3 83.4 -1.21217 156.5 462.5 -0.89031
11.5 84.4 -0.24496 144.4 335.7 0.11950 149.0 431.8 -0.57251
234.3 123.2 -1.33642 37.8 197.8 -0.27624 183.1 173.9 -2.14558
149.3 426.7 -1.06506 434.4 67.5 -1.04657 439.6 237.0 -0.09074
36.4 175.2 -1.21211 370.6 244.0 3.28091 452.0 96.5 -0.77081
247.0 86.8 0.04720 413.6 373.2 1.78235 253.5 291.7 0.56132
129.7 111.9 1.34000 352.7 42.1 0.23621 279.3 82.7 2.12350
382.6 290.7 0.86756 188.2 222.8 -1.23308 382.8 154.5 -0.94094
304.4 309.2 -1.95158 337.5 387.2 -1.31294 490.7 189.8 0.40206
159.0 100.1 -0.22272 245.5 329.2 -0.26082 372.1 379.5 -1.89078
417.8 84.1 -1.25176 173.9 407.6 -0.24240 121.5 107.7 1.54509
453.5 313.6 0.65895 143.5 346.7 -0.87196 157.4 125.5 -1.96165
371.8 353.2 -0.59464 358.9 338.2 -1.07133 8.6 437.8 1.44203
395.9 394.2 -0.24144 149.5 58.9 1.17459 453.5 420.6 -0.63951
182.3 85.0 1.00005 21.0 290.1 0.31016 11.1 352.2 -0.88418
131.2 238.4 -0.57184 104.9 6.3 1.12054 247.3 256.0 0.14019
428.4 383.7 0.92448 327.8 481.1 -2.72543 199.2 92.8 -0.05717
453.9 230.1 0.16571 205.0 250.6 0.07581 459.5 271.6 0.93700
229.5 262.8 1.83590 370.4 228.6 2.96611 330.2 281.9 1.79723
354.8 388.3 -3.18262 406.2 222.7 2.41594 254.4 393.1 2.03221
96.7 85.2 -0.47156 407.2 256.8 0.66747 498.5 273.8 1.03041
417.2 471.4 -1.42766 368.8 424.3 -0.70506 303.0 59.1 1.43070
403.1 264.1 1.64554 21.2 360.8 0.67094 148.2 78.1 2.15323
305.5 310.7 -1.47985 228.5 180.3 -0.68386 161.1 143.3 1.07901
70.5 155.1 0.54652 363.1 282.6 -0.43051 86.0 472.5 -1.18855
175.9 105.3 -2.08112 96.8 426.3 1.56592 475.1 453.1 -1.53776
125.7 485.4 1.40054 277.9 201.6 -0.54565 406.2 125.0 -1.38657
60.0 275.5 -0.59966 431.3 494.6 -0.36860 399.9 399.0 -0.77265
28.8 311.1 0.91693 166.1 348.2 -0.49056 266.6 83.5 0.67277
54.7 356.3 0.49596 433.5 460.3 -1.61309 201.7 167.6 -1.40678
158.1 203.6 -1.32499 67.6 230.4 1.14672 81.9 250.0 0.63378
372.0 50.7 0.72445 26.4 264.6 1.00862 300.1 91.7 -0.74089
303.0 447.4 1.74589 108.4 386.2 1.12847 55.6 191.7 0.95175
36.3 273.2 1.78880 94.5 298.3 -2.43320 366.1 187.3 -0.80526
130.7 389.2 -0.31513 37.2 324.2 0.24489 295.5 211.8 0.41899
58.6 206.2 0.18495 346.3 142.8 -0.92038 484.2 215.9 0.08012
451.4 415.7 0.02773 58.9 86.5 0.17652 212.6 363.9 0.17215
378.7 407.6 0.51516 265.9 305.0 -0.30718 123.2 314.8 -0.90591
26.9 471.7 1.70285 16.5 7.1 0.51736 255.1 472.6 2.02381
111.5 148.4 -0.09658 440.4 375.0 1.23285 406.4 19.5 1.01181
321.2 65.8 -0.02095 466.4 357.1 -0.49272 2.0 484.6 0.50994
200.9 205.1 0.43543 30.3 337.0 1.60882 297.0 12.7 1.79824
158.2 450.7 0.05295 122.8 105.3 1.53936 417.8 329.7 -2.08124
;
First you want to inspect the logAs data for surface trends and the pairwise distribution. You run the
VARIOGRAM procedure with the NOVARIOGRAM option in the COMPUTE statement. You also
request the PLOTS=PAIRS(MID) option, which prompts the pair distance plot to display the actual
distance between pairs, rather than the lag number itself, in the midpoint of the lags. You use the
following statements:
The observations scatter plot in Output 106.1.1 shows a rather uniform distribution of the locations
in the study domain. Reasonably, neighboring values of logAs seem to exhibit some correlation.
There seems to be no definite sign of an overall surface trend in the logAs values. You can consider
that the observations are trend-free, and proceed with estimation of the empirical semivariance.
Based on the discussion in section Preliminary Spatial Data Analysis, the pair distance plot in
Output 106.1.2 suggests that you could consider pairs that are anywhere around up to half the
maximum pairwise distance of about 700 km.
The first few lag classes of the logAs empirical semivariance table are shown in Output 106.1.3.
Output 106.1.3: Partial Output of the Empirical Semivariogram Table for logAs Data
Semivariogram Model Fitting of Log-Arsenic Concentration
The VARIOGRAM Procedure
Dependent Variable: logAs
Empirical Semivariogram
Output 106.1.3 and Output 106.1.4 indicate that the logarithm of arsenic spatial correlation starts
with a small nugget effect around 0.11 and rises to a sill value that is most likely between 1.4 and
1.8. The rise could be of exponential type, although the smooth increase of semivariance close to
the origin could also suggest Gaussian behavior. You suspect that a Matérn form might also work,
since its smoothness parameter can regulate the form to exhibit an intermediate behavior between
the exponential and Gaussian forms.
The combinations include repetitions. For example, you specified the GAU form; hence the GAU-
GAU form is tested, too. The model combinations also include permutations. For example, you
specified the GAU and the EXP forms; hence the GAU-EXP and EXP-GAU models are fitted
separately. According to the section Nested Models, it might seem that the same model is fitted
twice. However, in each of these two cases, each structure starts the fitting process with different
parameter initial values. This can lead GAU-EXP to a different fit than EXP-GAU leads to, as seen
in the fitting summary table in Output 106.1.6. The table shows all the model combinations that
were tested and fitted. By default, the ordering is based on the weighted sum of squares error
criterion, and you can see that the lowest values in the Weighted SSE column are in top slots of the
list.
Another remark in Output 106.1.6 is that despite submitting 12 model combinations for fitting, the
table shows only 10. You can easily see that the combinations MAT-GAU and MAT-MAT are not
among the listed models in the fit summary. This results from the behavior of the VARIOGRAM
procedure in the following situation: A parameter optimization takes place during the fitting process.
In the present case the optimizer keeps increasing the Matérn smoothness parameter in the MAT-
GAU model. At the limit of an infinite parameter, the Matérn form becomes the Gaussian form. For
that reason, when the parameter is driven towards very high values, PROC VARIOGRAM
automatically replaces the Matérn form with the Gaussian. This switch converts the MAT-GAU
model into a GAU-GAU model. However, a GAU-GAU model already exists among the specified
forms; consequently, the duplicate GAU-GAU model is skipped, and the fitted model list is reduced
by one model. A similar explanation justifies the omission of the MAT-MAT model from the fit
summary table.
In our example, the nested Gaussian-Gaussian model is the fitting selection of the procedure based
on the default ranking criteria. Output 106.1.7 displays additional information about the selected
model. In particular, you see the table with general information about the Gaussian-Gaussian
model, the initial values used for its parameters, and information about the optimization process for
the fitting.
Model Information
Parameter Initial
Value
Nugget 0.0903
GauScale1 0.6709
GauRange1 100.0
GauScale2 0.6709
GauRange2 50.023
0
Optimization Information
Optimization Technique Dual Quasi-
Newton
Parameters in Optimization5
Lower Boundaries 5
Upper Boundaries 0
Optimization Information
Starting Values From PROC
The estimated parameter values of the selected Gaussian-Gaussian model are shown in Output
106.1.8.
By default, when you specify more than one model to fit, PROC VARIOGRAM produces a fit plot
that compares the first five classes of the successfully fit candidate models. The model that is
selected according to the specified fitting criteria is shown with a thicker line in the plot.
You can modify the number of displayed equivalence classes with the NCLASSES= suboption of
the PLOTS=FIT option. When you have such comparison plots, PROC VARIOGRAM displays the
representative model from each class of equivalence.
The default fit plot for the current model comparison is shown in Output 106.1.9. The legend informs
you there is one more model in the first class of equivalence, as the fitting summary table indicated
earlier in Output 106.1.6.
Output 106.1.9: Fitted Theoretical and Empirical logAs Concentration Semivariograms for the
Specified Models
In the present example, all fitted models in the first five classes have very similar semivariograms.
The selected Gaussian-Gaussian model seems to have a relatively larger range than the rest of the
displayed models, but you can expect any of these models to exhibit a near-identical behavior in
terms of spatial correlation. As a result, all models in the displayed classes are likely to lead to very
similar output, if you proceed to use any of them for spatial prediction.
In that sense, semivariogram fitting is a partially subjective process, for which there might not exist
only one single correct answer to solve your problem. In the context of the example, on one hand
you might conclude that the selected Gaussian-Gaussian model is exactly sufficient to describe
spatial correlation in the arsenic study. On the other hand, the similar performance of all models
might prompt you to choose instead a more simple non-nested model for prediction like the Matérn
or the Gaussian model.
Regardless of whether you might opt to sacrifice the statistically best fit (depending on your
selected criteria) to simplicity, eventually you are the one to decide which approach serves your
study optimally. The model fitting features of PROC VARIOGRAM offer you significant assistance
so that you can assess your options efficiently.
Example 53.1 Spatial Prediction of Pollutant Concentration
The example in the section Aspects of Semivariogram Model Fitting in Chapter 106: The
VARIOGRAM Procedure, investigates fitting of a theoretical model to describe spatial correlation in
a study of 138 simulated arsenic logarithm concentration (logAs) observations. These observations
form the logAsData data set, which is treated as actual data for illustration in the examples.
In this example, you use the logAsData data set and the semivariogram analysis results to predict
the logAs variable values across space in a specified square region of size 500 km 500 km. Your
goal is to answer scientific questions in your analysis by means of your prediction results. This
application highlights the impact of the correlation model choice on predictions. The example in
section Investigating the Effect of Model Specification on Spatial Prediction examines additional
aspects of this impact.
The World Health Organization (WHO) standard for maximum arsenic concentration in drinking
water is 10 g/lt. Assume that you want to answer the following question: In what percentage of the
study area does the Arsenic concentration exceed the WHO regulatory standard?
First, you read the logAsData data set with the following DATA step:
data logAsData;
input East North logAs @@;
label logAs='log(As) Concentration';
datalines;
193.0 296.6 -0.68153 232.6 479.1 0.96279 268.7 312.5 -1.02908
43.6 4.9 0.65010 152.6 54.9 1.87076 449.1 395.8 0.95932
310.9 493.6 -1.66208 287.8 164.9 -0.01779 330.0 8.0 2.06837
225.7 241.7 0.15899 452.3 83.4 -1.21217 156.5 462.5 -0.89031
11.5 84.4 -0.24496 144.4 335.7 0.11950 149.0 431.8 -0.57251
234.3 123.2 -1.33642 37.8 197.8 -0.27624 183.1 173.9 -2.14558
149.3 426.7 -1.06506 434.4 67.5 -1.04657 439.6 237.0 -0.09074
36.4 175.2 -1.21211 370.6 244.0 3.28091 452.0 96.5 -0.77081
247.0 86.8 0.04720 413.6 373.2 1.78235 253.5 291.7 0.56132
129.7 111.9 1.34000 352.7 42.1 0.23621 279.3 82.7 2.12350
382.6 290.7 0.86756 188.2 222.8 -1.23308 382.8 154.5 -0.94094
304.4 309.2 -1.95158 337.5 387.2 -1.31294 490.7 189.8 0.40206
159.0 100.1 -0.22272 245.5 329.2 -0.26082 372.1 379.5 -1.89078
417.8 84.1 -1.25176 173.9 407.6 -0.24240 121.5 107.7 1.54509
453.5 313.6 0.65895 143.5 346.7 -0.87196 157.4 125.5 -1.96165
371.8 353.2 -0.59464 358.9 338.2 -1.07133 8.6 437.8 1.44203
395.9 394.2 -0.24144 149.5 58.9 1.17459 453.5 420.6 -0.63951
182.3 85.0 1.00005 21.0 290.1 0.31016 11.1 352.2 -0.88418
131.2 238.4 -0.57184 104.9 6.3 1.12054 247.3 256.0 0.14019
428.4 383.7 0.92448 327.8 481.1 -2.72543 199.2 92.8 -0.05717
453.9 230.1 0.16571 205.0 250.6 0.07581 459.5 271.6 0.93700
229.5 262.8 1.83590 370.4 228.6 2.96611 330.2 281.9 1.79723
354.8 388.3 -3.18262 406.2 222.7 2.41594 254.4 393.1 2.03221
96.7 85.2 -0.47156 407.2 256.8 0.66747 498.5 273.8 1.03041
417.2 471.4 -1.42766 368.8 424.3 -0.70506 303.0 59.1 1.43070
403.1 264.1 1.64554 21.2 360.8 0.67094 148.2 78.1 2.15323
305.5 310.7 -1.47985 228.5 180.3 -0.68386 161.1 143.3 1.07901
70.5 155.1 0.54652 363.1 282.6 -0.43051 86.0 472.5 -1.18855
175.9 105.3 -2.08112 96.8 426.3 1.56592 475.1 453.1 -1.53776
125.7 485.4 1.40054 277.9 201.6 -0.54565 406.2 125.0 -1.38657
60.0 275.5 -0.59966 431.3 494.6 -0.36860 399.9 399.0 -0.77265
28.8 311.1 0.91693 166.1 348.2 -0.49056 266.6 83.5 0.67277
54.7 356.3 0.49596 433.5 460.3 -1.61309 201.7 167.6 -1.40678
158.1 203.6 -1.32499 67.6 230.4 1.14672 81.9 250.0 0.63378
372.0 50.7 0.72445 26.4 264.6 1.00862 300.1 91.7 -0.74089
303.0 447.4 1.74589 108.4 386.2 1.12847 55.6 191.7 0.95175
36.3 273.2 1.78880 94.5 298.3 -2.43320 366.1 187.3 -0.80526
130.7 389.2 -0.31513 37.2 324.2 0.24489 295.5 211.8 0.41899
58.6 206.2 0.18495 346.3 142.8 -0.92038 484.2 215.9 0.08012
451.4 415.7 0.02773 58.9 86.5 0.17652 212.6 363.9 0.17215
378.7 407.6 0.51516 265.9 305.0 -0.30718 123.2 314.8 -0.90591
26.9 471.7 1.70285 16.5 7.1 0.51736 255.1 472.6 2.02381
111.5 148.4 -0.09658 440.4 375.0 1.23285 406.4 19.5 1.01181
321.2 65.8 -0.02095 466.4 357.1 -0.49272 2.0 484.6 0.50994
200.9 205.1 0.43543 30.3 337.0 1.60882 297.0 12.7 1.79824
158.2 450.7 0.05295 122.8 105.3 1.53936 417.8 329.7 -2.08124
;
For prediction of the logAs values in the specified area, assume a rectangular grid of nodes with an
equal spacing of 5 km between neighboring nodes in the north and east directions. This produces a
total of prediction locations.
In the KRIGE2D procedure you specify the name of the item store you want to use for prediction
input in the IN= option of the RESTORE statement. You request use of the selected model for
prediction by specifying the STORESELECT option in the MODEL statement.
The INFO option of the RESTORE statement produces a table with information about the selected
fitted model in the item store. To review all models in the input item store, specify the two INFO
option suboptions. In particular, specify the DET suboption to request details about all additional
fitted models that are included in the item store and the ONLY suboption to suppress prediction and
produce only the tables about the item store, as shown in the following statements:
Output 53.1.1: PROC KRIGE2D and Input Item Store General Information
Spatial Prediction of Log-Arsenic Concentration
The second table in Output 53.1.2 itemizes the variables in the item store and displays the sample
mean and standard deviation of their data set of origin. Hence, the values shown in Output 53.1.2
refer to the observations in the logAsData data set.
The table in Output 53.1.3 presents all the correlation models fitted to the arsenic logarithm logAs
empirical semivariance that are saved in the SemivAsStore item store.
Output 53.1.3: Angle and Models Information in the Input Item Store
Item Store Models
For logAs
ClassModel
1Gau-Gau
Gau-Mat
2Exp-Gau
3Exp-Mat
4Mat
5Gau
6Exp
Exp-Exp
Item Store Models
For logAs
ClassModel
Mat-Exp
Gau-Exp
According to Output 53.1.3, the Gaussian-Gaussian model is the selected model for the empirical
semivariance fit based on the specific weighted least squares fit and ranking criteria. In the section
Aspects of Semivariogram Model Fitting in Chapter 106: The VARIOGRAM Procedure, it is noted
that all fitted models in the first five equivalence classes produce very similar semivariograms, and
this is likely to lead to similar results in prediction analysis. For comparison purposes, you choose to
examine the selected model, in addition to the exponential model in the SemivAsStore item store.
As shown in Output 53.1.3, the exponential model is one of the least well-fit models based on the
criteria used for the specific fit. You are interested in comparing the predictions from each one of
these two models, and you examine their impact on your analysis.
The default item store model selection is the model on top of the list in Output 53.1.3. Hence, you
specify the STORESELECT option in the MODEL statement without any suboptions, and it invokes
the Gaussian-Gaussian model from the SemivAsStore item store. You assign the label SELMODEL
to the corresponding MODEL statement.
You also specify a second MODEL statement with the label EXPMODEL to request prediction
based on the exponential correlation form. In this case you specify the STORESELECT(MODEL=)
option in the MODEL statement to request the desired form.
You omit the INFO option from the RESTORE statement. You specify the PRED and the SEMIVAR
options in the PLOTS option of the PROC KRIGE2D statement to produce plots of the predicted
values and the semivariance model, respectively, for each MODEL statement. You request that the
prediction output be saved in the Pred output data set.
When you run these statements, in addition to the input item store information table, PROC
KRIGE2D also produces the number of observations table and general kriging process information,
as shown in Output 53.1.4.
Kriging Information
Prediction Grid Points 10201
Type of Analysis Globa
l
PROC KRIGE2D first uses the Gaussian-Gaussian model. The table in Output 53.1.5 shows the
saved parameter values of the fitted Gaussian-Gaussian model in the SemivAsStore item store.
PROC KRIGE2D uses these parameters for the prediction based on the selected model.
The semivariogram of the Gaussian-Gaussian model with the parameters shown in Output 53.1.5 is
depicted in Output 53.1.6.
The prediction plot for the exponential model is shown in Output 53.1.10. Prediction values and
spatial patterns are similar overall to those of the Gaussian-Gaussian case. Clearly, although both
models predict the same basic characteristics for the arsenic logarithm concentration distribution,
the exponential model suggests a more limited spatial variability in closely neighboring locations.
The lack of a nugget effect in the exponential model justifies this behavior. Also, the exponential
model predictions seem less inclined to deviate farther away from the near-zero mean than the
Gaussian-Gaussian model predictions.The prediction error reaches about the same upper values
for both models, though its low values are slightly smaller in the exponential model.
data AsOverLimit;
set Pred;
OverLimit = (exp(estimate) > 10) * 100;
run;
The second step uses the MEANS procedure to express the selected nodes population, where the
WHO arsenic concentration limit violation occurs, as a percentage of the entire domain area. You
study the results of each correlation model separately by specifying the BY statement in the PROC
MEANS. The BY variable is the Label variable in the AsOverLimit and Pred data sets. You need to
sort the AsOverLimit data prior to using PROC MEANS. You run the following statements:
proc sort data=AsOverLimit;
by Label;
run;
proc means data=AsOverLimit mean;
var OverLimit;
by label;
label Overlimit="Percent above WHO threshold";
run;
The Gaussian-Gaussian model prediction produces the result in Output 53.1.11. The analysis
suggests a minimal occurrence of excessive arsenic concentration in drinking water in about 0.43%
of the study region.
The exponential model predicts that the WHO arsenic concentration threshold is exceeded in about
0.27% of the domain, as shown in Output 53.1.12. Although this is still a minimal occurrence of the
threshold violation across the region, the exponential model estimates the impact to be at about two
thirds of the Gaussian-Gaussian model percentage.
The results in Output 53.1.11 and Output 53.1.12 suggest that it might not be possible to provide a
unique answer about the area percentage that is affected by increased arsenic concentration. You
chose to examine two different correlation models whose performance is relatively similar, and they
provide impact estimates that differ by about 37%.
You might conclude that the answer to the initial question about the percentage value lies in the
neighborhood of the results given by the two correlation models. Further analysis with more models
is necessary to validate this assumption. It is important to note that apart from the continuity model
choice, additional factors contribute to this investigation. Such factors could be the use of local
instead of global kriging, or even going back to the empirical semivariogram computation stage and
repeating the analysis for different possible spatial continuity empirical estimates. A sensible
approach to tackle this analysis would be to investigate the range of the impact suggested by all
candidate correlation models and to proceed by defining the best and worst case scenarios for the
size of the affected area.
Eventually, when it comes to using your findings, it is important to account for the subjective nature
of stochastic analysis and multiple possible answers to your questions. In that sense, some
scientific questions might be more sensible than others to interpret your results correctly. For
instance, you might want to investigate only whether the adversely affected domain percentage is
below 1%, rather than attempting to provide a specific value for it. Then, you might consider the
preceding findings sufficient, despite any fluctuations in the estimated percentage. In a different
scenario, the areas with high pollutant concentration could be populated. Hence, any local health
standard violation is probably unacceptable, and it can be crucial that you provide solid and more
detailed assessment in that case.
The section Risk Analysis with Simulation in Chapter 89: The SIM2D Procedure, investigates a
different aspect of this study and offers additional perspective about spatial analysis.