Manual SAS GEOESTADISTICA

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 21

Example 106.

1 Aspects of Semivariogram Model Fitting

This example helps you explore aspects of automated semivariogram fitting with PROC
VARIOGRAM. The test case is a spatial study of arsenic (As) concentration in drinking water.

Arsenic is a toxic pollutant that can occur in drinking water because of human activity or, typically,
due to natural release from the sediments in water aquifers. The World Health Organization has a
standard that allows As concentration up to a maximum of 10 g/lt (micrograms per liter) in drinking
water.

In general, natural release of arsenic into groundwater is very slow. Arsenic concentration in water
might exhibit no significant temporal fluctuations over a period of a few months. For this reason, it is
acceptable to perform a spatial study of arsenic with input from time-aggregated pollutant
concentrations. This example makes use of this assumption for its data set logAsData. The data set
consists of 138 simulated observations from wells across a square area of 500 km 500 km. The
variable logAs in the logAsData data set is the natural logarithm of arsenic concentration. Often, the
natural logarithm of arsenic concentration (logAs) is used as the random variable to facilitate the
analysis because its distribution tends to resemble the normal distribution.

The goal is to explore spatial continuity in the logAs observations. The following statements read
the logAs values from the logAsData data set:

title 'Semivariogram Model Fitting of Log-Arsenic Concentration';

data logAsData;
input East North logAs @@;
label logAs='log(As) Concentration';
datalines;
193.0 296.6 -0.68153 232.6 479.1 0.96279 268.7 312.5 -1.02908
43.6 4.9 0.65010 152.6 54.9 1.87076 449.1 395.8 0.95932
310.9 493.6 -1.66208 287.8 164.9 -0.01779 330.0 8.0 2.06837
225.7 241.7 0.15899 452.3 83.4 -1.21217 156.5 462.5 -0.89031
11.5 84.4 -0.24496 144.4 335.7 0.11950 149.0 431.8 -0.57251
234.3 123.2 -1.33642 37.8 197.8 -0.27624 183.1 173.9 -2.14558
149.3 426.7 -1.06506 434.4 67.5 -1.04657 439.6 237.0 -0.09074
36.4 175.2 -1.21211 370.6 244.0 3.28091 452.0 96.5 -0.77081
247.0 86.8 0.04720 413.6 373.2 1.78235 253.5 291.7 0.56132
129.7 111.9 1.34000 352.7 42.1 0.23621 279.3 82.7 2.12350
382.6 290.7 0.86756 188.2 222.8 -1.23308 382.8 154.5 -0.94094
304.4 309.2 -1.95158 337.5 387.2 -1.31294 490.7 189.8 0.40206
159.0 100.1 -0.22272 245.5 329.2 -0.26082 372.1 379.5 -1.89078
417.8 84.1 -1.25176 173.9 407.6 -0.24240 121.5 107.7 1.54509
453.5 313.6 0.65895 143.5 346.7 -0.87196 157.4 125.5 -1.96165
371.8 353.2 -0.59464 358.9 338.2 -1.07133 8.6 437.8 1.44203
395.9 394.2 -0.24144 149.5 58.9 1.17459 453.5 420.6 -0.63951
182.3 85.0 1.00005 21.0 290.1 0.31016 11.1 352.2 -0.88418
131.2 238.4 -0.57184 104.9 6.3 1.12054 247.3 256.0 0.14019
428.4 383.7 0.92448 327.8 481.1 -2.72543 199.2 92.8 -0.05717
453.9 230.1 0.16571 205.0 250.6 0.07581 459.5 271.6 0.93700
229.5 262.8 1.83590 370.4 228.6 2.96611 330.2 281.9 1.79723
354.8 388.3 -3.18262 406.2 222.7 2.41594 254.4 393.1 2.03221
96.7 85.2 -0.47156 407.2 256.8 0.66747 498.5 273.8 1.03041
417.2 471.4 -1.42766 368.8 424.3 -0.70506 303.0 59.1 1.43070
403.1 264.1 1.64554 21.2 360.8 0.67094 148.2 78.1 2.15323
305.5 310.7 -1.47985 228.5 180.3 -0.68386 161.1 143.3 1.07901
70.5 155.1 0.54652 363.1 282.6 -0.43051 86.0 472.5 -1.18855
175.9 105.3 -2.08112 96.8 426.3 1.56592 475.1 453.1 -1.53776
125.7 485.4 1.40054 277.9 201.6 -0.54565 406.2 125.0 -1.38657
60.0 275.5 -0.59966 431.3 494.6 -0.36860 399.9 399.0 -0.77265
28.8 311.1 0.91693 166.1 348.2 -0.49056 266.6 83.5 0.67277
54.7 356.3 0.49596 433.5 460.3 -1.61309 201.7 167.6 -1.40678
158.1 203.6 -1.32499 67.6 230.4 1.14672 81.9 250.0 0.63378
372.0 50.7 0.72445 26.4 264.6 1.00862 300.1 91.7 -0.74089
303.0 447.4 1.74589 108.4 386.2 1.12847 55.6 191.7 0.95175
36.3 273.2 1.78880 94.5 298.3 -2.43320 366.1 187.3 -0.80526
130.7 389.2 -0.31513 37.2 324.2 0.24489 295.5 211.8 0.41899
58.6 206.2 0.18495 346.3 142.8 -0.92038 484.2 215.9 0.08012
451.4 415.7 0.02773 58.9 86.5 0.17652 212.6 363.9 0.17215
378.7 407.6 0.51516 265.9 305.0 -0.30718 123.2 314.8 -0.90591
26.9 471.7 1.70285 16.5 7.1 0.51736 255.1 472.6 2.02381
111.5 148.4 -0.09658 440.4 375.0 1.23285 406.4 19.5 1.01181
321.2 65.8 -0.02095 466.4 357.1 -0.49272 2.0 484.6 0.50994
200.9 205.1 0.43543 30.3 337.0 1.60882 297.0 12.7 1.79824
158.2 450.7 0.05295 122.8 105.3 1.53936 417.8 329.7 -2.08124
;

First you want to inspect the logAs data for surface trends and the pairwise distribution. You run the
VARIOGRAM procedure with the NOVARIOGRAM option in the COMPUTE statement. You also
request the PLOTS=PAIRS(MID) option, which prompts the pair distance plot to display the actual
distance between pairs, rather than the lag number itself, in the midpoint of the lags. You use the
following statements:

ods graphics on;


proc variogram data=logAsData plots=pairs(mid);
compute novariogram nhc=50;
coord xc=East yc=North;
var logAs;
run;

The observations scatter plot in Output 106.1.1 shows a rather uniform distribution of the locations
in the study domain. Reasonably, neighboring values of logAs seem to exhibit some correlation.
There seems to be no definite sign of an overall surface trend in the logAs values. You can consider
that the observations are trend-free, and proceed with estimation of the empirical semivariance.

Output 106.1.1: logAs Observation Data Scatter Plot


The observed logAs values go as high as 3.28091, which corresponds to a concentration of 26.6
g/lt. In fact, only three observations exceed the health standard of 10 g/lt (or about 2.3 in the log
scale), and they are situated in relatively neighboring locations to the east of the domain center.

Based on the discussion in section Preliminary Spatial Data Analysis, the pair distance plot in
Output 106.1.2 suggests that you could consider pairs that are anywhere around up to half the
maximum pairwise distance of about 700 km.

Output 106.1.2: Distribution of Pairwise Distances for logAs Data


After some experimentation with values for the LAGDISTANCE= and MAXLAGS= options, you
actually find that a lag distance of 5 km over 40 lags can provide a clear representation of the logAs
semivariance. With respect to Output 106.1.2, this finding indicates that in the current example it is
sufficient to consider pairs separated by a distance of up to 200 km. You run the following
statements to obtain the empirical semivariogram:

proc variogram data=logAsData plots(only)=semivar;


compute lagd=5 maxlag=40;
coord xc=East yc=North;
var logAs;
run;

The first few lag classes of the logAs empirical semivariance table are shown in Output 106.1.3.

Output 106.1.3: Partial Output of the Empirical Semivariogram Table for logAs Data
Semivariogram Model Fitting of Log-Arsenic Concentration
 
The VARIOGRAM Procedure
Dependent Variable: logAs
 
Empirical Semivariogram

Lag Pair Average Semivarianc


Class Coun Distanc e
t e
0 1 1.9 0.111
1 5 4.9 0.145
2 6 9.7 0.286
3 11 14.6 0.545
4 27 20.0 0.900

Output 106.1.3 and Output 106.1.4 indicate that the logarithm of arsenic spatial correlation starts
with a small nugget effect around 0.11 and rises to a sill value that is most likely between 1.4 and
1.8. The rise could be of exponential type, although the smooth increase of semivariance close to
the origin could also suggest Gaussian behavior. You suspect that a Matérn form might also work,
since its smoothness parameter can regulate the form to exhibit an intermediate behavior between
the exponential and Gaussian forms.

Output 106.1.4: Empirical Semivariogram for logAs Data


You can investigate all of the preceding clues with the model fitting features of PROC
VARIOGRAM. The simplest way to fit a model is to specify its form in the MODEL statement. In this
case, you have the added complexity of having more than one possible candidate. For this reason,
you use the FORM=AUTO option that picks the best fit out of a list of candidates. Within this option
you specify the MLIST= suboption to use the exponential, Gaussian, and Matérn forms. You also
specify the NEST= suboption to request fitting of a model with up to two nested structures.
Eventually, you specify the PLOTS=FIT option to produce a plot of the fitted models. The STORE
statement saves the fitting output into an item store you name SemivAsStore for future use. You
apply these specifications with the following statements:

proc variogram data=logAsData plots(only)=fit;


store out=SemivAsStore / label='LogAs Concentration Models';
compute lagd=5 maxlag=40;
coord xc=East yc=North;
model form=auto(mlist=(exp,gau,mat) nest=1 to 2);
var logAs;
run;

ods graphics off;


The table of general information about fitting is shown in Output 106.1.5. The table lets you know
that 12 model combinations are to be tested for weighted least squares fitting, based on the three
forms that you specified.

Output 106.1.5: Semivariogram Model Fitting General Information


Semivariogram Model Fitting of Log-Arsenic Concentration

The VARIOGRAM Procedure


Dependent Variable: logAs
Angle: Omnidirectional
Semivariogram Model Fitting
Model Selection from 12 form
combinations
Output Item StoreWORK.SEMIVASSTORE
Item Store Label LogAs Concentration Models

The combinations include repetitions. For example, you specified the GAU form; hence the GAU-
GAU form is tested, too. The model combinations also include permutations. For example, you
specified the GAU and the EXP forms; hence the GAU-EXP and EXP-GAU models are fitted
separately. According to the section Nested Models, it might seem that the same model is fitted
twice. However, in each of these two cases, each structure starts the fitting process with different
parameter initial values. This can lead GAU-EXP to a different fit than EXP-GAU leads to, as seen
in the fitting summary table in Output 106.1.6. The table shows all the model combinations that
were tested and fitted. By default, the ordering is based on the weighted sum of squares error
criterion, and you can see that the lowest values in the Weighted SSE column are in top slots of the
list.

Output 106.1.6: Semivariogram Model Fitting Summary


Fit Summary
ClassModel Weighte AIC
d
SSE
1Gau- 25.42435 -9.59246
Gau
 Gau-Mat 25.42482 -7.59169
2Exp-Gau 25.97835 -8.70865
3Exp-Mat 26.36846 -6.09754
4Mat 26.37519 -
10.08708
5Gau 26.78629 -
11.45296
6Exp 28.01200 -9.61851
 Exp-Exp 28.01200 -5.61850
 Mat-Exp 28.01200 -3.61850
 Gau-Exp 28.01200 -5.61850
Note the leftmost Class column in Output 106.1.6. As explained in detail in section Classes of
Equivalence, when you fit more than one model, all fitted models that compute the same
semivariance are placed in the same class of equivalence. For example, in this fitting example the
top ranked GAU-GAU and GAU-MAT nested models produce indistinguishable semivariograms; for
that reason they are both placed in the same class 1 of equivalence. The same occurs with the
EXP, GAU-EXP, EXP-EXP, and MAT-EXP models in the bottom of the table. By default, PROC
VARIOGRAM uses the AIC as a secondary classification criterion; hence models in each
equivalence class are already ordered based on their AIC values.

Another remark in Output 106.1.6 is that despite submitting 12 model combinations for fitting, the
table shows only 10. You can easily see that the combinations MAT-GAU and MAT-MAT are not
among the listed models in the fit summary. This results from the behavior of the VARIOGRAM
procedure in the following situation: A parameter optimization takes place during the fitting process.
In the present case the optimizer keeps increasing the Matérn smoothness parameter in the MAT-
GAU model. At the limit of an infinite parameter, the Matérn form becomes the Gaussian form. For
that reason, when the parameter is driven towards very high values, PROC VARIOGRAM
automatically replaces the Matérn form with the Gaussian. This switch converts the MAT-GAU
model into a GAU-GAU model. However, a GAU-GAU model already exists among the specified
forms; consequently, the duplicate GAU-GAU model is skipped, and the fitted model list is reduced
by one model. A similar explanation justifies the omission of the MAT-MAT model from the fit
summary table.

In our example, the nested Gaussian-Gaussian model is the fitting selection of the procedure based
on the default ranking criteria. Output 106.1.7 displays additional information about the selected
model. In particular, you see the table with general information about the Gaussian-Gaussian
model, the initial values used for its parameters, and information about the optimization process for
the fitting.

Output 106.1.7: Fitting and Optimization Information for Gaussian-Gaussian Model


Semivariogram Model Fitting
Name Gaussian-Gaussian
Label Gau-Gau

Model Information
Parameter Initial
Value
Nugget 0.0903
GauScale1 0.6709
GauRange1 100.0
GauScale2 0.6709
GauRange2 50.023
0

Optimization Information
Optimization Technique Dual Quasi-
Newton
Parameters in Optimization5
Lower Boundaries 5
Upper Boundaries 0
Optimization Information
Starting Values From PROC

The estimated parameter values of the selected Gaussian-Gaussian model are shown in Output
106.1.8.

Output 106.1.8: Parameter Estimates of the Fitting Selected Model


Parameter Estimates
Parameter Estimat Approx D t Valu Appro
e Std F e x
Error Pr > |t|
Nugget 0.08308 0.05097 36 1.63 0.1118
GauScale1 0.3277 0.2077 36 1.58 0.1234
GauRange1 62.3127 19.8488 36 3.14 0.0034
GauScale2 1.2615 0.2070 36 6.10 <.0001
GauRange2 21.4596 3.2722 36 6.56 <.0001

By default, when you specify more than one model to fit, PROC VARIOGRAM produces a fit plot
that compares the first five classes of the successfully fit candidate models. The model that is
selected according to the specified fitting criteria is shown with a thicker line in the plot.

You can modify the number of displayed equivalence classes with the NCLASSES= suboption of
the PLOTS=FIT option. When you have such comparison plots, PROC VARIOGRAM displays the
representative model from each class of equivalence.

The default fit plot for the current model comparison is shown in Output 106.1.9. The legend informs
you there is one more model in the first class of equivalence, as the fitting summary table indicated
earlier in Output 106.1.6.

Output 106.1.9: Fitted Theoretical and Empirical logAs Concentration Semivariograms for the
Specified Models
In the present example, all fitted models in the first five classes have very similar semivariograms.
The selected Gaussian-Gaussian model seems to have a relatively larger range than the rest of the
displayed models, but you can expect any of these models to exhibit a near-identical behavior in
terms of spatial correlation. As a result, all models in the displayed classes are likely to lead to very
similar output, if you proceed to use any of them for spatial prediction.

In that sense, semivariogram fitting is a partially subjective process, for which there might not exist
only one single correct answer to solve your problem. In the context of the example, on one hand
you might conclude that the selected Gaussian-Gaussian model is exactly sufficient to describe
spatial correlation in the arsenic study. On the other hand, the similar performance of all models
might prompt you to choose instead a more simple non-nested model for prediction like the Matérn
or the Gaussian model.

Regardless of whether you might opt to sacrifice the statistically best fit (depending on your
selected criteria) to simplicity, eventually you are the one to decide which approach serves your
study optimally. The model fitting features of PROC VARIOGRAM offer you significant assistance
so that you can assess your options efficiently.
Example 53.1 Spatial Prediction of Pollutant Concentration

The example in the section Aspects of Semivariogram Model Fitting in Chapter 106: The
VARIOGRAM Procedure, investigates fitting of a theoretical model to describe spatial correlation in
a study of 138 simulated arsenic logarithm concentration (logAs) observations. These observations
form the logAsData data set, which is treated as actual data for illustration in the examples.

In this example, you use the logAsData data set and the semivariogram analysis results to predict
the logAs variable values across space in a specified square region of size 500 km 500 km. Your
goal is to answer scientific questions in your analysis by means of your prediction results. This
application highlights the impact of the correlation model choice on predictions. The example in
section Investigating the Effect of Model Specification on Spatial Prediction examines additional
aspects of this impact.

The World Health Organization (WHO) standard for maximum arsenic concentration in drinking
water is 10 g/lt. Assume that you want to answer the following question: In what percentage of the
study area does the Arsenic concentration exceed the WHO regulatory standard?

First, you read the logAsData data set with the following DATA step:

title 'Spatial Prediction of Log-Arsenic Concentration';

data logAsData;
input East North logAs @@;
label logAs='log(As) Concentration';
datalines;
193.0 296.6 -0.68153 232.6 479.1 0.96279 268.7 312.5 -1.02908
43.6 4.9 0.65010 152.6 54.9 1.87076 449.1 395.8 0.95932
310.9 493.6 -1.66208 287.8 164.9 -0.01779 330.0 8.0 2.06837
225.7 241.7 0.15899 452.3 83.4 -1.21217 156.5 462.5 -0.89031
11.5 84.4 -0.24496 144.4 335.7 0.11950 149.0 431.8 -0.57251
234.3 123.2 -1.33642 37.8 197.8 -0.27624 183.1 173.9 -2.14558
149.3 426.7 -1.06506 434.4 67.5 -1.04657 439.6 237.0 -0.09074
36.4 175.2 -1.21211 370.6 244.0 3.28091 452.0 96.5 -0.77081
247.0 86.8 0.04720 413.6 373.2 1.78235 253.5 291.7 0.56132
129.7 111.9 1.34000 352.7 42.1 0.23621 279.3 82.7 2.12350
382.6 290.7 0.86756 188.2 222.8 -1.23308 382.8 154.5 -0.94094
304.4 309.2 -1.95158 337.5 387.2 -1.31294 490.7 189.8 0.40206
159.0 100.1 -0.22272 245.5 329.2 -0.26082 372.1 379.5 -1.89078
417.8 84.1 -1.25176 173.9 407.6 -0.24240 121.5 107.7 1.54509
453.5 313.6 0.65895 143.5 346.7 -0.87196 157.4 125.5 -1.96165
371.8 353.2 -0.59464 358.9 338.2 -1.07133 8.6 437.8 1.44203
395.9 394.2 -0.24144 149.5 58.9 1.17459 453.5 420.6 -0.63951
182.3 85.0 1.00005 21.0 290.1 0.31016 11.1 352.2 -0.88418
131.2 238.4 -0.57184 104.9 6.3 1.12054 247.3 256.0 0.14019
428.4 383.7 0.92448 327.8 481.1 -2.72543 199.2 92.8 -0.05717
453.9 230.1 0.16571 205.0 250.6 0.07581 459.5 271.6 0.93700
229.5 262.8 1.83590 370.4 228.6 2.96611 330.2 281.9 1.79723
354.8 388.3 -3.18262 406.2 222.7 2.41594 254.4 393.1 2.03221
96.7 85.2 -0.47156 407.2 256.8 0.66747 498.5 273.8 1.03041
417.2 471.4 -1.42766 368.8 424.3 -0.70506 303.0 59.1 1.43070
403.1 264.1 1.64554 21.2 360.8 0.67094 148.2 78.1 2.15323
305.5 310.7 -1.47985 228.5 180.3 -0.68386 161.1 143.3 1.07901
70.5 155.1 0.54652 363.1 282.6 -0.43051 86.0 472.5 -1.18855
175.9 105.3 -2.08112 96.8 426.3 1.56592 475.1 453.1 -1.53776
125.7 485.4 1.40054 277.9 201.6 -0.54565 406.2 125.0 -1.38657
60.0 275.5 -0.59966 431.3 494.6 -0.36860 399.9 399.0 -0.77265
28.8 311.1 0.91693 166.1 348.2 -0.49056 266.6 83.5 0.67277
54.7 356.3 0.49596 433.5 460.3 -1.61309 201.7 167.6 -1.40678
158.1 203.6 -1.32499 67.6 230.4 1.14672 81.9 250.0 0.63378
372.0 50.7 0.72445 26.4 264.6 1.00862 300.1 91.7 -0.74089
303.0 447.4 1.74589 108.4 386.2 1.12847 55.6 191.7 0.95175
36.3 273.2 1.78880 94.5 298.3 -2.43320 366.1 187.3 -0.80526
130.7 389.2 -0.31513 37.2 324.2 0.24489 295.5 211.8 0.41899
58.6 206.2 0.18495 346.3 142.8 -0.92038 484.2 215.9 0.08012
451.4 415.7 0.02773 58.9 86.5 0.17652 212.6 363.9 0.17215
378.7 407.6 0.51516 265.9 305.0 -0.30718 123.2 314.8 -0.90591
26.9 471.7 1.70285 16.5 7.1 0.51736 255.1 472.6 2.02381
111.5 148.4 -0.09658 440.4 375.0 1.23285 406.4 19.5 1.01181
321.2 65.8 -0.02095 466.4 357.1 -0.49272 2.0 484.6 0.50994
200.9 205.1 0.43543 30.3 337.0 1.60882 297.0 12.7 1.79824
158.2 450.7 0.05295 122.8 105.3 1.53936 417.8 329.7 -2.08124
;

For prediction of the logAs values in the specified area, assume a rectangular grid of nodes with an
equal spacing of 5 km between neighboring nodes in the north and east directions. This produces a
total of prediction locations.

In the section Aspects of Semivariogram Model Fitting in Chapter 106: The VARIOGRAM


Procedure, you saved the selected fitted model that resulted from the correlation analysis into the
SemivAsStore item store as shown in the following statements:

ods graphics on;


proc variogram data=logAsData plots=none;
store out=SemivAsStore / label='LogAs Concentration Models';
compute lagd=5 maxlag=40;
coord xc=East yc=North;
model form=auto(mlist=(exp,gau,mat) nest=1 to 2);
var logAs;
run;

In the KRIGE2D procedure you specify the name of the item store you want to use for prediction
input in the IN= option of the RESTORE statement. You request use of the selected model for
prediction by specifying the STORESELECT option in the MODEL statement.

The INFO option of the RESTORE statement produces a table with information about the selected
fitted model in the item store. To review all models in the input item store, specify the two INFO
option suboptions. In particular, specify the DET suboption to request details about all additional
fitted models that are included in the item store and the ONLY suboption to suppress prediction and
produce only the tables about the item store, as shown in the following statements:

proc krige2d data=logAsData outest=pred plots=none;


restore in=SemivAsStore / info(det only);
coordinates xc=East yc=North;
predict var=logAs;
model storeselect;
grid x=0 to 500 by 5 y=0 to 500 by 5;
run;
PROC KRIGE2D produces a table with general information about the input item store identity, as
shown in Output 53.1.1.

Output 53.1.1: PROC KRIGE2D and Input Item Store General Information
Spatial Prediction of Log-Arsenic Concentration

The KRIGE2D Procedure


Correlation Model Item Store Information
Input Item Store WORK.SEMIVASSTORE
Item Store Label LogAs Concentration
Models
Data Set Created From WORK.LOGASDATA
By-group Information No By-groups Present
Created By PROC VARIOGRAM
Date Created 11OCT13:15:36:11

The second table in Output 53.1.2 itemizes the variables in the item store and displays the sample
mean and standard deviation of their data set of origin. Hence, the values shown in Output 53.1.2
refer to the observations in the logAsData data set.

Output 53.1.2: Variables in the Input Item Store


Item Store Variables
Variable Mean Std
Deviation
logAs 0.08430 1.527707
9

The table in Output 53.1.3 presents all the correlation models fitted to the arsenic logarithm logAs
empirical semivariance that are saved in the SemivAsStore item store.

Output 53.1.3: Angle and Models Information in the Input Item Store
Item Store Models
For logAs
ClassModel
1Gau-Gau
 Gau-Mat
2Exp-Gau
3Exp-Mat
4Mat
5Gau
6Exp
 Exp-Exp
Item Store Models
For logAs
ClassModel
 Mat-Exp
 Gau-Exp

According to Output 53.1.3, the Gaussian-Gaussian model is the selected model for the empirical
semivariance fit based on the specific weighted least squares fit and ranking criteria. In the section
Aspects of Semivariogram Model Fitting in Chapter 106: The VARIOGRAM Procedure, it is noted
that all fitted models in the first five equivalence classes produce very similar semivariograms, and
this is likely to lead to similar results in prediction analysis. For comparison purposes, you choose to
examine the selected model, in addition to the exponential model in the SemivAsStore item store.
As shown in Output 53.1.3, the exponential model is one of the least well-fit models based on the
criteria used for the specific fit. You are interested in comparing the predictions from each one of
these two models, and you examine their impact on your analysis.

The default item store model selection is the model on top of the list in Output 53.1.3. Hence, you
specify the STORESELECT option in the MODEL statement without any suboptions, and it invokes
the Gaussian-Gaussian model from the SemivAsStore item store. You assign the label SELMODEL
to the corresponding MODEL statement.

You also specify a second MODEL statement with the label EXPMODEL to request prediction
based on the exponential correlation form. In this case you specify the STORESELECT(MODEL=)
option in the MODEL statement to request the desired form.

You omit the INFO option from the RESTORE statement. You specify the PRED and the SEMIVAR
options in the PLOTS option of the PROC KRIGE2D statement to produce plots of the predicted
values and the semivariance model, respectively, for each MODEL statement. You request that the
prediction output be saved in the Pred output data set.

You satisfy the preceding requests by specifying the following statements:

proc krige2d data=logAsData outest=Pred plots(only)=(pred semivar);


restore in=SemivAsStore;
coordinates xc=East yc=North;
predict var=logAs;
SelModel: model storeselect;
ExpModel: model storeselect(model=exp);
grid x=0 to 500 by 5 y=0 to 500 by 5;
run;

When you run these statements, in addition to the input item store information table, PROC
KRIGE2D also produces the number of observations table and general kriging process information,
as shown in Output 53.1.4.

Output 53.1.4: Number of Observations and Kriging Information Tables


Spatial Prediction of Log-Arsenic Concentration
The KRIGE2D Procedure
Dependent Variable: logAs
Number of Observations Read 13
8
Number of Observations Used 13
8

Kriging Information
Prediction Grid Points 10201
Type of Analysis Globa
l

PROC KRIGE2D first uses the Gaussian-Gaussian model. The table in Output 53.1.5 shows the
saved parameter values of the fitted Gaussian-Gaussian model in the SemivAsStore item store.
PROC KRIGE2D uses these parameters for the prediction based on the selected model.

Output 53.1.5: Information about the Gaussian-Gaussian Model


Spatial Prediction of Log-Arsenic Concentration

The KRIGE2D Procedure


Dependent Variable: logAs
Prediction: Pred1, Model: SelModel
Covariance Model Information for SelModel
Nested Structure 1 Type Gaussian
Nested Structure 1 Sill 0.327664
6
Nested Structure 1 Range 62.31272
8
Nested Structure 1 Effective Range 107.9288
1
Nested Structure 2 Type Gaussian
Nested Structure 2 Sill 1.261545
Nested Structure 2 Range 21.45956
3
Nested Structure 2 Effective Range 37.16905
3
Nugget Effect 0.083075
8

The semivariogram of the Gaussian-Gaussian model with the parameters shown in Output 53.1.5 is
depicted in Output 53.1.6.

Output 53.1.6: Gaussian-Gaussian Semivariogram Model Used in Kriging Predictions


Output 53.1.7 is a map of the kriging prediction of the arsenic concentration values logAs in the
specified domain. The prediction error surface shows a naturally increasing error as you move
farther away from the observation locations. Interestingly, kriging predicts a small area of increased
arsenic concentration values located in the central-eastern part of the domain. The WHO threshold
of 10 g/lt for the maximum allowed arsenic concentration in water translates into about 2.3 in the
log scale, and the particular area exhibits values in excess of 3. Due to the suggested violation of
the WHO standard, this particular area is very likely to be the focus of further environmental risk
analysis.

Output 53.1.7: Predicted Arsenic Logarithm Values with Gaussian-Gaussian Covariance


Next, PROC KRIGE2D performs prediction with the exponential model. The model parameters are
also read from the SemivAsStore item store and are shown in Output 53.1.8.

Output 53.1.8: Information about the Exponential Model


Spatial Prediction of Log-Arsenic Concentration

The KRIGE2D Procedure


Dependent Variable: logAs
Prediction: Pred1, Model: ExpModel
Covariance Model Information for
ExpModel
Type Exponential
Sill 1.6779788
Range 24.537294
Effective Range 73.611882
Nugget Effect 0
Output 53.1.9 illustrates the semivariogram of the nested exponential model where its parameter
values are those shown in Output 53.1.8.

Output 53.1.9: Exponential Semivariogram Model Used in Kriging Predictions

The prediction plot for the exponential model is shown in Output 53.1.10. Prediction values and
spatial patterns are similar overall to those of the Gaussian-Gaussian case. Clearly, although both
models predict the same basic characteristics for the arsenic logarithm concentration distribution,
the exponential model suggests a more limited spatial variability in closely neighboring locations.
The lack of a nugget effect in the exponential model justifies this behavior. Also, the exponential
model predictions seem less inclined to deviate farther away from the near-zero mean than the
Gaussian-Gaussian model predictions.The prediction error reaches about the same upper values
for both models, though its low values are slightly smaller in the exponential model.

Output 53.1.10: Predicted Arsenic Logarithm Values with Exponential Covariance


In the following two-step computation, you proceed to compute the percentage of the study area
where the arsenic concentration exceeds the WHO regulatory standard according to your
predictions. First, a DATA step marks the arsenic predicted values in excess of the WHO
concentration threshold of 10 g/lt and saves the outcome into an indicator variable OverLimit. The
DATA step input is the prediction Pred output data set, where the logarithm arsenic prediction is
stored in the estimate variable. The DATA step also transforms the arsenic logarithm values back
into arsenic concentration values to compare them to the threshold value. You use the following
statements:

data AsOverLimit;
set Pred;
OverLimit = (exp(estimate) > 10) * 100;
run;

The second step uses the MEANS procedure to express the selected nodes population, where the
WHO arsenic concentration limit violation occurs, as a percentage of the entire domain area. You
study the results of each correlation model separately by specifying the BY statement in the PROC
MEANS. The BY variable is the Label variable in the AsOverLimit and Pred data sets. You need to
sort the AsOverLimit data prior to using PROC MEANS. You run the following statements:
proc sort data=AsOverLimit;
by Label;
run;
proc means data=AsOverLimit mean;
var OverLimit;
by label;
label Overlimit="Percent above WHO threshold";
run;

ods graphics off;

The Gaussian-Gaussian model prediction produces the result in Output 53.1.11. The analysis
suggests a minimal occurrence of excessive arsenic concentration in drinking water in about 0.43%
of the study region.

Output 53.1.11: Violation of Arsenic Concentration Threshold Using Gaussian-Gaussian


Model
Spatial Prediction of Log-Arsenic Concentration

The MEANS Procedure


Label for the PREDICT/MODEL combination=Pred1.SelModel
Analysis Variable
: OverLimit Percent
above WHO threshold
Mean
0.4313303

The exponential model predicts that the WHO arsenic concentration threshold is exceeded in about
0.27% of the domain, as shown in Output 53.1.12. Although this is still a minimal occurrence of the
threshold violation across the region, the exponential model estimates the impact to be at about two
thirds of the Gaussian-Gaussian model percentage.

Output 53.1.12: Violation of Arsenic Concentration Threshold Using Exponential Model


Spatial Prediction of Log-Arsenic Concentration

The MEANS Procedure


Label for the PREDICT/MODEL combination=Pred1.ExpModel
Analysis Variable
: OverLimit Percent
above WHO threshold
Mean
0.2744829

The results in Output 53.1.11 and Output 53.1.12 suggest that it might not be possible to provide a
unique answer about the area percentage that is affected by increased arsenic concentration. You
chose to examine two different correlation models whose performance is relatively similar, and they
provide impact estimates that differ by about 37%.

You might conclude that the answer to the initial question about the percentage value lies in the
neighborhood of the results given by the two correlation models. Further analysis with more models
is necessary to validate this assumption. It is important to note that apart from the continuity model
choice, additional factors contribute to this investigation. Such factors could be the use of local
instead of global kriging, or even going back to the empirical semivariogram computation stage and
repeating the analysis for different possible spatial continuity empirical estimates. A sensible
approach to tackle this analysis would be to investigate the range of the impact suggested by all
candidate correlation models and to proceed by defining the best and worst case scenarios for the
size of the affected area.

Eventually, when it comes to using your findings, it is important to account for the subjective nature
of stochastic analysis and multiple possible answers to your questions. In that sense, some
scientific questions might be more sensible than others to interpret your results correctly. For
instance, you might want to investigate only whether the adversely affected domain percentage is
below 1%, rather than attempting to provide a specific value for it. Then, you might consider the
preceding findings sufficient, despite any fluctuations in the estimated percentage. In a different
scenario, the areas with high pollutant concentration could be populated. Hence, any local health
standard violation is probably unacceptable, and it can be crucial that you provide solid and more
detailed assessment in that case.

The section Risk Analysis with Simulation in Chapter 89: The SIM2D Procedure, investigates a
different aspect of this study and offers additional perspective about spatial analysis.

Copyright © SAS Institute Inc. All Rights Reserved.

You might also like