1
$\begingroup$

I've been wondering if a "weighted average" is a valid means to consider the Gaussian Process, specifically in the context of GP Regression. The kernel (I'll be referring to the common Radial Basis Function (RBF) Kernel) plays an extremely important role, which determines how similar any pair of $(x_i, x_j)$ are based on their distance from one another.

Background

Consider the below function and bit of code:

dims = 10
x = np.linspace(1,10,dims)
y = np.random.uniform(low=5,high=10,size=len(x))

GP Plot

# code from blog linked at bottom
def kernel(X1, X2, l=1.0, sigma_f=1.0):
    sqdist = np.sum(X1**2, 1).reshape(-1, 1) + np.sum(X2**2, 1) - 2 * np.dot(X1, X2.T)
    return sigma_f**2 * np.exp(-0.5 / l**2 * sqdist)

cov = kernel(x.reshape(-1,1),x.reshape(-1,1))

Let's see what this returns when given X:

cov.round(2)
>>>
array([[1.  , 0.61, 0.14, 0.01, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.61, 1.  , 0.61, 0.14, 0.01, 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.14, 0.61, 1.  , 0.61, 0.14, 0.01, 0.  , 0.  , 0.  , 0.  ],
       [0.01, 0.14, 0.61, 1.  , 0.61, 0.14, 0.01, 0.  , 0.  , 0.  ],
       [0.  , 0.01, 0.14, 0.61, 1.  , 0.61, 0.14, 0.01, 0.  , 0.  ],
       [0.  , 0.  , 0.01, 0.14, 0.61, 1.  , 0.61, 0.14, 0.01, 0.  ],
       [0.  , 0.  , 0.  , 0.01, 0.14, 0.61, 1.  , 0.61, 0.14, 0.01],
       [0.  , 0.  , 0.  , 0.  , 0.01, 0.14, 0.61, 1.  , 0.61, 0.14],
       [0.  , 0.  , 0.  , 0.  , 0.  , 0.01, 0.14, 0.61, 1.  , 0.61],
       [0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.01, 0.14, 0.61, 1.  ]])

As you can see, on a column by column (or row by row because it's symmetric) basis, the values do not sum to one. However, they do have a peak value. And this hints at what I mean by a "weighted average." A given $x_i$ is pulled most strongly by itself and to lesser degrees by nearby neighbors (and none whatsoever by distant neighbors.)

So far, I've only been talking about $X$ having not mentioned $Y$. In all the literature I've ingested, $Y$ comes into play when a dot product is taken between $Y$ and some aggregation of kernels.

gp1 = np.dot(cov,y)

plt.scatter(x,y)
for i in range(5):
  samples = mv_np(mean=gp1,cov=cov)
  plt.plot(x,samples)

GP1 plot

this plot looks pretty bad, as it grossly overestimates every $y_i$. This makes sense as the rows of the covariance matrix do not sum to to 1. However, if we make that simple change, the plot looks much prettier.

cov_normed =  cov / cov.sum(axis=1)
gp2 = np.dot(cov_normed,y)

plt.scatter(x,y)
for i in range(5):
  samples = mv_np(mean=gp2,cov=cov)
  plt.plot(x,samples)

GP2 Plot

In both plots, I've sampled from a multivariate gaussian given the mean vector (kernel * y) and covariance (kernel(x) ). The only difference in the latter situation is that I've normalized the kernel rows to sum to 1, such that any given mean vector element is a weighted average of $Y$ based on kernel elements.

Question

In one of the resources I've reviewed, the mean vector and covariance matrix are defined via Gaussian Conditioning rules. You'll note that the author made the (common) decision to make the mean vector (of the training $X$ data) to 0.

At first, I thought these conditioning tricks were just used to enable predictions of an unknown $y$, given $X_{train}$, $X_{test}$, $y_{train}$ which of course this accomplishes. But in light of what I've captured above, I'm starting to think that this conditioning trick accomplishes something else as well; namely, it somehow normalizes the kernel output such that when the kernel aggregation is projected onto $Y$, a weighted average of $Y$ is returned as the mean vector.

Is this valid? Any thoughts you'd like to add?

Blog post

Edit: After playing around with the blog author's code, I found that the problem over estimation goes away as soon as conditioning is executed. And in closer inspection, when we condition on $X_{train}$ the mean vector observed at such a point is set to $1 * Y_{train}$ corresponding to the point and the covariance is set to 0. This forces multivariate gaussian to sample the training point exactly but sample $Y$ values corresponding to $X_test$ stochasticly. This allows us to visualize our uncertainty around unknown inputs based on how distant they are from known points when sampling the posterior.

In a nutshell, the conditioning step is sort of THE step. From a more intuitive perspective, training and testing/prediction are not separate steps in gaussian processes, as they are in parametric methods. Conditioning is the mechanism that generates the posterior distribution. So without conditioning, there's no model.

$\endgroup$

1 Answer 1

3
$\begingroup$

Consider a typical GP regression model with zero mean function, covariance function $k$, and i.i.d. Gaussian noise (with variance $\sigma_n^2$). The posterior predictive distribution $p(y_* \mid X_*, X, y)$ describes the outputs $y_*$ at new test points $X_*$, having seen training data $(X,y)$. It's expectation can be written as:

$$\begin{array}{c} E[y_* \mid X_*, X, y] = H y \\ H = K(X_*, X) \big( K(X, X) + \sigma_n^2 I \big)^{-1} \\ \end{array}$$

where $K(A,B)$ denotes a matrix containing the covariance function evaluated between each pair of points in $A$ and $B$.

This says that each expected test output is a linear combination of the training outputs. Each test point has a corresponding row in $H$, containing a weight for each training point. The predicted test output is then obtained as a weighted sum of the training outputs. If you use a 'local' covariance function that decays with distance in input space (e.g. squared exponential/RBF), then the weights in $H$ will also decay with distance to the test point. For more information, check out section 2.6 in Gaussian Processes for Machine Learning (Rasmussen & Williams 2006).

Note that the predictions above are not strictly weighted averages of the training outputs, because $H$ may contain negative values, and the rows may not sum to one. However, you might be interested in comparing GP regression to Nadaraya-Watson kernel regression, which does indeed use weighted averages. Here, the test predictions are also obtained as $\hat{y}_* = H y$. However, $H$ is computed by taking $K(X_*, X)$, then normalizing each row to sum to one. This seems to be similar to what you've done in your plots.

$\endgroup$
5
  • $\begingroup$ Thanks for the info! Is there a short rule of thumb for choosing GP regression vs NW kernel regression? The latter seems more stable to me, but might come with drawbacks that I might have overlooked. $\endgroup$
    – jbuddy_13
    Commented Jan 13, 2021 at 19:23
  • $\begingroup$ To your point about negative values in GP (vs weighted averages), I wonder if the kernel function the blog author provided is erroneous as it never provides negative values. Not sure if this is universally true for the RBF kernel. $\endgroup$
    – jbuddy_13
    Commented Jan 13, 2021 at 21:05
  • $\begingroup$ And that idea that negative values are never provided is the crux of my question. Does conditioning somehow alter this property in a beneficial way? $\endgroup$
    – jbuddy_13
    Commented Jan 13, 2021 at 21:25
  • 1
    $\begingroup$ @jbuddy_13 $H$ (which can have negative entries) is the effective "weight matrix", not the kernel/covariance matrix (which I've written as $K(X, X)$). Entries of the covariance matrix must always be nonnegative. $\endgroup$
    – user20160
    Commented Jan 14, 2021 at 1:17
  • 1
    $\begingroup$ @jbuddy_13 I almost always prefer GP regression over NW kernel regression. For example: 1) NW can be badly biased at the edges of the data, where the weight function falls off the end. 2) GP regression is a probabilistic method that gives uncertainty estimates over predictions, whereas NW only gives point estimates. 3) NW mostly makes sense with 'local' (e.g. RBF) weight functions. GP regression can use various kernels, encoding different kinds of prior beliefs about the function to be estimated. $\endgroup$
    – user20160
    Commented Jan 14, 2021 at 1:31

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.