I've been wondering if a "weighted average" is a valid means to consider the Gaussian Process, specifically in the context of GP Regression. The kernel (I'll be referring to the common Radial Basis Function (RBF) Kernel) plays an extremely important role, which determines how similar any pair of $(x_i, x_j)$ are based on their distance from one another.
Background
Consider the below function and bit of code:
dims = 10
x = np.linspace(1,10,dims)
y = np.random.uniform(low=5,high=10,size=len(x))
# code from blog linked at bottom
def kernel(X1, X2, l=1.0, sigma_f=1.0):
sqdist = np.sum(X1**2, 1).reshape(-1, 1) + np.sum(X2**2, 1) - 2 * np.dot(X1, X2.T)
return sigma_f**2 * np.exp(-0.5 / l**2 * sqdist)
cov = kernel(x.reshape(-1,1),x.reshape(-1,1))
Let's see what this returns when given X:
cov.round(2)
>>>
array([[1. , 0.61, 0.14, 0.01, 0. , 0. , 0. , 0. , 0. , 0. ],
[0.61, 1. , 0.61, 0.14, 0.01, 0. , 0. , 0. , 0. , 0. ],
[0.14, 0.61, 1. , 0.61, 0.14, 0.01, 0. , 0. , 0. , 0. ],
[0.01, 0.14, 0.61, 1. , 0.61, 0.14, 0.01, 0. , 0. , 0. ],
[0. , 0.01, 0.14, 0.61, 1. , 0.61, 0.14, 0.01, 0. , 0. ],
[0. , 0. , 0.01, 0.14, 0.61, 1. , 0.61, 0.14, 0.01, 0. ],
[0. , 0. , 0. , 0.01, 0.14, 0.61, 1. , 0.61, 0.14, 0.01],
[0. , 0. , 0. , 0. , 0.01, 0.14, 0.61, 1. , 0.61, 0.14],
[0. , 0. , 0. , 0. , 0. , 0.01, 0.14, 0.61, 1. , 0.61],
[0. , 0. , 0. , 0. , 0. , 0. , 0.01, 0.14, 0.61, 1. ]])
As you can see, on a column by column (or row by row because it's symmetric) basis, the values do not sum to one. However, they do have a peak value. And this hints at what I mean by a "weighted average." A given $x_i$ is pulled most strongly by itself and to lesser degrees by nearby neighbors (and none whatsoever by distant neighbors.)
So far, I've only been talking about $X$ having not mentioned $Y$. In all the literature I've ingested, $Y$ comes into play when a dot product is taken between $Y$ and some aggregation of kernels.
gp1 = np.dot(cov,y)
plt.scatter(x,y)
for i in range(5):
samples = mv_np(mean=gp1,cov=cov)
plt.plot(x,samples)
this plot looks pretty bad, as it grossly overestimates every $y_i$. This makes sense as the rows of the covariance matrix do not sum to to 1. However, if we make that simple change, the plot looks much prettier.
cov_normed = cov / cov.sum(axis=1)
gp2 = np.dot(cov_normed,y)
plt.scatter(x,y)
for i in range(5):
samples = mv_np(mean=gp2,cov=cov)
plt.plot(x,samples)
In both plots, I've sampled from a multivariate gaussian given the mean vector (kernel * y) and covariance (kernel(x) ). The only difference in the latter situation is that I've normalized the kernel rows to sum to 1, such that any given mean vector element is a weighted average of $Y$ based on kernel elements.
Question
In one of the resources I've reviewed, the mean vector and covariance matrix are defined via Gaussian Conditioning rules. You'll note that the author made the (common) decision to make the mean vector (of the training $X$ data) to 0.
At first, I thought these conditioning tricks were just used to enable predictions of an unknown $y$, given $X_{train}$, $X_{test}$, $y_{train}$ which of course this accomplishes. But in light of what I've captured above, I'm starting to think that this conditioning trick accomplishes something else as well; namely, it somehow normalizes the kernel output such that when the kernel aggregation is projected onto $Y$, a weighted average of $Y$ is returned as the mean vector.
Is this valid? Any thoughts you'd like to add?
Edit: After playing around with the blog author's code, I found that the problem over estimation goes away as soon as conditioning is executed. And in closer inspection, when we condition on $X_{train}$ the mean vector observed at such a point is set to $1 * Y_{train}$ corresponding to the point and the covariance is set to 0. This forces multivariate gaussian to sample the training point exactly but sample $Y$ values corresponding to $X_test$ stochasticly. This allows us to visualize our uncertainty around unknown inputs based on how distant they are from known points when sampling the posterior.
In a nutshell, the conditioning step is sort of THE step. From a more intuitive perspective, training and testing/prediction are not separate steps in gaussian processes, as they are in parametric methods. Conditioning is the mechanism that generates the posterior distribution. So without conditioning, there's no model.