Questions tagged [kernel-trick]
Kernel methods are used in machine learning to generalize linear techniques to nonlinear situations, especially SVMs, PCA, and GPs. Not to be confused with [kernel-smoothing], for kernel density estimation (KDE) and kernel regression.
757 questions
0
votes
0
answers
11
views
When running a Support Vector Machine, how do I formulate the linear transformation that flips the decision hyperplane in the non-augmented dimension?
We know that when running a support vector machine, we actually use the "kernel trick" to compute the decision hyperplane (boundary) as if we do so in the kernel-augmented dimension, but not ...
0
votes
0
answers
92
views
Proof: The Gaussian Kernel as an Inner Product in Infinite-Dimensional Feature Space
Prove that the Gaussian kernel on $ \mathbb{R}^d $ for a positive integer $d $:
\begin{equation}
k(x, x') = \exp(-\gamma \|x - x'\|^2) \tag{1}
\end{equation}
for $\gamma > 0 $, can be expressed as ...
0
votes
0
answers
152
views
Prove matrix constructed based on gaussian RBF is PSD
I have a radial basis function $k(x, y) = \exp(-{(x-y)}^T M {(x-y)})$ where $M$ is a symmetric PSD matrix.
I know that $k(\cdot)$ is a kernel itself: Prove that multiplication with positive ...
0
votes
0
answers
7
views
Has natural language processing/generation been attempted with kernal methods
I am curious as to whether there has been much success in the past with applying kernel methods to perform natural language processing/generation?
Rather than just numerical evidence, I'm particularly ...
3
votes
0
answers
34
views
Convergence of kernel mean embeddings
Let $k(\cdot,\cdot)$ be a bounded kernel and $\mathcal{H}$ its associated RKHS. Define the kernel mean embedding $\mu=\int k(\cdot,x) \, dP_X(x)$ and let $\hat{\mu}=\frac{1}{n}\sum k(x_i,\cdot)$ be ...
0
votes
0
answers
19
views
How to get the second stationary point condition corresponding to intercept when using the augmented weight vector and augmented design matrix in SVM?
Below is the formulation I got for SVM when using the equation of classifier as w.x + b = 0
I want to know why I am not getting the second stationary condition i.e. summation over i from 1 to n of (...
3
votes
2
answers
131
views
In the contex of Kernel regression why do we define the feature map as equal to the Kernel $\varphi(x)=k(\cdot ,x)$?
I have a notational confusion I am trying to clear up. In the context of Kernel regression the following relationship between the kernel and the feature map is defined:
Consider a positive-definite ...
1
vote
0
answers
40
views
Derivation of dual formulation of support vector regression
I'm trying to derive the dual formulation of epsilon-insensitive support vector regression. I think my derivation is correct, but I can't match it up to a result for the dual that I've seen given in ...
1
vote
0
answers
37
views
What is the best way to use Gaussian Processes to approximate highly non-stationary functions?
Gaussian process regression has trouble approximating functions with "kinks". So, what is the most widely used method to deal with this problem? I have found many proposed methods, including ...
3
votes
0
answers
64
views
What are the "tricks" in machine learning? [closed]
I have come across a few different "tricks" in machine learning methodology, which I list below along with my rudimental understandings.
The Kernel Trick:
This is used in Support Vector ...
0
votes
1
answer
32
views
SVM Kernel to compare histograms as input vectors
In lecture 7 of CS229 by Andrew Ng he mentions at the very end a specific Kernel that allows an SVM to "classify" how similar two histograms are, such as the demographics of 2 countries. He ...
0
votes
0
answers
26
views
Using MMD for Feature Selection with Linear Regression: Valid Approach?
I'm using Maximum Mean Discrepancy (MMD) for feature selection (i.e., to select the features that minimize the dissimilarity between the training and testing datasets). I'm aware that MMD introduces ...
0
votes
0
answers
43
views
Covariance inversion for Gaussian process
Background
Let $x=f(u_x)\in\mathbb{R}$ and let $y=[f(u_y^1)\cdots f(u_y^{N})]\in\mathbb{R}^N$ for some function $f:u \in \mathbb{R}\mapsto \mathbb{R}$.
Given $y$, $u_x$, $u_{y}^1,\dots, u_{y}^{N}$, I ...
2
votes
0
answers
24
views
How to find K in kernel trick?
How does one go about finding the kernel when using the so-called "kernel trick?" Here is an example from quora:
Simple Example: x = (x1, x2, x3); y = (y1, y2, y3). Then for the function f(...
0
votes
0
answers
9
views
Computing Test Loss in Kernel Ridge Regression
In Kernel Ridge regression we have the standard loss function $$L(\beta) = \|Y-K\beta\|_2^2 + \alpha \beta^T K \beta$$
Here, $K$ is the kernel (gram) matrix.
If I compute $\beta$ on a training set, so ...
0
votes
0
answers
14
views
Estimation of bivariate function with one variable being constricted
Suppose the following classical supervised regression setting,
$$y_{i} = f(x_{i}) + \epsilon_{i}, \quad i=1,\cdots,n,$$
where $\epsilon_{i}$ are i.i.d. zero mean Gaussian noise.
The above regression ...
0
votes
0
answers
23
views
Can I find the explicit feature map that generates exponent of a kernel?
Let's say I have a kernel $K$, and another kernel of the form :
$$
K' = e^K
$$
now I know how to prove K' is a kernel, I can do it using taylor expansion of $e^x$ around $0$,
but let's say if I want ...
2
votes
1
answer
42
views
Does solution to ridge regression still minimizes the cost function when lambda is <=0?
This was a homework problem where I was asked to find explicit expression that minimises the cost function.
I found the solution as :
$\hat{\theta} = (X^TX + \lambda I)^{-1}X^Ty$
Now the problem ...
0
votes
0
answers
10
views
What is normalized winning frequency in kernel self organizing map(SOM)?
In the k-means based kernel SOM, proposed by MacDonald and Fyfe (2000), the update of the mean is based on a soft learning algorithm
mi(t + 1) = mi(t) + Λ[φ(x) − mi(t)]
where Λ is the normalized ...
0
votes
0
answers
28
views
theoretical question: why is RBF the 'best' kernel
I am trying to understand why the RBF kernel is usually used in many research papers doing kernel tricks. To reduce the scope, we can focus on linear regression (thus effectively, increasing the ...
0
votes
0
answers
23
views
normalized dual activation function for neural tangent kernel
Let $\phi$ be an activation function. In this lecture note, The author assumes that the dual activation function, denoted as $\check{\phi}$ is normalized such that $\check{\phi}(1)=1$. How can it be ...
2
votes
0
answers
61
views
How is the weight vector calculated when using kernel trick for ridge regression
Im trying to understand how kernelized ridge regression works, and how we manage to first transform, and subsequently learn on higher-dimensional features without explicitly having to calculate them.
...
0
votes
0
answers
58
views
How to use random kitchen sinks for $\sigma \neq 1$?
The RBF kernel is given by
$$
k(x,y) = \exp\left(-\frac{\| x - y \|_2^2}{2 \sigma^2}\right)
$$
where $\sigma$ is the length-scale parameter. I want to use the random kitchen sinks method to create a ...
1
vote
0
answers
30
views
RKHS inclusion relationship of the Erf network's NTK
In the referenced paper, it is stated that for ReLU networks, the Reproducing Kernel Hilbert Space (RKHS) of the Neural Tangent Kernels (NTK) remains unchanged regardless of the model's depth. I am ...
8
votes
2
answers
192
views
Under what kernels and/or conditions does $k(x, x) = k(x, X) k(X, X)^{-1} k(X, x)$?
This question is motivated by a question I'm facing in vector-valued kernel methods (also known as Gaussian Processes and co-krieging).
Suppose I have $N$ data $X := \{x_n\}_{n=1}^N$ , where each $x_n ...
3
votes
0
answers
38
views
Exchanging integrals with inner products with kernel mean embeddings
I am doing some reading on kernel mean embeddings. In particular I am reading the survey paper by Muandet et al. On page 27 (Section 3.1) the authors begin a gentle introduction to kernel mean ...
3
votes
1
answer
147
views
Dual form of the least square solution (ridge rigression)
I was reading this introductory material and on the 5th page, it describes the dual form of the least-square solution (with ridge regression) as $$A(aI + A^\top A)^{-1} = (aI + AA^\top)^{-1}A$$ for a $...
3
votes
0
answers
130
views
Clarifying the difference between various regression methods called "kernel' or "Bayesian"
I want to understand the pairwise relationship between four types of regression: Bayesian Linear Regression, Gaussian Process Regression, Kernel Regression (Nadaraya-Watson), and Kernel Ridge ...
0
votes
0
answers
27
views
Calculating the Orthogonal Distance to Kernel PCA subspace (with a new data)
I am studying Kernel PCA methods and now I'm trying to calculate orthogonal distances (OD) on the feature space. What I've found is, you can calculate ODs with a kernel trick if you are interested in ...
2
votes
1
answer
41
views
Interpreting the formula for Riemannian metric tensor
In Improving support vector machine classifiers by modifying kernel functions, the authors defined Riemannian metric tensor for a kernel as follows:
$$
\begin{align}
g(\vec{x}) &= \text{det}|g_{ij}...
1
vote
0
answers
31
views
Why is the concept of RKHS useful in kernel ridge regression?
The way I have seen kernel ridge regression introduced is as follows. Given data $(X,Y)$ you want to fit a function $f$ from a RKHS $\mathcal{H}$ to minimise some empirical loss $\sum_i L(f(x_i), y_i)$...
1
vote
0
answers
133
views
Weighted sum of RBF kernels with different length scales
When applying Gaussian Processes to applied problems, the choice of length-scale parameter parameter for the radial basis function (RBF, ie Gaussian) kernel makes a big difference. In practice, I have ...
3
votes
2
answers
156
views
Is it enough to prove that the Kernel matrix is positive semidefinite to know that the function is a kernel?
Is it enough to prove that the Kernel matrix is positive semidefinite to know that the function is a kernel? Or is it also necessary to prove that the matrix is symmetric?
1
vote
0
answers
120
views
Geometric intuition of kernel trick
I would like to understand better the geometry underlying the Kernel trick with the Gaussian Kernel. In particular my question is:
How the Kernel trick can be interpreted geometrically, in particular ...
1
vote
1
answer
60
views
How to project kernel PCA?
I have an $m\times n$ matrix $X$. To apply a Kernel PCA to my $X$ matrix I need to warp it into a function $K = \Phi(X)$.
The problem here is that $K$ get the size $m \times m$. If I'm doing ...
0
votes
1
answer
29
views
Result after applying kernel trick
I understand when the data is not linearly separable, it has to transformed into higher dimensional space, to make it linearly separable. Applying kernel trick can perform it without even computing ...
1
vote
0
answers
38
views
is it possible to use RBF sampler to construct kernel and use it for prediction at new data point?
I would like to construct a kernel from very large samples which makes it impossible to construct the N by N kernel matrix. I can use RBF sampler (random fourier features) to make the dimension more ...
1
vote
0
answers
67
views
Why do we need $a:\mathcal{X} \to \mathbb{R}$ to be positive here?
This is exercise 6.1 from the book Foundations of Machine Learning:
Let $K: \mathcal{X}\times \mathcal{X} \to \mathbb{R}$ be a PDS kernel, and let $a:
\mathcal{X}\to \mathbb{R}$ be a positive ...
2
votes
0
answers
99
views
Rescaling matrix W in Random Fourier Features
I came across this beautiful idea of Random Fourier Features by Rahimi and Recht while working on optimising my GP model using Predictive Entropy Search.
I understand the overall idea of approximating ...
5
votes
1
answer
1k
views
Why does a valid Kernel only have to be positive semi-definite instead of positive definite?
I'm currently concerned with the topic of Gaussian Processes. To compute the covariance matrix of the conditional distribution, we have to invert $(K_{XX})^{-1}$, where $K_{XX}$ is a matrix of a ...
3
votes
0
answers
97
views
Understanding the ridge leverage scores sampling from an arXiv paper
I give a try to read the arXiv paper Distributed Adaptive Sampling for Kernel Matrix Approximation, Calandriello et al. 2017. I got a code implementation where they compute ridge leverage scores ...
1
vote
1
answer
673
views
Prove that 2nd order polynomial kernel is positive semi-definite
I'm trying to prove that the 2nd order polynomial kernel, $K(x_i, x_j) = (x_i^Tx_j + 1)^2$ is a valid kernel which satisfies the following conditions:
K is symmetric, that is, $K(x_i, x_j) = K(x_j, ...
1
vote
0
answers
305
views
How to properly implement a Matérn kernel function in R?
This definition is excerpted from Wikipedia: The Matérn covariance between measurements taken at two points separated by d distance units is given by
$$C_\nu(d) = \sigma^2\frac{2^{1-\nu}}{\Gamma(\nu)}\...
1
vote
0
answers
15
views
Given a psd matrix $Q$ and a kernel function $f(y_i, y_j)$, how do I find $Y \in \mathbb{R}^{n \times d}$ that best approximates $Q$? [duplicate]
The question is basically the title. I have a matrix $Q$ that I know is positive semi-definite. I now want to find the $Y$ that approximates this matrix under some kernel function $f(y_i, y_j)$. I ...
1
vote
1
answer
314
views
Non-stationary Random Fourier Features
Random Fourier Features (RFFs) were introduced by A. Rahimi and B. Recht in their 2007 publication Random Features for Large-Scale Kernel Machines. RFFs are based on Bochner's theorem, which applies ...
2
votes
0
answers
41
views
Identifiability of models on RKHS
I have just started learning about using reproducing kernel hilbert spaces for regularisation in machine learning. I am looking for some examples of reproducing kernels that produce identifiable and ...
0
votes
0
answers
65
views
Feature maps of the chi-squared kernel
The additive chi-squared kernel for histograms is defined as
$$K(x,y)= \sum_{i=1}^n \frac{2x_i y_i}{x_i + y_i}$$ Is this kernel positive definite on histograms? And if so, is there a known expression ...
0
votes
0
answers
85
views
Method of evaluating the feature map of a polynomial kernel feature mapping
I'm attempting to implement an adaptive kernel Kalman filter following this paper https://arxiv.org/abs/2203.08300, but I'm struggling to find a method of evaluating the feature mapping for a ...
3
votes
0
answers
68
views
Is the transformation implied by a positive-type kernel well-defined?
I’ve been trying to get my head around the particularity of the Hilbert space that a positive-type (equiv. positive definite) kernel represents an inner product on, and was hoping for some help in ...
2
votes
0
answers
25
views
In Gaussian Process Regression, what kinds of information can you *not* put in the kernel as opposed to the mean?
For example, suppose you want to learn some structure for the mean and then you also have some kernel. Is is sometimes not possible to put most things in the kernel? For example, consider ...