Skip to main content

Questions tagged [kernel-trick]

Kernel methods are used in machine learning to generalize linear techniques to nonlinear situations, especially SVMs, PCA, and GPs. Not to be confused with [kernel-smoothing], for kernel density estimation (KDE) and kernel regression.

Filter by
Sorted by
Tagged with
0 votes
0 answers
11 views

When running a Support Vector Machine, how do I formulate the linear transformation that flips the decision hyperplane in the non-augmented dimension?

We know that when running a support vector machine, we actually use the "kernel trick" to compute the decision hyperplane (boundary) as if we do so in the kernel-augmented dimension, but not ...
Wonjae Oh's user avatar
0 votes
0 answers
92 views

Proof: The Gaussian Kernel as an Inner Product in Infinite-Dimensional Feature Space

Prove that the Gaussian kernel on $ \mathbb{R}^d $ for a positive integer $d $: \begin{equation} k(x, x') = \exp(-\gamma \|x - x'\|^2) \tag{1} \end{equation} for $\gamma > 0 $, can be expressed as ...
Lifeni's user avatar
  • 313
0 votes
0 answers
152 views

Prove matrix constructed based on gaussian RBF is PSD

I have a radial basis function $k(x, y) = \exp(-{(x-y)}^T M {(x-y)})$ where $M$ is a symmetric PSD matrix. I know that $k(\cdot)$ is a kernel itself: Prove that multiplication with positive ...
BiriBora's user avatar
  • 101
0 votes
0 answers
7 views

Has natural language processing/generation been attempted with kernal methods

I am curious as to whether there has been much success in the past with applying kernel methods to perform natural language processing/generation? Rather than just numerical evidence, I'm particularly ...
N A McMahon's user avatar
3 votes
0 answers
34 views

Convergence of kernel mean embeddings

Let $k(\cdot,\cdot)$ be a bounded kernel and $\mathcal{H}$ its associated RKHS. Define the kernel mean embedding $\mu=\int k(\cdot,x) \, dP_X(x)$ and let $\hat{\mu}=\frac{1}{n}\sum k(x_i,\cdot)$ be ...
xcesc's user avatar
  • 122
0 votes
0 answers
19 views

How to get the second stationary point condition corresponding to intercept when using the augmented weight vector and augmented design matrix in SVM?

Below is the formulation I got for SVM when using the equation of classifier as w.x + b = 0 I want to know why I am not getting the second stationary condition i.e. summation over i from 1 to n of (...
Shri's user avatar
  • 23
3 votes
2 answers
131 views

In the contex of Kernel regression why do we define the feature map as equal to the Kernel $\varphi(x)=k(\cdot ,x)$?

I have a notational confusion I am trying to clear up. In the context of Kernel regression the following relationship between the kernel and the feature map is defined: Consider a positive-definite ...
Monolite's user avatar
  • 1,465
1 vote
0 answers
40 views

Derivation of dual formulation of support vector regression

I'm trying to derive the dual formulation of epsilon-insensitive support vector regression. I think my derivation is correct, but I can't match it up to a result for the dual that I've seen given in ...
oweydd's user avatar
  • 225
1 vote
0 answers
37 views

What is the best way to use Gaussian Processes to approximate highly non-stationary functions?

Gaussian process regression has trouble approximating functions with "kinks". So, what is the most widely used method to deal with this problem? I have found many proposed methods, including ...
Dan Zhao's user avatar
3 votes
0 answers
64 views

What are the "tricks" in machine learning? [closed]

I have come across a few different "tricks" in machine learning methodology, which I list below along with my rudimental understandings. The Kernel Trick: This is used in Support Vector ...
camhsdoc's user avatar
  • 409
0 votes
1 answer
32 views

SVM Kernel to compare histograms as input vectors

In lecture 7 of CS229 by Andrew Ng he mentions at the very end a specific Kernel that allows an SVM to "classify" how similar two histograms are, such as the demographics of 2 countries. He ...
yyyLLL's user avatar
  • 33
0 votes
0 answers
26 views

Using MMD for Feature Selection with Linear Regression: Valid Approach?

I'm using Maximum Mean Discrepancy (MMD) for feature selection (i.e., to select the features that minimize the dissimilarity between the training and testing datasets). I'm aware that MMD introduces ...
Adham Enaya's user avatar
0 votes
0 answers
43 views

Covariance inversion for Gaussian process

Background Let $x=f(u_x)\in\mathbb{R}$ and let $y=[f(u_y^1)\cdots f(u_y^{N})]\in\mathbb{R}^N$ for some function $f:u \in \mathbb{R}\mapsto \mathbb{R}$. Given $y$, $u_x$, $u_{y}^1,\dots, u_{y}^{N}$, I ...
matteogost's user avatar
2 votes
0 answers
24 views

How to find K in kernel trick?

How does one go about finding the kernel when using the so-called "kernel trick?" Here is an example from quora: Simple Example: x = (x1, x2, x3); y = (y1, y2, y3). Then for the function f(...
Hank's user avatar
  • 21
0 votes
0 answers
9 views

Computing Test Loss in Kernel Ridge Regression

In Kernel Ridge regression we have the standard loss function $$L(\beta) = \|Y-K\beta\|_2^2 + \alpha \beta^T K \beta$$ Here, $K$ is the kernel (gram) matrix. If I compute $\beta$ on a training set, so ...
WeakLearner's user avatar
  • 1,531
0 votes
0 answers
14 views

Estimation of bivariate function with one variable being constricted

Suppose the following classical supervised regression setting, $$y_{i} = f(x_{i}) + \epsilon_{i}, \quad i=1,\cdots,n,$$ where $\epsilon_{i}$ are i.i.d. zero mean Gaussian noise. The above regression ...
DoubleL's user avatar
  • 11
0 votes
0 answers
23 views

Can I find the explicit feature map that generates exponent of a kernel?

Let's say I have a kernel $K$, and another kernel of the form : $$ K' = e^K $$ now I know how to prove K' is a kernel, I can do it using taylor expansion of $e^x$ around $0$, but let's say if I want ...
aroma's user avatar
  • 123
2 votes
1 answer
42 views

Does solution to ridge regression still minimizes the cost function when lambda is <=0?

This was a homework problem where I was asked to find explicit expression that minimises the cost function. I found the solution as : $\hat{\theta} = (X^TX + \lambda I)^{-1}X^Ty$ Now the problem ...
aroma's user avatar
  • 123
0 votes
0 answers
10 views

What is normalized winning frequency in kernel self organizing map(SOM)?

In the k-means based kernel SOM, proposed by MacDonald and Fyfe (2000), the update of the mean is based on a soft learning algorithm mi(t + 1) = mi(t) + Λ[φ(x) − mi(t)] where Λ is the normalized ...
Anshuman Jayaprakash's user avatar
0 votes
0 answers
28 views

theoretical question: why is RBF the 'best' kernel

I am trying to understand why the RBF kernel is usually used in many research papers doing kernel tricks. To reduce the scope, we can focus on linear regression (thus effectively, increasing the ...
cgo's user avatar
  • 9,317
0 votes
0 answers
23 views

normalized dual activation function for neural tangent kernel

Let $\phi$ be an activation function. In this lecture note, The author assumes that the dual activation function, denoted as $\check{\phi}$ is normalized such that $\check{\phi}(1)=1$. How can it be ...
MohammadJavad Vaez's user avatar
2 votes
0 answers
61 views

How is the weight vector calculated when using kernel trick for ridge regression

Im trying to understand how kernelized ridge regression works, and how we manage to first transform, and subsequently learn on higher-dimensional features without explicitly having to calculate them. ...
pyrrosk's user avatar
  • 33
0 votes
0 answers
58 views

How to use random kitchen sinks for $\sigma \neq 1$?

The RBF kernel is given by $$ k(x,y) = \exp\left(-\frac{\| x - y \|_2^2}{2 \sigma^2}\right) $$ where $\sigma$ is the length-scale parameter. I want to use the random kitchen sinks method to create a ...
user336650's user avatar
1 vote
0 answers
30 views

RKHS inclusion relationship of the Erf network's NTK

In the referenced paper, it is stated that for ReLU networks, the Reproducing Kernel Hilbert Space (RKHS) of the Neural Tangent Kernels (NTK) remains unchanged regardless of the model's depth. I am ...
user376649's user avatar
8 votes
2 answers
192 views

Under what kernels and/or conditions does $k(x, x) = k(x, X) k(X, X)^{-1} k(X, x)$?

This question is motivated by a question I'm facing in vector-valued kernel methods (also known as Gaussian Processes and co-krieging). Suppose I have $N$ data $X := \{x_n\}_{n=1}^N$ , where each $x_n ...
Rylan Schaeffer's user avatar
3 votes
0 answers
38 views

Exchanging integrals with inner products with kernel mean embeddings

I am doing some reading on kernel mean embeddings. In particular I am reading the survey paper by Muandet et al. On page 27 (Section 3.1) the authors begin a gentle introduction to kernel mean ...
Nick Bishop's user avatar
3 votes
1 answer
147 views

Dual form of the least square solution (ridge rigression)

I was reading this introductory material and on the 5th page, it describes the dual form of the least-square solution (with ridge regression) as $$A(aI + A^\top A)^{-1} = (aI + AA^\top)^{-1}A$$ for a $...
Alemu's user avatar
  • 125
3 votes
0 answers
130 views

Clarifying the difference between various regression methods called "kernel' or "Bayesian"

I want to understand the pairwise relationship between four types of regression: Bayesian Linear Regression, Gaussian Process Regression, Kernel Regression (Nadaraya-Watson), and Kernel Ridge ...
Tanishq Kumar's user avatar
0 votes
0 answers
27 views

Calculating the Orthogonal Distance to Kernel PCA subspace (with a new data)

I am studying Kernel PCA methods and now I'm trying to calculate orthogonal distances (OD) on the feature space. What I've found is, you can calculate ODs with a kernel trick if you are interested in ...
cccanhakan's user avatar
2 votes
1 answer
41 views

Interpreting the formula for Riemannian metric tensor

In Improving support vector machine classifiers by modifying kernel functions, the authors defined Riemannian metric tensor for a kernel as follows: $$ \begin{align} g(\vec{x}) &= \text{det}|g_{ij}...
Omar Shehab's user avatar
1 vote
0 answers
31 views

Why is the concept of RKHS useful in kernel ridge regression?

The way I have seen kernel ridge regression introduced is as follows. Given data $(X,Y)$ you want to fit a function $f$ from a RKHS $\mathcal{H}$ to minimise some empirical loss $\sum_i L(f(x_i), y_i)$...
Danny Duberstein's user avatar
1 vote
0 answers
133 views

Weighted sum of RBF kernels with different length scales

When applying Gaussian Processes to applied problems, the choice of length-scale parameter parameter for the radial basis function (RBF, ie Gaussian) kernel makes a big difference. In practice, I have ...
Betterthan Kwora's user avatar
3 votes
2 answers
156 views

Is it enough to prove that the Kernel matrix is positive semidefinite to know that the function is a kernel?

Is it enough to prove that the Kernel matrix is positive semidefinite to know that the function is a kernel? Or is it also necessary to prove that the matrix is symmetric?
winnie's user avatar
  • 31
1 vote
0 answers
120 views

Geometric intuition of kernel trick

I would like to understand better the geometry underlying the Kernel trick with the Gaussian Kernel. In particular my question is: How the Kernel trick can be interpreted geometrically, in particular ...
Thomas's user avatar
  • 952
1 vote
1 answer
60 views

How to project kernel PCA?

I have an $m\times n$ matrix $X$. To apply a Kernel PCA to my $X$ matrix I need to warp it into a function $K = \Phi(X)$. The problem here is that $K$ get the size $m \times m$. If I'm doing ...
euraad's user avatar
  • 425
0 votes
1 answer
29 views

Result after applying kernel trick

I understand when the data is not linearly separable, it has to transformed into higher dimensional space, to make it linearly separable. Applying kernel trick can perform it without even computing ...
mainak mukherjee's user avatar
1 vote
0 answers
38 views

is it possible to use RBF sampler to construct kernel and use it for prediction at new data point?

I would like to construct a kernel from very large samples which makes it impossible to construct the N by N kernel matrix. I can use RBF sampler (random fourier features) to make the dimension more ...
W Jin's user avatar
  • 11
1 vote
0 answers
67 views

Why do we need $a:\mathcal{X} \to \mathbb{R}$ to be positive here?

This is exercise 6.1 from the book Foundations of Machine Learning: Let $K: \mathcal{X}\times \mathcal{X} \to \mathbb{R}$ be a PDS kernel, and let $a: \mathcal{X}\to \mathbb{R}$ be a positive ...
George Giapitzakis's user avatar
2 votes
0 answers
99 views

Rescaling matrix W in Random Fourier Features

I came across this beautiful idea of Random Fourier Features by Rahimi and Recht while working on optimising my GP model using Predictive Entropy Search. I understand the overall idea of approximating ...
Ann's user avatar
  • 43
5 votes
1 answer
1k views

Why does a valid Kernel only have to be positive semi-definite instead of positive definite?

I'm currently concerned with the topic of Gaussian Processes. To compute the covariance matrix of the conditional distribution, we have to invert $(K_{XX})^{-1}$, where $K_{XX}$ is a matrix of a ...
rodeo's user avatar
  • 53
3 votes
0 answers
97 views

Understanding the ridge leverage scores sampling from an arXiv paper

I give a try to read the arXiv paper Distributed Adaptive Sampling for Kernel Matrix Approximation, Calandriello et al. 2017. I got a code implementation where they compute ridge leverage scores ...
Emon Hossain's user avatar
1 vote
1 answer
673 views

Prove that 2nd order polynomial kernel is positive semi-definite

I'm trying to prove that the 2nd order polynomial kernel, $K(x_i, x_j) = (x_i^Tx_j + 1)^2$ is a valid kernel which satisfies the following conditions: K is symmetric, that is, $K(x_i, x_j) = K(x_j, ...
Muhteva's user avatar
  • 113
1 vote
0 answers
305 views

How to properly implement a Matérn kernel function in R?

This definition is excerpted from Wikipedia: The Matérn covariance between measurements taken at two points separated by d distance units is given by $$C_\nu(d) = \sigma^2\frac{2^{1-\nu}}{\Gamma(\nu)}\...
Miles N.'s user avatar
  • 184
1 vote
0 answers
15 views

Given a psd matrix $Q$ and a kernel function $f(y_i, y_j)$, how do I find $Y \in \mathbb{R}^{n \times d}$ that best approximates $Q$? [duplicate]

The question is basically the title. I have a matrix $Q$ that I know is positive semi-definite. I now want to find the $Y$ that approximates this matrix under some kernel function $f(y_i, y_j)$. I ...
Andrew Draganov's user avatar
1 vote
1 answer
314 views

Non-stationary Random Fourier Features

Random Fourier Features (RFFs) were introduced by A. Rahimi and B. Recht in their 2007 publication Random Features for Large-Scale Kernel Machines. RFFs are based on Bochner's theorem, which applies ...
LoveRKHS's user avatar
2 votes
0 answers
41 views

Identifiability of models on RKHS

I have just started learning about using reproducing kernel hilbert spaces for regularisation in machine learning. I am looking for some examples of reproducing kernels that produce identifiable and ...
Codie's user avatar
  • 51
0 votes
0 answers
65 views

Feature maps of the chi-squared kernel

The additive chi-squared kernel for histograms is defined as $$K(x,y)= \sum_{i=1}^n \frac{2x_i y_i}{x_i + y_i}$$ Is this kernel positive definite on histograms? And if so, is there a known expression ...
Claudio Moneo's user avatar
0 votes
0 answers
85 views

Method of evaluating the feature map of a polynomial kernel feature mapping

I'm attempting to implement an adaptive kernel Kalman filter following this paper https://arxiv.org/abs/2203.08300, but I'm struggling to find a method of evaluating the feature mapping for a ...
esatemporis's user avatar
3 votes
0 answers
68 views

Is the transformation implied by a positive-type kernel well-defined?

I’ve been trying to get my head around the particularity of the Hilbert space that a positive-type (equiv. positive definite) kernel represents an inner product on, and was hoping for some help in ...
demim00nde's user avatar
2 votes
0 answers
25 views

In Gaussian Process Regression, what kinds of information can you *not* put in the kernel as opposed to the mean?

For example, suppose you want to learn some structure for the mean and then you also have some kernel. Is is sometimes not possible to put most things in the kernel? For example, consider ...
safetyduck's user avatar

1
2 3 4 5
16