I am trying to understand why the RBF kernel is usually used in many research papers doing kernel tricks. To reduce the scope, we can focus on linear regression (thus effectively, increasing the degree of the polynomial and reducing bias and underfitting).
Can someone point me to the correct direction where such an explanation is given?
I would appreciate a theoretical discussion (mathematical in nature), as opposed to comparing different kernels computationally and then showing loss values. Even a hand-wavy explanation with good mathematics is sufficient.