Author: Eiko

Time: 2025-02-24 11:01:21 - 2025-02-24 11:01:21 (UTC)

Equivalent Definitions of RKHS

  1. Given \(k:X\times X\to \mathbb{R}\) positive definite,

    \[\mathcal{H}_K = \overline{\mathrm{span}\{ k(x,\cdot) : x\in X\}}\]

  2. \(\mathcal{H}_K\) is given by the following two properties

    1. \(k(x,\cdot)\in \mathcal{H}_K\)

    2. \(\langle f, k(x,\cdot)\rangle_{\mathcal{H}_K} = f(x)\) for all \(f\in H_K, x\in X\).

  3. An RKHS is a Hilbert space of functions on \(X\), \(\mathcal{H}_K=\{f:X\to \mathbb{R}\}\) such that all evaluation maps \(\delta_x:\mathcal{H}_K\to \mathbb{R}\) at points are (bounded) continuous.

Relation With Sobolev Spaces

Let

\[H^s(\mathbb{R}^d) = \left\{ f\in L^2(\mathbb{R}^d) : \|f\|_{H^s}<\infty \right\}\]

where

\[\|f\|_{H^s}^2 = \sum_{|\alpha|\leq s} \int_{\mathbb{R}^d} |\partial^\alpha f(x)|^2 dx\]

Remarks

  • In one dimension, \(H^1\) already has continuous point evaluations, making it an RKHS. But in higher dimensions, controlling only this first order might not be enough anymore.

    \(H^s(\mathbb{R}^d)\) is an RKHS iff \(s>\frac{d}{2}\) (ref: Sobolev embedding theorem).

  • If we define some general version of derivatives (weak derivatives etc), then all RKHS can be viewed as some Sobolev space.

Kernel Ridge Regression

Given some data as a finite subset of \(X\times \mathbb{R}\), we want to find a function \(f:X\to \mathbb{R}\) fitting the data \(f(x_i)\sim y_i\) in the space \(\mathcal{H}_K\).

Problem. Fix \(k:X^2\to\mathbb{R}\), minimize \[\mathbb{E}(f(x_i)-y_i)^2 + \lambda \|f\|_{\mathcal{H}_K}^2, \quad f\in \mathcal{H}_K,\]

where \(\lambda\ge 0\) is the regularization parameter.

Representer Theorem

If \(f^*\) minimizes the above problem for \(\lambda>0\), then \(f^*\) must be a finite linear combination of the kernel functions \(k(x_i,\cdot)\) for \(x_i\) in the data set.

To prove this theorem we use a orthogonal decomposition \(f = g + h \in \mathcal{U}_D\oplus \mathcal{U}_D^\perp\) where \(\mathcal{U}_D\) is the finite dimensional space of functions in spanned by the data set, if we put in the minimization goal, we get

\[\mathbb{E}(g(x_i)+h(x_i) - y_i)^2 + \lambda \|g\|_{\mathcal{H}_K}^2 + \lambda \|h\|_{\mathcal{H}_K}^2 = \mathbb{E}(g(x_i) - y_i)^2 + \lambda (\|g\|_{\mathcal{H}_K}^2 + \|h\|_{\mathcal{H}_K}^2) \]

We see that minimizing it yields \(h=0\) (unless \(\lambda=0\)).

Now we are trying to minimize

\[S=\mathbb{E}(\langle f, e_i\rangle - y_i)^2 + \lambda \langle f, f\rangle\]

If we differentiate \(S\) with respect to \(f = \sum f_i e_i\),

\[\begin{align*} DS &= \mathbb{E}2(\langle f, e_i\rangle - y_i)\langle Df, e_i\rangle + 2\lambda \langle Df, f\rangle\\ &= 2\langle Df, \mathbb{E}(\langle f,e_i\rangle - y_i)e_i+\lambda f\rangle\\ &= 2\langle Df, \mathbb{E}(\langle f,e_i\rangle + \lambda f_i- y_i)e_i\rangle\\ \end{align*}\]

and set it to zero for all \(Df\), we get

\[ \langle f, e_i\rangle + \lambda f_i - y_i = 0 \]

In terms of matrices this is

\[ (K+\lambda I)\cdot \underline{f} = \underline{y} \]

where \(K_{ij} = k(x_i,x_j)\).