Lecturer: Nik at KCL, Computational Statistics
Note taker and remarker: Eiko, I will write down in my own style and adding some personal comments.
The most important equation is the reproducing property of the kernel, recall that it says
\[ \langle K(x,), f\rangle_{\mathcal{H}_K} = f(x), \quad\forall f\in \mathcal{H}_K, x\in X.\]
Derivative reproducing property: if \(X=\mathbb{R}\) and \(k\) is differentiable, then
\[ f'(x) = \langle f, \partial_x k(x,\cdot) \rangle_{\mathcal{H}_K}.\]
more generally the higher order derivatives can be computed as
\[ D^\alpha f(x) = \langle f, \partial_x^\alpha k(x,\cdot) \rangle_{\mathcal{H}_K}.\]
Proof.
Just observe that
\[ \frac{f(x+h)-f(x)}{h} = \left\langle f, \frac{k(x+h,\cdot)-k(x,\cdot)}{h} \right\rangle_{\mathcal{H}_K}. \]
For \(\mathcal{H}_K\), look at \(\mathcal{H}_k^*=\{\text{cont linear functionals } \tilde{u}:\mathcal{H}_k\to \mathbb{R}\}\)
continuous linear functionals on a Hilbert space are by Riesz representation theorem induced by inner product. So since the evaluation maps \(\delta_x\), \(\delta_x'\) are continuous, they can be written as inner products with some functions in \(\mathcal{H}_K\).
\[f^* \in \mathrm{argmin}_{f\in \mathcal{H}_K} \left( \frac{1}{N} \sum_{i=1}^N (\tilde{u_i}(f) - y_i)^2 + \lambda \|f\|_{\mathcal{H}_K}^2\right)\]
\(\tilde{u_i}\subset \mathcal{H}_K'\) then \(f^*\) can be written as
\[ f^* = \sum_{i=1}^N \alpha_i u_i.\]
where \(u_i\) represents \(\tilde{u_i}\) in \(\mathcal{H}_K^*\).
Use PCA (principle component analysis)
Given data \((x_i\in \mathbb{R}^D)_{i\le N}\in (\mathbb{R}^D)^N = \mathrm{Hom}(\mathbb{R}^N, \mathbb{R}^D)\), we can compute the covariance matrix
\[ C(X,X) = (\mathbb{E}(X^{(i)}-\mathbb{E}X^{(i)})(X^{(j)}-\mathbb{E}X^{(j)}))_{i,j\le D}\]
WLOG we can assume \(\mathbb{E}X = 0\), so \(C = \frac{1}{N} XX^T\). This is a \(D\times D\) symmetric positive semi-definite matrix. By orthogonal diagonalization, we can write for an orthonormal basis \((u_i)_{i\le D}\)
\[ C u_i = \lambda_i u_i, \quad \lambda_1\ge \dots \ge \lambda_D\ge 0 \]
and \(C = \sum \lambda_i u_i \otimes u_i^*\), here note that \(u_i\otimes u_i^* : \mathbb{R}^D\to \mathbb{R}u_i\) just projects onto the one-dimensional subspace spanned by \(u_i\).
We can then project to the space spanned by the orthonormal vectors with larger eigenvalues \(\mathrm{span}(u_1,\dots, u_k)\), which preserves the majority of the variance and inner product.
\[ P = \sum_{i=1}^k u_i\otimes u_i^* : \mathbb{R}^D \to \mathrm{span}(u_1,\dots, u_k) \cong \mathbb{R}^k. \]
\(P\) here can be viewed as an approximation of identity, since \(P_{k=n} = \mathrm{id}\).
Lemma.
\(P\in \mathrm{argmax}_P \mathrm{Var}(P\{x_i\}_{1\le i\le N})\), picks the maximal variance directions.
\(P^* \in \mathrm{argmin}_P \frac{1}{N} \sum_{i=1}^N |x_i - Px_i|^2\), least moving, maximal inner product preservation.
Idea: Embed the points \((x_i)_{i=1}^N\) into some high dimensional space, and then do PCA in that space.
Consider a map \(\varphi: X\to \mathbb{R}^D\) or equivalently a map \(\mathbb{R}^X\to \mathbb{R}^D\) by free forgetful adjunction. We wish that this map preserves
Prop. \(\varphi:X\to (H,\langle,\rangle_H)\) is a map into any Hilbert space \(H\), then
\[k(x,y):= \langle \varphi(x), \varphi(y)\rangle_H\]
defines a positive definite kernel on \(X\).
If we have an RKHS \(\mathcal{H}_K\) on \(X\), then
\[\varphi(x) = k(x,\cdot) \in \mathcal{H}_K\]
is the canonical feature map associated to \((\mathcal{H}_K, X)\).
Let \(X\) be a set and \(k:X\times X\to \mathbb{R}\) a positive definite kernel, we get the associated RKHS \(\mathcal{H}_K\) and canonical feature map \(\varphi:X\to \mathcal{H}_K\).
\(\varphi_k(x), \varphi_k(y)\) are now in \(\mathcal{H}_K\), and we can compute the inner product
\[ \langle \varphi_k(x), \varphi_k(y)\rangle_{\mathcal{H}_K} = k(x,y).\]
The kernel trick says that any algorithm working with points \((x_i)\subset \mathbb{R}^D\) in Euclidean space that only rely on a Euclidean structure \((\mathbb{R}^D, \langle, \rangle)\), can be generalized to a Kernel setting by replacing all inner products with the kernel function.
Eiko Remark. This seems to be able to utilize the topology of \(X\) and bring some non-linear structure into the game. The topology seems to be important here, if \(k(x,y)\) is given by any numbers and has nothing to do with the topology of \(X\) (equivalently choosing discrete topology), this is just putting arbitrary inner product on the huge space \(\mathbb{R}^X\). The point is the use of topology lowers the dimension we need to consider but not restricting us inside the original space \(\mathbb{R}^D\) with the Euclidean structure.