Kernel Methods In Machine Learning

Author: Eiko

Time: 2025-03-03 10:47:12 - 2025-03-03 10:47:12 (UTC)

Lecturer: Nik at KCL, Computational Statistics

Note taker and remarker: Eiko, I will write down in my own style and adding some personal comments.

Kernel Methods - Lecture 3

The most important equation is the reproducing property of the kernel, recall that it says

$⟨ K (x,), f ⟩_{H_{K}} = f (x), \forall f \in H_{K}, x \in X .$

Derivative reproducing property: if $X = R$ and $k$ is differentiable, then

$f^{'} (x) = ⟨ f, \partial_{x} k (x, \cdot) ⟩_{H_{K}} .$

more generally the higher order derivatives can be computed as

$D^{α} f (x) = ⟨ f, \partial_{x}^{α} k (x, \cdot) ⟩_{H_{K}} .$

Proof.

Just observe that

$\frac{f (x + h) - f (x)}{h} = {⟨ f, \frac{k (x + h, \cdot) - k (x, \cdot)}{h} ⟩}_{H_{K}} .$

For $H_{K}$ , look at $H_{k}^{*} = {cont linear functionals \tilde{u} : H_{k} \to R}$

continuous linear functionals on a Hilbert space are by Riesz representation theorem induced by inner product. So since the evaluation maps $δ_{x}$ , $δ_{x}^{'}$ are continuous, they can be written as inner products with some functions in $H_{K}$ .

Generalized Representer Theorem

$f^{*} \in {argmin}_{f \in H_{K}} (\frac{1}{N} \sum_{i = 1}^{N} (\tilde{u_{i}} (f) - y_{i})^{2} + λ ∥ f ∥_{H_{K}}^{2})$

$\tilde{u_{i}} \subset H_{K}^{'}$ then $f^{*}$ can be written as

$f^{*} = \sum_{i = 1}^{N} α_{i} u_{i} .$

where $u_{i}$ represents $\tilde{u_{i}}$ in $H_{K}^{*}$ .

Kernel Embeddings

Use PCA (principle component analysis)

Given data $(x_{i} \in R^{D})_{i \leq N} \in (R^{D})^{N} = Hom (R^{N}, R^{D})$ , we can compute the covariance matrix

$C (X, X) = (E (X^{(i)} - E X^{(i)}) (X^{(j)} - E X^{(j)}))_{i, j \leq D}$

WLOG we can assume $E X = 0$ , so $C = \frac{1}{N} X X^{T}$ . This is a $D \times D$ symmetric positive semi-definite matrix. By orthogonal diagonalization, we can write for an orthonormal basis $(u_{i})_{i \leq D}$

$C u_{i} = λ_{i} u_{i}, λ_{1} \geq \dots \geq λ_{D} \geq 0$

and $C = \sum λ_{i} u_{i} \otimes u_{i}^{*}$ , here note that $u_{i} \otimes u_{i}^{*} : R^{D} \to R u_{i}$ just projects onto the one-dimensional subspace spanned by $u_{i}$ .

We can then project to the space spanned by the orthonormal vectors with larger eigenvalues $span (u_{1}, \dots, u_{k})$ , which preserves the majority of the variance and inner product.

$P = \sum_{i = 1}^{k} u_{i} \otimes u_{i}^{*} : R^{D} \to span (u_{1}, \dots, u_{k}) ≅ R^{k} .$

$P$ here can be viewed as an approximation of identity, since $P_{k = n} = id$ .

Lemma.

$P \in {argmax}_{P} Var (P {x_{i}}_{1 \leq i \leq N})$ , picks the maximal variance directions.
$P^{*} \in {argmin}_{P} \frac{1}{N} \sum_{i = 1}^{N} | x_{i} - P x_{i} |^{2}$ , least moving, maximal inner product preservation.

Idea: Embed the points $(x_{i})_{i = 1}^{N}$ into some high dimensional space, and then do PCA in that space.

Consider a map $φ : X \to R^{D}$ or equivalently a map $R^{X} \to R^{D}$ by free forgetful adjunction. We wish that this map preserves

Prop. $φ : X \to (H, ⟨, ⟩_{H})$ is a map into any Hilbert space $H$ , then

$k (x, y) := ⟨ φ (x), φ (y) ⟩_{H}$

defines a positive definite kernel on $X$ .

If we have an RKHS $H_{K}$ on $X$ , then

$φ (x) = k (x, \cdot) \in H_{K}$

is the canonical feature map associated to $(H_{K}, X)$ .

Let $X$ be a set and $k : X \times X \to R$ a positive definite kernel, we get the associated RKHS $H_{K}$ and canonical feature map $φ : X \to H_{K}$ .

$φ_{k} (x), φ_{k} (y)$ are now in $H_{K}$ , and we can compute the inner product

$⟨ φ_{k} (x), φ_{k} (y) ⟩_{H_{K}} = k (x, y) .$

Kernel Trick

The kernel trick says that any algorithm working with points $(x_{i}) \subset R^{D}$ in Euclidean space that only rely on a Euclidean structure $(R^{D}, ⟨, ⟩)$ , can be generalized to a Kernel setting by replacing all inner products with the kernel function.

Eiko Remark. This seems to be able to utilize the topology of $X$ and bring some non-linear structure into the game. The topology seems to be important here, if $k (x, y)$ is given by any numbers and has nothing to do with the topology of $X$ (equivalently choosing discrete topology), this is just putting arbitrary inner product on the huge space $R^{X}$ . The point is the use of topology lowers the dimension we need to consider but not restricting us inside the original space $R^{D}$ with the Euclidean structure.