Author: Eiko

Time: 2025-03-10 10:56:15 - 2025-03-10 10:56:15 (UTC)

Kernel Methods In Machine Learning - Lecture 4

Last time we discussed that given a positive definite kernel \(k:X\times X\to \mathbb{R}\)

We can have a canonical feature map

\[\varphi_K: X\to \mathcal{H}_K \quad x\mapsto k(x,\cdot). \]

Kernel Mean Embeddings

Given a probability measure we can average the feature map over it

\[\mu: \mathcal{P}(\mathbb{R}^d) \to \mathcal{H}_K\]

\[\rho\mapsto \int k(x,\cdot) \rho(dx)\]

Observation: \(x\in X\Rightarrow \delta_x\in \mathcal{P}(X)\), we can get back the feature map / kernel.

This means, since \(f(x)=\langle k(x,\cdot),f\rangle_{\mathcal{H}_K}\), we have

\[\begin{align*} \mathbb{E}_{X\sim \rho}[f(X)] &=\int f\,\mathrm{d}\rho \\ &= \int \langle k(x,\cdot),f\rangle_{\mathcal{H}_K} \,\mathrm{d}\rho(x) \\ &= \left\langle \int k(x,\cdot)\,\mathrm{d}\rho(x) ,f \right\rangle_{\mathcal{H}_K} \\ &= \langle \mu_\rho, f\rangle_{\mathcal{H}_K} \\ \end{align*}\]

We see that it can be used to evaluate the expectation of a function.

Maximal Mean Discrepancy

For \(\rho,\pi\in \mathcal{P}(X)\) we can define

\[\mathrm{MMD}(\rho,\pi) = \| \mu_\rho - \mu_\pi\|_{\mathcal{H}_K}.\]

This is like you pull-backed the metric from the RKHS to the space of probability measures on \(X\).

We can compute

\[\begin{align*} \mathrm{MMD}^2(\rho,\pi) &= \|\mu_\rho - \mu_\pi\|^2_{\mathcal{H}_K} \\ &= \langle \mu_\rho - \mu_\pi, \mu_\rho - \mu_\pi\rangle_{\mathcal{H}_K} \\ &= \langle \mu_\rho, \mu_\rho\rangle_{\mathcal{H}_K} + \langle \mu_\pi, \mu_\pi\rangle_{\mathcal{H}_K} - 2\langle \mu_\rho, \mu_\pi\rangle_{\mathcal{H}_K} \\ &= \mathbb{E}_{X\sim \rho,Y\sim \rho}[k(X,Y)] + \mathbb{E}_{X\sim \pi,Y\sim \pi}[k(X,Y)] - 2\mathbb{E}_{X\sim \rho,Y\sim \pi}[k(X,Y)] \end{align*}\]

Which we can estimate from samples as

\[\rho \approx \frac{1}{n}\sum_{i=1}^n \delta_{x_i},\quad \pi \approx \frac{1}{m}\sum_{j=1}^m \delta_{y_j}\]

This gives easy estimators for the MMD

\[\mathrm{MMD}^2(\rho,\pi) \approx \frac{1}{n^2}\sum_{i,j} k(x_i,x_j) + \frac{1}{m^2}\sum_{i,j} k(y_i,y_j) - \frac{2}{nm}\sum_{i,j} k(x_i,y_j).\]

Note that \(\mathrm{MMD}(\rho,\pi)=0\) cannot mean \(\rho=\pi\), it is not necessarily faithful.

Definition. A kernel \(k\) is called characteristic, if the kernel mean embedding is injective.

Theorem. The Gaussian kernel (and most kernels in application like the \(p\)-kernels) is characteristic.

We also have an outbedding problem, given \(f\), can you find \(\rho\) such that \(\mu_\rho=f\)?

Other Stuff

\[\begin{align*} \int \psi(t)e^{is\Phi(t)} \,\mathrm{d}t &= e^{is\Phi(t_0)} \int \psi(t)e^{is(\Phi(t)-\Phi(t_0))} \,\mathrm{d}t \\ &= e^{is\Phi(t_0)} \int \psi(t)e^{is(A(t-t_0)^2+O(|t-t_0|^3))} \,\mathrm{d}t \end{align*}\]