Last time we discussed that given a positive definite kernel \(k:X\times X\to \mathbb{R}\)
We can have a canonical feature map
\[\varphi_K: X\to \mathcal{H}_K \quad x\mapsto k(x,\cdot). \]
Given a probability measure we can average the feature map over it
\[\mu: \mathcal{P}(\mathbb{R}^d) \to \mathcal{H}_K\]
\[\rho\mapsto \int k(x,\cdot) \rho(dx)\]
Observation: \(x\in X\Rightarrow \delta_x\in \mathcal{P}(X)\), we can get back the feature map / kernel.
This means, since \(f(x)=\langle k(x,\cdot),f\rangle_{\mathcal{H}_K}\), we have
\[\begin{align*} \mathbb{E}_{X\sim \rho}[f(X)] &=\int f\,\mathrm{d}\rho \\ &= \int \langle k(x,\cdot),f\rangle_{\mathcal{H}_K} \,\mathrm{d}\rho(x) \\ &= \left\langle \int k(x,\cdot)\,\mathrm{d}\rho(x) ,f \right\rangle_{\mathcal{H}_K} \\ &= \langle \mu_\rho, f\rangle_{\mathcal{H}_K} \\ \end{align*}\]
We see that it can be used to evaluate the expectation of a function.
For \(\rho,\pi\in \mathcal{P}(X)\) we can define
\[\mathrm{MMD}(\rho,\pi) = \| \mu_\rho - \mu_\pi\|_{\mathcal{H}_K}.\]
This is like you pull-backed the metric from the RKHS to the space of probability measures on \(X\).
We can compute
\[\begin{align*} \mathrm{MMD}^2(\rho,\pi) &= \|\mu_\rho - \mu_\pi\|^2_{\mathcal{H}_K} \\ &= \langle \mu_\rho - \mu_\pi, \mu_\rho - \mu_\pi\rangle_{\mathcal{H}_K} \\ &= \langle \mu_\rho, \mu_\rho\rangle_{\mathcal{H}_K} + \langle \mu_\pi, \mu_\pi\rangle_{\mathcal{H}_K} - 2\langle \mu_\rho, \mu_\pi\rangle_{\mathcal{H}_K} \\ &= \mathbb{E}_{X\sim \rho,Y\sim \rho}[k(X,Y)] + \mathbb{E}_{X\sim \pi,Y\sim \pi}[k(X,Y)] - 2\mathbb{E}_{X\sim \rho,Y\sim \pi}[k(X,Y)] \end{align*}\]
Which we can estimate from samples as
\[\rho \approx \frac{1}{n}\sum_{i=1}^n \delta_{x_i},\quad \pi \approx \frac{1}{m}\sum_{j=1}^m \delta_{y_j}\]
This gives easy estimators for the MMD
\[\mathrm{MMD}^2(\rho,\pi) \approx \frac{1}{n^2}\sum_{i,j} k(x_i,x_j) + \frac{1}{m^2}\sum_{i,j} k(y_i,y_j) - \frac{2}{nm}\sum_{i,j} k(x_i,y_j).\]
Note that \(\mathrm{MMD}(\rho,\pi)=0\) cannot mean \(\rho=\pi\), it is not necessarily faithful.
Definition. A kernel \(k\) is called characteristic, if the kernel mean embedding is injective.
Theorem. The Gaussian kernel (and most kernels in application like the \(p\)-kernels) is characteristic.
We also have an outbedding problem, given \(f\), can you find \(\rho\) such that \(\mu_\rho=f\)?
\[\begin{align*} \int \psi(t)e^{is\Phi(t)} \,\mathrm{d}t &= e^{is\Phi(t_0)} \int \psi(t)e^{is(\Phi(t)-\Phi(t_0))} \,\mathrm{d}t \\ &= e^{is\Phi(t_0)} \int \psi(t)e^{is(A(t-t_0)^2+O(|t-t_0|^3))} \,\mathrm{d}t \end{align*}\]