Author: Eiko

Time: 2025-03-03 10:47:12 - 2025-03-03 10:47:12 (UTC)

Lecturer: Nik at KCL, Computational Statistics

Note taker and remarker: Eiko, I will write down in my own style and adding some personal comments.

Kernel Methods - Lecture 3

The most important equation is the reproducing property of the kernel, recall that it says

K(x,),fHK=f(x),fHK,xX.

Derivative reproducing property: if X=R and k is differentiable, then

f(x)=f,xk(x,)HK.

more generally the higher order derivatives can be computed as

Dαf(x)=f,xαk(x,)HK.

Proof.

Just observe that

f(x+h)f(x)h=f,k(x+h,)k(x,)hHK.

For HK, look at Hk={cont linear functionals u~:HkR}

continuous linear functionals on a Hilbert space are by Riesz representation theorem induced by inner product. So since the evaluation maps δx, δx are continuous, they can be written as inner products with some functions in HK.

Generalized Representer Theorem

fargminfHK(1Ni=1N(ui~(f)yi)2+λfHK2)

ui~HK then f can be written as

f=i=1Nαiui.

where ui represents ui~ in HK.

Kernel Embeddings

Use PCA (principle component analysis)

Given data (xiRD)iN(RD)N=Hom(RN,RD), we can compute the covariance matrix

C(X,X)=(E(X(i)EX(i))(X(j)EX(j)))i,jD

WLOG we can assume EX=0, so C=1NXXT. This is a D×D symmetric positive semi-definite matrix. By orthogonal diagonalization, we can write for an orthonormal basis (ui)iD

Cui=λiui,λ1λD0

and C=λiuiui, here note that uiui:RDRui just projects onto the one-dimensional subspace spanned by ui.

We can then project to the space spanned by the orthonormal vectors with larger eigenvalues span(u1,,uk), which preserves the majority of the variance and inner product.

P=i=1kuiui:RDspan(u1,,uk)Rk.

P here can be viewed as an approximation of identity, since Pk=n=id.

Lemma.

  • PargmaxPVar(P{xi}1iN), picks the maximal variance directions.

  • PargminP1Ni=1N|xiPxi|2, least moving, maximal inner product preservation.


Idea: Embed the points (xi)i=1N into some high dimensional space, and then do PCA in that space.

Consider a map φ:XRD or equivalently a map RXRD by free forgetful adjunction. We wish that this map preserves

Prop. φ:X(H,,H) is a map into any Hilbert space H, then

k(x,y):=φ(x),φ(y)H

defines a positive definite kernel on X.

If we have an RKHS HK on X, then

φ(x)=k(x,)HK

is the canonical feature map associated to (HK,X).


Let X be a set and k:X×XR a positive definite kernel, we get the associated RKHS HK and canonical feature map φ:XHK.

φk(x),φk(y) are now in HK, and we can compute the inner product

φk(x),φk(y)HK=k(x,y).

Kernel Trick

The kernel trick says that any algorithm working with points (xi)RD in Euclidean space that only rely on a Euclidean structure (RD,,), can be generalized to a Kernel setting by replacing all inner products with the kernel function.

Eiko Remark. This seems to be able to utilize the topology of X and bring some non-linear structure into the game. The topology seems to be important here, if k(x,y) is given by any numbers and has nothing to do with the topology of X (equivalently choosing discrete topology), this is just putting arbitrary inner product on the huge space RX. The point is the use of topology lowers the dimension we need to consider but not restricting us inside the original space RD with the Euclidean structure.