Reference:
Theoretical Statistics by Robert W. Keener
Probability Theory by Eiko
Foundations of Modern Probability by Olav Kallenberg
Let \(\theta\in \Omega\) be some parameter space and \(\Omega = \Omega_0\cup \Omega_1\) be a partition of this space, \(X\sim \mathbb{P}(X|\theta)\) be certain law. \(H_i\) is a hypothesis that \(\theta\in \Omega_i\).
Hypothesis testing aims to tell which of the two competing hypotheses \(H_0\) or \(H_1\) is correct by observing \(X\).
A non-randomized test of \(H_0\) versus \(H_1\) can be specified by a critical region \(S\), so if \(X\in S\) we reject \(H_0\) in favor of \(H_1\).
Power function \(\beta_S:\Omega\to \mathbb{R}\) describes the probability of rejecting \(H_0\) given \(\theta\)
\[\beta_S(\theta) = \mathbb{P}(X\in S|\theta)\]
Significant level \(\alpha\) is the small error calculating the worst error rate of falsely rejecting \(H_0\) when it is true.
\[\alpha_S = \sup_{\theta\in \Omega_0} \beta_S(\theta).\]
In theory we would want \(\beta_S(\theta) = 1_{\Omega_1}\) which would imply \(\alpha_S = 0\), but this is not possible in practice.
Sometimes instead of giving a critical region \(S\) or equivalently a function \(1_S\), we give a critical function \(\varphi(x)\) instead, reflecting the probability of rejecting \(H_0\). Then a non-randomized test is just a special case of \(\varphi = 1_S\).
In this case, the power function is
\[ \beta_\varphi(\theta) = \mathbb{E}(\varphi(X)|\theta) \]
and the significant level is
\[ \alpha_\varphi = \sup_{\theta\in \Omega_0} \beta_\varphi(\theta) = \sup_{\theta\in \Omega_0} \mathbb{E}(\varphi(X)|\theta). \]
The main advantage of randomized tests is that they can form (convex) linear combinations.
A hypothesis is simple if \(\Omega_i\) is a singleton.
Assume \(H_0\) and \(H_1\) are both simple, in this case there is a Neyman-Pearson Lemma describing all reasonable tests. Let \(\mu_1 = \mathbb{P}(X|\theta_1)\) and \(\mu_0 = \mathbb{P}(X|\theta_0)\) be the distributions of \(X\) under \(H_1\) and \(H_0\) respectively.
We have
\[\alpha_\varphi = \mu_0(\varphi) = \int \varphi(x) \mu_0(dx)\] \[\beta_\varphi(\theta_i) = \mu_i(\varphi) = \int \varphi(x) \mu_i(dx).\]
We would want to choose \(\varphi\) such that \(\mu_0(\varphi)\to 0\) and \(\mu_1(\varphi)\to 1\). Consider maximizing \(\beta_\varphi(\theta_1)\) subject to \(\alpha_\varphi\le \alpha\).
Let \(k\ge 0\) be any constant, then maximizing \(\mu_1(\varphi) - k\mu_0(\varphi)\) gives the function \(\varphi^*\) maximizing \(\mu_1(\varphi)\) subject to \(\mu_0(\varphi)\le \alpha\), here \(\alpha = \mu_0(\varphi^*)\).
\[\begin{align*} \varphi^* &\in \mathrm{argmax}_\varphi \left(\mu_1(\varphi) - k \mu_0(\varphi) \right) \\ &\subset \mathrm{argmax}_{\mu_0(\varphi)\le \alpha} \mu_1(\varphi). \end{align*}\]
Moreover, any function \(\varphi^{**}\) maximizing \(\mu_1(\varphi)\) subject to \(\mu_0(\varphi)\le \alpha\) must have \(\mu_0(\varphi^{**}) = \alpha\).
\[ \varphi^{**}\in \mathrm{argmax}_{\mu_0(\varphi)\le \alpha} \mu_1(\varphi) \subset \{ \varphi: \mu_0(\varphi) = \alpha \}.\]
Note that \(\varphi^*\) and \(\alpha\) depend on \(k\) here.
Proof.
Let \(\varphi^*\) be the maximizing function. It suffices to prove that \(\mu_0(\varphi)\le \alpha \Rightarrow \mu_1(\varphi)\le \mu_1(\varphi^*)\).
We have
\[ \mu_1(\varphi) - k(\mu_0(\varphi) - \alpha) \le \mu_1(\varphi^*) - k(\mu(\varphi^*) -\alpha)\]
So \(\mu_0(\varphi) - \alpha \le 0 \Rightarrow -k(\mu_0(\varphi) - \alpha)\ge 0\) therefore
\[\begin{align*} \mu_1(\varphi) &\le \mu_1(\varphi) - k(\mu_0(\varphi) - \alpha) \\ &\le \mu_1(\varphi^*) - k(\mu_0(\varphi^*) -\alpha) \\ &= \mu_1(\varphi^*). \end{align*}\]
We know that \(\varphi^{**}\) and \(\varphi^*\) are both in \(\mathrm{argmax}_{\mu_0(\varphi)\le \alpha} \mu_1(\varphi)\). Therefore \(\mu_1(\varphi^{**})=\mu_1(\varphi^*)\). The fact that \(\varphi^*\in \mathrm{argmax}(\mu_1(\varphi) - k\mu_0(\varphi))\) implies
\[\begin{align*} \mu_1(\varphi^*) - k\mu_0(\varphi^*) &\ge \mu_1(\varphi^{**}) - k\mu_0(\varphi^{**}) \\ &= \mu_1(\varphi^*) - k\mu_0(\varphi^{**}). \end{align*}\]
Therefore \(\mu_0(\varphi^{**}) \ge \mu_0(\varphi^*) = \alpha\) by definition.
We know that \(\mu_1 - k\mu_0\) is a finite signed measure, so according to Hahn decomposition, any finite signed measure can be uniquely decomposed into the difference of two mutually singular finite measures
\[ \mu_1 - k \mu_0 = \nu_+ - \nu_- .\]
So maximizing \(\mu_1(\varphi) - k\mu_0(\varphi)\) is equivalent to maximizing \(\nu_+(\varphi) - \nu_-(\varphi)\). From which it is clear that we can pick \(\varphi = 1_{A_+}\) where \(A_+\) is the set where \(\nu_+\) is concentrated, and there is a freedom for us to pick anything from \([0,1]\) on a set of measure zero in \(|\mu_1 - k\mu_0|\).
If \(\mu_i\) can be written as density functions, then the set \(A_+\) is simply \(\left\{x: \frac{\mathrm{d} \mu_1}{\mathrm{d} \mu} > k \frac{\mathrm{d} \mu_0}{\mathrm{d} \mu}\right\}\). This can be seen as a slight generalization of a likelihood ratio test, if we ignore the division by zero problem, it can be written as \(\left\{\frac{\mathrm{d} \mu_1}{\mathrm{d} \mu_0} > k\right\}\).
The Lemma states that, for a simple test scenario, given any level \(\alpha\in [0,1]\), there exists a likelihood ratio test (which means \(1_{L>k}\le \varphi\le 1_{L\ge k}\) and potentially some other function values on a measure zero set) \(\varphi_\alpha\) with exactly level \(\alpha\) (i.e. \(\mu_0(\varphi_\alpha)=\alpha\)). The likelihood ratio test \(\varphi_\alpha\) is chosen to be maximizing \(\mu_1(\varphi) - k\mu_0(\varphi)\) and any likelihood ratio test maximizes the power function \(\beta_{\varphi_\alpha}(\theta_1)\) subject to the significant level \(\alpha_{\varphi_\alpha} \le \alpha\).
For \(\alpha\in [0,1]\), let \(k\) be a critical value for a likelihood ratio test \(\varphi_\alpha\) in the sense of Neyman-Pearson Lemma, i.e.
\[\varphi_\alpha = 1_{\left\{x:\frac{\mathrm{d} \mu_1(x)}{\mathrm{d} \mu_0(x)} > k\right\}} \text{ a.e. in } |\mu_0 - k\mu_1|.\]
Then \(\mu_0(\varphi_\alpha) = \alpha\) and \(\mu_1(\varphi_\alpha) = \beta_{\varphi_\alpha}(\theta_1)\).
We have
\[\varphi^{**}\in \mathrm{argmax}_{\mu_0\le \alpha}\mu_1 \Rightarrow \varphi^{**}=\varphi_\alpha \text{ a.e. in } |\mu_0 - k\mu_1|.\]
If \(\mu_0\neq \mu_1\) or \(k\neq 1\), with \(\varphi_\alpha\) a likelihood ratio test with level \(\alpha\in (0,1)\), then \(\mu_1(\varphi_\alpha) > \alpha\).
\[\mu_0\neq \mu_1\Rightarrow \mu_1(\varphi_\alpha)>\alpha. \]
Proof.
We already proved that \(\mu_0(\varphi^{**})=\mu_0(\varphi_\alpha) = \alpha\). Since
\[ (\mu_1 - k\mu_0)(\varphi^{**}) = (\mu_1 - k\mu_0)(\varphi_\alpha)\]
and by the construction of \(\varphi_\alpha\), we know that \(\varphi_\alpha - \varphi^{**}\ge 0\) a.e. in \(|\mu_0 - k\mu_1|\). This implies \(\varphi^{**}= \varphi_\alpha\) a.e. in \(|\mu_0 - k\mu_1|\).
Consider the constant test \(\varphi_c = \alpha\in (0,1)\), by \(\varphi_\alpha\in \mathrm{argmax}_{\mu_0\le\alpha}\mu_1\) we know \(\mu_1(\varphi_\alpha)\ge \mu_1(\varphi_c) = \alpha\). If equality holds then \(\varphi_c\) is also in the set, thus \(\varphi_c = \varphi_\alpha\) a.e. in \(|\mu_0 - k\mu_1|\), but this equality never hold since \(\varphi_\alpha\in \{0,1\}\) a.e. in \(|\mu_0 - k\mu_1|\). The only possible case is \(\mu_0=\mu_1\) and \(k=1\).
Suppose we are testing
\[\mathbb{P}(X|\theta) \sim \text{Exponential}(\theta) \sim \theta e^{-\theta x}1_{x\ge 0}\,\mathrm{d}{x}\] with hypothesis \(H_0: \theta = \theta_0\) and \(H_1:\theta=\theta_1\), for simplicity assume \(\theta_1>\theta_0\). The likelihood ratio test is of the form
\[ \frac{\theta_1e^{-\theta_1x}}{\theta_0e^{-\theta_0x}} > k \Leftrightarrow x < \frac{1}{\theta_1-\theta_0}\log\frac{\theta_1}{k\theta_0} = x_k.\]
\[\alpha = \mu_0\varphi = \int_0^{x_k} \theta_0e^{-\theta_0x}\,\mathrm{d}{x} = 1 - e^{-\theta_0x_k}.\]
\[ x_k = \frac{1}{\theta_0}\log\frac{1}{1-\alpha}.\]
And the test with level \(\alpha\) is simply given by \(\varphi_\alpha = 1_{x<\frac{1}{\theta_0}\log \frac{1}{1-\alpha}}\). Some magic is happening here, this test is optimal as it maximizes \(\mu_1(\varphi)\) among level \(\le \alpha\), but is independent of \(\theta_1\)! (This is an example of Uniformly Most Powerful Test. An interesting question is when does this happen?)
Consider a very simple random variable \(X\sim \text{Bernoulli}(p)\), with \(H_0: p=\frac{1}{2}\) and \(H_1: p=\frac{1}{4}\). The likelihood ratio is
\[ L(x) = \begin{cases} \frac{1}{2} & x=1 \\ \frac{3}{2} & x=0. \end{cases}\]
Then clearly there are \(5\) different regions of \(k\) we can take to form different tests \(\varphi = \begin{cases} 1 & L(x) > k \\ \gamma & L(x) = k \\ 0 & L(x) < k \end{cases}\)
\[ \left[0,\frac{1}{2}\right) , \left\{\frac{1}{2}\right\} , \left(\frac{1}{2},\frac{3}{2}\right) , \left\{\frac{3}{2}\right\} , \left(\frac{3}{2},\infty\right) . \]
The corresponding significant levels are
\[ \alpha = \mu_0(\varphi_{k,\gamma}) = \begin{cases} 1 & k \in [0,\frac{1}{2}) \\ 1\cdot \gamma + \frac{1}{2}\cdot (1-\gamma) & k = \frac{1}{2} \\ \frac{1}{2} & k \in (\frac{1}{2},\frac{3}{2}) \\ \frac{1}{2}\cdot \gamma + 0\cdot (1-\gamma) & k = \frac{3}{2} \\ 0 & k \in (\frac{3}{2},\infty) \end{cases}.\]