Wed Mar 06 2024

False Discovery Rate (FDR) Control

FDR Introduction

For the relationship between the truth of hypotheses in reality (Null and Non-null) and the truth of test results (Reject and Not reject), we can use the following table to represent it:

Null is TrueNull is False
Reject NullType I ErrorTrue Positive
Not Reject NullTrue NegativeType II Error

False Discovery — from this perspective — seems to be Type I Error. The difference is only in the denominator.

PI=#FPN=αFDR=E{#FP#FP+#TP}=α×Nα×N+#TP=α1α+#TPNP_{I} = \frac{\#FP}{N} = \alpha \\ FDR = E\{\frac{\#FP}{\#FP + \#TP}\} = \frac{\alpha \times N}{\alpha \times N + \#TP} = \alpha \frac{1}{\alpha + \frac{\#TP}{N}}

However, consider the following situation, for multiple tests with \alpha = 0.05, among 100 features there is one that is truly significant, but in the remaining 99 features that are actually not significant, no more than 5% of the tests result in false positives.

FDR=E{#FP#FP+#TP}=99×0.051+99×0.05=83.19%FDR = E\{\frac{\#FP}{\#FP + \# TP }\} = \frac{99 \times 0.05}{1 + 99 \times 0.05} = 83.19\%

From this situation, we can discover — the confidence level in hypothesis testing cannot suppress the FDR, especially when the number of True Positives is actually very low. This situation is not uncommon in biostatistics and other high-dimensional data (sparsity?).

Benjamini-Hochberg Procedure (BH(q))

The most classic method to control FDR is the Benjamini-Hochberg Procedure. The basic idea of this method is: instead of setting a specific α\alpha (e.g., 0.05), a function related to i (maximum index) and q (fixed constant) is used as the threshold for p-values in multiple hypothesis testing, to control the overall false discovery rate. The specific operations are as follows:

It requires pip_i to be independent. In this case, the expected FDR will have an upper bound limited by q.

E{FDRBH(q)}=π0qq(π0=N0N)E\{FDR_{BH(q)}\} = \pi_0 q \leq q \\ (\pi_0 = \frac{N_0}{N})

Due to the similarity in definitions, FDR is often compared with FWER. The difference is that FWER is for all hypothesis tests, while FDR is for the rejected hypothesis tests. FWER tends to be more conservative as it attempts to control all types of errors, which may reduce the model's power (#TPN1\frac{\#TP}{N_1}, the ability to discover true significance); the FDR method is more popular in dealing with high-dimensional data because it offers a better balance between controlling error rates and maintaining high statistical power.

The requirement for pip_i to be independent can be relaxed. When pip_i are only non-positively correlated, by introducing some correction terms, the Benjamini-Hochberg Procedure can still be effective.

li=Σj=1i1jp(i)iNq×1lil_i = \Sigma_{j=1}^{i} \frac{1}{j} \\ p_{(i)} \leq \frac{i}{N}q \times \frac{1}{l_i}

Proof of the Upper Bound of FDR Expectation under BH(q)

Algebra

R(t)=#{p(i)t} (t(0,1])R(t) = \#\{p_{(i)} \leq t\} \ (t \in (0, 1])

Define a(t)a(t) as the number of tests for which pi<tp_i < t but H0H_0 is actually true. Then:

Fdp(t)=a(t)R(t) [=0R(t)=0]Q(t)=NtR(t) [=0R(t)=0]Fdp(t) = \frac{a(t)}{R(t)} \ [=0 | R(t)=0] \\ Q(t) = \frac{Nt}{R(t)} \ [=0 | R(t)=0]

Let tq=supt{Q(t)q}t_{q} = sup_t\{Q(t) \leq q\}, then Q(tq)=qQ(t_q) = q

R(p(i))=iQ(p(i))=Np(i)ip(i)iNq    Q(p(i))q    p(i)tq\because R(p_{(i)}) = i \\ \therefore Q(p_{(i)}) = \frac{Np_{(i)}}{i} \\ \therefore p_{(i)} \leq \frac{i}{N}q \iff Q(p_{(i)}) \leq q \iff p_{(i)} \leq t_{q}

Therefore, the rejection region for H0(i)isp(i)tqH_{0(i)} is p_{(i)} \leq t_{q}, having:

a(tq)=#{p(i)tqH0(i) is true}a(t_{q}) = \#\{p_{(i)} \leq t_{q} | H_{0(i)}\ is\ true\}

Let A(t)=a(t)tA(t) = \frac{a(t)}{t}

st,E{A(s)A(t)}=A(t)\forall s \leq t, E\{A(s) | A(t)\} = A(t)

And a(1)=N0a(1) = N_0 is known, thus:

E{A(tq)}=A(1)=a(1)1=N0E\{A(t_{q})\} = A(1) = \frac{a(1)}{1} = N_0

Therefore:

Fdp(tq)=a(tq)R(tq)=Q(tq)Ntqa(tq)=qNA(tq)E{Fdp(tq)}=N0NqqFdp(t_q) = \frac{a(t_q)}{R(t_q)} = \frac{Q(t_q)}{N \cdot t_q} \cdot a(t_q) = \frac{q}{N} \cdot A(t_q) \\ E\{Fdp(t_q)\} = \frac{N_0}{N} q \leq q

Visual and Algebraic Integration

Noticing the image in the book, there seems to be an intuitive but not strictly rigorous proof that is hard to fault.

BHq-proof.png

With a constant slope, we have:

slope=qN=himaxslope = \frac{q}{N} = \frac{h}{i_{max}}

For independently distributed piU(0,1)p_i \sim U(0, 1), including N0N_0 true negatives, the probability of each pip_i being judged positive because it's below h is h, so:

E(#FP)=N0h#TP+#FP=imaxE(Fdp)=E(#FP)#TP+#FP=N0himax=N0NqE(\#FP) = {N_0}\cdot h \\ \#TP + \#FP = i_{max} \\ \therefore E(Fdp) = \frac{E(\#FP)}{\#TP + \#FP} = \frac{N_0 \cdot h}{i_{max}} = \frac{N_0}{N}q

Questions

During the proof process, one can roughly feel why the BH method can control FDR, but there are still some questions about the details of its criteria:


ref: