Bootstrap for Confidence Interval

Obtaining a CI is not an easy, most times an extremely difficult, task. For example, even for the most widely used correlation coefficient, it is already difficult to get the confidence interval. For bivariate normal data, one can get the exact distribution of the sample correlation coefficient $r$ as

\[f(r) = \frac{(n - 2)\, \mathbf{\Gamma}(n - 1) (1 - \rho^2)^{\frac{n - 1}{2}} (1 - r^2)^{\frac{n - 4}{2}}}{\sqrt{2\pi}\, \mathbf{\Gamma}\left(n - \frac{1}{2}\right) (1 - \rho r)^{n - \frac{3}{2}}} \,\mathbf{_2F_1}\left(\frac{1}{2}, \frac{1}{2}; \frac{2n - 1}{2}; \frac{\rho r + 1}{2}\right).\]

With the sampling distribution, theoretically, one can get an exact CI for the correlation coefficient $\rho$. However, because the computational difficulty, many ways have been proposed to approximate it instead (see https://arxiv.org/pdf/1410.8165.pdf). The most widely used method is based on Fisher's z-transformation. For a sample correlation $r$, its Fisher's transformation is \[ Z =\frac{1}{2} \ln \frac{1+r}{1-r}. \] After transformation, $Z$ approximately follows a normal distribution with mean \[\frac{1}{2} \ln \frac{1+\phi}{1-\phi} \] and variance $\phi = 1/(\phi - 3)$.

To get the CI for $\rho$, we first construct a CI for \[\frac{1}{2}\ln\frac{1+\rho}{1-\rho}\] as \[\left[\frac{1}{2}\ln\frac{1+r}{1-r}-\frac{z_{1-\alpha/2}}{\sqrt{n-3}},\frac{1}{2}\ln\frac{1+r}{1-r}+\frac{z_{1-\alpha/2}}{\sqrt{n-3}}\right]\] With $Z_{\alpha}$ denoting the $100\alpha\%$ percentile of the standard normal distribution. Then, we can solve this for the the CI of $\rho$.

Bootstrap provides a useful procedure that can be used  to construct a CI. Bootstrap method is computationally intensive but can be used to get the distribution of a quantity of interest especially when the theoretical distribution of a statistic is complicated or unknown and / or the sample size is insufficient for straightforward statistical inference.

Basic idea

The basic idea of the bootstrap method is to circumvent the difficulty to get the sampling distribution of a statistic. Sometimes, one can get the theoretical distribution of such a statistic through derivation. But the underlying idea can be understood through a simple example.

Suppose we are interested in estimating a parameter called $\theta$, such as average GPA of students at University of Notre Dame. To do so, we can get a sample with sample size 100 and then use the average GPA as an estimate of $\theta$, denoted as $\hat{\theta}$. To make inference, we would need to get the sampling distribution of $\hat{\theta}$, which is the distribution of the statistic for all possible samples from the same population (students at Notre Dame) of a given size (100). Although generally it is not feasible to get all possible samples, we assume we can repeatedly get $R$ samples each with sample size 100. For each sample, we can get the average GPA as $\tilde{\theta}_i, i=1,\ldots,R$. Then, the $R$ average GPA will form a distribution, which can be viewed as an approximation of the sampling distribution. If $R$ is equal to the number of all possible samples, then it would be the exact sampling distribution. This procedure is illustrated in the figure below.

Illustration of the sampling distribution

Bootstrap is analogous to the procedure to get the sampling distribution. First, from a population, we can get a random sample with size $n$. Using the sample, we can obtain an estimate of $\theta$, denoted as $\hat{\theta}$. Second, using the random sample as a "population", we can randomly sample the subjects to form new samples with the same size. In doing so, we allow the subjects in the original sample to appear more than once in the new, bootstrapping samples. Each time we get a bootstrapping sample, we can calculate the statistic we are interested in, denoted by $\tilde{\theta}$. By repeating the procedure for $B$ times, we can get a set of $\tilde{\theta}_j, j=1,\ldots,B$. With the set of values, we can get an empirical distribution, e.g., as a histogram, for $\hat{\theta}$. Inference about the parameter can be conducted by viewing the bootstrap distribution as the "sampling" distribution. This procedure is illustrated in the figure below.

Illustration of the bootstrap distribution

Bootstrap standard error and confidence intervals

The bootstrap method is often used to obtain bootstrap standard error or confidence intervals for statistics of interest. The bootstrap standard error of the parameter estimate $\hat{{\theta}}$ can be calculated as \[s.e.(\hat{\theta}_{p})=\sqrt{\sum_{j=1}^{B}(\tilde{\theta}_{j}-\bar{\tilde{\theta}}_{j})^{2}/(B-1)}\] with \[\bar{\tilde{\theta}}=\sum_{j=1}^{B}\tilde{\theta}_{j}/B.\]

One way to construct a confidence interval is based on the standard error by assuming a normal distribution. Specially, a normal based $1-2\alpha$ CI can be constructed as \[ [\hat{\theta}-z_{1-\alpha}s.e.(\hat{\theta}),\hat{\theta}_{p}+z_{1-\alpha}s.e.(\hat{\theta})]. \]

A widely used CI is called the percentile bootstrap CI that is constructed by  \[ [\tilde{\theta}(\alpha),\tilde{\theta}(1-\alpha)] \] with $\tilde{\theta}(\alpha)$ denoting the $100\alpha$th percentile of the $B$ bootstrap estimates. Practically, this is constructed by ordering all the bootstrap estimates $\tilde{\theta}$ and then select the pair of values that are at the $100\alpha$th percentile and the $100(1-\alpha)$th percentile.

Example for correlation coefficient

As an example, we obtain the bootstrap standard error and confidence interval for the correlation between the verbal test scores at time 1 and time 2. The complete R code is given below.

> usedata('active')
> attach(active)
> ## calculate correlation for the sample
> orig.cor <- cor(hvltt, hvltt2)
> orig.cor
[1] 0.6793641
> ## calculate the sample size
> n<-length(hvltt)
> ## Bootstrap for 1000 times
> B<-1000
> boot.cor.all<-NULL
> for (i in 1:B){
+   index<-sample(1:n, replace=T)
+   boot.hvltt2<-hvltt2[index]
+   boot.hvltt<-hvltt[index]
+   boot.cor<-cor(boot.hvltt2, boot.hvltt)
+   boot.cor.all<-c(boot.cor.all, boot.cor)
+ }
> ## plot the bootstrap distribution
> hist(boot.cor.all,prob=T)
> lines(density(boot.cor.all))
> ## Bootstrap standard error
> sd(boot.cor.all)
[1] 0.0145945
> ## percentile bootstrap CI
> quantile(boot.cor.all, prob=c(0.025, 0.975))
     2.5%     97.5% 
0.6506212 0.7060827 

We now explain the code line by line.