Bootstrap for Confidence Interval

Obtaining a CI is not an easy, most times an extremely difficult, task. For example, even for the most widely used correlation coefficient, it is already difficult to get the confidence interval. For bivariate normal data, one can get the exact distribution of the sample correlation coefficient $r$ as

\[f(r) = \frac{(n - 2)\, \mathbf{\Gamma}(n - 1) (1 - \rho^2)^{\frac{n - 1}{2}} (1 - r^2)^{\frac{n - 4}{2}}}{\sqrt{2\pi}\, \mathbf{\Gamma}\left(n - \frac{1}{2}\right) (1 - \rho r)^{n - \frac{3}{2}}} \,\mathbf{_2F_1}\left(\frac{1}{2}, \frac{1}{2}; \frac{2n - 1}{2}; \frac{\rho r + 1}{2}\right).\]

With the sampling distribution, theoretically, one can get an exact CI for the correlation coefficient $\rho$. However, because the computational difficulty, many ways have been proposed to approximate it instead (see https://arxiv.org/pdf/1410.8165.pdf). The most widely used method is based on Fisher's z-transformation. For a sample correlation $r$, its Fisher's transformation is \[ Z =\frac{1}{2} \ln \frac{1+r}{1-r}. \] After transformation, $Z$ approximately follows a normal distribution with mean \[\frac{1}{2} \ln \frac{1+\phi}{1-\phi} \] and variance $\phi = 1/(\phi - 3)$.

To get the CI for $\rho$, we first construct a CI for \[\frac{1}{2}\ln\frac{1+\rho}{1-\rho}\] as \[\left[\frac{1}{2}\ln\frac{1+r}{1-r}-\frac{z_{1-\alpha/2}}{\sqrt{n-3}},\frac{1}{2}\ln\frac{1+r}{1-r}+\frac{z_{1-\alpha/2}}{\sqrt{n-3}}\right]\] With $Z_{\alpha}$ denoting the $100\alpha\%$ percentile of the standard normal distribution. Then, we can solve this for the the CI of $\rho$.

Bootstrap provides a useful procedure that can be used  to construct a CI. Bootstrap method is computationally intensive but can be used to get the distribution of a quantity of interest especially when the theoretical distribution of a statistic is complicated or unknown and / or the sample size is insufficient for straightforward statistical inference.

Basic idea

The basic idea of the bootstrap method is to circumvent the difficulty to get the sampling distribution of a statistic. Sometimes, one can get the theoretical distribution of such a statistic through derivation. But the underlying idea can be understood through a simple example.

Suppose we are interested in estimating a parameter called $\theta$, such as average GPA of students at University of Notre Dame. To do so, we can get a sample with sample size 100 and then use the average GPA as an estimate of $\theta$, denoted as $\hat{\theta}$. To make inference, we would need to get the sampling distribution of $\hat{\theta}$, which is the distribution of the statistic for all possible samples from the same population (students at Notre Dame) of a given size (100). Although generally it is not feasible to get all possible samples, we assume we can repeatedly get $R$ samples each with sample size 100. For each sample, we can get the average GPA as $\tilde{\theta}_i, i=1,\ldots,R$. Then, the $R$ average GPA will form a distribution, which can be viewed as an approximation of the sampling distribution. If $R$ is equal to the number of all possible samples, then it would be the exact sampling distribution. This procedure is illustrated in the figure below.

Illustration of the sampling distribution

Bootstrap is analogous to the procedure to get the sampling distribution. First, from a population, we can get a random sample with size $n$. Using the sample, we can obtain an estimate of $\theta$, denoted as $\hat{\theta}$. Second, using the random sample as a "population", we can randomly sample the subjects to form new samples with the same size. In doing so, we allow the subjects in the original sample to appear more than once in the new, bootstrapping samples. Each time we get a bootstrapping sample, we can calculate the statistic we are interested in, denoted by $\tilde{\theta}$. By repeating the procedure for $B$ times, we can get a set of $\tilde{\theta}_j, j=1,\ldots,B$. With the set of values, we can get an empirical distribution, e.g., as a histogram, for $\hat{\theta}$. Inference about the parameter can be conducted by viewing the bootstrap distribution as the "sampling" distribution. This procedure is illustrated in the figure below.

Illustration of the bootstrap distribution

Bootstrap standard error and confidence intervals

The bootstrap method is often used to obtain bootstrap standard error or confidence intervals for statistics of interest. The bootstrap standard error of the parameter estimate $\hat{{\theta}}$ can be calculated as \[s.e.(\hat{\theta}_{p})=\sqrt{\sum_{j=1}^{B}(\tilde{\theta}_{j}-\bar{\tilde{\theta}}_{j})^{2}/(B-1)}\] with \[\bar{\tilde{\theta}}=\sum_{j=1}^{B}\tilde{\theta}_{j}/B.\]

One way to construct a confidence interval is based on the standard error by assuming a normal distribution. Specially, a normal based $1-2\alpha$ CI can be constructed as \[ [\hat{\theta}-z_{1-\alpha}s.e.(\hat{\theta}),\hat{\theta}_{p}+z_{1-\alpha}s.e.(\hat{\theta})]. \]

A widely used CI is called the percentile bootstrap CI that is constructed by  \[ [\tilde{\theta}(\alpha),\tilde{\theta}(1-\alpha)] \] with $\tilde{\theta}(\alpha)$ denoting the $100\alpha$th percentile of the $B$ bootstrap estimates. Practically, this is constructed by ordering all the bootstrap estimates $\tilde{\theta}$ and then select the pair of values that are at the $100\alpha$th percentile and the $100(1-\alpha)$th percentile.

Example for correlation coefficient

As an example, we obtain the bootstrap standard error and confidence interval for the correlation between the verbal test scores at time 1 and time 2. The complete R code is given below.

> usedata('active')
> attach(active)
> 
> ## calculate correlation for the sample
> 
> orig.cor <- cor(hvltt, hvltt2)
> orig.cor
[1] 0.6793641
> 
> ## calculate the sample size
> n<-length(hvltt)
> 
> ## Bootstrap for 1000 times
> 
> B<-1000
> boot.cor.all<-NULL
> 
> for (i in 1:B){
+   index<-sample(1:n, replace=T)
+   boot.hvltt2<-hvltt2[index]
+   boot.hvltt<-hvltt[index]
+   boot.cor<-cor(boot.hvltt2, boot.hvltt)
+   boot.cor.all<-c(boot.cor.all, boot.cor)
+ }
> 
> ## plot the bootstrap distribution
> hist(boot.cor.all,prob=T)
> lines(density(boot.cor.all))
> 
> ## Bootstrap standard error
> sd(boot.cor.all)
[1] 0.0145945
> 
> ## percentile bootstrap CI
> quantile(boot.cor.all, prob=c(0.025, 0.975))
     2.5%     97.5% 
0.6506212 0.7060827 
> 

We now explain the code line by line.

  • usedata('active') is used to set the data set "active.csv" to be used in this example. We use attach(active) to attach the dataframe so that the variables in the data can be accessed directly.
  • The correlation for the two variables hvltt and hvltt2 is calculated using orig.cor <- cor(hvltt, hvltt2), where cor() is the R function to calculate a correlation. 
  • To use bootstrap, the data have to be sampled with replacement. Thus, the same datum can appear more than once in the bootstrap sample. sample() is a function to sample data randomly from an input vector. To sample with replacement, the option replace=T should be supplied. 1:n, where n is the sample size, is a vector that represents the index of each subject in the data. Therefore, index<-sample(1:n, replace=T) will generate a set of values with the length n
  • hvltt[index] and hvltt2[index] will obtain the values in the two variables using the index from above. Therefore, each time, a new set of values for hvltt and hvltt2 are obtained and saved into boot.hvltt and boot.hvltt2, respectively.
  • The correlation based on the newly sampled data is calculated and called boot.cor.
  • for () is a control function for loop. Using this function, the same statements, for example, the procedure for resampling and correlation calculation, can be repeated. The statements to be repeated are put into a pair of braces. (i in 1:B) tell R that i is the index for the repetition.
  • At each time, we can get a bootstrapping correlation. We can combine all the correlations together using the R function c(). After running the for loop, the correlations are saved as boot.cor.all. Note that we define boot.cor.all as NULL, meaning there is no value inside initially. Since B<-1000, there will be 1000 values in boot.cor.all after the analysis.
  • With the bootstrapping information in boot.cor.all, we can plot the histogram, calculate bootstrap standard error and percentile CI (using quantile() function).

To cite the book, use: Zhang, Z. & Wang, L. (2017-2022). Advanced statistics using R. Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2.
To take the full advantage of the book such as running analysis within your web browser, please subscribe.