Performing statistical power analysis and sample size estimation is an important aspect of experimental design. Without power analysis, sample size may be too large or too small. If sample size is too small, the experiment will lack the precision to provide reliable answers to the questions it is investigating. If sample size is too large, time and resources will be wasted, often for minimal gain. Statistical power analysis and sample size estimation allow us to decide how large a sample is needed to enable statistical judgments that are accurate and reliable and how likely your statistical test will be to detect effects of a given size in a particular situation.

The power of a statistical test is the probability that the test will reject a false null hypothesis (i.e. that it will not make a Type II error). Given the null hypothesis $H_0$ and an alternative hypothesis $H_1$, we can define power in the following way. The type I error is the probability to incorrect reject the null hypothesis. Therefore

\(\text{Type I error} = \Pr(\text{Reject } H_0 | H_0 \text{ is true}).\)

The type II error is the probability of failing to reject the null hypothesis while the alternative hypothesis is correct. That is

\(\text{Type II error} = \Pr(\text{Fail to reject } H_0 | H_1 \text{ is true}).\)

Statistical power is the probability of correctly rejecting the null hypothesis while the alternative hypothesis is correct. That is = 1 - Type II error.

\(\text{Power} = \Pr(\text{Fail to reject } H_0 | H_1 \text{ is true}) = \text{1 - Type II error}.\)

We can summarize these in the table below.

Fail to reject $H_{0}$ | Reject $H_{0}$ | |

Null Hypothesis $H_{0}$ is true | Good | Type I error |

Alternative Hypothesis $H_{1}$ is true | Type II error | Power |

Statistical power depends on a number of factors. But in general, power nearly always depends on the following three factors: the statistical significance criterion (alpha level), the effect size and the sample size. In general, power increases with larger sample size, larger effect size, and larger alpha level.

A significance criterion is a statement of how unlikely a result must be, if the null hypothesis is true, to be considered significant. The most commonly used criteria are probabilities of 0.05 (5%, 1 in 20), 0.01 (1%, 1 in 100), and 0.001 (0.1%, 1 in 1000). If the criterion is 0.05, the probability of obtaining the observed effect when the null hypothesis is true must be less than 0.05, and so on. One easy way to increase the power of a test is to carry out a less conservative test by using a larger significance criterion. This increases the chance of obtaining a statistically significant result (rejecting the null hypothesis) when the null hypothesis is false, that is, reduces the risk of a Type II error. But it also increases the risk of obtaining a statistically significant result when the null hypothesis is true; that is, it increases the risk of a Type I error.

The magnitude of the effect of interest in the population can be quantified in terms of an effect size, where there is greater power to detect larger effects. An effect size can be a direct estimate of the quantity of interest, or it can be a standardized measure that also accounts for the variability in the population. For example, in an analysis comparing outcomes in a treated and control population, the difference of outcome means $\mu_1 - \mu_2$ would be a direct measure of the effect size, whereas $(\mu_1 - \mu_2)/\sigma$, where $\sigma$ is the common standard deviation of the outcomes in the treated and control groups, would be a standardized effect size. If constructed appropriately, a standardized effect size, along with the sample size, will completely determine the power. An unstandardized (direct) effect size will rarely be sufficient to determine the power, as it does not contain information about the variability in the measurements.

The sample size determines the amount of sampling error inherent in a test result. Other things being equal, effects are harder to detect in smaller samples. Increasing sample size is often the easiest way to boost the statistical power of a test. However, a large sample size would require more resources to achieve, which might not be possible in practice.

Many other factors can influence statistical power. First, increasing the reliability of data can increase power. The precision with which the data are measured influences statistical power. Consequently, power can often be improved by reducing the measurement error in the data. A related concept is to improve the "reliability" of the measure being assessed (as in psychometric reliability).

Second, the design of an experiment or observational study often influences the power. For example, in a two-sample testing situation with a given total sample size \(n\), it is optimal to have equal numbers of observations from the two populations being compared (as long as the variances in the two populations are the same). In regression analysis and Analysis of Variance, there is an extensive theory, and practical strategies, for improving the power based on optimally setting the values of the independent variables in the model.

Third, for longitudinal studies, power increases with the number of measurement occasions. Power may also be related to the measurement intervals.

Fourth, missing data reduce sample size and thus power. Furthermore, different missing data pattern can have difference power.

To ensure a statistical test will have adequate power, we usually must perform special analyses prior to running the experiment, to calculate how large an \(n\) is required. Although there are no formal standards for power, most researchers assess the power using 0.80 as a standard for adequacy. This convention implies a four-to-one trade off between Type II error and Type I error.

We now use a simple example to illustrate how to calculate power and sample size. More complex power analysis can be conducted in the similar way.

Suppose a researcher is interested in whether training can improve mathematical ability. S/he can conduct a study to get the math test scores from a group of students before and after training. The null hypothesis here is the change is 0. S/He believes that change should be 1 unit. Thus, the alternative hypothesis is the change is 1.

\begin{eqnarray*} H_{0}:\mu & = & \mu_{0}=0 \\ H_{1}:\mu & = & \mu_{1}=1 \end{eqnarray*}

Based on the definition of power, we have

\begin{eqnarray*} \mbox{Power} & = & \Pr(\mbox{reject }H_{0}|\mu=\mu_{1})\\ & = & \Pr(\mbox{change (}d\mbox{) is larger than critical value under }H_{0}|\mu=\mu_{1})\\ & = & \Pr(d>\mu_{0}+c_{\alpha}s/\sqrt{n}|\mu=\mu_{1}) \end{eqnarray*}

where

- $\mu_{0}$ is the population value under the null hypothesis
- $\mu_{1}$ is the population value under the alternative hypothesis
- $s$ is the population standard deviation under the null hypothesis.
- $c_{\alpha}$ is the critical value for a distribution, such as the standard normal distribution.
- $n$ is the sample size.

Clearly, to calculate the power, we need to know $\mu_{0},\mu_{1},s,c_{\alpha}$, the sample size $n$, and the distributions of $d$ under both null hypothesis and alternative hypothesis. Let's assume that $\alpha=.05$ and the distribution is normal with the same variance $s$ under both null and alternative hypothesis. Then the above power is

\begin{eqnarray*} \mbox{Power} & = & \Pr(d>\mu_{0}+c_{.95}s/\sqrt{n}|\mu=\mu_{1})\\ & = & \Pr(d>\mu_{0}+1.645\times s/\sqrt{n}|\mu=\mu_{1})\\ & = & \Pr(\frac{d-\mu_{1}}{s/\sqrt{n}}>-\frac{(\mu_{1}-\mu_{0})}{s/\sqrt{n}}+1.645|\mu=\mu_{1})\\ & = & 1-\Phi\left(-\frac{(\mu_{1}-\mu_{0})}{s/\sqrt{n}}+1.645\right)\\ & = & 1-\Phi\left(-\frac{(\mu_{1}-\mu_{0})}{s}\sqrt{n}+1.645\right) \end{eqnarray*}

Thus, power is related to sample size $n$, the significance level $\alpha$, and the effect size $(\mu_{1}-\mu_{0})/s$. If we assume $s=2$, then the effect size is .5. With a sample size 100, the power from the above formulae is .999. In addition, we can solve the sample size $n$ from the equation for a given power. For example, when the power is 0.8, we can get a sample size of 25. That is to say, to achieve a power 0.8, a sample size 25 is needed.

The R package `webpower`

has functions to conduct power analysis for a variety of model. We now show how to use it.

Correlation measures whether and how a pair of variables are related. In correlation analysis, we estimate a sample correlation coefficient, such as the Pearson Product Moment correlation coefficient (\(r\)). Values of the correlation coefficient are always between -1 and +1 and quantify the direction and strength of an association.

The correlation itself can be viewed as an effect size. The correlation coefficient is a standardized metric, and effects reported in the form of r can be directly compared. According to Cohen (1998), a correlation coefficient of .10 (0.1-0.23) is considered to represent a weak or small association; a correlation coefficient of .30 (0.24-0.36) is considered a moderate correlation; and a correlation coefficient of 0.50 (0.37 or higher) or larger is considered to represent a strong or large correlation.

We can obtain sample size for a significant correlation at a given alpha level or the power for a given sample size using the function `wp.correlation()`

from the R package `webpower`

. The function has the form of `wp.correlation(n = NULL, r = NULL, power = NULL, p = 0, rho0=0, alpha = 0.05, alternative = c("two.sided", "less", "greater"))`

. Intuitively, `n`

is the sample size and `r`

is the effect size (correlation). If we provide values for `n`

and `r`

and set `power`

to `NULL`

, we can calculate a power. On the other hand, if we provide values for `power`

and `r`

and set `n`

to `NULL`

, we can calculate a sample size.

A student wants to study the relationship between stress and health. Based on her prior knowledge, she expects the two variables to be correlated with a correlation coefficient of 0.3. If she plans to collect data from 50 participants and measure their stress and health, what is the power for her to obtain a significant correlation using such a sample? Using R, we can easily see that the power is 0.573.

> library(webpower) Loading required package: MASS Loading required package: lme4 Loading required package: Matrix Loading required package: lavaan This is lavaan 0.5-23.1097 lavaan is BETA software! Please report any bugs. Loading required package: parallel > wp.correlation(n=50, r=0.3) Power for correlation n r alpha power 50 0.3 0.05 0.5728731 URL: http://psychstat.org/correlation>

A power curve is a line plot of the statistical power along with the given sample sizes. In the example above, the power is 0.573 with the sample size 50. What is the power for a different sample size, say, 100? One can investigate the power of different sample sizes and plot a power curve. To do so, we can specify a set of sample sizes. The power curve can be used for interpolation. For example, to get a power 0.8, we need a sample size about 85.

> library(webpower) Loading required package: MASS Loading required package: lme4 Loading required package: Matrix Loading required package: lavaan This is lavaan 0.5-23.1097 lavaan is BETA software! Please report any bugs. Loading required package: parallel > example=wp.correlation(n=seq(50,100,10), r=0.3, alternative = "two.sided") > example Power for correlation n r alpha power 50 0.3 0.05 0.5728731 60 0.3 0.05 0.6541956 70 0.3 0.05 0.7230482 80 0.3 0.05 0.7803111 90 0.3 0.05 0.8272250 100 0.3 0.05 0.8651692 URL: http://psychstat.org/correlation> > plot(example,type='b') >

In practice, a power 0.8 is often desired. Given the power, the sample size can also be calculated as shown in the R output below. In the output, we can see a sample size 84, rounded to the near integer, is needed to obtain the power 0.8.

> library(webpower) Loading required package: MASS Loading required package: lme4 Loading required package: Matrix Loading required package: lavaan This is lavaan 0.5-23.1097 lavaan is BETA software! Please report any bugs. Loading required package: parallel > wp.correlation(n=NULL,r=0.3, power=0.8) Power for correlation n r alpha power 83.94932 0.3 0.05 0.8 URL: http://psychstat.org/correlation>

A t-test is a statistical hypothesis test in which the test statistic follows a Student's t distribution if the null hypothesis is true, and a non-central t distribution if the alternative hypothesis is true. The t test can assess the statistical significance of the difference between population mean and a specific value, the difference between two independent population means and difference between means of matched pairs (dependent population means).

The effect size for a t-test is defined as

\[d=\frac{|\mu_{1}-\mu_{2}|}{\sigma}\]

where $\mu_{1}$ is the mean of the first group, $\mu_{2}$ is the mean of the second group and $\sigma^{2}$ is the common error variance. In practice, there are many ways to estimate the effect size. One is Cohen's \(d\), which is the sample mean difference divided by pooled standard deviation. For Cohen's \(d\) an effect size of 0.2 to 0.3 is a small effect, around 0.5 a medium effect and 0.8 to infinity, a large effect. Note the definition of small, medium, and large effect sizes is relative.

The power analysis for t-test can be conducted using the function `wp.t()`

.

To test the effectiveness of a training intervention, a researcher plans to recruit a group of students and test them before and after training. Suppose the expected effect size is 0.3. How many participants are needed to maintain a 0.8 power?

> library(webpower) Loading required package: MASS Loading required package: lme4 Loading required package: Matrix Loading required package: lavaan This is lavaan 0.5-23.1097 lavaan is BETA software! Please report any bugs. Loading required package: parallel > wp.t(n1=NULL, d=.3, power=0.8, type='paired') Paired t-test n d alpha power 89.14936 0.3 0.05 0.8 NOTE: n is number of *pairs* URL: http://psychstat.org/ttest>

For the above example, suppose the researcher would like to recruit two groups of participants, one group receiving training and the other not. What would be the required sample size based on a balanced design (two groups are of the same size)?

> library(webpower) Loading required package: MASS Loading required package: lme4 Loading required package: Matrix Loading required package: lavaan This is lavaan 0.5-23.1097 lavaan is BETA software! Please report any bugs. Loading required package: parallel > wp.t(n1=NULL, d=.3, power=0.8, type='two.sample') Two-sample t-test n d alpha power 175.3847 0.3 0.05 0.8 NOTE: n is number in *each* group URL: http://psychstat.org/ttest>

For the above example, if one group has a size 100 and the other 250, what would be the power?

> library(webpower) Loading required package: MASS Loading required package: lme4 Loading required package: Matrix Loading required package: lavaan This is lavaan 0.5-23.1097 lavaan is BETA software! Please report any bugs. Loading required package: parallel > wp.t(n1=100, n2=250, d=.3, power=NULL, type='two.sample.2n') Unbalanced two-sample t-test n1 n2 d alpha power 100 250 0.3 0.05 0.7151546 NOTE: n1 and n2 are number in *each* group URL: http://psychstat.org/ttest2n>

One-way analysis of variance (one-way ANOVA) is a technique used to compare means of two or more groups (e.g., Maxwell et al., 2003). The ANOVA tests the null hypothesis that samples in two or more groups are drawn from populations with the same mean values.

The statistic $f$ can be used as a measure of effect size for one-way ANOVA as in Cohen (1988, p. 275). The $f$ is the ratio between the standard deviation of the effect to be tested $\sigma_{b}$ (or the standard deviation of the group means, or between-group standard deviation) and the common standard deviation within the populations (or the standard deviation within each group, or within-group standard deviation) $\sigma_{w}$ such that

\[f=\frac{\sigma_{b}}{\sigma_{w}}.\]

Given the two quantities $\sigma_{m}$ and $\sigma_w$, the effect size can be determined. Cohen defined the size of effect as: small 0.1, medium 0.25, and large 0.4.

The power analysis for one-way ANOVA can be conducted using the function `wp.anova()`

.

A student hypothesizes that freshman, sophomore, junior and senior college students have different attitude towards obtaining arts degrees. Based on his prior knowledge, he expects that the effect size is about 0.25. If he plans to interview 25 students on their attitude in each student group, what is the power for him to find the significant difference among the four groups?

> library(webpower) Loading required package: MASS Loading required package: lme4 Loading required package: Matrix Loading required package: lavaan This is lavaan 0.5-23.1097 lavaan is BETA software! Please report any bugs. Loading required package: parallel > wp.anova(f=0.25,k=4,n=100,alpha=0.05) Power for One-way ANOVA k n f alpha power 4 100 0.25 0.05 0.5181755 NOTE: n is the total sample size (overall) URL: http://psychstat.org/anova>

One can also calculate the minimum detectable effect to achieve certain power given a sample size. For the above example, we can see that to get a power 0.8 with the sample size 100, the population effect size has to be at least 0.337.

> library(webpower) Loading required package: MASS Loading required package: lme4 Loading required package: Matrix Loading required package: lavaan This is lavaan 0.5-23.1097 lavaan is BETA software! Please report any bugs. Loading required package: parallel > wp.anova(f=NULL,k=4,n=100,power=0.8, alpha=0.05) Power for One-way ANOVA k n f alpha power 4 100 0.3369881 0.05 0.8 NOTE: n is the total sample size (overall) URL: http://psychstat.org/anova>

Linear regression is a statistical technique for examining the relationship between one or more independent variables and one dependent variable. The independent variables are often called predictors or covariates, while the dependent variable are also called outcome variable or criterion. Although regression is commonly used to test linear relationship between continuous predictors and an outcome, it may also test interaction between predictors and involve categorical predictors by utilizing dummy or contrast coding.

We use the effect size measure \(f^{2}\) proposed by Cohen (1988, p.410) as the measure of the regression effect size. Cohen discussed the effect size in three different cases, which actually can be generalized using the idea of a full model and a reduced model by Maxwell et al. (2003). The \(f^{2}\) is defined as

\[f^{2}=\frac{R_{Full}^{2}-R_{Reduced}^{2}}{1-R_{Full}^{2}},\]

where \(R_{Full}^{2}\) and \(R_{Reduced}^{2}\) are R-squared for the full and reduced models respectively. Suppose we are evaluating the impact of one set of predictors (B) above and beyond a second set of predictors (A). Then \(R_{Full}^{2}\) is variance accounted for by variable set A and variable set B together and \(R_{Reduced}^{2}\) is variance accounted for by variable set A only.

Cohen suggests \(f^{2}\) values of 0.02, 0.15, and 0.35 represent small, medium, and large effect sizes. The power analysis for linear regression can be conducted using the function `wp.regression()`

.

A researcher believes that a student's high school GPA and SAT score can explain 50% of variance of her/his college GPA. If she/he has a sample of 50 students, what is her/his power to find significant relationship between college GPA and high school GPA and SAT?

In this case, the \(R_{Full}^{2} = 0.5\) for the model with both predictors (p1=2). Since the interest is about both predictors, the reduced model would be a model without any predictors (p2=0). Therefore, \(R_{Reduced}^{2}=0\). Then, the effect size $f^2=1$. Given the sample size, we can see the power is 1.

> library(webpower) Loading required package: MASS Loading required package: lme4 Loading required package: Matrix Loading required package: lavaan This is lavaan 0.5-23.1097 lavaan is BETA software! Please report any bugs. Loading required package: parallel > wp.regression(n=100, p1=2, f2=1) Power for multiple regression n p1 p2 f2 alpha power 100 2 0 1 0.05 1 URL: http://psychstat.org/regression>

Another researcher believes in addition to a student's high school GPA and SAT score, the quality of recommendation letter is also important to predict college GPA. Based on some literature review, the quality of recommendation letter can explain an addition of 5% of variance of college GPA. In order to find significant relationship between college GPA and the quality of recommendation letter above and beyond high school GPA and SAT score with a power of 0.8, what is the required sample size?

In this case, the \(R_{Full}^{2} = 0.55\) for the model with all three predictors (p1=3). Since the interest is about recommendation letter, the reduced model would be a model SAT and GPA only (p2=2). Therefore, \(R_{Reduced}^{2}=0.55\). Then, the effect size $f^2=0.111$. Given the required power 0.8, the resulting sample size is 75.

> library(webpower) Loading required package: MASS Loading required package: lme4 Loading required package: Matrix Loading required package: lavaan This is lavaan 0.5-23.1097 lavaan is BETA software! Please report any bugs. Loading required package: parallel > wp.regression(n=NULL, p1=3, p2=2, f2=0.111, power=0.8) Power for multiple regression n p1 p2 f2 alpha power 74.68203 3 2 0.111 0.05 0.8 URL: http://psychstat.org/regression>