Null hypothesis testing is a procedure to evaluate the strength of evidence against a null hypothesis. Given/assuming the null hypothesis is true, we evaluate the likelihood of obtaining the observed evidence or more extreme, when the study is on a randomly-selected representative sample. The null hypothesis assumes no difference/relationship/effect in the population from which the sample is selected. The likelihood is measured by a $p$ value. If the $p$ value is small enough, we reject the null. In the significance testing approach of Ronald Fisher, a null hypothesis is rejected on the basis of data that are significantly unlikely if the null is true. However, the null hypothesis is never accepted or proved. This is analogous to a criminal trial: The defendant is assumed to be innocent (null is not rejected) until proven guilty (null is rejected) beyond a reasonable doubt (to a statistically significant degree).

To conduct a typical null hypothesis test, the following 7 steps can be followed:

- State the research question
- State the null and alternative hypotheses based on the research question
- Select a value for significance level \(\alpha\)
- Collect or locate data
- Calculate the test statistic and the p value
- Make a decision on rejecting or failing to reject the hypothesis
- Answer the research question

A hypothesis testing is used to answer a question. Therefore, the first step is to state a research question. For example, a research question could be "Does memory training improve participants' performance on a memory test?" in the ACTIVE study.

Based on the research question, one then forms the null and the alternative hypotheses. For example, to answer the research question in Step 1, we would need to compare the memory test score for two groups of participants, those who receive training and those who do not. Let \(\mu_1\) and \(\mu_2\) be the population means of the two groups.

The **null** hypothesis \(H_0\) should be a statement about parameter(s) and of "no effect" or "no difference":

\[ H_{0}:\;\mu_{1}=\mu_{2}\mbox{ or }\mu_{1}-\mu_{2}=0.\]

The **alternative** hypothesis \(H_1\) or \(H_a\) is the statement we hope or suspect is true. In this example, we hope the training group has a higher score than the control group, therefore, our alternative hypothesis would be

\[ H_{a}:\:\mu_{1}>\mu_{2}\mbox{ or }\mu_{1}-\mu_{2}>0. \]

But note that it is cheating to first look at the data and then frame \(H_a\) to fit what the data show. If we do not have direction firmly in mind in advance, we must use a two-sided alternative (default) hypothesis such that

\[H_{a}:\:\mu_{1}=\mu_{2}\mbox{ or }\mu_{1}-\mu_{2}=0.\]

Hypothesis testing is a procedure to evaluate the strength of evidence against a null hypothesis. Given the null hypothesis is true, we calculate the probability of obtaining the observed evidence or more extreme, which is called p-value. If the p value is small enough, reject the null. In practice, a value 0.05 is considered as small but other values can be used. For example, recently a group of researchers recommended to use 0.005 instead (Benjamin et al., 2017). It is called the significance level, often denoted by \(\alpha\) and should be decided before data analysis. If \(p\leq\alpha\), we reject the null hypothesis and if \(p>\alpha\), fail to reject the null and the evidence is insufficient to support a conclusion.

In this step, we can conduct an experiment to collect data or we can use some existing data. Note that even data exist, we should not form our hypothesis by peeking into the data.

The ACTIVE study has data on memory training. Therefore, we use the data as an example. The following code gets the data for the training group and the control group. `hvltt2`

has information on all 4 training groups (memory=1, reasoning=2, speed=3, control=4). Note that we use `hvltt2[group==1]`

to select a subset of data from `hvltt2`

. This means we want to get the data from `hvltt2`

when the `group`

value is equal to 1. Similarly, we select the data for the control group.

When the null hypothesis is true, the population mean difference (\(\mu_{1}-\mu_{2}=0\)) is zero. Based on our data, the observed mean difference for the two group is \(\bar{x}_{1}-\bar{x}_{2} = 1.54\). To conduct a test, we would need to calculate the probability of drawing a random sample with the difference of 1.54 or more extreme when \(H_{0}\) is true? That is

\[\Pr(\bar{x}_{1}-\bar{x}_{2}\geq1.54|\:\mu_{1}-\mu_{2}=0)=?\]

In obtaining the above probability, we need to know the sampling distribution of \(\bar{x}_{1}-\bar{x}_{2}\), which leads to the $t$ distribution in a $t$ test. We calculate a test statistic

\[t=\frac{\bar{x}_{1}-\bar{x}_{2}}{s}\]

where \(s\) and the distribution of \(t\) need to be decided.

When the two population variances of the two groups are not equal (the two sample sizes may or may not be equal). The \(t\) statistic to test whether the population means are different is calculated as:

\[t=\frac{\bar{x}_{1}-\bar{x}_{2}}{s_{\overline{\Delta}}}\]

where

\[s_{\overline{\Delta}}=\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}.\]

Here, \(s_{1}^{2}\) and \(s_{2}^{2}\) are the unbiased estimators of the variances of the two samples with \(n_{k}\) = number of participants in group \(k\) = 1 or 2. For use in significance testing, the distribution of the test statistic is approximated as an ordinary Student's \(t\) distribution with the degrees of freedom calculated as

\[\mathrm{d.f.}=\frac{(s_{1}^{2}/n_{1}+s_{2}^{2}/n_{2})^{2}}{(s_{1}^{2}/n_{1})^{2}/(n_{1}-1)+(s_{2}^{2}/n_{2})^{2}/(n_{2}-1)}.\]

This is known as the Welch-Satterthwaite equation. The true distribution of the test statistic actually depends (slightly) on the two unknown population variances.

In R, the function `t.test()`

can be used to conduct a t test. The following code conducts the Welch's t test. Note that `alternative = "greater"`

sets the alternative hypothesis. The other options include `two.sided`

and `less`

.

> usedata('active') > attach(active) > > training <- hvltt2[group==1] > control <- hvltt2[group==4] > > mean(training, na.rm=T)-mean(control, na.rm=T) [1] 1.538577 > > t.test(training, control, alternative = 'greater') Welch Two Sample t-test data: training and control t = 4.6022, df = 1272.7, p-value = 2.299e-06 alternative hypothesis: true difference in means is greater than 0 95 percent confidence interval: 0.9882856 Inf sample estimates: mean of x mean of y 25.15493 23.61635 >

When the two groups have the same population variance.The \(t\) statistic can be calculated as follows:

\[t=\frac{\bar{x}_{1}-\bar{x}_{2}}{s_{p}\cdot\sqrt{\frac{1}{n_{1}}+\frac{1}{n_{2}}}}\]

where

\[s_{p}=\sqrt{\frac{(n_{1}-1)s_{1}^{2}+(n_{2}-1)s_{2}^{2}}{n_{1}+n_{2}-2}}\]

is an estimator of the pooled standard deviation of the two samples. \(n_{k}-1\) is the degrees of freedom for each group, and the total sample size minus two (\(n_{1}+n_{2}-2\)) is the total number of degrees of freedom, which is used in significance testing.

The pooled two independent sample t test can also be conducted using the `t.test()`

function by setting the option `var.equal=T`

or `TRUE`

.

> usedata('active') > attach(active) > > training <- hvltt2[group==1] > control <- hvltt2[group==4] > > t.test(training, control, alternative = 'greater', var.equal=T) Two Sample t-test data: training and control t = 4.602, df = 1273, p-value = 2.301e-06 alternative hypothesis: true difference in means is greater than 0 95 percent confidence interval: 0.9882598 Inf sample estimates: mean of x mean of y 25.15493 23.61635 >

Based on the t test, we have a p-value about 2e-06. Since the p-value is smaller than the chosen significance level \(\alpha=0.05\), the null hypothesis is rejected.

Using the ACTIVE data, we tested whether the memory training can improve participants' performance on a memory test. Because we rejected the null hypothesis, we may conclude that the memory training statistically significantly increased the memory test performance.

- Hypothesis testing is more of a confirmatory data analysis than exploratory data analysis method. Therefore, one starts with a hypothesis and then tests whether the collected data support the hypothesis.
- The logic of hypothesis testing is - Assuming that the null hypothesis is true, what is the probability of observing a value for the test statistic that is at least as extreme as the value that was actually observed?
- If the null hypothesis is true while one rejects the null hypothesis, one would make the Type I error. If the alternative hypothesis is true while one fails to reject the null hypothesis, one would make the Type II error. Statistical power is when one would reject the null hypothesis when the alternative hypothesis is true.

Fail to reject \(H_0\) | Reject \(H_0\) | |

Null hypothesis \(H_0\) is true | Correct decision | Type I error |

Alternative hypothesis \(H_1\) is true | Type II error | Power |

- Statistical significance means that the results are unlikely to have occurred by chance, given that the null is true.
- Statistical significance does not imply practical importance. For example, in comparing two groups, the difference can still be statistically significant even if the difference is tiny.

To measure the practical importance, effect size is often recommended to use. For example, for mean difference, the commonly used effect size measure is Cohen's "d" (Cohen, 1988). Cohen's d is defined as the difference between two means divided by a standard deviation.

\[d=\frac{\bar{x}_{1}-\bar{x}_{2}}{s_{p}}\]

Cohen defined \(s_{p}\), the pooled standard deviation, as

\[s_{p}=\sqrt{\frac{(n_{1}-1)s_{x_{1}}^{2}+(n_{2}-1)s_{x_{2}}^{2}}{n_{1}+n_{2}-2}}\]A Cohen's d with the value around 0.2 is considered small, .5, median, and \(\geq\).8, large.

For example, the Cohen's d for the memory training example is 0.25, representing a small effect even though the p-value is small and indicates a statistical significance.

> #usedata('active') > #attach(active) > > #training<-hvltt2[group==1] > #control<-hvltt2[group==4] > > mean1=mean(training,na.rm=T) > mean2=mean(control,na.rm=T) > meandiff=mean1-mean2 > > n1=length(training)-sum(is.na(training)) > n2=length(control)-sum(is.na(control)) > > v1=var(training,na.rm=T) > v2=var(control,na.rm=T) > s=sqrt(((n1-1)*v1+(n2-1)*v2)/(n1+n2-2)) > s [1] 5.461292 > > cohend=meandiff > cohend [1] 1.970935 >