Null hypothesis testing

Null hypothesis testing is a procedure to evaluate the strength of evidence against a null hypothesis. Given/assuming the null hypothesis is true, we evaluate the likelihood of obtaining the observed evidence or more extreme, when the study is on a randomly-selected representative sample. The null hypothesis assumes no difference/relationship/effect in the population from which the sample is selected. The likelihood is measured by a $p$ value. If the $p$ value is small enough, we reject the null. In the significance testing approach of Ronald Fisher, a null hypothesis is rejected on the basis of data that are significantly unlikely if the null is true. However, the null hypothesis is never accepted or proved. This is analogous to a criminal trial: The defendant is assumed to be innocent (null is not rejected) until proven guilty (null is rejected) beyond a reasonable doubt (to a statistically significant degree).

To conduct a typical null hypothesis testing, the following 7 steps can be followed:

  1. State the research question
  2. State the null and alternative hypotheses based on the research question
  3. Select a value for significance level \(\alpha\)
  4. Collect or locate data
  5. Calculate the test statistic and the p value
  6. Make a decision on rejecting or failing to reject the hypothesis
  7. Answer the research question

Step 1. State the research question

A hypothesis testing is used to answer a question. Therefore, the first step is to state a research question. For example, a research question could be "Does memory training improve participants' performance on a memory test?" in the ACTIVE study.

Step 2. State the null and alternative hypotheses

Based on the research question, one then forms the null and the alternative hypotheses. For example, to answer the research question in Step 1, we would need to compare the memory test score for two groups of participants, those who receive training and those who do not. Let \(\mu_1\) and \(\mu_2\) be the population means of the two groups.

The null hypothesis \(H_0\) should be a statement about parameter(s), typically, of "no effect" or "no difference":

\[ H_{0}:\;\mu_{1}=\mu_{2}\mbox{ or }\mu_{1}-\mu_{2}=0.\]

The alternative hypothesis \(H_1\) or \(H_a\) is the statement we hope or suspect is true. In this example, we hope the training group has a higher score than the control group, and, therefore, our alternative hypothesis would be

\[ H_{a}:\:\mu_{1}>\mu_{2}\mbox{ or }\mu_{1}-\mu_{2}>0. \]

But note that it is cheating to first look at the data and then frame \(H_a\) to fit what the data show. If we do not have direction firmly in mind in advance, we must use a two-sided alternative (default) hypothesis such that

\[H_{a}:\:\mu_{1} \neq \mu_{2}\mbox{ or }\mu_{1}-\mu_{2} \neq 0.\]

Step 3. Set the significance level \(\alpha\)

Hypothesis testing is a procedure to evaluate the strength of evidence against a null hypothesis. Given the null hypothesis is true, we calculate the probability of obtaining the observed evidence or more extreme, which is called $p$-value. If the $p$ value is small enough, reject the null. In practice, a value 0.05 is considered as small but other values can be used. For example, recently a group of researchers recommended to use 0.005 instead (Benjamin et al., 2017). It is called the significance level, often denoted by \(\alpha\) and should be decided before data analysis. If \(p\leq\alpha\), we reject the null hypothesis, and if \(p>\alpha\), we fail to reject the null and the evidence is insufficient to support a conclusion.

Step 4. Collect or locate data

In this step, we can conduct an experiment to collect data or we can use some existing data. Note that even data exist, we should not form our hypothesis by peeking into the data.

The ACTIVE study has data on memory training. Therefore, we use the data as an example. The following code gets the data for the training group and the control group. hvltt2 has information on all 4 training groups (memory=1, reasoning=2, speed=3, control=4). Note that we use hvltt2[group==1] to select a subset of data from hvltt2. This means we want to get the data from hvltt2 when the group value is equal to 1. Similarly, we select the data for the control group.

Step 5. Calculate the test statistic and the $p$ value

When the null hypothesis is true, the population mean difference (\(\mu_{1}-\mu_{2}=0\)) is zero. Based on our data, the observed mean difference for the two group is \(\bar{x}_{1}-\bar{x}_{2} = 1.54\). To conduct a test, we would need to calculate the probability of drawing a random sample with the difference of 1.54 or more extreme when \(H_{0}\) is true? That is

\[\Pr(\bar{x}_{1}-\bar{x}_{2}\geq1.54|\:\mu_{1}-\mu_{2}=0)=?\]

In obtaining the above probability, we need to know the sampling distribution of \(\bar{x}_{1}-\bar{x}_{2}\), which leads to the $t$ distribution in a $t$ test. We calculate a test statistic

\[t=\frac{\bar{x}_{1}-\bar{x}_{2}}{s}\]

where \(s\) and the distribution of \(t\) need to be decided.

Welch's t test (unpooled two independent sample t test)

When the two population variances of the two groups are not equal (the two sample sizes may or may not be equal). The \(t\) statistic to test whether the population means are different is calculated as:

\[t=\frac{\bar{x}_{1}-\bar{x}_{2}}{s_{\overline{\Delta}}}\]

where

\[s_{\overline{\Delta}}=\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}.\]

Here, \(s_{1}^{2}\) and \(s_{2}^{2}\) are the unbiased estimators of the variances of the two samples with \(n_{k}\) = number of participants in group \(k\) = 1 or 2. For use in significance testing, the distribution of the test statistic is approximated as an ordinary Student's \(t\) distribution with the degrees of freedom calculated as

\[\mathrm{d.f.}=\frac{(s_{1}^{2}/n_{1}+s_{2}^{2}/n_{2})^{2}}{(s_{1}^{2}/n_{1})^{2}/(n_{1}-1)+(s_{2}^{2}/n_{2})^{2}/(n_{2}-1)}.\]

This is known as the Welch-Satterthwaite equation. The true distribution of the test statistic actually depends (slightly) on the two unknown population variances.

To conduct a t-test in Python, you can use the SciPy library, which provides functions for various statistical tests, including the t-test. Specifically, you can use the scipy.stats.ttest_ind() function for independent t-tests (when comparing the means of two independent groups) or scipy.stats.ttest_rel() for paired t-tests (when comparing means from the same group before and after a treatment or condition).

scipy.stats.ttest_ind(a, b, axis=0, equal_var=True, nan_policy='propagate')

Statsmodels is a powerful library for statistical modeling and tests, and it provides various statistical methods, including t-tests. In Statsmodels, the ttest_ind function from the stats.api module is used for independent t-tests.

statsmodels.stats.weightstats.ttest_ind(
	x1, 
	x2, 
	alternative='two-sided', 
	usevar='pooled', 
	weights=(None, None), 
	value=0
)
If you're looking for an even simpler alternative, pingouin is a statistical library in Python that's built on top of Pandas and Statsmodels and provides a clean, easy-to-use interface for statistical tests. 

pingouin.ttest(x, y, tail='two-sided', dv='dv', paired=False, correction=False)
We now conduct a t-test to see the gender difference in age for the ACTIVE data.
>>> import pandas as pd >>> active = pd.read_csv("https://advstats.psychstat.org/data/active.csv") >>> group_M = active[active['sex']==1]['age'] >>> group_F = active[active['sex']==2]['age'] >>> >>> ## using scipy >>> from scipy import stats >>> tvalue, pvalue = stats.ttest_ind(group_M, group_F, equal_var=False) >>> print(f"T-statistic: {tvalue:.3f} \nP-value: {pvalue:.3f}") T-statistic: 1.282 P-value: 0.200 >>> >>> ## using statsmodels >>> from statsmodels.stats import weightstats as stests >>> tvalue, pvalue, df = stests.ttest_ind(group_M, group_F, usevar="unequal") >>> print(f"T-statistic: {tvalue:.3f} \nP-value: {pvalue:.3f}") T-statistic: 1.282 P-value: 0.200 >>> >>> ## using pingouin >>> import pingouin as pg >>> result = pg.ttest(group_M, group_F) >>> print(result.to_string()) ## to_string will print everything T dof alternative p-val CI95% cohen-d BF10 power T-test 1.282212 596.924421 two-sided 0.200266 [-0.23, 1.08] 0.077914 0.149 0.259607

Pooled two independent sample t test

When the two groups have the same population variance.The \(t\) statistic can be calculated as follows:

\[t=\frac{\bar{x}_{1}-\bar{x}_{2}}{s_{p}\cdot\sqrt{\frac{1}{n_{1}}+\frac{1}{n_{2}}}}\]

where

\[s_{p}=\sqrt{\frac{(n_{1}-1)s_{1}^{2}+(n_{2}-1)s_{2}^{2}}{n_{1}+n_{2}-2}}\]

is an estimator of the pooled standard deviation of the two samples. \(n_{k}-1\) is the degrees of freedom for each group, and the total sample size minus two (\(n_{1}+n_{2}-2\)) is the total number of degrees of freedom, which is used in significance testing.

The pooled two independent sample $t$ test can also be conducted using the t.test() function by setting the option var.equal=T or TRUE.


>>> import pandas as pd >>> active = pd.read_csv("https://advstats.psychstat.org/data/active.csv") >>> group_M = active[active['sex']==1]['age'] >>> group_F = active[active['sex']==2]['age'] >>> >>> ## using scipy >>> from scipy import stats >>> tvalue, pvalue = stats.ttest_ind(group_M, group_F) >>> print(f"T-statistic: {tvalue:.3f} \nP-value: {pvalue:.3f}") T-statistic: 1.315 P-value: 0.189 >>> >>> ## using statsmodels >>> from statsmodels.stats import weightstats as stests >>> tvalue, pvalue, df = stests.ttest_ind(group_M, group_F) >>> print(f"T-statistic: {tvalue:.3f} \nP-value: {pvalue:.3f}") T-statistic: 1.315 P-value: 0.189 >>> >>> ## using pingouin >>> import pingouin as pg >>> result = pg.ttest(group_M, group_F, correction=False) >>> print(result.to_string()) ## to_string will print everything T dof alternative p-val CI95% cohen-d BF10 power T-test 1.314572 1573 two-sided 0.188845 [-0.21, 1.07] 0.077914 0.156 0.259607

Step 6. Make a decision

Based on the $t$ test, we have a $p$-value about 2e-06. Since the $p$-value is smaller than the chosen significance level \(\alpha=0.05\), the null hypothesis is rejected.

Step 7. Answer the research question

Using the ACTIVE data, we tested whether the memory training can improve participants' performance on a memory test. Because we rejected the null hypothesis, we may conclude that the memory training statistically significantly increased the memory test performance.

Remarks on hypothesis testing

  • Hypothesis testing is more of a confirmatory data analysis than exploratory data analysis method. Therefore, one starts with a hypothesis and then tests whether the collected data support the hypothesis.
  • The logic of hypothesis testing is - Assuming that the null hypothesis is true, what is the probability of observing a value for the test statistic that is at least as extreme as the value that was actually observed?
  • If the null hypothesis is true while one rejects the null hypothesis, one would make the Type I error.  If the alternative hypothesis is true while one fails to reject the null hypothesis, one would make the Type II error. Statistical power is when one would reject the null hypothesis when the alternative hypothesis is true.
  Fail to reject \(H_0\) Reject \(H_0\)
Null hypothesis \(H_0\) is true Correct decision Type I error
Alternative hypothesis \(H_1\) is true Type II error Power
  • Statistical significance means that the results are unlikely to have occurred by chance, given that the null is true.
  • Statistical significance does not imply practical importance. For example, in comparing two groups, the difference can still be statistically significant even if the difference is tiny.

Effect size

To measure the practical importance, effect size is often recommended to use. For example, for mean difference, the commonly used effect size measure is Cohen's "d" (Cohen, 1988). Cohen's d is defined as the difference between two means divided by a standard deviation.

\[d=\frac{\bar{x}_{1}-\bar{x}_{2}}{s_{p}} = t \sqrt{1/n_1 + 1/n_2}.\]

Cohen defined \(s_{p}\), the pooled standard deviation, as

\[s_{p}=\sqrt{\frac{(n_{1}-1)s_{x_{1}}^{2}+(n_{2}-1)s_{x_{2}}^{2}}{n_{1}+n_{2}-2}}.\]

A Cohen's d with the value around 0.2 is considered small, .5, median, and \(\geq\).8, large.

For example, the Cohen's d for the memory training example is 0.25, representing a small effect even though the p-value is small and indicates a statistical significance.


>>> import pandas as pd >>> active = pd.read_csv("https://advstats.psychstat.org/data/active.csv") >>> group_M = active[active['sex']==1]['age'] >>> group_F = active[active['sex']==2]['age'] >>> n_M = group_M.count() >>> n_F = group_F.count() >>> >>> from scipy import stats >>> tvalue, pvalue = stats.ttest_ind(group_M, group_F) >>> >>> import numpy as np >>> tvalue*np.sqrt(1/n_M + 1/n_F) np.float64(0.07791445105642322)

To cite the book, use: Zhang, Z. & Wang, L. (2017-2022). Advanced statistics using R. Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2.
To take the full advantage of the book such as running analysis within your web browser, please subscribe.