One-way ANOVA

Analysis of variance (ANOVA) is a collection of statistical models used to analyze the differences between group means, developed by R.A. Fisher (Fisher, 1925). In ANOVA, the observed variance in a particular variable, usually an outcome variable, is partitioned into components attributable to different sources of variation: typically the between-group variation and the within-group variation. Simply speaking, ANOVA provides a statistical test of whether or not the means of several groups are equal, and therefore generalizes the t-test to more than two groups.

One-way ANOVA

One-way ANOVA typically evaluates whether there is difference in means of three or more groups, although it also works for two-group analysis. In other words, it investigates the effect a grouping variable on the outcome variable.

In the ACTIVE data, we have 4 groups: one control group and three training groups - memory, reasoning, and speed training groups. As an example, we test whether there is any difference in the four groups in terms of memory performance measured by the Hopkins Verbal Learning Test, denoted by the hvltt2 variable in the data set. Then, one-way ANOVA can be used.

Specifically for this example, the null and alternative hypotheses

\[ H_{0}:\:\mu_{1}=\mu_{2}=\mu_{3}=\mu_{4}\]

\[H_{a}:\;H_{0}\,\mbox{is not true or at least two groups have different means.}\]

For the hypothesis testing, an F-test is used. The basic idea is to decompose the total variation of the outcome variance into the variation because of the difference between groups and the variation within groups. The variation is measured by the sum of squares. For one-way ANOVA, we have \[ SST = SSB + SSW \]

SST is the total sum of squares.
SSB is the between-group sum of squares.
SSW is the within-group sum of squares.

Then the F-statistic is \[ F=\frac{SSB/df_B}{SSW/df_W} = \frac{MSB}{MSW}\]

$df_B$ is the between-group degrees of freedom, which is equal to the number of groups - 1. For the ACTIVE example, it is $4-1=3$.
$df_W$ is the within-group degrees of freedom, which is equal to the total sample size - the number of groups. Since the sample size is 1575, $df_W = 1575-4 = 1571$.

If the null hypothesis is true, the F-statistic should follow an F distribution with degrees of freedom $df_B$ and $df_W$ under certain condition. The R code below conducts the one-way ANOVA for the ACTIVE data.

ANOVA in python can be conducted using the function anova() from the pingouin package. The function is used to construct the ANOVA source of variation table. In the table, the sum of squares (SS), mean sum of squares (MS), degrees of freedom (DF), F value and p-value (F & p)) are included.
Specifically for this example, we have F=10.051. Comparing it to an F distribution with degrees of freedom (3, 1571), we obtain the p-value = 1.466e-06. Therefore, one would reject the null hypothesis. The memory performance significantly differed across the 4 groups.

>>> ## read data
>>> import pandas as pd
>>> active = pd.read_csv("https://advstats.psychstat.org/data/active.csv")
>>> #active['group'] = active['group'].astype('category')
>>> active.head()
   site  age  edu  group  booster  sex  ...  hvltt  hvltt2  hvltt3  hvltt4  mmse  id
0     1   76   12      1        1    1  ...     28      28      17      22    27   1
1     1   67   10      1        1    2  ...     24      22      20      27    25   2
2     6   67   13      3        1    2  ...     24      24      28      27    27   3
3     5   72   16      1        1    2  ...     35      34      32      34    30   4
4     4   69   12      4        0    2  ...     35      29      34      34    28   5

[5 rows x 14 columns]
>>> 
>>> ## use pingouin for anova
>>> import pingouin as pg
>>> aov = pg.anova(dv='hvltt2', between='group', data=active,
...                detailed=True)
>>> aov.round(3)
   Source         SS    DF       MS       F  p-unc    np2
0   group    858.610     3  286.203  10.051    0.0  0.019
1  Within  44736.225  1571   28.476     NaN    NaN    NaN
>>> 
>>> ## one can also conduct one-way ANOVA using statsmodels
>>> import statsmodels.api as sm
>>> from statsmodels.formula.api import ols
>>> from statsmodels.stats.anova import anova_lm
>>> 
>>> # Fit the two-way ANOVA model
>>> model = ols('hvltt2 ~ C(group)', data=active).fit()
>>> 
>>> # Perform the ANOVA
>>> anova_results = anova_lm(model)
>>> 
>>> # Display the results
>>> print(anova_results.round(3))
              df     sum_sq  mean_sq       F  PR(>F)
C(group)     3.0    858.610  286.203  10.051     0.0
Residual  1571.0  44736.225   28.476     NaN     NaN

Post-hoc multiple comparison

In the one-way ANOVA, we have found at least two groups are different in memory performance. We can further investigate which two groups are different. In python, the function pairwise_tukey() can be used to conduct a t-test for any two groups to see if they are significantly different. For example, the code below shows how to conduct such analysis.

>>> ## read data
>>> import pandas as pd
>>> active = pd.read_csv("https://advstats.psychstat.org/data/active.csv")
>>> #active['group'] = active['group'].astype('category')
>>> active.head()
   site  age  edu  group  booster  sex  ...  hvltt  hvltt2  hvltt3  hvltt4  mmse  id
0     1   76   12      1        1    1  ...     28      28      17      22    27   1
1     1   67   10      1        1    2  ...     24      22      20      27    25   2
2     6   67   13      3        1    2  ...     24      24      28      27    27   3
3     5   72   16      1        1    2  ...     35      34      32      34    30   4
4     4   69   12      4        0    2  ...     35      29      34      34    28   5

[5 rows x 14 columns]
>>> 
>>> ## use pingouin for anova
>>> import pingouin as pg
>>> pairwise = pg.pairwise_tests(dv='hvltt2', between='group', data=active)
>>> print(pairwise.round(3).to_string())
  Contrast  A  B  Paired  Parametric      T      dof alternative  p-unc       BF10  hedges
0    group  1  2   False        True  4.221  776.812   two-sided  0.000    460.778   0.302
1    group  1  3   False        True  3.284  792.934   two-sided  0.001     15.289   0.232
2    group  1  4   False        True  5.042  775.369   two-sided  0.000  1.755e+04   0.361
3    group  2  3   False        True -1.062  786.028   two-sided  0.289      0.138  -0.075
4    group  2  4   False        True  0.936  771.841   two-sided  0.349      0.123   0.067
5    group  3  4   False        True  1.993  778.808   two-sided  0.047      0.555   0.142

From the output, we might "naively" conclude that group 1 is statistically different from groups 2, 3, and 4 as well as group 3 different from group 4. However, in doing so, we run into the multiple comparison problem -- when conducting a group of more than one hypothesis testing simultaneously, hypothesis tests are more likely to incorrectly reject the null hypothesis. To demonstrate this, consider two hypotheses:

\[ H_{01}: \text{group 1 = group 2} \] and \[ H_{02}: \text{group 3 = group 4}. \]

If for each individual hypothesis, the type I error rate is controlled at 0.05, meaning there is 5% probability to reject the hypothesis even it is true. Then, the type I error rate to reject either or both of the two hypotheses is $1 - (1-.05)*(1-0.05) = 0.0975$, which is about two times of the nominal level 0.05. This indicates to control the type I error rate, one should use a smaller significance level. Such a significance level can be calculated. Suppose we have $k$ simultaneous hypotheses to test and we want to keep the overall type I error rate to be 0.05. Let $\alpha^*$ is the significance level to use. Then we have \[ 1-(1-\alpha^*)^k = 0.05. \] Solving this equation leads to \[ \alpha^*= 1 - 0.95^{1/k} \approx 0.05/k. \] For the one-way ANOVA post-hoc comparison, we conducted a total of 6 hypotheses tests. Therefore, the corrected significance level should be $0.05/6 = 0.008$. Therefore, only a p-value smaller than this can be considered significant. In this case, the difference between group 3 and group 4 is not statistically significant anymore. Another way to solve the problem is to multiply the obtained p-value by the number of tests and then compare it with 0.05 all the time. The way to adjust the p-value is called Bonferroni correction (Dunn, 1961). This can be done easily in R as shown below.

>>> ## read data
>>> import pandas as pd
>>> active = pd.read_csv("https://advstats.psychstat.org/data/active.csv")
>>> 
>>> ## use pingouin for anova
>>> import pingouin as pg
>>> pairwise = pg.pairwise_tests(dv='hvltt2', between='group', 
...                              padjust='bonf', data=active)
>>> print(pairwise.round(3).to_string())
  Contrast  A  B  Paired  Parametric      T      dof alternative  p-unc  p-corr p-adjust       BF10  hedges
0    group  1  2   False        True  4.221  776.812   two-sided  0.000   0.000     bonf    460.778   0.302
1    group  1  3   False        True  3.284  792.934   two-sided  0.001   0.006     bonf     15.289   0.232
2    group  1  4   False        True  5.042  775.369   two-sided  0.000   0.000     bonf  1.755e+04   0.361
3    group  2  3   False        True -1.062  786.028   two-sided  0.289   1.000     bonf      0.138  -0.075
4    group  2  4   False        True  0.936  771.841   two-sided  0.349   1.000     bonf      0.123   0.067
5    group  3  4   False        True  1.993  778.808   two-sided  0.047   0.279     bonf      0.555   0.142

Other p-value correction methods

Bonferroni correction does not take into consideration of the actual size of p-values. Other methods can have a larger power than the Bonferroni correction method. One such method is the Holm's step-down method. The method obtain the corrected level for rejecting a null hypothesis in the following way.

Specify $\alpha$, the level at which to control the overall type I error.
Compute $p$-values, $p_1, \ldots , p_m$, for the m null hypotheses $H_{01}, \ldots, H_{0m}$.
Order the m p-values so that $p^{(1)} \leq p^{(2)} \leq \ldots \leq p^{(m)}$.
Define $L = min\left\{ j: p^{(j)} > \frac{\alpha}{m+1−j} \right \}$
Reject all null hypotheses $H_{0j}$ for which $p_j < p^{(L)}$

>>> ## read data
>>> import pandas as pd
>>> active = pd.read_csv("https://advstats.psychstat.org/data/active.csv")
>>> 
>>> ## use pingouin for anova
>>> import pingouin as pg
>>> pairwise = pg.pairwise_tests(dv='hvltt2', between='group', 
...                              padjust='holm', data=active)
>>> print(pairwise.round(3).to_string())
  Contrast  A  B  Paired  Parametric      T      dof alternative  p-unc  p-corr p-adjust       BF10  hedges
0    group  1  2   False        True  4.221  776.812   two-sided  0.000   0.000     holm    460.778   0.302
1    group  1  3   False        True  3.284  792.934   two-sided  0.001   0.004     holm     15.289   0.232
2    group  1  4   False        True  5.042  775.369   two-sided  0.000   0.000     holm  1.755e+04   0.361
3    group  2  3   False        True -1.062  786.028   two-sided  0.289   0.577     holm      0.138  -0.075
4    group  2  4   False        True  0.936  771.841   two-sided  0.349   0.577     holm      0.123   0.067
5    group  3  4   False        True  1.993  778.808   two-sided  0.047   0.140     holm      0.555   0.142

Barplot with standard error or confidence interval

With a fairly large sample size, it can be useful to make a barplot to show the group information. The height of each bar is proportional to the mean of each group. In addition, the standard error or the confidence interval for the mean of each group can be added to the bars. Such a plot can be generated using the bar() from the matplotlib package. To generate a such a plot, we need a data frame with the following information: the grouping variable, the mean of each group and the standard error. For the ACTIVE data, the information is given in the following table.

group	mean	se
1	26.1	0.27
2	24.5	0.27
3	24.9	0.25
4	24.1	0.29

The python code below calculates the information in the table and generates a bar plot as shown in the output. We now explain the code.

groups: The labels for the different groups (x-axis).
means: The mean value for each group (height of the bars).
std_errors: The standard errors (or any other measure of uncertainty) to plot as vertical error bars.
yerr=std_errors: Specifies the error bars along the y-axis (the vertical error bars).
capsize=5: Adds caps at the ends of the error bars for better visibility.
color='skyblue': Specifies the color of the bars.
edgecolor='black': Adds a black border around the bars.

>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> 
>>> # Example data (mean values and standard errors for each group)
>>> groups = ['Memory', 'Reasoning', 'Speed', 'Control']
>>> means = [26.1, 24.5, 24.9, 24.1]
>>> std_errors = [0.27, 0.27, 0.25, 0.29]  # Standard error for each group
>>> 
>>> # Create the bar plot with error bars
>>> plt.bar(groups, means, yerr=std_errors, capsize=5, color='skyblue', edgecolor='black')

>>> 
>>> # Adding labels and title
>>> plt.xlabel('Training Groups')
Text(0.5, 0, 'Training Groups')
>>> plt.ylabel('Mean Value')
Text(0, 0.5, 'Mean Value')
>>> plt.title('Bar Plot with Error Bars')
Text(0.5, 1.0, 'Bar Plot with Error Bars')
>>> 
>>> # Display the plot
>>> plt.savefig('bar.svg', format='svg')

To cite the book, use: Zhang, Z. & Wang, L. (2017-2025). Advanced statistics using Python. Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2.
To take the full advantage of the book such as running analysis within your web browser, please subscribe.