3 ANOVA

Analysis of Variance for Comparing More than Two Groups

Author

Biostatistics Teaching Team

Published

June 1, 2026

3.1 Learning Objectives

After this session, students will be able to:

Know	Understand	Do
ANOVA assumptions (normality, homogeneity of variance, independence)	ANOVA tests whether there are significant differences between three or more group means	Perform one-way ANOVA with `aov()`
Interpretation of F-value and p-value in ANOVA	p-value < 0.05 means at least one group differs	Calculate descriptive statistics per group
Differences between post-hoc tests (Tukey, Dunnett, Bonferroni)	Post-hoc tests identify which groups differ after a significant ANOVA	Perform post-hoc test with `glht()`
How to report ANOVA results in scientific format		Visualize ANOVA results with ggplot2

3.2 What is ANOVA?

ANOVA stands for Analysis of Variance. It’s a statistical method used to find out whether there are significant differences between the means of two or more groups. In other words, it helps us answer the question: Are the group averages different from each other, or are any differences just due to random chance?

3.3 When do we use ANOVA?

If you only have 2 groups, you can use a t-test to compare them.
If you have 3 or more groups, ANOVA is the better tool to use.

3.4 Types of ANOVA

There are several types of ANOVA depending on the study design (like one-way, two-way, or repeated measures).
In this course, we will focus on the one-way ANOVA, which is the simplest version. It is called “one-way” because we are looking at the effect of one factor or independent variable.

3.5 Comparing Phenotypes Across Penguin Species

Imagine you’re a wildlife researcher studying penguins in Antarctica. You have access to data from three different species of penguins:

Adelie
Chinstrap
Gentoo

You’re curious: Do these penguin species have different average flipper lengths?
Flipper length can tell us a lot about how penguins swim, hunt, and adapt to their environment.

To explore this question, you use a dataset called penguins, which contains real measurements from 344 penguins. Among the many variables recorded, you focus on:

Species — which kind of penguin it is
Flipper length — measured in millimeters

Your goal is to find out whether at least one species has a significantly different average flipper length compared to the others.

Because you’re comparing more than two groups, this is a perfect case for using ANOVA (Analysis of Variance)!

library(report)
library(palmerpenguins)


Attaching package: 'palmerpenguins'

The following objects are masked from 'package:datasets':

    penguins, penguins_raw

library(ggplot2)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(multcomp)

Loading required package: mvtnorm

Loading required package: survival

Loading required package: TH.data

Loading required package: MASS


Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select


Attaching package: 'TH.data'

The following object is masked from 'package:MASS':

    geyser

library(ggstatsplot)

You can cite this package as:
     Patil, I. (2021). Visualizations with statistical details: The 'ggstatsplot' approach.
     Journal of Open Source Software, 6(61), 3167, doi:10.21105/joss.03167

summary(penguins)

      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NAs    :2       NAs    :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NAs   : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NAs    :2         NAs    :2

# Remove rows with NA values
df <- na.omit(penguins)
head(df)

# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           36.7          19.3               193        3450
5 Adelie  Torgersen           39.3          20.6               190        3650
6 Adelie  Torgersen           38.9          17.8               181        3625
# ℹ 2 more variables: sex <fct>, year <int>

write.csv(df, "assets/data/penguin_clean.csv", row.names = FALSE)

summary(df)

      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :146   Biscoe   :163   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :123   1st Qu.:39.50   1st Qu.:15.60  
 Gentoo   :119   Torgersen: 47   Median :44.50   Median :17.30  
                                 Mean   :43.99   Mean   :17.16  
                                 3rd Qu.:48.60   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172       Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190       1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197       Median :4050                Median :2008  
 Mean   :201       Mean   :4207                Mean   :2008  
 3rd Qu.:213       3rd Qu.:4775                3rd Qu.:2009  
 Max.   :231       Max.   :6300                Max.   :2009

ggplot(df) +
  aes(x = species, y = flipper_length_mm, color = species) +
  geom_jitter() +
  theme(legend.position = "none")

3.6 Hypotheses of ANOVA

In the penguin example, we want to answer the question:

Is the flipper length different between Adelie, Chinstrap, and Gentoo penguins?

When performing ANOVA, we set up two hypotheses:

Null hypothesis (\(H_0\)):
The average flipper length is the same for all three species.
\(H_0\): \(\mu_{Adelie} = \mu_{Chinstrap} = \mu_{Gentoo}\)
Alternative hypothesis (\(H_1\)):
At least one species has a different average flipper length.
\(H_1\): At least one mean is different

Important: The alternative hypothesis does not say that all the means are different. It only says that at least one of them is different.

3.6.1 Hinge Question

ANOVA shows a p-value = 0.03. What is the correct interpretation?

All penguin species have different flipper lengths

At least one penguin species has a different flipper length

There is no difference in flipper length between species

The average flipper length of all species equals zero

Answer: B. ANOVA only tells us that at least one group differs, not that all groups differ.

For example, maybe Gentoo penguins have longer flippers, but Adelie and Chinstrap are still quite similar to each other.

To find out which groups are different, we would need to do additional tests after ANOVA — these are called post-hoc tests.

3.7 Retrieval Practice & Preliminary Analyses

Retrieval Practice: Before continuing to ANOVA, try to recall from the T-test session: - What is the difference between t-test and ANOVA? - When do we use t-test vs ANOVA? - What does the null hypothesis test in ANOVA?

Before running ANOVA, it’s a good idea to first explore and visualize the data. This helps us understand patterns and spot any unusual values.

3.7.1 Make a Boxplot

One of the best ways to see the differences in flipper length between penguin species is to use a boxplot.

ggplot(df) +
  aes(x = species, y = flipper_length_mm, color = species) +
  geom_boxplot()

The boxplot shows the distribution of flipper length for each species:

Gentoo penguins seem to have the longest flippers
Adelie penguins appear to have the shortest flippers
Chinstrap penguins fall somewhere in between

This visual check helps us see whether the groups look different enough to expect significant results from ANOVA.

3.7.2 Descriptive Statistics

In addition to the boxplot, we should also calculate summary statistics for each species, such as:

Mean (average flipper length)
Standard deviation (how spread out the flipper lengths are)

These numbers give us a clearer picture of the central tendency and variability in each group before running the formal test.

length_stats <- df %>%
  group_by(species) %>%
  summarise(
    mean = mean(flipper_length_mm),
    sd = sd(flipper_length_mm),
    n = n(),
    se = sd / sqrt(n),
    df = n - 1
  )
length_stats

# A tibble: 3 × 6
  species    mean    sd     n    se    df
  <fct>     <dbl> <dbl> <int> <dbl> <dbl>
1 Adelie     190.  6.52   146 0.540   145
2 Chinstrap  196.  7.13    68 0.865    67
3 Gentoo     217.  6.59   119 0.604   118

3.8 Running ANOVA in R

So far, we’ve explored the data visually and calculated some summary statistics.
But to formally test whether the flipper lengths are significantly different between the three penguin species, we need to perform an ANOVA.

This will help us answer the original research question:

“Is the flipper length different between Adelie, Chinstrap, and Gentoo penguins?”

ANOVA allows us to make conclusions about the entire population based on the sample data we have.

In the next step, we’ll learn how to run this test in R and interpret the result.

res_aov <- aov(flipper_length_mm ~ species,
  data = df)

summary(res_aov)

             Df Sum Sq Mean Sq F value Pr(>F)    
species       2  50526   25263   567.4 <2e-16 ***
Residuals   330  14693      45                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.8.1 Understanding and Interpreting the Output

When you run summary(res_aov), R will display an ANOVA table. Here’s what to look for:

F value: The test statistic. A higher F value indicates stronger evidence of differences between group means.
Pr(>F): This is the p-value. If it’s less than 0.05, we can conclude that at least one species has a different average flipper length.

The table also includes:

Degrees of freedom (Df): Tells us how many groups were compared and how many observations remain.
Mean Squares (Mean Sq): Measures variation between and within groups.

3.8.2 Interpreting the Results

In our case, the p-value is smaller than 0.05 (p < 2.2e-16), so we reject the null hypothesis.

This means we have enough evidence to say that at least one species has a different average flipper length.

We cannot say that all species are different — only that at least one species differs from the others.

If the p-value had been larger than 0.05:
We would not reject the null hypothesis.
That means we would have no strong evidence to say the species differ in flipper length.

3.8.3 Optional: Reporting the Results

A clean way to summarize the ANOVA results in R is by using the report() function from the {report} package:

# install.packages("report")  # if not already installed
library(report)
report(res_aov)

The ANOVA (formula: flipper_length_mm ~ species) suggests that:

  - The main effect of species is statistically significant and large (F(2, 330)
= 567.41, p < .001; Eta2 = 0.77, 95% CI [0.74, 1.00])

Effect sizes were labelled following Field's (2013) recommendations.

3.9 What Happens After ANOVA?

3.9.1 If the null hypothesis is not rejected (p-value ≥ 0.05):

We do not have enough evidence to say that the group means are different.
In this case, the ANOVA process usually stops here.
While other analyses could still be done, we cannot conclude that any of the groups differ based on the data at hand.

3.9.2 If the null hypothesis is rejected (p-value < 0.05):

We have shown that at least one group is different from the others.
If your only goal was to test whether all group means are equal, you can stop here.

But often, we want to go further and ask:

Which group(s) are different from each other?

This is where ANOVA alone is not enough — it only tells us that some difference exists, but not where that difference lies.

3.10 What’s Next? Post-hoc Tests

To find out which specific groups differ, we need to perform post-hoc tests.

The term post-hoc means “after this” in Latin — so these tests are done after a significant ANOVA result.
They are also called multiple pairwise comparison tests.

These tests compare the groups two by two to identify exactly which group(s) have significantly different means.

3.11 The Issue of Multiple Testing

Once ANOVA tells us that at least one group is different, the next logical step is to figure out which group(s) are different.
To do this, we need to compare the groups two at a time.

In our penguin dataset with 3 species, we would make the following pairwise comparisons:

Chinstrap vs. Adelie
Gentoo vs. Adelie
Gentoo vs. Chinstrap

At first, it might seem fine to run a t-test for each of these comparisons — after all, t-tests are made for comparing two groups.
But this brings up a big problem: multiple testing (also called multiplicity).

3.11.1 Why Multiple Testing is a Problem

Each time we do a hypothesis test (like a t-test), there’s a small chance — usually 5% if α = 0.05 — that we find a significant result just by chance, even if the null hypothesis is true.

So what happens if we run multiple tests?

Let’s say we run 3 pairwise comparisons and set our significance level at 0.05. The probability of getting at least one significant result just by chance becomes:

\[\begin{equation} \begin{split} P(\text{at least 1 sig. result}) & = 1 - P(\text{no sig. results}) \\ & = 1 - (1 - 0.05)^3 \\ & = 0.142625 \end{split} \end{equation}\]

That means there’s a 14.26% chance of finding a false positive — nearly 3 times higher than our intended 5% error rate!

3.11.2 More Groups, More Problems

As the number of groups increases, the number of pairwise comparisons increases rapidly:

4 groups → 6 comparisons
5 groups → 10 comparisons
10 groups → 45 comparisons

With 10 groups, the chance of getting a false positive becomes 90%, and with 14 or more groups, it’s almost guaranteed — over 99%.

3.12 Post-hoc Tests in R and Their Interpretation

Once ANOVA tells us there is a difference, post-hoc tests help us identify which specific groups are different. There are several types of post-hoc tests, and the most common ones are:

Tukey HSD: Compares all possible pairs of groups.
Dunnett: Compares each group to a reference group (e.g., a control group).
Bonferroni correction: Used when you have specific planned comparisons.

3.12.1 Bonferroni Correction

This is the simplest method. You divide your desired significance level (e.g., 0.05) by the number of comparisons.

In our penguin example, we are comparing 3 pairs:

\[\alpha' = \frac{0.05}{3} = 0.0167\]

Then you perform individual t-tests and compare their p-values to 0.0167.
If the p-value is smaller than this adjusted level, the difference is considered significant.

Note: Bonferroni is simple but conservative. It reduces false positives but may miss true differences (false negatives).

3.12.2 Bonferroni in R

We can run the Bonferroni correction using pairwise.t.test():

# Bonferroni correction
pairwise.t.test(df$flipper_length_mm, df$species,
                p.adjust.method = "bonferroni",
                pool.sd = FALSE)


    Pairwise comparisons using t tests with non-pooled SD 

data:  df$flipper_length_mm and df$species 

          Adelie  Chinstrap
Chinstrap 3.9e-07 -        
Gentoo    < 2e-16 < 2e-16  

P value adjustment method: bonferroni

3.12.3 Tukey HSD Test

In our case, we don’t have a reference group and we want to compare all species — so we’ll use the Tukey HSD test.

We can reuse the ANOVA result (res_aov) and run the test like this:

# Tukey HSD test:
post_test <- glht(res_aov,
linfct = mcp(species = "Tukey")
)

summary(post_test)


     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: aov(formula = flipper_length_mm ~ species, data = df)

Linear Hypotheses:
                        Estimate Std. Error t value Pr(>|t|)    
Chinstrap - Adelie == 0   5.7208     0.9796    5.84 2.18e-08 ***
Gentoo - Adelie == 0     27.1326     0.8241   32.92  < 1e-08 ***
Gentoo - Chinstrap == 0  21.4118     1.0143   21.11  < 1e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

After running the Tukey HSD test with summary(post_test), R displays a section titled Linear Hypotheses:. This table shows the pairwise comparisons between groups.

Focus on two key columns:

First column: Lists the comparisons being made
(e.g., Chinstrap - Adelie == 0)
Last column (Pr(>|t|)): Shows the adjusted p-values
These are corrected for multiple testing to ensure the overall error rate stays at 0.05

The null hypothesis in each case is that the two groups have equal means.
If the p-value is less than 0.05, we reject this hypothesis — the two groups are significantly different.

In our case, the Tukey test compares:

Chinstrap vs. Adelie (Chinstrap - Adelie == 0)
Gentoo vs. Adelie (Gentoo - Adelie == 0)
Gentoo vs. Chinstrap (Gentoo - Chinstrap == 0)

All three adjusted p-values are less than 0.05, so we reject the null hypothesis for each comparison.

This means that all three penguin species are significantly different from one another in terms of flipper length.

3.12.3.1 Visualizing the Results

You can use the plot() function to generate a confidence interval plot:

par(mar=c(3,8,3,3)) # Adjust margins for better label spacing
plot(post_test)

We see that the confidence intervals do not cross the zero line, which indicate that all groups are significantly different.

3.12.4 Dunnett’s Test

We’ve seen that as the number of groups increases, so does the number of pairwise comparisons.
This leads to a problem: to keep the overall error rate (e.g. 5%) under control, we have to lower the threshold for each individual test.

But when we do that, we also reduce the statistical power — meaning it’s harder to detect real differences between groups.

3.12.4.1 How Dunnett’s Test Helps

One way to improve power is by reducing the number of comparisons.
This is especially useful in experiments where there is a control group and one or more treatment groups.

Instead of comparing every group with every other group (like Tukey HSD does), Dunnett’s test:

Compares each group only to a reference group
Does not compare the non-reference groups to each other

3.12.4.2 Tukey vs. Dunnett

Test Type	Compares	Statistical Power
Tukey HSD	All groups to each other	Lower
Dunnett	Each group to a reference	Higher

3.12.4.3 Applying Dunnett’s Test

Let’s say we want to use Adelie as our reference group and compare the other two species (Chinstrap and Gentoo) to it.

Here’s how to run Dunnett’s test in R:

# Dunnett's post-hoc test
post_test_dunnett <- glht(res_aov, linfct = mcp(species = "Dunnett"))

summary(post_test_dunnett)


     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Dunnett Contrasts


Fit: aov(formula = flipper_length_mm ~ species, data = df)

Linear Hypotheses:
                        Estimate Std. Error t value Pr(>|t|)    
Chinstrap - Adelie == 0   5.7208     0.9796    5.84 2.51e-08 ***
Gentoo - Adelie == 0     27.1326     0.8241   32.92  < 1e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

Note: This is the same as the Tukey test code, except we change "Tukey" to "Dunnett" in the linfct line.

The output works similarly to Tukey’s test — you’ll see the comparisons and adjusted p-values.

But this time, only the comparisons with the reference group (Adelie) are included:

Chinstrap vs. Adelie
Gentoo vs. Adelie

This can be helpful when you only care about how each species compares to the reference, and not how they compare to each other.

In the output of summary(post_test_dunnett), focus on the last column, which shows the adjusted p-values.

In our example, both adjusted p-values are below 0.05, so we reject the null hypothesis for both comparisons:

Chinstrap vs. Adelie: Significant difference
Gentoo vs. Adelie: Significant difference

This means that:

Both Chinstrap and Gentoo penguins have significantly different flipper lengths compared to Adelie penguins.

However, nothing can be said about the comparison between Chinstrap and Gentoo — because Dunnett’s test only compares to the reference group.

par(mar = c(3, 8, 3, 3))  # Adjust margins for better label spacing
plot(post_test_dunnett)

3.13 Visualization of ANOVA and Post-hoc Tests on the Same Plot

To make your results easier to interpret and more visually appealing, it’s often helpful to display the ANOVA and post-hoc test results directly on your boxplots.

ggbetweenstats(
  data = df,
  x = species,
  y = flipper_length_mm,
  type = "parametric", # ANOVA or Kruskal-Wallis
  var.equal = TRUE, # ANOVA or Welch ANOVA
  plot.type = "box",
  pairwise.comparisons = TRUE,
  pairwise.display = "significant",
  centrality.plotting = FALSE,
  bf.message = FALSE
)

In the example plot above, each boxplot by species is shown along with:

The p-value from the ANOVA — usually displayed in the subtitle of the plot (e.g., p = 1.59e-107)
The p-values from post-hoc tests — shown above each pairwise comparison

3.14 Exercise: Choosing the Right Post-hoc Test

After ANOVA shows a significant difference, your task is to choose the correct post-hoc test.

3.14.1 Question 1: Plant Growth Study

A researcher wants to know whether fertilizer type (Fertilizer A, Fertilizer B, Fertilizer Control) affects tomato plant height. ANOVA shows a significant difference (p < 0.001). The researcher had planned beforehand to compare both new fertilizers against the control.

Question: Which post-hoc test is most appropriate? - A) Tukey HSD - B) Dunnett - C) Bonferroni - D) No post-hoc needed

Click to see the answer

Answer: B) Dunnett. Since the researcher only wants to compare each treatment to the control group (not compare Fertilizer A vs Fertilizer B), Dunnett’s test is the right choice — it’s more powerful than Tukey for this scenario.

3.14.2 Question 2: Fish Species Study

An ecologist compares fish body length from 4 different lakes. ANOVA shows a significant difference (p = 0.003). The researcher wants to know which lakes differ from each other.

Question: Which post-hoc test is most appropriate? - A) Tukey HSD - B) Dunnett - C) Bonferroni - D) No post-hoc needed

Click to see the answer

Answer: A) Tukey HSD. Since the researcher wants to compare all pairs of lakes (6 comparisons), Tukey HSD is the right choice — it controls error rate for all pairwise comparisons.

3.14.3 Code Exercise: Bonferroni on Tapak Dara Data

Try using the tapak dara data to test whether plant height (plant-height) differs across locations (label):

# 1. Load data
tapak_dara <- read.csv("assets/data/morfologi_tapak_dara.tsv")

# 2. Perform ANOVA
res_aov_tapak <- aov(`plant-height` ~ label, data = tapak_dara)
summary(res_aov_tapak)

# 3. If significant, run Bonferroni
pairwise.t.test(tapak_dara$`plant-height`, tapak_dara$label,
                p.adjust.method = "bonferroni")

Interpretation: - If ANOVA p-value < 0.05 → there is a difference between locations - If Bonferroni p-value < 0.05 → those location pairs differ significantly

The output shows pairwise comparisons with p-values adjusted using the Bonferroni method.

Tukey vs. Bonferroni: - Tukey HSD: More powerful for all-pairwise comparisons (all pairs) - Bonferroni: More conservative, suitable for planned comparisons (pre-planned comparisons)

3.15 Summary

In this exercise, we explored the key concepts and practical steps involved in performing an ANOVA analysis:

We began by reviewing the goal of ANOVA — to test whether the means of multiple groups are equal — and stated the null and alternative hypotheses.
We demonstrated how to perform ANOVA in R and interpret the output, especially the F value and p-value.
We covered post-hoc tests, including:
- Tukey HSD, used to compare all pairs of groups
- Dunnett’s test, used to compare treatment groups to a reference group
Finally, we showed how to visualize the data and statistical results together on a single plot for easier interpretation.