4  Chi-Square Test

Testing Relationships Between Two Categorical Variables

Author

Biostatistics Teaching Team

Published

June 1, 2026

4.1 Learning Objectives

After this session, students will be able to:

Know Understand Do
Chi-Square test assumptions (independence, adequate sample size) Chi-Square tests whether there is a relationship between two categorical variables Perform Chi-Square test with chisq.test()
Interpretation of Chi-Square statistic and p-value p-value < 0.05 means there is a significant relationship between variables Create contingency tables with table()
How to report Chi-Square results in scientific format Visualize results with ggplot2

4.2 Retrieval Practice & Introduction

Retrieval Practice: Before learning about Chi-square, try to recall: - What is the difference between categorical and numerical variables? - Give an example of categorical variables in biological research! - How do you present categorical data in a table?

4.3 What is the Chi-Square Test?

The Chi-Square Test of Independence helps us answer:

  • Is there a relationship between two categorical variables?
  • Or are they independent?

It compares the observed frequencies (what we actually see) to the expected frequencies (what we would expect if the variables were not related).

4.3.1 Hinge Question

You are researching whether soil type (clay, sandy, loam) affects the type of plant growing (flower, vegetable, tree). Which variables are categorical?

  1. Only soil type
  2. Only plant type
  3. Both variables (soil type AND plant type)
  4. No categorical variables

Answer: C. Both variables are categorical — Chi-square test is used to test the relationship between two categorical variables.

library(palmerpenguins)

Attaching package: 'palmerpenguins'
The following objects are masked from 'package:datasets':

    penguins, penguins_raw
library(ggplot2)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggrepel)
library(ggthemes)

4.4 Example: Penguins on Islands

Let’s take a look at our penguin dataset again. How are the penguins distributed in the three islands?

4.4.1 Observed Frequencies (O)

df <- na.omit(penguins)

# Create observed contingency table
penguin_table <- table(df$species, df$island)
as.data.frame.matrix(penguin_table)
          Biscoe Dream Torgersen
Adelie        44    55        47
Chinstrap      0    68         0
Gentoo       119     0         0

It seems that there are some preferences of the penguins to live in a certain island? Don’t you think so?

Let’s test this question: > Is species distribution associated with the islands?

4.4.2 Expected Frequencies (E)

If there’s no association, the penguins would be distributed proportionally in each island.

We can use this formula to calculate the expected frequency in each island.

\(E_{ij} = \frac{(\text{row total}_i) \times (\text{column total}_j)}{\text{grand total}}\)

So the Expected Table is:

# Get totals
row_totals <- rowSums(penguin_table)
col_totals <- colSums(penguin_table)
grand_total <- sum(penguin_table)

# Manually compute expected frequency matrix
expected_matrix <- outer(row_totals, col_totals, FUN = function(r, c) (r * c) / grand_total)

# Round to nearest integer
expected_matrix <- round(expected_matrix)

# Assign row and column names
rownames(expected_matrix) <- rownames(penguin_table)
colnames(expected_matrix) <- colnames(penguin_table)

# View expected frequencies
expected_matrix
          Biscoe Dream Torgersen
Adelie        71    54        21
Chinstrap     33    25        10
Gentoo        58    44        17

Let’s compare the observed and expected values side by side:

# Extract observed and expected
observed <- as.numeric(penguin_table)
expected <- as.numeric(expected_matrix)

# Combine into a data frame with labels
species <- rep(rownames(penguin_table), times = ncol(penguin_table))
island <- rep(colnames(penguin_table), each = nrow(penguin_table))

observation_table <- data.frame(
  Species = species,
  Island = island,
  Observed = observed,
  Expected = round(expected, 2)
)

observation_table
    Species    Island Observed Expected
1    Adelie    Biscoe       44       71
2 Chinstrap    Biscoe        0       33
3    Gentoo    Biscoe      119       58
4    Adelie     Dream       55       54
5 Chinstrap     Dream       68       25
6    Gentoo     Dream        0       44
7    Adelie Torgersen       47       21
8 Chinstrap Torgersen        0       10
9    Gentoo Torgersen        0       17

4.5 Calculating the Chi-Square

The Chi-Square test statistic tells us how different the observed data is from what we would expect under the assumption that the two categorical variables are independent.

We use the following formula:

\[\chi^2 = \sum \frac{(O - E)^2}{E}\]

Where:

  • \(( O )\) = observed frequency (what we actually counted)
  • \(( E )\) = expected frequency (what we would expect if there were no association)
  • \(( \chi^2 )\) = the total test statistic, measuring the overall difference between observed and expected

4.5.1 How does it work?

We go through each cell of the contingency table and compute:

\[\frac{(O - E)^2}{E}\]

This value will be: - Close to 0 if observed and expected are similar - Larger when there’s a big difference between the two

Then we sum up all of these values from every cell to get the total Chi-Square statistic.


4.5.2 What does it tell us?

If the resulting \(( \chi^2 )\) value is large enough, it means the differences between observed and expected are too big to be due to random chance.
This suggests that the two variables (like species and island) are likely associated.


4.5.3 Visualize O vs E

To help you better understand where the differences come from, you can make a bar plot comparing observed vs expected frequencies:

ggplot(observation_table, aes(x = Island, fill = Species)) +
  geom_bar(aes(y = Observed), stat = "identity", position = "dodge", alpha = 0.7) +
  geom_point(aes(y = Expected), shape = 4, size = 3,
             position = position_dodge(width = 0.9)) +
  labs(title = "Observed vs Expected Counts",
       y = "Frequency", caption = "X marks expected values") +
  theme_minimal()

4.5.4 Let’s Calculate the Chi-Square value

Let’s go through each cell of the contingency table and compute:

\[\frac{(O - E)^2}{E}\]

And at the end, we will create a sum of all the component values

# Calculate chi-square components: (O - E)^2 / E
component <- (observed - expected)^2 / expected

# Add to the data frame (round if desired)
observation_table$Component <- round(component, 2)

# View the updated table
observation_table
    Species    Island Observed Expected Component
1    Adelie    Biscoe       44       71     10.27
2 Chinstrap    Biscoe        0       33     33.00
3    Gentoo    Biscoe      119       58     64.16
4    Adelie     Dream       55       54      0.02
5 Chinstrap     Dream       68       25     73.96
6    Gentoo     Dream        0       44     44.00
7    Adelie Torgersen       47       21     32.19
8 Chinstrap Torgersen        0       10     10.00
9    Gentoo Torgersen        0       17     17.00
# Total chi-square statistic
cat("Total χ² =", round(sum(component), 2), "\n")
Total χ² = 284.59 

4.5.5 Degrees of Freedom

So, we got the χ² values, but how do we interpret it? Before we can interpret the value, first we need to calculate the degree of freedom first.

The degrees of freedom (df) for a Chi-Square test in a contingency table is calculated as:

\[df = (\text{\#rows} - 1) \times (\text{\#columns} - 1)\]

In our case, there are 3 species (rows) and 3 islands (columns):

\[df = (3 - 1)(3 - 1) = 2 \times 2 = 4\]

We can use the df to find the critical value from a Chi-Square distribution table at a significance level of 0.05.

# Parameters
df_val <- 4
alpha <- 0.05

# Critical value (right-tail threshold)
critical_value <- qchisq(p = 1 - alpha, df = df_val)

# Generate x and y for chi-square density curve
x <- seq(0, critical_value + 10, length.out = 500)
y <- dchisq(x, df = df_val)

# Plot the Chi-Square distribution
plot(x, y, type = "l", lwd = 2, col = "#2171B5",
     ylab = "Density", xlab = expression(chi^2),
     main = bquote("Chi-Square Distribution (df = " ~ .(df_val) ~ ")"))

# Add vertical line at critical value
abline(v = critical_value, col = "#c02728", lwd = 2, lty = 2)

# Shade the rejection region (right tail)
x_shade <- seq(critical_value, max(x), length.out = 100)
y_shade <- dchisq(x_shade, df = df_val)
polygon(c(critical_value, x_shade, max(x_shade)),
        c(0, y_shade, 0),
        col = "#FF666680", border = NA)

# Annotate the plot
text(critical_value + 1.5, max(y)*0.5,
     paste0("Critical value (0.05) = ", round(critical_value, 2)),
     col = "#c02728")

4.5.6 Interpretation

Now we compare the calculated Chi-Square statistic to the critical value from a Chi-Square distribution table at a significance level of 0.05.

If:

  • \(( \chi^2 = 284.59 )\)
  • \(( df = 4 )\)
  • Critical value at \(( \alpha = 0.05 )\) is 9.49

Then:

\[284.59 > 9.49\]

So we reject the null hypothesis.


Conclusion:
- There is a statistically significant relationship between penguin species and the islands where they are found.
- The distribution is not uniform and likely reflects ecological or behavioral preferences.

4.6 Running Chi-Square Test in R

R has a built-in function for the Chi-Square test, so we can directly use chisq.test() instead of calculating it manually:

Hint Ladder: If you’re confused about how to calculate Chi-Square manually: - Hint 1: Remember the formula: χ² = Σ(O - E)² / E - Hint 2: Calculate expected frequency: E = (row total × column total) / grand total - Hint 3: Use chisq.test() for direct results

# Run the chi-square test
chi <- chisq.test(penguin_table)

chi

    Pearson's Chi-squared test

data:  penguin_table
X-squared = 284.59, df = 4, p-value < 2.2e-16

Let us also visualize the result:

# Extract test statistic and df
chi_stat <- chi$statistic
df_val <- chi$parameter

# Create x values for chi-square density
x <- seq(0, chi_stat + 20, length.out = 500)
y <- dchisq(x, df = df_val)

# Plot
plot(x, y, type = "l", lwd = 2, col = "#2171B5",
     ylab = "Density", xlab = expression(chi^2),
     main = paste("Chi-Square Distribution (df =", df_val, ")"))

# Add vertical line at test statistic
abline(v = chi_stat, col = "#c02728", lwd = 2, lty = 2)

# Annotate
text(chi_stat + 2, max(y)*0.8,
     labels = paste0("X² = ", round(chi_stat, 2)),
     col = "#c02728")