The t-test is a Special Case of OLS (aka my attempt to write a quick post)

statistics
Matt tries to write a quick post, and also the t-test is OLS in disguise.
Author

Matt Bowers

Published

January 19, 2026

Well for a while now I’ve been thinking it might be nice to be able to fire off a quick post now and then, instead of working on these huge months long sagas that I edit and re edit a bazilion times before posting to yall. So here’s my attempt to quickly write a cute little post. We’re just going to run through the math to convince ourselves that the two-sample t-test is mathematically identical to ordinary least squares regression on a single covariate, and therefore the t-test is a special case of OLS. I know there are a bunch of t-test variants, but we’ll focus on the garden variety two sample equal variance one from your intro stats class.

Two-Sample t-test

We’ll look at the t-test from two perspectives—the classical setup and a linear regression reformulation. In each case we’ll break the approach down into these items: data generating process, estimator, expectation and variance of the estimator, test statistic, and sampling distribution of the test statistic. You can use this kind of breakdown to understand pretty much any classical statistical test. In this case, the point is to clearly show that the classical t-test and the linear regression formulation yield identical tests.

The Classical t-test Approach

The data generating process

You have two populations or processes \(Y_0\) and \(Y_1\), and you want to know whether their true means \(\mu_0\) and \(\mu_1\) are equal. We assume that both processes are Gaussian with equal but unknown variance \(\sigma^2\):

\[ Y_0 \sim N(\mu_0, \sigma^2), \quad Y_1 \sim N(\mu_1, \sigma^2) \]

The estimator

You draw \(n_0\) samples from group 0 and \(n_1\) samples from group 1 for a total of \(n=n_0+n_1\) samples, and compute the sample means \(\bar{Y}_0\) and \(\bar{Y}_1\). Your estimator for the difference in means is simply:

\[\hat{\delta} = \bar{Y}_1 - \bar{Y}_0\]

Expectation of the estimator

Since \(E[\bar{Y}_0] = \mu_0\) and \(E[\bar{Y}_1] = \mu_1\), we have:

\[E[\hat{\delta}] = E[\bar{Y}_1 - \bar{Y}_0] = \mu_1 - \mu_0\]

So \(\hat{\delta}\) is an unbiased estimator of the true difference in means.

Standard error of the estimator

The sample means are independent, so:

\[Var[\hat{\delta}] = Var[\bar{Y}_1] + Var[\bar{Y}_0] = \frac{\sigma^2}{n_1} + \frac{\sigma^2}{n_0} = \sigma^2\left(\frac{1}{n_1} + \frac{1}{n_0}\right)\]

Since we don’t know \(\sigma^2\), we estimate it with the pooled sample variance:

\[\hat{\sigma}_{\text{pooled}}^2 = \frac{(n_0-1)s_0^2 + (n_1-1)s_1^2}{n_0 + n_1 - 2}\]

where \(s_0^2\) and \(s_1^2\) are the sample variances for each group. This gives us the estimated standard error:

\[SE(\hat{\delta}) = \sqrt{\hat{\sigma}_{\text{pooled}}^2\left(\frac{1}{n_0} + \frac{1}{n_1}\right)}\]

The test statistic

We form the test statistic by dividing our estimator by its standard error:

\[t = \frac{\hat{\delta}}{SE(\hat{\delta})} = \frac{\bar{Y}_1 - \bar{Y}_0}{\sqrt{\hat{\sigma}_{\text{pooled}}^2 (1/n_0 + 1/n_1)}}\]

Sampling distribution

Under the null hypothesis \(H_0: \mu_1 = \mu_0\), this test statistic follows a Student’s t-distribution with \(n_0 + n_1 - 2\) degrees of freedom.

Having horrifying flashbacks to your intro to stats class yet? No worries. Let’s look at it from a new perspective.

The Regression Approach

The data generating process

We can express the exact same data generating process as a linear regression model. Stack all observations into a single length-\(n\) vector \(Y\) and create a dummy variable \(X \in \{0,1\}\) indexing which group each observation came from:

\[ Y = \beta_0 + \beta_1 X + \epsilon \]

where \(\epsilon \overset{iid}{\sim} N(0, \sigma^2)\).

Taking conditional expectations:

\[ E[Y|X=0] = \beta_0 = \mu_0 \] \[ E[Y|X=1] = \beta_0 + \beta_1 = \mu_1 \]

So we can see that \(\beta_1 = \mu_1 - \mu_0\), meaning the regression coefficient \(\beta_1\) directly represents the difference in population means.

The estimator

The ordinary least squares estimator for \(\beta_1\) is:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n} (X_i - \bar{X})^2}\]

For our dummy variable where \(\bar{X} = n_1/(n_0 + n_1)\), after some algebra that you can crank through on your own this simplifies to:

\[\hat{\beta}_1 = \bar{Y}_1 - \bar{Y}_0\]

Well look at that—the regression coefficient estimate is exactly the difference in sample means!

Expectation of the estimator

By the properties of OLS under our model assumptions:

\[E[\hat{\beta}_1] = \beta_1 = \mu_1 - \mu_0\]

So \(\hat{\beta}_1\) is also an unbiased estimator of the difference in means.

Standard error of the estimator

The standard error formula for an OLS coefficient is:

\[SE(\hat{\beta}_1) = \sqrt{\hat{\sigma}^2 \cdot \frac{1}{\sum_{i=1}^{n}(X_i - \bar{X})^2}}\]

where \(\hat{\sigma}^2\) is the residual variance from the regression:

\[\hat{\sigma}^2 = \frac{1}{n_0 + n_1 - 2}\sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2\]

For our dummy variable, it turns out that: - The residual variance \(\hat{\sigma}^2\) equals the pooled variance \(\hat{\sigma}_{\text{pooled}}^2\) - The sum \(\sum_{i=1}^{n}(X_i - \bar{X})^2 = \frac{n_0 n_1}{n_0 + n_1}\)

Substituting these:

\[SE(\hat{\beta}_1) = \sqrt{\hat{\sigma}_{\text{pooled}}^2 \cdot \frac{n_0 + n_1}{n_0 n_1}} = \sqrt{\hat{\sigma}_{\text{pooled}}^2\left(\frac{1}{n_0} + \frac{1}{n_1}\right)}\]

This is exactly the same standard error we got from the classical approach.

The test statistic

We form the test statistic by dividing our coefficient estimate by its standard error:

\[t = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} = \frac{\bar{Y}_1 - \bar{Y}_0}{\sqrt{\hat{\sigma}_{\text{pooled}}^2 (1/n_0 + 1/n_1)}}\]

Sampling distribution

Under the null hypothesis \(H_0: \beta_1 = 0\), this test statistic follows a Student’s t-distribution with \(n_0 + n_1 - 2\) degrees of freedom (the residual degrees of freedom from the regression).

The Punchline

See what just happened? The two approaches give us:

  • The same point estimate: \(\hat{\delta} = \hat{\beta}_1 = \bar{Y}_1 - \bar{Y}_0\)
  • The same standard error: \(\sqrt{\hat{\sigma}_{\text{pooled}}^2(1/n_0 + 1/n_1)}\)
  • The same test statistic: \(t = \frac{\bar{Y}_1 - \bar{Y}_0}{\sqrt{\hat{\sigma}_{\text{pooled}}^2 (1/n_0 + 1/n_1)}}\)
  • The same sampling distribution: \(t_{n_0+n_1-2}\)
  • Therefore, the same p-value

In other words these approaches are mathematically equivalent.

Implementation

Let’s simulate some data and implement both testing approaches.

import numpy as np
from scipy import stats
import statsmodels.api as sm

# Simulate data
np.random.seed(42)
n0, n1 = 20, 25
mu0, mu1 = 10, 12
sigma = 2
group0 = np.random.normal(mu0, sigma, n0)
group1 = np.random.normal(mu1, sigma, n1)

# Traditional t-test
t_stat, p_val_ttest = stats.ttest_ind(group1, group0, equal_var=True)

# Regression approach
y = np.concatenate([group0, group1])
x = np.concatenate([np.zeros(n0), np.ones(n1)])
X = sm.add_constant(x)
model = sm.OLS(y, X).fit()

# Compare
print(f"t-test statistic: {t_stat:.6f}")
print(f"Regression t-stat for β₁: {model.tvalues[1]:.6f}")

print(f"\nt-test p-value: {p_val_ttest:.6f}")
print(f"Regression p-value: {model.pvalues[1]:.6f}")
t-test statistic: 3.258749
Regression t-stat for β₁: 3.258749

t-test p-value: 0.002190
Regression p-value: 0.002190

As promised, the two-sample equal-variance t-test yields identical results to a linear regression with a dummy variable.

Wrapping Up

Ok, I mostly just wanted to prove to myself that I could write a short post that didn’t take an embarrassing amount of time to research and write. Let’s consider this experiment a success! See you next time.


Comments