Régression Multiple

27/03/2024

Model Validation

Gaussian Model

Model:

\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \]

$\mathbf{y}$ random vector of $n$ responses
$\mathbf{X}$ non random $n\times p$ matrix of predictors
$\boldsymbol{\epsilon}$ random vector of $n$ errors
$\boldsymbol{\beta}$ non random, unknown vector of $p$ coefficients

Assumptions:

(H1) $rg(\mathbf{X}) = p$
(H2) $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n)$

Advertising Data

Question:
Are the residuals independent, identically distributed, with the same variance $\sigma^2$ ?

Validation

Goal: Test hypothesis (H2): $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n)$.

Check residuals:
- Are there outliers ?
- Gaussian assumption ?
- Homoscedasticity ?
- Structure ?

Check fit:
- Leverage
- Cook distance

Studentized Residuals

Errors vs Residuals

From (H2), the errors are iid: \[ \epsilon_i \sim \mathcal{N}(0, \sigma^2 ) \]

The residuals have distribution: \[ \hat{\boldsymbol{\epsilon}} = \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I}_n - \mathbf{P}^{\mathbf{X}})\mathbf{y} \sim \mathcal{N}(\mathbf{0}, \sigma^2 (\mathbf{I}_n - \mathbf{P}^{\mathbf{X}})) \]

Writing: \[ \mathbf{H} = \mathbf{P}^{\mathbf{X}} = \mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T} \]

The residuals have distribution: \[ \hat{\epsilon}_i \sim \mathcal{N}(0, \sigma^2 (1 - h_{ii})) \]

Errors vs Residuals

From (H2), the errors are iid: \[ \epsilon_i \sim \mathcal{N}(0, \sigma^2 ) \] But the residuals variance depends on $i$: \[ \hat{\epsilon}_i \sim \mathcal{N}(0, \sigma^2 (1 - h_{ii})). \]

“Studentization”: try to “uniformize” the residuals so that they are iid.

Normalisation

Normalized residuals: \[ \frac{\hat{\epsilon}_i}{\sqrt{\sigma^2 (1 - h_{ii})}} \sim \mathcal{N}(0, 1). \]

Normalized residuals are iid.

Problem: $\sigma^2$ is unknown $\to$ replace it by $\hat{\sigma}^2$.

Standardized residuals

What is the distribution of \[ t_i = \frac{\hat{\epsilon}_i}{\sqrt{\hat{\sigma}^2 (1 - h_{ii})}} \sim ~? \]

If $\hat{\sigma}^2 = \frac{1}{n - p} \| \hat{\epsilon} \|^2$ is the standard unbiaised variance estimate, the $t_i$ are not Student (!)

That is because $\hat{\sigma}^2$ is not independent from $\hat{\epsilon}_i$ in general.

Solution: Replace $\hat{\sigma}^2$ with $\hat{\sigma}^2_{(-i)}$ an estimate of $\sigma^2$ independent of $\hat{\epsilon}_i$.

Cross Validation

Leave-one-out cross validation: for all $i$:

Remove observation $i$ from the dataset: \[ \mathbf{y}_{(-i)} = \begin{pmatrix} y_1\\ \vdots\\ y_{i-1}\\ y_{i+1}\\ \vdots\\ y_n \end{pmatrix} \qquad\qquad \mathbf{X}_{(-i)} = \begin{pmatrix} \mathbf{x}^1\\ \vdots\\ \mathbf{x}^{i-1}\\ \mathbf{x}^{i+1}\\ \vdots\\ \mathbf{x}^{n} \end{pmatrix} \]
Fit the dataset without observation $i$:
- Get $\hat{\boldsymbol{\beta}}_{(-i)}$ and $\hat{\sigma}^2_{(-i)}$ using $\mathbf{y}_{(-i)}$ and $\mathbf{X}_{(-i)}$.
Predict point $i$ using fitted values:
- $\hat{y}_i^P = \mathbf{x}^i\hat{\boldsymbol{\beta}}_{(-i)}$

Cross Validation

Then, from CM4, we get: \[ \frac{y_{i} - \hat{y}_{i}^P} {\sqrt{\hat{\sigma}^2_{(-i)} \left(1 + \mathbf{x}^{i} (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}(\mathbf{x}^{i})^T \right)}} \sim \mathcal{T}_{n - p -1} \]

The prediction error from the leave-one-out cross validation follows a Student with $n-p-1$ degrees of freedom.

Studentized Residuals

If $\mathbf{X}$ and $\mathbf{X}_{(-i)}$ all have rank $p$, then:

\[ t_i^* = \frac{\hat{\epsilon}_i}{\sqrt{\hat{\sigma}^2_{(-i)} (1 - h_{ii})}} = \frac{y_{i} - \hat{y}_{i}^P} {\sqrt{\hat{\sigma}^2_{(-i)} \left(1 + \mathbf{x}^{i} (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}(\mathbf{x}^{i})^T \right)}} \]

and: \[ t_i^* \sim \mathcal{T}_{n - p -1}. \]

The studentized residuals are equal to
the normalized predictions from the leave-one-out cross validation.

Sherman–Morrison formula

Sherman–Morrison formula (also “Woodbury formula”):
For any invertible square $q \times q$ matrix $\mathbf{A}$ and vectors $\mathbf{u}$ and $\mathbf{v}$ of $\mathbb{R}^q$, then $\mathbf{A} + \mathbf{u}\mathbf{v}^T$ is invertible iff $1 + \mathbf{v}^T\mathbf{A}^{-1}\mathbf{u} \neq 0$, and: \[ (\mathbf{A} + \mathbf{u}\mathbf{v}^T)^{-1} = \mathbf{A}^{-1} - \frac{ \mathbf{A}^{-1} \mathbf{u}\mathbf{v}^T \mathbf{A}^{-1} }{ 1 + \mathbf{v}^T\mathbf{A}^{-1}\mathbf{u} }. \] $~$

Proof: exercise (wikipedia is a great resource).

Studentized Residuals - Proof

Hints:

\[ (\mathbf{A} + \mathbf{u}\mathbf{v}^T)^{-1} = \mathbf{A}^{-1} - \frac{ \mathbf{A}^{-1} \mathbf{u}\mathbf{v}^T \mathbf{A}^{-1} }{ 1 + \mathbf{v}^T\mathbf{A}^{-1}\mathbf{u} }. \]

\[ \mathbf{X}^T\mathbf{X} = \mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)} + (\mathbf{x}^{i})^T\mathbf{x}^i \]

\[ \mathbf{X}^T\mathbf{y} = \mathbf{X}_{(-i)}^T\mathbf{y}_{(-i)} + (\mathbf{x}^{i})^Ty_i \]

\[ h_{ii} = [\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}]_{ii} = \mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T \]

So: \[ (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1} = (\mathbf{X}^T\mathbf{X} - (\mathbf{x}^{i})^T\mathbf{x}^i)^{-1} = \cdots \]

\[ \hat{y}_i^P = \mathbf{x}^i\hat{\boldsymbol{\beta}}_{(-i)} = \mathbf{x}^i(\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}\mathbf{X}_{(-i)}^T\mathbf{y}_{(-i)} = \cdots \]

Studentized Residuals - Proof - 1/6

\[ (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1} = (\mathbf{X}^T\mathbf{X} - (\mathbf{x}^{i})^T\mathbf{x}^i)^{-1} = \cdots \]

\[ (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1} = (\mathbf{X}^T\mathbf{X})^{-1} + \frac{ (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1} }{ 1 - \mathbf{x}^{i}(\mathbf{X}^T\mathbf{X})^{-1}(\mathbf{x}^{i})^T } \]

\[ (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1} = (\mathbf{X}^T\mathbf{X})^{-1} + \frac{(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1}}{1 - h_{ii}} \]

Hence: \[ 1 + \mathbf{x}^{i}(\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}(\mathbf{x}^{i})^T = 1 + h_{ii} +\frac{h_{ii}^2}{1 - h_{ii}} = \frac{1}{1 - h_{ii}} \]

Studentized Residuals - Proof - 2/6

\[ \hat{y}_i^P = \mathbf{x}^i\hat{\boldsymbol{\beta}}_{(-i)} = \mathbf{x}^i(\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}\mathbf{X}_{(-i)}^T\mathbf{y}_{(-i)} \]

But: \[ (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1} = (\mathbf{X}^T\mathbf{X})^{-1} + \frac{(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1}}{1 - h_{ii}} \] \[ \mathbf{X}_{(-i)}^T\mathbf{y}_{(-i)} = \mathbf{X}^T\mathbf{y} - (\mathbf{x}^{i})^Ty_i \]

Hence: \[ \begin{multline} \hat{y}_i^P = \mathbf{x}^i\left[ (\mathbf{X}^T\mathbf{X})^{-1} + \frac{(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1}}{1 - h_{ii}} \right] \\ \times [\mathbf{X}^T\mathbf{y} - (\mathbf{x}^{i})^Ty_i] \end{multline} \]

Studentized Residuals - Proof - 3/6

\[ \begin{multline} \hat{y}_i^P = \mathbf{x}^i (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{y} - \mathbf{x}^i (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^Ty_i \\ + \frac{\mathbf{x}^i(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T}{1 - h_{ii}}\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{y} \\ - \frac{\mathbf{x}^i(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1}(\mathbf{x}^{i})^T}{1 - h_{ii}} y_i \end{multline} \]

\[ \hat{y}_i^P = \mathbf{x}^i \hat{\boldsymbol{\beta}} - h_{ii}y_i + \frac{h_{ii}}{1-h_{ii}}\mathbf{x}^i \hat{\boldsymbol{\beta}} - \frac{h_{ii}^2}{1-h_{ii}} y_i \]

\[ \hat{y}_i^P = \left(1 + \frac{h_{ii}}{1-h_{ii}}\right) \hat{y}_i - \left(h_{ii} + \frac{h_{ii}^2}{1-h_{ii}}\right) y_i \]

Studentized Residuals - Proof - 4/6

\[ \hat{y}_i^P = \frac{1}{1-h_{ii}}\hat{y}_i - \frac{h_{ii}}{1-h_{ii}} y_i \]

So: \[ y_i - \hat{y}_i^P = - \frac{1}{1-h_{ii}}\hat{y}_i + \left(1 + \frac{h_{ii}}{1-h_{ii}}\right) y_i \]

And: \[ y_i - \hat{y}_i^P = \frac{1}{1-h_{ii}}(y_i - \hat{y}_i) = \frac{1}{1-h_{ii}}\hat{\epsilon}_i \]

Studentized Residuals - Proof - 5/6

We have: \[ 1 + \mathbf{x}^{i}(\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}(\mathbf{x}^{i})^T = \frac{1}{1 - h_{ii}} \] \[ y_i - \hat{y}_i^P = \frac{1}{1-h_{ii}}\hat{\epsilon}_i \]

Hence: \[ \begin{aligned} t_i^* &= \frac{y_{i} - \hat{y}_{i}^P} {\sqrt{\hat{\sigma}^2_{(-i)} \left(1 + \mathbf{x}^{i} (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}(\mathbf{x}^{i})^T \right)}}\\ &= \frac{\hat{\epsilon}_i/(1-h_{ii})}{\sqrt{\hat{\sigma}^2_{(-i)} /(1-h_{ii})}} \\ \end{aligned} \]

Studentized Residuals - Proof - 6/6

Hence: \[ \begin{aligned} t_i^* &= \frac{y_{i} - \hat{y}_{i}^P} {\sqrt{\hat{\sigma}^2_{(-i)} \left(1 + \mathbf{x}^{i} (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}(\mathbf{x}^{i})^T \right)}}\\ &= \frac{\hat{\epsilon}_i}{\sqrt{\hat{\sigma}^2_{(-i)} (1 - h_{ii})}} \\ &\sim \mathcal{T}_{n - p -1}. \end{aligned} \]

Standardized vs Studentized Residuals

Studentized Residuals: Student iid \[ t_i^* = \frac{\hat{\epsilon}_i}{\sqrt{\hat{\sigma}^2_{(-i)} (1 - h_{ii})}} \sim \mathcal{T}_{n - p -1}. \]

Also called “externally studentized residuals”, “deleted residuals”, “jackknifed residuals”, “leave-one-out residuals”. Function rstudent.

Standardized Residuals: not Student in general \[ t_i = \frac{\hat{\epsilon}_i}{\sqrt{\hat{\sigma}^2 (1 - h_{ii})}} \]

Also called “internally studentized residuals” (although they are not Student). Function rstandard.

Standardized vs Studentized Residuals

It can be shown that: \[ t_i^* = t_i \sqrt{\frac{n-p-1}{n-p-t_i^2}} \]

So that the Studentized residuals are cheap to compute.

Outliers

Studentized Residuals

Studentized Residuals \[ t_i^* = \frac{\hat{\epsilon}_i}{\sqrt{\hat{\sigma}^2_{(-i)} (1 - h_{ii})}} \sim \mathcal{T}_{n - p -1}. \]

With probability $1 - \alpha$: \[ |t_i^*| \leq t_{n-p-1}(1-\alpha/2) \quad \forall i \]

Note: if $n$ is large and $\alpha=0.05$, $t_{n-p-1}(1-\alpha/2) \approx 1.96$.

Simulation

set.seed(1289)

## Predictors
n <- 500
x_1 <- runif(n, min = -2, max = 2)
x_2 <- runif(n, min = -2, max = 2)

## Model
beta_0 <- -1; beta_1 <- 3; beta_2 <- -1

## sim
eps <- rnorm(n, mean = 0, sd = 2)
y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps

## Fit
p <- 3
fit <- lm(y_sim ~ x_1 + x_2)

Simulation

## 5% threshold
alp <- 0.05
t_student <- qt(1 - alp/2, n-p-1)

## Studentized residuals
res_student <- rstudent(fit)

## How many above the threshold ?
mean(abs(res_student) >= t_student)

## [1] 0.046

Simulation

## Plot
plot(res_student)
lines(rep(t_student, n), col = "firebrick", lwd = 2)
lines(rep(-t_student, n), col = "firebrick", lwd = 2)

Outliers

An Outlier is a point such that \[ t_i^* \gg t_{n-p-1}(1-\alpha/2) \]

i.e. a point that we do not believe was issued from a Student distribution.

“Very much higher” does not have a formal definition…

Simulation - Outlier

## Generate an outlier
y_sim[100] <- 1.3 * max(y_sim)

## Fit
fit <- lm(y_sim ~ x_1 + x_2)

Simulation - Outlier

## Plot
plot(rstudent(fit))
lines(rep(t_student, n), col = "firebrick", lwd = 2)
lines(rep(-t_student, n), col = "firebrick", lwd = 2)

Gaussian Hypothesis

Studentized Residuals

Studentized Residuals \[ t_i^* = \frac{\hat{\epsilon}_i}{\sqrt{\hat{\sigma}^2_{(-i)} (1 - h_{ii})}} \sim \mathcal{T}_{n - p -1}. \]

If $n$ is large, then $\mathcal{T}_{n - p -1} \approx \mathcal{N}(0,1)$.

$\to$ QQ-plot on the Studentized residuals.

Simulation - Gaussian

## Gaussian sim
eps <- rnorm(n, mean = 0, sd = 2)

## QQ norm
qqnorm(eps)
qqline(eps, col = "lightblue", lwd = 2)

Simulation - Gaussian

## Sim
y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps
## Fit
fit <- lm(y_sim ~ x_1 + x_2)
## Studentized residuals
qqnorm(rstudent(fit)); qqline(rstudent(fit), col = "lightblue", lwd = 2)

Simulation - Not Gaussian

## Student sim
eps <- rt(n, 3)

## QQ norm
qqnorm(eps)
qqline(eps, col = "lightblue", lwd = 2)