27/03/2024

Model Validation

Gaussian Model

Model:

\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \]

  • \(\mathbf{y}\) random vector of \(n\) responses
  • \(\mathbf{X}\) non random \(n\times p\) matrix of predictors
  • \(\boldsymbol{\epsilon}\) random vector of \(n\) errors
  • \(\boldsymbol{\beta}\) non random, unknown vector of \(p\) coefficients

Assumptions:

  • (H1) \(rg(\mathbf{X}) = p\)
  • (H2) \(\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n)\)

Advertising Data

Question:
Are the residuals independent, identically distributed, with the same variance \(\sigma^2\) ?

Validation

  • Goal: Test hypothesis (H2): \(\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n)\).
  • Check residuals:
    • Are there outliers ?
    • Gaussian assumption ?
    • Homoscedasticity ?
    • Structure ?
  • Check fit:
    • Leverage
    • Cook distance

Studentized Residuals

Errors vs Residuals

From (H2), the errors are iid: \[ \epsilon_i \sim \mathcal{N}(0, \sigma^2 ) \]

The residuals have distribution: \[ \hat{\boldsymbol{\epsilon}} = \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I}_n - \mathbf{P}^{\mathbf{X}})\mathbf{y} \sim \mathcal{N}(\mathbf{0}, \sigma^2 (\mathbf{I}_n - \mathbf{P}^{\mathbf{X}})) \]

Writing: \[ \mathbf{H} = \mathbf{P}^{\mathbf{X}} = \mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T} \]

The residuals have distribution: \[ \hat{\epsilon}_i \sim \mathcal{N}(0, \sigma^2 (1 - h_{ii})) \]

Errors vs Residuals

From (H2), the errors are iid: \[ \epsilon_i \sim \mathcal{N}(0, \sigma^2 ) \] But the residuals variance depends on \(i\): \[ \hat{\epsilon}_i \sim \mathcal{N}(0, \sigma^2 (1 - h_{ii})). \]

“Studentization”: try to “uniformize” the residuals so that they are iid.

Normalisation

Normalized residuals: \[ \frac{\hat{\epsilon}_i}{\sqrt{\sigma^2 (1 - h_{ii})}} \sim \mathcal{N}(0, 1). \]

Normalized residuals are iid.

Problem: \(\sigma^2\) is unknown \(\to\) replace it by \(\hat{\sigma}^2\).

Standardized residuals

What is the distribution of \[ t_i = \frac{\hat{\epsilon}_i}{\sqrt{\hat{\sigma}^2 (1 - h_{ii})}} \sim ~? \]

If \(\hat{\sigma}^2 = \frac{1}{n - p} \| \hat{\epsilon} \|^2\) is the standard unbiaised variance estimate, the \(t_i\) are not Student (!)

That is because \(\hat{\sigma}^2\) is not independent from \(\hat{\epsilon}_i\) in general.

Solution: Replace \(\hat{\sigma}^2\) with \(\hat{\sigma}^2_{(-i)}\) an estimate of \(\sigma^2\) independent of \(\hat{\epsilon}_i\).

Cross Validation

Leave-one-out cross validation: for all \(i\):

  1. Remove observation \(i\) from the dataset: \[ \mathbf{y}_{(-i)} = \begin{pmatrix} y_1\\ \vdots\\ y_{i-1}\\ y_{i+1}\\ \vdots\\ y_n \end{pmatrix} \qquad\qquad \mathbf{X}_{(-i)} = \begin{pmatrix} \mathbf{x}^1\\ \vdots\\ \mathbf{x}^{i-1}\\ \mathbf{x}^{i+1}\\ \vdots\\ \mathbf{x}^{n} \end{pmatrix} \]

  2. Fit the dataset without observation \(i\):

    • Get \(\hat{\boldsymbol{\beta}}_{(-i)}\) and \(\hat{\sigma}^2_{(-i)}\) using \(\mathbf{y}_{(-i)}\) and \(\mathbf{X}_{(-i)}\).
  3. Predict point \(i\) using fitted values:

    • \(\hat{y}_i^P = \mathbf{x}^i\hat{\boldsymbol{\beta}}_{(-i)}\)

Cross Validation

Then, from CM4, we get: \[ \frac{y_{i} - \hat{y}_{i}^P} {\sqrt{\hat{\sigma}^2_{(-i)} \left(1 + \mathbf{x}^{i} (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}(\mathbf{x}^{i})^T \right)}} \sim \mathcal{T}_{n - p -1} \]

The prediction error from the leave-one-out cross validation follows a Student with \(n-p-1\) degrees of freedom.

Studentized Residuals

If \(\mathbf{X}\) and \(\mathbf{X}_{(-i)}\) all have rank \(p\), then:

\[ t_i^* = \frac{\hat{\epsilon}_i}{\sqrt{\hat{\sigma}^2_{(-i)} (1 - h_{ii})}} = \frac{y_{i} - \hat{y}_{i}^P} {\sqrt{\hat{\sigma}^2_{(-i)} \left(1 + \mathbf{x}^{i} (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}(\mathbf{x}^{i})^T \right)}} \]

and: \[ t_i^* \sim \mathcal{T}_{n - p -1}. \]

The studentized residuals are equal to
the normalized predictions from the leave-one-out cross validation.

Sherman–Morrison formula

Sherman–Morrison formula (also “Woodbury formula”):
For any invertible square \(q \times q\) matrix \(\mathbf{A}\) and vectors \(\mathbf{u}\) and \(\mathbf{v}\) of \(\mathbb{R}^q\), then \(\mathbf{A} + \mathbf{u}\mathbf{v}^T\) is invertible iff \(1 + \mathbf{v}^T\mathbf{A}^{-1}\mathbf{u} \neq 0\), and: \[ (\mathbf{A} + \mathbf{u}\mathbf{v}^T)^{-1} = \mathbf{A}^{-1} - \frac{ \mathbf{A}^{-1} \mathbf{u}\mathbf{v}^T \mathbf{A}^{-1} }{ 1 + \mathbf{v}^T\mathbf{A}^{-1}\mathbf{u} }. \] \(~\)

Proof: exercise (wikipedia is a great resource).

Studentized Residuals - Proof

\[ t_i^* = \frac{\hat{\epsilon}_i}{\sqrt{\hat{\sigma}^2_{(-i)} (1 - h_{ii})}} = \frac{y_{i} - \hat{y}_{i}^P} {\sqrt{\hat{\sigma}^2_{(-i)} \left(1 + \mathbf{x}^{i} (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}(\mathbf{x}^{i})^T \right)}} \]

Hints:

\[ (\mathbf{A} + \mathbf{u}\mathbf{v}^T)^{-1} = \mathbf{A}^{-1} - \frac{ \mathbf{A}^{-1} \mathbf{u}\mathbf{v}^T \mathbf{A}^{-1} }{ 1 + \mathbf{v}^T\mathbf{A}^{-1}\mathbf{u} }. \]

\[ \mathbf{X}^T\mathbf{X} = \mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)} + (\mathbf{x}^{i})^T\mathbf{x}^i \]

\[ \mathbf{X}^T\mathbf{y} = \mathbf{X}_{(-i)}^T\mathbf{y}_{(-i)} + (\mathbf{x}^{i})^Ty_i \]

\[ h_{ii} = [\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}]_{ii} = \mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T \]

So: \[ (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1} = (\mathbf{X}^T\mathbf{X} - (\mathbf{x}^{i})^T\mathbf{x}^i)^{-1} = \cdots \]

\[ \hat{y}_i^P = \mathbf{x}^i\hat{\boldsymbol{\beta}}_{(-i)} = \mathbf{x}^i(\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}\mathbf{X}_{(-i)}^T\mathbf{y}_{(-i)} = \cdots \]

Studentized Residuals - Proof - 1/6

\[ (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1} = (\mathbf{X}^T\mathbf{X} - (\mathbf{x}^{i})^T\mathbf{x}^i)^{-1} = \cdots \]

\[ (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1} = (\mathbf{X}^T\mathbf{X})^{-1} + \frac{ (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1} }{ 1 - \mathbf{x}^{i}(\mathbf{X}^T\mathbf{X})^{-1}(\mathbf{x}^{i})^T } \]

\[ (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1} = (\mathbf{X}^T\mathbf{X})^{-1} + \frac{(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1}}{1 - h_{ii}} \]

Hence: \[ 1 + \mathbf{x}^{i}(\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}(\mathbf{x}^{i})^T = 1 + h_{ii} +\frac{h_{ii}^2}{1 - h_{ii}} = \frac{1}{1 - h_{ii}} \]

Studentized Residuals - Proof - 2/6

\[ \hat{y}_i^P = \mathbf{x}^i\hat{\boldsymbol{\beta}}_{(-i)} = \mathbf{x}^i(\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}\mathbf{X}_{(-i)}^T\mathbf{y}_{(-i)} \]

But: \[ (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1} = (\mathbf{X}^T\mathbf{X})^{-1} + \frac{(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1}}{1 - h_{ii}} \] \[ \mathbf{X}_{(-i)}^T\mathbf{y}_{(-i)} = \mathbf{X}^T\mathbf{y} - (\mathbf{x}^{i})^Ty_i \]

Hence: \[ \begin{multline} \hat{y}_i^P = \mathbf{x}^i\left[ (\mathbf{X}^T\mathbf{X})^{-1} + \frac{(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1}}{1 - h_{ii}} \right] \\ \times [\mathbf{X}^T\mathbf{y} - (\mathbf{x}^{i})^Ty_i] \end{multline} \]

Studentized Residuals - Proof - 3/6

\[ \begin{multline} \hat{y}_i^P = \mathbf{x}^i (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{y} - \mathbf{x}^i (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^Ty_i \\ + \frac{\mathbf{x}^i(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T}{1 - h_{ii}}\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{y} \\ - \frac{\mathbf{x}^i(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1}(\mathbf{x}^{i})^T}{1 - h_{ii}} y_i \end{multline} \]

\[ \hat{y}_i^P = \mathbf{x}^i \hat{\boldsymbol{\beta}} - h_{ii}y_i + \frac{h_{ii}}{1-h_{ii}}\mathbf{x}^i \hat{\boldsymbol{\beta}} - \frac{h_{ii}^2}{1-h_{ii}} y_i \]

\[ \hat{y}_i^P = \left(1 + \frac{h_{ii}}{1-h_{ii}}\right) \hat{y}_i - \left(h_{ii} + \frac{h_{ii}^2}{1-h_{ii}}\right) y_i \]

Studentized Residuals - Proof - 4/6

\[ \hat{y}_i^P = \frac{1}{1-h_{ii}}\hat{y}_i - \frac{h_{ii}}{1-h_{ii}} y_i \]

So: \[ y_i - \hat{y}_i^P = - \frac{1}{1-h_{ii}}\hat{y}_i + \left(1 + \frac{h_{ii}}{1-h_{ii}}\right) y_i \]

And: \[ y_i - \hat{y}_i^P = \frac{1}{1-h_{ii}}(y_i - \hat{y}_i) = \frac{1}{1-h_{ii}}\hat{\epsilon}_i \]

Studentized Residuals - Proof - 5/6

We have: \[ 1 + \mathbf{x}^{i}(\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}(\mathbf{x}^{i})^T = \frac{1}{1 - h_{ii}} \] \[ y_i - \hat{y}_i^P = \frac{1}{1-h_{ii}}\hat{\epsilon}_i \]

Hence: \[ \begin{aligned} t_i^* &= \frac{y_{i} - \hat{y}_{i}^P} {\sqrt{\hat{\sigma}^2_{(-i)} \left(1 + \mathbf{x}^{i} (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}(\mathbf{x}^{i})^T \right)}}\\ &= \frac{\hat{\epsilon}_i/(1-h_{ii})}{\sqrt{\hat{\sigma}^2_{(-i)} /(1-h_{ii})}} \\ \end{aligned} \]

Studentized Residuals - Proof - 6/6

Hence: \[ \begin{aligned} t_i^* &= \frac{y_{i} - \hat{y}_{i}^P} {\sqrt{\hat{\sigma}^2_{(-i)} \left(1 + \mathbf{x}^{i} (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}(\mathbf{x}^{i})^T \right)}}\\ &= \frac{\hat{\epsilon}_i}{\sqrt{\hat{\sigma}^2_{(-i)} (1 - h_{ii})}} \\ &\sim \mathcal{T}_{n - p -1}. \end{aligned} \]

Standardized vs Studentized Residuals

Studentized Residuals: Student iid \[ t_i^* = \frac{\hat{\epsilon}_i}{\sqrt{\hat{\sigma}^2_{(-i)} (1 - h_{ii})}} \sim \mathcal{T}_{n - p -1}. \]

Also called “externally studentized residuals”, “deleted residuals”, “jackknifed residuals”, “leave-one-out residuals”. Function rstudent.

Standardized Residuals: not Student in general \[ t_i = \frac{\hat{\epsilon}_i}{\sqrt{\hat{\sigma}^2 (1 - h_{ii})}} \]

Also called “internally studentized residuals” (although they are not Student). Function rstandard.

Standardized vs Studentized Residuals

It can be shown that: \[ t_i^* = t_i \sqrt{\frac{n-p-1}{n-p-t_i^2}} \]

So that the Studentized residuals are cheap to compute.

Outliers

Studentized Residuals

Studentized Residuals \[ t_i^* = \frac{\hat{\epsilon}_i}{\sqrt{\hat{\sigma}^2_{(-i)} (1 - h_{ii})}} \sim \mathcal{T}_{n - p -1}. \]

With probability \(1 - \alpha\): \[ |t_i^*| \leq t_{n-p-1}(1-\alpha/2) \quad \forall i \]

Note: if \(n\) is large and \(\alpha=0.05\), \(t_{n-p-1}(1-\alpha/2) \approx 1.96\).

Simulation

set.seed(1289)

## Predictors
n <- 500
x_1 <- runif(n, min = -2, max = 2)
x_2 <- runif(n, min = -2, max = 2)

## Model
beta_0 <- -1; beta_1 <- 3; beta_2 <- -1

## sim
eps <- rnorm(n, mean = 0, sd = 2)
y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps

## Fit
p <- 3
fit <- lm(y_sim ~ x_1 + x_2)

Simulation

## 5% threshold
alp <- 0.05
t_student <- qt(1 - alp/2, n-p-1)

## Studentized residuals
res_student <- rstudent(fit)

## How many above the threshold ?
mean(abs(res_student) >= t_student)
## [1] 0.046

Simulation

## Plot
plot(res_student)
lines(rep(t_student, n), col = "firebrick", lwd = 2)
lines(rep(-t_student, n), col = "firebrick", lwd = 2)

Outliers

An Outlier is a point such that \[ t_i^* \gg t_{n-p-1}(1-\alpha/2) \]

i.e. a point that we do not believe was issued from a Student distribution.

“Very much higher” does not have a formal definition…

Simulation - Outlier

## Generate an outlier
y_sim[100] <- 1.3 * max(y_sim)

## Fit
fit <- lm(y_sim ~ x_1 + x_2)

Simulation - Outlier

## Plot
plot(rstudent(fit))
lines(rep(t_student, n), col = "firebrick", lwd = 2)
lines(rep(-t_student, n), col = "firebrick", lwd = 2)

Gaussian Hypothesis

Studentized Residuals

Studentized Residuals \[ t_i^* = \frac{\hat{\epsilon}_i}{\sqrt{\hat{\sigma}^2_{(-i)} (1 - h_{ii})}} \sim \mathcal{T}_{n - p -1}. \]

If \(n\) is large, then \(\mathcal{T}_{n - p -1} \approx \mathcal{N}(0,1)\).

\(\to\) QQ-plot on the Studentized residuals.

Simulation - Gaussian

## Gaussian sim
eps <- rnorm(n, mean = 0, sd = 2)

## QQ norm
qqnorm(eps)
qqline(eps, col = "lightblue", lwd = 2)

Simulation - Gaussian

## Sim
y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps
## Fit
fit <- lm(y_sim ~ x_1 + x_2)
## Studentized residuals
qqnorm(rstudent(fit)); qqline(rstudent(fit), col = "lightblue", lwd = 2)

Simulation - Not Gaussian

## Gaussian sim
eps <- rt(n, 3)

## QQ norm
qqnorm(eps)
qqline(eps, col = "lightblue", lwd = 2)

Simulation - Not Gaussian

## Sim
y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps
## Fit
fit <- lm(y_sim ~ x_1 + x_2)
## Studentized residuals
qqnorm(rstudent(fit)); qqline(rstudent(fit), col = "lightblue", lwd = 2)

Non Gaussian case - Coefficients

What about the distribution of \(\hat{\beta}_2\) ?

If \(\epsilon_i \sim \mathcal{N}(0, \sigma^2 )\) iid, then \[ \hat{\beta}_2 \sim \mathcal{N}(\beta_2, \sigma^2 [(\mathbf{X}^T\mathbf{X})^{-1}]_{22}) \]

If the \(\epsilon_i\) are not Gaussian, this is not true anymore.

Is it approximatelly true ?

Simulation - Gaussian

What about the distribution of \(\hat{\beta}_2\) ?

set.seed(1289)
Nsim <- 1000
beta_hat_2 <- vector("numeric", Nsim)
for (i in 1:Nsim) {
  ## error
  eps <- rnorm(n, 0, 1)
  ## Sim
  y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps
  ## Fit
  fit <- lm(y_sim ~ x_1 + x_2)
  ## Get beta_1
  beta_hat_2[i] <- coef(fit)[3]
}

Simulation - Gaussian

What about the distribution of \(\hat{\beta}_2\) ?

## Empirical distribution
qqnorm(beta_hat_2); qqline(beta_hat_2, col = "lightblue", lwd = 2)

Simulation - Not Gaussian

What about the distribution of \(\hat{\beta}_2\) ?

set.seed(1289)
Nsim <- 1000
beta_hat_2 <- vector("numeric", Nsim)
for (i in 1:Nsim) {
  ## error
  eps <- rt(n, 3)
  ## Sim
  y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps
  ## Fit
  fit <- lm(y_sim ~ x_1 + x_2)
  ## Get beta_1
  beta_hat_2[i] <- coef(fit)[3]
}

Simulation - Not Gaussian

What about the distribution of \(\hat{\beta}_2\) ?

## Empirical distribution
qqnorm(beta_hat_2); qqline(beta_hat_2, col = "lightblue", lwd = 2)

Simulation - Non-finite variance

set.seed(1289)
Nsim <- 1000
beta_hat_2 <- vector("numeric", Nsim)
for (i in 1:Nsim) {
  ## error
  eps <- rt(n, 1)
  ## Sim
  y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps
  ## Fit
  fit <- lm(y_sim ~ x_1 + x_2)
  ## Get beta_1
  beta_hat_2[i] <- coef(fit)[3]
}

Simulation - Non-finite variance

## Empirical distribution
qqnorm(beta_hat_2); qqline(beta_hat_2, col = "lightblue", lwd = 2)

Simulation - Not Gaussian

  • If \(\epsilon\) has finite variance:
    • The coefficients will be approximately Gaussian (empirically).
    • The confidence intervals won’t be exact.
    • CI tend to be too narrow (for a heavy tail distribution).
  • If \(\epsilon\) has non-finite variance:
    • The Gaussian approximation for the coefficients becomes bad.
    • CIs won’t be good.
    • Extra care must be taken (robust regression).
    • Ex for \(\epsilon\): Student with \(1\) or \(2\) df, Cauchy,

Homoscedasticity

Homoscedasticity

  • No rigorous test

  • Useful to plot \(t^*_i\) against \(\hat{y}_i\)

  • Detect increasing variance, or changing expectation

Increasing variance - Simulation

set.seed(1289)

## Predictors
n <- 200
x_1 <- sort(runif(n, min = 0, max = 2))
x_2 <- sort(runif(n, min = 0, max = 2))

## Model
beta_0 <- 1; beta_1 <- 3; beta_2 <- 1

## sim
sd_het <- abs(rnorm(n, sqrt(1:n)/2, 0.1))
eps <- rnorm(n, mean = 0, sd = sd_het)
y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps

## Fit
fit <- lm(y_sim ~ x_1 + x_2)

Increasing variance - Student residuals

## Studentized residuals vs pred
plot(predict(fit), rstudent(fit),
     xlab = expression(hat(y)[i]), ylab = expression(paste(t[i],"*")))

Increasing variance - Student residuals

## Abs Studentized residuals vs pred
plot(predict(fit), abs(rstudent(fit)),
     xlab = expression(hat(y)[i]),
     ylab = expression(abs(paste(t[i],"*"))))
lines(lowess(predict(fit), abs(rstudent(fit))), col = "firebrick")

Increasing variance - Weighting

Weight observations:

## Weighting
y_weight <- y_sim / sqrt(1:n)

## Fit
fit <- lm(y_weight ~ x_1 + x_2)

Increasing variance - Weighting

## Studentized residuals vs pred
plot(predict(fit), rstudent(fit),
     xlab = expression(hat(y)[i]), ylab = expression(t[i]^"*"))

Increasing variance - Weighting

## Abs Studentized residuals vs pred
plot(predict(fit), abs(rstudent(fit)),
     xlab = expression(hat(y)[i]),
     ylab = expression(abs(paste(t[i],"*"))))
lines(lowess(predict(fit), abs(rstudent(fit))), col = "firebrick")

Structure

Structure

  • Spacial or temporal structure

  • “Forgotten” predictor

  • No rigorous test or method

Structure - Simulation

set.seed(1289)

## Predictors
n <- 200
x_1 <- sort(runif(n, min = 0, max = 2))
x_2 <- sin(sort(3 * runif(n, min = 0, max = 2)))

## sim
eps <- rnorm(n, mean = 0, sd = 0.2)
y_sim <- 1 + 2 * x_1 +  0.5 * x_2 + eps

## Fit
fit <- lm(y_sim ~ x_1)

Structure - Studentized residuals

## Studentized residuals vs pred
plot(predict(fit), rstudent(fit),
     xlab = expression(hat(y)[i]), ylab = expression(paste(t[i],"*")))
lines(lowess(predict(fit), rstudent(fit)), col = "firebrick")

Structure - Studentized residuals

## Fit
fit <- lm(y_sim ~ x_1 + x_2)

## Studentized residuals vs pred
plot(predict(fit), rstudent(fit),
     xlab = expression(hat(y)[i]), ylab = expression(paste(t[i],"*")))
lines(lowess(predict(fit), rstudent(fit)), col = "firebrick")

Advertising data

Advertising data - Fit

## Advertising data
library(here)
ad <- read.csv(here("data", "Advertising.csv"), row.names = "X")

## linear model
fit <- lm(sales ~ TV + radio, data = ad)

Advertising data - Plots

Transform the data

What if sales depends on \(f\)(TV) instead of just TV, with \(f\) known ?

I.e. take non-linear effects into account.

E.g. \(f(\cdot)\) is \(\sqrt{\cdot}\), \(\log(\cdot)\), …

No rigourous way to fing \(f\) (in this class).

Look at the data.

Advertising data - Plots

attach(ad)
par(mfrow = c(1, 2))
plot(TV, sales); plot(sqrt(TV), sales);

Advertising data - Fit - Transform

## linear model
fit <- lm(sales ~ sqrt(TV) + radio, data = ad)

Interacting effects

What if TV and radio are not independent ?

I.e. are there synergy effects ?

If I spend some money on radio, is it increasing the effect of TV?

If I have \(100\$\), is it better to spend all on either TV or radio, or \(50\$\) on each ?

Interacting effects

We can fit the model: \[ \texttt{sales} = \beta_0 + \beta_1 \texttt{TV} + \beta_2 \texttt{radio} + \beta_3 \texttt{TV} \times \texttt{radio} + \epsilon \]

i.e \[ \texttt{sales} = \beta_0 + (\beta_1 + \beta_3 \texttt{radio}) \texttt{TV} + \beta_2 \texttt{radio} + \epsilon \]

\(\beta_3\): increase in the effectiveness of TV advertising for a one unit increase in radio advertising (or vice-versa)

Advertising data - Fit - Interactions

## linear model
fit <- lm(sales ~ sqrt(TV) + radio + radio * sqrt(TV), data = ad)

Advertising data - Fit - Interactions

summary(fit)
## 
## Call:
## lm(formula = sales ~ sqrt(TV) + radio + radio * sqrt(TV), data = ad)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0562 -0.2757 -0.0121  0.2758  1.2421 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.4444112  0.1793714  24.778  < 2e-16 ***
## sqrt(TV)        0.4383960  0.0150223  29.183  < 2e-16 ***
## radio          -0.0500957  0.0062645  -7.997 1.09e-13 ***
## sqrt(TV):radio  0.0215106  0.0005179  41.538  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4476 on 196 degrees of freedom
## Multiple R-squared:  0.9928, Adjusted R-squared:  0.9926 
## F-statistic:  8949 on 3 and 196 DF,  p-value: < 2.2e-16

Outliers and Leverage

Outliers and Leverage

Question: Are my estimates robust ?

i.e. if I remove one observation, do they change a lot ?

e.g. Outliers can have a strong leverage.

Outliers and Leverage - Simulation

set.seed(1289)

## sim
n <- 20
x_1 <- runif(n-1, min = -2, max = 2)
eps <- rnorm(n-1, mean = 0, sd = 1)
y_sim <- 1 - 2 * x_1 + eps

## Outlier
x_1[n] <- 4
y_sim[n] <- 10 * max(y_sim)

Outliers and Leverage - Simulation

## Plot
plot(x_1, y_sim)
abline(a = 1, b = -2, col = "firebrick", lwd = 2)

Outliers and Leverage - Simulation

## Fit
fit <- lm(y_sim ~ x_1)
summary(fit)
## 
## Call:
## lm(formula = y_sim ~ x_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.155  -4.801  -1.460   3.757  27.280 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.431      2.163   1.586  0.13017   
## x_1            5.480      1.752   3.128  0.00581 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.639 on 18 degrees of freedom
## Multiple R-squared:  0.3522, Adjusted R-squared:  0.3162 
## F-statistic: 9.787 on 1 and 18 DF,  p-value: 0.005806

Outliers and Leverage - Simulation

## Plot
plot(x_1, y_sim)
abline(a = 1, b = -2, col = "firebrick", lwd = 2) ## True
abline(reg = fit, col = "lightblue", lwd = 2) ## Fitted

Outliers and Leverage - Simulation

## Fit without outlier
fit_no <- lm(y_sim[-n] ~ x_1[-n])
summary(fit_no)
## 
## Call:
## lm(formula = y_sim[-n] ~ x_1[-n])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5245 -0.7559  0.1282  0.4912  1.3617 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.2107     0.2133   5.677 2.73e-05 ***
## x_1[-n]      -2.3386     0.2442  -9.578 2.90e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9235 on 17 degrees of freedom
## Multiple R-squared:  0.8437, Adjusted R-squared:  0.8345 
## F-statistic: 91.73 on 1 and 17 DF,  p-value: 2.903e-08

Outliers and Leverage - Simulation

## Plot
plot(x_1, y_sim)
abline(a = 1, b = -2, col = "firebrick", lwd = 2) ## True
abline(reg = fit, col = "lightblue", lwd = 2) ## Fitted
abline(reg = fit_no, col = "darkgreen", lwd = 2) ## Fitted - no outlier

Outliers and Leverage

Removing one point has a dramatic effect.

Can we compute a weight for the importance of one observation?

Look at the projection matrix \(\to\) leverage.

Orthogonal Projection Matrix

Orthogonal Projection Matrix

The prevision is computed from: \[ \hat{\mathbf{y}} = \mathbf{H} \mathbf{y} \] with \[ \mathbf{H} = \mathbf{P}^{\mathbf{X}} = \mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T} \]

i.e. \[ \hat{y}_i = \sum_{j = 1}^n h_{ij} y_j = h_{ii}y_i + \sum_{j\neq i} h_{ij} y_j \]

\(h_{ii}\) gives the “weight” of each observation on its own prevision.

Properties of \(\mathbf{H}\) ?

Properties of \(\mathbf{H}\)

\[ \mathbf{H} = \mathbf{P}^{\mathbf{X}} = \mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T} \]

We have:

  1. \(\text{tr}(\mathbf{H}) = \sum_{i = 1}^n h_{ii} = p\).
  2. \(\sum_{i = 1}^n\sum_{j = 1}^n h_{ij}^2 = p\).
  3. \(0 \leq h_{ii} \leq 1\) for all \(i \in \{1, \dotsc, n\}\).
  4. If \(h_{ii} = 0\) or \(h_{ii} = 1\), then \(h_{ij} = 0\) for all \(j \neq i\).
  5. \(-0.5 \leq h_{ij} \leq 0.5\) for all \(i \neq j\).

Reminder: \[ \mathbf{H} = \mathbf{H}^2 \quad \text{and} \quad \mathbf{H}^T = \mathbf{H} \]

Properties of \(\mathbf{H}\) - Proof - 1/4

1- \[ \text{tr}(\mathbf{H}) = \text{rg}(\mathbf{X}) = p \]

2- \[ p = \text{tr}(\mathbf{H}) = \text{tr}(\mathbf{H}^2) = \text{tr}(\mathbf{H}^T\mathbf{H}) = \sum_{i = 1}^n\sum_{j = 1}^n h_{ij}^2 \]

Properties of \(\mathbf{H}\) - Proof - 2/4

3- \[ \mathbf{H}_{ii} = (\mathbf{H}^2)_{ii} \]

hence: \[ h_{ii} = \sum_{j = 1}^n h_{ij}h_{ji} = \sum_{j = 1}^n h_{ij}^2 = h_{ii}^2 + \sum_{j \neq i} h_{ij}^2 \]

\[ h_{ii}(1 - h_{ii}) = \sum_{j \neq i} h_{ij}^2 \geq 0 \]

and \[ 0 \leq h_{ii} \leq 1 \]

Properties of \(\mathbf{H}\) - Proof - 3/4

4- \[ h_{ii}(1 - h_{ii}) = \sum_{j \neq i} h_{ij}^2 \]

If \(h_{ii} = 0\) or \(h_{ii} = 1\), then \(h_{ij} = 0\) for all \(j \neq i\).

Properties of \(\mathbf{H}\) - Proof - 4/4

5- \[ h_{ii}(1 - h_{ii}) = h_{ij}^2 + \sum_{k \neq i,j} h_{ik}^2 \]

\(h_{ii}(1 - h_{ii})\) is maximal in \(h_{ii} = 0.5\), so: \[ 0 \leq h_{ij}^2 + \sum_{k \neq i,j} h_{ik}^2 \leq 0.25 \] \[ 0 \leq h_{ij}^2 \leq 0.25 \] and: \[ -0.5 \leq h_{ij} \leq 0.5 \]

Leverage

\[ \hat{y}_i = \sum_{j = 1}^n h_{ij} y_j = h_{ii}y_i + \sum_{j\neq i} h_{ij} y_j \]

  • If \(h_{ii} = 1\), then \(h_{ij} = 0\) for all \(j \neq i\), and \(\hat{y}_i = h_{ii}y_i\)
    \(\to\) \(\hat{y}_i\) is completely determined by \(y_i\).
  • If \(h_{ii} = 0\), then \(h_{ij} = 0\) for all \(j \neq i\), and \(\hat{y}_i = 0\)
    \(\to\) \(y_i\) does not have any effect on \(\hat{y}_i\).
  • If \(h_{ii}\) is “large”
    \(\to\) \(y_i\) has a strong impact on \(\hat{y}_i\).
  • What is “large” ?

Leverage

As \(\text{tr}(\mathbf{H}) = \text{rg}(\mathbf{X}) = p\), if the \(h_{ii}\) are perfectly ballanced, (all equal to \(h\)), then \(h = p/n\).

Leverage Point \(i\) is a leverage point if:

  • \(h_{ii} > 2p/n\) [Hoaglin & Welsch, 1978]
  • \(h_{ii} > 3p/n\), \(p > 6\) et \(n − p > 12\) [Velleman & Welsch, 1981]
  • \(h_{ii} > 0.5\) [Huber, 1981]

Leverage - Simulation

## Plot
plot(x_1, y_sim)
abline(a = 1, b = -2, col = "firebrick", lwd = 2) ## True
abline(reg = fit, col = "lightblue", lwd = 2) ## Fitted
abline(reg = fit_no, col = "darkgreen", lwd = 2) ## Fitted - no outlier

Leverage - Simulation

## Leverage
h_val <- hatvalues(fit)
## Plot
barplot(h_val)
abline(h = 2 * 2 / n, lty = 2)
abline(h = 3 * 2 / n, lty = 3)

Leverage - Simulation - bis

set.seed(1289)

## sim
n <- 20
x_1 <- runif(n-1, min = -2, max = 2)
eps <- rnorm(n-1, mean = 0, sd = 1)
y_sim <- 1 - 2 * x_1 + eps

## (Not an) Outlier
x_1[n] <- 4
y_sim[n] <- 1 - 2 * x_1[n]

Leverage - Simulation - bis

Last point is consistent with the other observations.

But its \(x\) value is high \(\to\) high leverage point.

Outlier vs High Leverage

  • Outlier:
    Very different from other points (\(y\)).
    Not necessarily high leverage.
    \(~\)

  • High leverage point:
    Strong impact on the estimation
    Very different predictor (\(x\)).
    Not necessarily an outlier.

Cook Distance

Outlier vs Hight Leverage

  • Outlier:
    • Very different from other points (\(y\)).
    • Not necessarily high leverage.
  • High leverage point:
    • Strong impact on the estimation
    • Very different predictor (\(x\)).
    • Effect of \(y_i\) on \(\hat{y}_i\).
    • Not necessarily an outlier.
  • Cook Distance:
    • “Compromise” between outliers and high leverage points.
    • Effect of \(y_i\) on \(\hat{\boldsymbol{\beta}}\) directly.

Cook Distance

Recall: \[ \frac{1}{p\hat{\sigma}^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T (\mathbf{X}^T\mathbf{X})\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) \sim \mathcal{F}^p_{n-p} \]

Confidence Region: with a probability of \(1-\alpha\): \[ \frac{1}{p\hat{\sigma}^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T (\mathbf{X}^T\mathbf{X})\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) \leq f^p_{n-p}(1 - \alpha) \]

Cook Distance: \[ C_i = \frac{1}{p\hat{\sigma}^2} \left(\hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}}\right)^T (\mathbf{X}^T\mathbf{X}) \left(\hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}}\right) \]

Cross Validation

Leave-one-out cross validation: for all \(i\)

  1. Remove observation \(i\) from the dataset:
    • \(\mathbf{y}_{(-i)} = (y_1, \dotsc, y_{i-1}, y_{i+1}, \dotsc, y_n)^T\) (remove obs \(i\))
    • \(\mathbf{X}_{(-i)} = (\mathbf{x}^1, \dotsc, \mathbf{x}^{i-1}, \mathbf{x}^{i+1}, \dotsc, \mathbf{x}^n)^T\) (remove line \(i\))
  2. Fit the dataset without observation \(i\).
    • Get \(\hat{\boldsymbol{\beta}}_{(-i)}\) and \(\hat{\sigma}^2_{(-i)}\) using \(\mathbf{y}_{(-i)}\) and \(\mathbf{X}_{(-i)}\).
  3. Predict point \(i\) using fitted values
    • \(\hat{y}_i^P = \mathbf{x}^i\hat{\boldsymbol{\beta}}_{(-i)}\)

Cook Distance

Cook Distance: \[ C_i = \frac{1}{p\hat{\sigma}^2} \left(\hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}}\right)^T (\mathbf{X}^T\mathbf{X}) \left(\hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}}\right) \]

We have: \[ C_i = \frac{h_{ii}(y_{i} - \hat{y}_{i}^P)^2}{p\hat{\sigma}^2} = \frac{h_{ii}}{p(1-h_{ii})^2}\frac{\hat{\epsilon}_i^2}{\hat{\sigma}^2} = \frac{h_{ii}}{p(1-h_{ii})}t_i^2 \]

Note:
Cook “distance” is actually the square of a distance.

Cook Distance - Proof

\[ C_i = \frac{h_{ii}(y_{i} - \hat{y}_{i}^P)^2}{p\hat{\sigma}^2} = \frac{h_{ii}}{p(1-h_{ii})^2}\frac{\hat{\epsilon}_i^2}{\hat{\sigma}^2} = \frac{h_{ii}}{p(1-h_{ii})}t_i^2 \]

Hints:

\[ (\mathbf{A} + \mathbf{u}\mathbf{v}^T)^{-1} = \mathbf{A}^{-1} - \frac{ \mathbf{A}^{-1} \mathbf{u}\mathbf{v}^T \mathbf{A}^{-1} }{ 1 + \mathbf{v}^T\mathbf{A}^{-1}\mathbf{u} }. \]

\[ \mathbf{X}^T\mathbf{X} = \mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)} + (\mathbf{x}^{i})^T\mathbf{x}^i \]

\[ \mathbf{X}^T\mathbf{y} = \mathbf{X}_{(-i)}^T\mathbf{y}_{(-i)} + (\mathbf{x}^{i})^Ty_i \]

\[ h_{ii} = [\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}]_{ii} = \mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T \]

So: \[ (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1} = (\mathbf{X}^T\mathbf{X} - (\mathbf{x}^{i})^T\mathbf{x}^i)^{-1} = \cdots \]

\[ \hat{\boldsymbol{\beta}}_{(-i)} = (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}\mathbf{X}_{(-i)}^T\mathbf{y}_{(-i)} = \cdots \]

Cook Distance - Proof - 1/6

\[ (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1} = (\mathbf{X}^T\mathbf{X} - (\mathbf{x}^{i})^T\mathbf{x}^i)^{-1} = \cdots \]

\[ (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1} = (\mathbf{X}^T\mathbf{X})^{-1} + \frac{ (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1} }{ 1 - \mathbf{x}^{i}(\mathbf{X}^T\mathbf{X})^{-1}(\mathbf{x}^{i})^T } \]

\[ (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1} = (\mathbf{X}^T\mathbf{X})^{-1} + \frac{(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1}}{1 - h_{ii}} \]

Cook Distance - Proof - 2/6

\[ \hat{\boldsymbol{\beta}}_{(-i)} = (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}\mathbf{X}_{(-i)}^T\mathbf{y}_{(-i)} \]

But: \[ (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1} = (\mathbf{X}^T\mathbf{X})^{-1} + \frac{(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1}}{1 - h_{ii}} \] \[ \mathbf{X}_{(-i)}^T\mathbf{y}_{(-i)} = \mathbf{X}^T\mathbf{y} - (\mathbf{x}^{i})^Ty_i \]

Hence: \[ \begin{multline} \hat{\boldsymbol{\beta}}_{(-i)} = \left[ (\mathbf{X}^T\mathbf{X})^{-1} + \frac{(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1}}{1 - h_{ii}} \right] \\ \times [\mathbf{X}^T\mathbf{y} - (\mathbf{x}^{i})^Ty_i] \end{multline} \]

Cook Distance - Proof - 3/6

\[ \begin{multline} \hat{\boldsymbol{\beta}}_{(-i)} = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{y} - (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^Ty_i \\ + \frac{(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T}{1 - h_{ii}}\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{y} \\ - \frac{(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1}(\mathbf{x}^{i})^T}{1 - h_{ii}} y_i \end{multline} \]

\[ \begin{multline} \hat{\boldsymbol{\beta}}_{(-i)} = \hat{\boldsymbol{\beta}} - (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^Ty_i + \frac{1}{1-h_{ii}}(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T \mathbf{x}^i \hat{\boldsymbol{\beta}} \\ - \frac{h_{ii}}{1-h_{ii}} (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T y_i \end{multline} \]

\[ \hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T \left[\frac{1}{1-h_{ii}} \hat{y}_i - \left(1 + \frac{h_{ii}}{1-h_{ii}}\right) y_i \right] \]

Cook Distance - Proof - 4/6

\[ \begin{aligned} \hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}} &= (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T \frac{1}{1-h_{ii}} (\hat{y}_i - y_i)\\ &= - (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T \frac{1}{1-h_{ii}} \hat{\epsilon}_i \end{aligned} \]

But remember: \[ y_i - \hat{y}_i^P = \frac{1}{1-h_{ii}}\hat{\epsilon}_i \] hence: \[ \hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}} = - (y_i - \hat{y}_i^P) (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T \]

Cook Distance - Proof - 5/6

We have: \[ \hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}} = - (y_i - \hat{y}_i^P) (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T \] \[ C_i = \frac{1}{p\hat{\sigma}^2} \left(\hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}}\right)^T (\mathbf{X}^T\mathbf{X}) \left(\hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}}\right) \]

Hence: \[ \begin{aligned} C_i &= \frac{1}{p\hat{\sigma}^2} (\mathbf{x}^{i}) (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{X}^T\mathbf{X}) (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T (y_i - \hat{y}_i^P)^2\\ &= \frac{1}{p\hat{\sigma}^2} h_{ii} (y_i - \hat{y}_i^P)^2 \end{aligned} \]

Cook Distance - Proof - 6/6

\[ C_i = \frac{1}{p\hat{\sigma}^2} \left(\hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}}\right)^T (\mathbf{X}^T\mathbf{X}) \left(\hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}}\right) = \frac{1}{p\hat{\sigma}^2} h_{ii} (y_i - \hat{y}_i^P)^2 \]

and: \[ y_i - \hat{y}_i^P = \frac{1}{1-h_{ii}}\hat{\epsilon}_i \]

hence: \[ C_i = \frac{h_{ii}}{p(1-h_{ii})^2}\frac{\hat{\epsilon}_i^2}{\hat{\sigma}^2} = \frac{h_{ii}}{p(1-h_{ii})}t_i^2 \]

Cook Distance

\[ C_i = \frac{1}{p\hat{\sigma}^2} \left(\hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}}\right)^T (\mathbf{X}^T\mathbf{X}) \left(\hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}}\right) \]

  • If \(C_i\) is “high”, then the point has a strong influence on the estimation.

  • According to Cook (1977):

    • \(C_i \leq f^p_{n-p}(0.1)\) : “desirable”
    • \(C_i > f^p_{n-p}(0.5) \approx 1\) : “concerning”

Cook Distance

\[ C_i = \frac{1}{p}\frac{h_{ii}}{(1-h_{ii})}t_i^2 \]

  • If \(C_i\) is “high”, then the point has a strong influence on the estimation.

  • \(C_i\) is the product of two contributions:

    • \(\frac{h_{ii}}{(1-h_{ii})}\) high if \(i\) has high leverage
    • \(t_i^2\) high if \(i\) is an outlier.

Cook Distance - Outlier - 1

Cook Distance - Outlier - 2

Cook Distance - High Leverage - 1

Cook Distance - High Leverage - 2

Cook Distance - Outlier and Leverage

Advertising data

Advertising data - Fit with interactions

## linear model
fit <- lm(sales ~ sqrt(TV) + radio + radio * sqrt(TV), data = ad)