27/03/2024
\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \]
Question:
Are the residuals independent, identically distributed, with the same variance \(\sigma^2\) ?
From (H2), the errors are iid: \[ \epsilon_i \sim \mathcal{N}(0, \sigma^2 ) \]
The residuals have distribution: \[ \hat{\boldsymbol{\epsilon}} = \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I}_n - \mathbf{P}^{\mathbf{X}})\mathbf{y} \sim \mathcal{N}(\mathbf{0}, \sigma^2 (\mathbf{I}_n - \mathbf{P}^{\mathbf{X}})) \]
Writing: \[ \mathbf{H} = \mathbf{P}^{\mathbf{X}} = \mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T} \]
The residuals have distribution: \[ \hat{\epsilon}_i \sim \mathcal{N}(0, \sigma^2 (1 - h_{ii})) \]
From (H2), the errors are iid: \[ \epsilon_i \sim \mathcal{N}(0, \sigma^2 ) \] But the residuals variance depends on \(i\): \[ \hat{\epsilon}_i \sim \mathcal{N}(0, \sigma^2 (1 - h_{ii})). \]
“Studentization”: try to “uniformize” the residuals so that they are iid.
Normalized residuals: \[ \frac{\hat{\epsilon}_i}{\sqrt{\sigma^2 (1 - h_{ii})}} \sim \mathcal{N}(0, 1). \]
Normalized residuals are iid.
Problem: \(\sigma^2\) is unknown \(\to\) replace it by \(\hat{\sigma}^2\).
What is the distribution of \[ t_i = \frac{\hat{\epsilon}_i}{\sqrt{\hat{\sigma}^2 (1 - h_{ii})}} \sim ~? \]
If \(\hat{\sigma}^2 = \frac{1}{n - p} \| \hat{\epsilon} \|^2\) is the standard unbiaised variance estimate, the \(t_i\) are not Student (!)
That is because \(\hat{\sigma}^2\) is not independent from \(\hat{\epsilon}_i\) in general.
Solution: Replace \(\hat{\sigma}^2\) with \(\hat{\sigma}^2_{(-i)}\) an estimate of \(\sigma^2\) independent of \(\hat{\epsilon}_i\).
Leave-one-out cross validation: for all \(i\):
Remove observation \(i\) from the dataset: \[ \mathbf{y}_{(-i)} = \begin{pmatrix} y_1\\ \vdots\\ y_{i-1}\\ y_{i+1}\\ \vdots\\ y_n \end{pmatrix} \qquad\qquad \mathbf{X}_{(-i)} = \begin{pmatrix} \mathbf{x}^1\\ \vdots\\ \mathbf{x}^{i-1}\\ \mathbf{x}^{i+1}\\ \vdots\\ \mathbf{x}^{n} \end{pmatrix} \]
Fit the dataset without observation \(i\):
Predict point \(i\) using fitted values:
Then, from CM4, we get: \[ \frac{y_{i} - \hat{y}_{i}^P} {\sqrt{\hat{\sigma}^2_{(-i)} \left(1 + \mathbf{x}^{i} (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}(\mathbf{x}^{i})^T \right)}} \sim \mathcal{T}_{n - p -1} \]
The prediction error from the leave-one-out cross validation follows a Student with \(n-p-1\) degrees of freedom.
If \(\mathbf{X}\) and \(\mathbf{X}_{(-i)}\) all have rank \(p\), then:
\[ t_i^* = \frac{\hat{\epsilon}_i}{\sqrt{\hat{\sigma}^2_{(-i)} (1 - h_{ii})}} = \frac{y_{i} - \hat{y}_{i}^P} {\sqrt{\hat{\sigma}^2_{(-i)} \left(1 + \mathbf{x}^{i} (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}(\mathbf{x}^{i})^T \right)}} \]
and: \[ t_i^* \sim \mathcal{T}_{n - p -1}. \]
The studentized residuals are equal to
the normalized predictions from the leave-one-out cross validation.
Sherman–Morrison formula (also “Woodbury formula”):
For any invertible square \(q \times q\) matrix \(\mathbf{A}\) and vectors \(\mathbf{u}\) and \(\mathbf{v}\) of \(\mathbb{R}^q\), then \(\mathbf{A} + \mathbf{u}\mathbf{v}^T\) is invertible iff \(1 + \mathbf{v}^T\mathbf{A}^{-1}\mathbf{u} \neq 0\), and: \[
(\mathbf{A} + \mathbf{u}\mathbf{v}^T)^{-1}
=
\mathbf{A}^{-1}
-
\frac{
\mathbf{A}^{-1} \mathbf{u}\mathbf{v}^T \mathbf{A}^{-1}
}{
1 + \mathbf{v}^T\mathbf{A}^{-1}\mathbf{u}
}.
\] \(~\)
Proof: exercise (wikipedia is a great resource).
\[ t_i^* = \frac{\hat{\epsilon}_i}{\sqrt{\hat{\sigma}^2_{(-i)} (1 - h_{ii})}} = \frac{y_{i} - \hat{y}_{i}^P} {\sqrt{\hat{\sigma}^2_{(-i)} \left(1 + \mathbf{x}^{i} (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}(\mathbf{x}^{i})^T \right)}} \]
Hints:
\[ (\mathbf{A} + \mathbf{u}\mathbf{v}^T)^{-1} = \mathbf{A}^{-1} - \frac{ \mathbf{A}^{-1} \mathbf{u}\mathbf{v}^T \mathbf{A}^{-1} }{ 1 + \mathbf{v}^T\mathbf{A}^{-1}\mathbf{u} }. \]
\[ \mathbf{X}^T\mathbf{X} = \mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)} + (\mathbf{x}^{i})^T\mathbf{x}^i \]
\[ \mathbf{X}^T\mathbf{y} = \mathbf{X}_{(-i)}^T\mathbf{y}_{(-i)} + (\mathbf{x}^{i})^Ty_i \]
\[ h_{ii} = [\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}]_{ii} = \mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T \]
So: \[ (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1} = (\mathbf{X}^T\mathbf{X} - (\mathbf{x}^{i})^T\mathbf{x}^i)^{-1} = \cdots \]
\[ \hat{y}_i^P = \mathbf{x}^i\hat{\boldsymbol{\beta}}_{(-i)} = \mathbf{x}^i(\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}\mathbf{X}_{(-i)}^T\mathbf{y}_{(-i)} = \cdots \]
\[ (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1} = (\mathbf{X}^T\mathbf{X} - (\mathbf{x}^{i})^T\mathbf{x}^i)^{-1} = \cdots \]
\[ (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1} = (\mathbf{X}^T\mathbf{X})^{-1} + \frac{ (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1} }{ 1 - \mathbf{x}^{i}(\mathbf{X}^T\mathbf{X})^{-1}(\mathbf{x}^{i})^T } \]
\[ (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1} = (\mathbf{X}^T\mathbf{X})^{-1} + \frac{(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1}}{1 - h_{ii}} \]
Hence: \[ 1 + \mathbf{x}^{i}(\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}(\mathbf{x}^{i})^T = 1 + h_{ii} +\frac{h_{ii}^2}{1 - h_{ii}} = \frac{1}{1 - h_{ii}} \]
\[ \hat{y}_i^P = \mathbf{x}^i\hat{\boldsymbol{\beta}}_{(-i)} = \mathbf{x}^i(\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}\mathbf{X}_{(-i)}^T\mathbf{y}_{(-i)} \]
But: \[ (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1} = (\mathbf{X}^T\mathbf{X})^{-1} + \frac{(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1}}{1 - h_{ii}} \] \[ \mathbf{X}_{(-i)}^T\mathbf{y}_{(-i)} = \mathbf{X}^T\mathbf{y} - (\mathbf{x}^{i})^Ty_i \]
Hence: \[ \begin{multline} \hat{y}_i^P = \mathbf{x}^i\left[ (\mathbf{X}^T\mathbf{X})^{-1} + \frac{(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1}}{1 - h_{ii}} \right] \\ \times [\mathbf{X}^T\mathbf{y} - (\mathbf{x}^{i})^Ty_i] \end{multline} \]
\[ \begin{multline} \hat{y}_i^P = \mathbf{x}^i (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{y} - \mathbf{x}^i (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^Ty_i \\ + \frac{\mathbf{x}^i(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T}{1 - h_{ii}}\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{y} \\ - \frac{\mathbf{x}^i(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1}(\mathbf{x}^{i})^T}{1 - h_{ii}} y_i \end{multline} \]
\[ \hat{y}_i^P = \mathbf{x}^i \hat{\boldsymbol{\beta}} - h_{ii}y_i + \frac{h_{ii}}{1-h_{ii}}\mathbf{x}^i \hat{\boldsymbol{\beta}} - \frac{h_{ii}^2}{1-h_{ii}} y_i \]
\[ \hat{y}_i^P = \left(1 + \frac{h_{ii}}{1-h_{ii}}\right) \hat{y}_i - \left(h_{ii} + \frac{h_{ii}^2}{1-h_{ii}}\right) y_i \]
\[ \hat{y}_i^P = \frac{1}{1-h_{ii}}\hat{y}_i - \frac{h_{ii}}{1-h_{ii}} y_i \]
So: \[ y_i - \hat{y}_i^P = - \frac{1}{1-h_{ii}}\hat{y}_i + \left(1 + \frac{h_{ii}}{1-h_{ii}}\right) y_i \]
And: \[ y_i - \hat{y}_i^P = \frac{1}{1-h_{ii}}(y_i - \hat{y}_i) = \frac{1}{1-h_{ii}}\hat{\epsilon}_i \]
We have: \[ 1 + \mathbf{x}^{i}(\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}(\mathbf{x}^{i})^T = \frac{1}{1 - h_{ii}} \] \[ y_i - \hat{y}_i^P = \frac{1}{1-h_{ii}}\hat{\epsilon}_i \]
Hence: \[ \begin{aligned} t_i^* &= \frac{y_{i} - \hat{y}_{i}^P} {\sqrt{\hat{\sigma}^2_{(-i)} \left(1 + \mathbf{x}^{i} (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}(\mathbf{x}^{i})^T \right)}}\\ &= \frac{\hat{\epsilon}_i/(1-h_{ii})}{\sqrt{\hat{\sigma}^2_{(-i)} /(1-h_{ii})}} \\ \end{aligned} \]
Hence: \[ \begin{aligned} t_i^* &= \frac{y_{i} - \hat{y}_{i}^P} {\sqrt{\hat{\sigma}^2_{(-i)} \left(1 + \mathbf{x}^{i} (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}(\mathbf{x}^{i})^T \right)}}\\ &= \frac{\hat{\epsilon}_i}{\sqrt{\hat{\sigma}^2_{(-i)} (1 - h_{ii})}} \\ &\sim \mathcal{T}_{n - p -1}. \end{aligned} \]
Studentized Residuals: Student iid \[ t_i^* = \frac{\hat{\epsilon}_i}{\sqrt{\hat{\sigma}^2_{(-i)} (1 - h_{ii})}} \sim \mathcal{T}_{n - p -1}. \]
Also called “externally studentized residuals”, “deleted residuals”, “jackknifed residuals”, “leave-one-out residuals”. Function rstudent
.
Standardized Residuals: not Student in general \[ t_i = \frac{\hat{\epsilon}_i}{\sqrt{\hat{\sigma}^2 (1 - h_{ii})}} \]
Also called “internally studentized residuals” (although they are not Student). Function rstandard
.
It can be shown that: \[ t_i^* = t_i \sqrt{\frac{n-p-1}{n-p-t_i^2}} \]
So that the Studentized residuals are cheap to compute.
Studentized Residuals \[ t_i^* = \frac{\hat{\epsilon}_i}{\sqrt{\hat{\sigma}^2_{(-i)} (1 - h_{ii})}} \sim \mathcal{T}_{n - p -1}. \]
With probability \(1 - \alpha\): \[ |t_i^*| \leq t_{n-p-1}(1-\alpha/2) \quad \forall i \]
Note: if \(n\) is large and \(\alpha=0.05\), \(t_{n-p-1}(1-\alpha/2) \approx 1.96\).
set.seed(1289) ## Predictors n <- 500 x_1 <- runif(n, min = -2, max = 2) x_2 <- runif(n, min = -2, max = 2) ## Model beta_0 <- -1; beta_1 <- 3; beta_2 <- -1 ## sim eps <- rnorm(n, mean = 0, sd = 2) y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps ## Fit p <- 3 fit <- lm(y_sim ~ x_1 + x_2)
## 5% threshold alp <- 0.05 t_student <- qt(1 - alp/2, n-p-1) ## Studentized residuals res_student <- rstudent(fit) ## How many above the threshold ? mean(abs(res_student) >= t_student)
## [1] 0.046
## Plot plot(res_student) lines(rep(t_student, n), col = "firebrick", lwd = 2) lines(rep(-t_student, n), col = "firebrick", lwd = 2)
An Outlier is a point such that \[ t_i^* \gg t_{n-p-1}(1-\alpha/2) \]
i.e. a point that we do not believe was issued from a Student distribution.
“Very much higher” does not have a formal definition…
## Generate an outlier y_sim[100] <- 1.3 * max(y_sim) ## Fit fit <- lm(y_sim ~ x_1 + x_2)
## Plot plot(rstudent(fit)) lines(rep(t_student, n), col = "firebrick", lwd = 2) lines(rep(-t_student, n), col = "firebrick", lwd = 2)
Studentized Residuals \[ t_i^* = \frac{\hat{\epsilon}_i}{\sqrt{\hat{\sigma}^2_{(-i)} (1 - h_{ii})}} \sim \mathcal{T}_{n - p -1}. \]
If \(n\) is large, then \(\mathcal{T}_{n - p -1} \approx \mathcal{N}(0,1)\).
\(\to\) QQ-plot on the Studentized residuals.
## Gaussian sim eps <- rnorm(n, mean = 0, sd = 2) ## QQ norm qqnorm(eps) qqline(eps, col = "lightblue", lwd = 2)
## Sim y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps ## Fit fit <- lm(y_sim ~ x_1 + x_2) ## Studentized residuals qqnorm(rstudent(fit)); qqline(rstudent(fit), col = "lightblue", lwd = 2)
## Student sim eps <- rt(n, 3) ## QQ norm qqnorm(eps) qqline(eps, col = "lightblue", lwd = 2)
## Sim y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps ## Fit fit <- lm(y_sim ~ x_1 + x_2) ## Studentized residuals qqnorm(rstudent(fit)); qqline(rstudent(fit), col = "lightblue", lwd = 2)
What about the distribution of \(\hat{\beta}_2\) ?
If \(\epsilon_i \sim \mathcal{N}(0, \sigma^2 )\) iid, then \[ \hat{\beta}_2 \sim \mathcal{N}(\beta_2, \sigma^2 [(\mathbf{X}^T\mathbf{X})^{-1}]_{22}) \]
If the \(\epsilon_i\) are not Gaussian, this is not true anymore.
Is it approximatelly true ?
What about the distribution of \(\hat{\beta}_2\) ?
set.seed(1289) Nsim <- 1000 beta_hat_2 <- vector("numeric", Nsim) for (i in 1:Nsim) { ## error eps <- rnorm(n, 0, 1) ## Sim y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps ## Fit fit <- lm(y_sim ~ x_1 + x_2) ## Get beta_1 beta_hat_2[i] <- coef(fit)[3] }
What about the distribution of \(\hat{\beta}_2\) ?
## Empirical distribution qqnorm(beta_hat_2); qqline(beta_hat_2, col = "lightblue", lwd = 2)
What about the distribution of \(\hat{\beta}_2\) ?
set.seed(1289) Nsim <- 1000 beta_hat_2 <- vector("numeric", Nsim) for (i in 1:Nsim) { ## error eps <- rt(n, 3) ## Sim y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps ## Fit fit <- lm(y_sim ~ x_1 + x_2) ## Get beta_1 beta_hat_2[i] <- coef(fit)[3] }
What about the distribution of \(\hat{\beta}_2\) ?
## Empirical distribution qqnorm(beta_hat_2); qqline(beta_hat_2, col = "lightblue", lwd = 2)
set.seed(1289) Nsim <- 1000 beta_hat_2 <- vector("numeric", Nsim) for (i in 1:Nsim) { ## error eps <- rt(n, 1) ## Sim y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps ## Fit fit <- lm(y_sim ~ x_1 + x_2) ## Get beta_1 beta_hat_2[i] <- coef(fit)[3] }
## Empirical distribution qqnorm(beta_hat_2); qqline(beta_hat_2, col = "lightblue", lwd = 2)
No rigorous test
Useful to plot \(t^*_i\) against \(\hat{y}_i\)
Detect increasing variance, or changing expectation
set.seed(1289) ## Predictors n <- 200 x_1 <- sort(runif(n, min = 0, max = 2)) x_2 <- sort(runif(n, min = 0, max = 2)) ## Model beta_0 <- 1; beta_1 <- 3; beta_2 <- 1 ## sim sd_het <- abs(rnorm(n, sqrt(1:n)/2, 0.1)) eps <- rnorm(n, mean = 0, sd = sd_het) y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps ## Fit fit <- lm(y_sim ~ x_1 + x_2)
## Studentized residuals vs pred plot(predict(fit), rstudent(fit), xlab = expression(hat(y)[i]), ylab = expression(paste(t[i],"*")))
## Abs Studentized residuals vs pred plot(predict(fit), abs(rstudent(fit)), xlab = expression(hat(y)[i]), ylab = expression(abs(paste(t[i],"*")))) lines(lowess(predict(fit), abs(rstudent(fit))), col = "firebrick")
Weight observations:
## Weighting y_weight <- y_sim / sqrt(1:n) ## Fit fit <- lm(y_weight ~ x_1 + x_2)
## Studentized residuals vs pred plot(predict(fit), rstudent(fit), xlab = expression(hat(y)[i]), ylab = expression(t[i]^"*"))
## Abs Studentized residuals vs pred plot(predict(fit), abs(rstudent(fit)), xlab = expression(hat(y)[i]), ylab = expression(abs(paste(t[i],"*")))) lines(lowess(predict(fit), abs(rstudent(fit))), col = "firebrick")
Spacial or temporal structure
“Forgotten” predictor
No rigorous test or method
set.seed(1289) ## Predictors n <- 200 x_1 <- sort(runif(n, min = 0, max = 2)) x_2 <- sin(sort(3 * runif(n, min = 0, max = 2))) ## sim eps <- rnorm(n, mean = 0, sd = 0.2) y_sim <- 1 + 2 * x_1 + 0.5 * x_2 + eps ## Fit fit <- lm(y_sim ~ x_1)
## Studentized residuals vs pred plot(predict(fit), rstudent(fit), xlab = expression(hat(y)[i]), ylab = expression(paste(t[i],"*"))) lines(lowess(predict(fit), rstudent(fit)), col = "firebrick")
## Fit fit <- lm(y_sim ~ x_1 + x_2) ## Studentized residuals vs pred plot(predict(fit), rstudent(fit), xlab = expression(hat(y)[i]), ylab = expression(paste(t[i],"*"))) lines(lowess(predict(fit), rstudent(fit)), col = "firebrick")
## Advertising data library(here) ad <- read.csv(here("data", "Advertising.csv"), row.names = "X") ## linear model fit <- lm(sales ~ TV + radio, data = ad)
What if sales
depends on \(f\)(TV
) instead of just TV
, with \(f\) known ?
I.e. take non-linear effects into account.
E.g. \(f(\cdot)\) is \(\sqrt{\cdot}\), \(\log(\cdot)\), …
No rigourous way to find \(f\) (in this class).
Look at the data.
attach(ad) par(mfrow = c(1, 2)) plot(TV, sales); plot(sqrt(TV), sales);
## linear model fit <- lm(sales ~ sqrt(TV) + radio, data = ad)
What if TV
and radio
are not independent ?
I.e. are there synergy effects ?
If I spend some money on radio
, is it increasing the effect of TV
?
If I have \(100\$\), is it better to spend all on either TV
or radio
, or \(50\$\) on each ?
We can fit the model: \[ \texttt{sales} = \beta_0 + \beta_1 \texttt{TV} + \beta_2 \texttt{radio} + \beta_3 \texttt{TV} \times \texttt{radio} + \epsilon \]
i.e \[ \texttt{sales} = \beta_0 + (\beta_1 + \beta_3 \texttt{radio}) \texttt{TV} + \beta_2 \texttt{radio} + \epsilon \]
\(\beta_3\): increase in the effectiveness of TV
advertising for a one unit increase in radio
advertising (or vice-versa)
## linear model fit <- lm(sales ~ sqrt(TV) + radio + radio * sqrt(TV), data = ad)
summary(fit)
## ## Call: ## lm(formula = sales ~ sqrt(TV) + radio + radio * sqrt(TV), data = ad) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.0562 -0.2757 -0.0121 0.2758 1.2421 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.4444112 0.1793714 24.778 < 2e-16 *** ## sqrt(TV) 0.4383960 0.0150223 29.183 < 2e-16 *** ## radio -0.0500957 0.0062645 -7.997 1.09e-13 *** ## sqrt(TV):radio 0.0215106 0.0005179 41.538 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4476 on 196 degrees of freedom ## Multiple R-squared: 0.9928, Adjusted R-squared: 0.9926 ## F-statistic: 8949 on 3 and 196 DF, p-value: < 2.2e-16
Question: Are my estimates robust ?
i.e. if I remove one observation, do they change a lot ?
e.g. Outliers can have a strong leverage.
set.seed(1289) ## sim n <- 20 x_1 <- runif(n-1, min = -2, max = 2) eps <- rnorm(n-1, mean = 0, sd = 1) y_sim <- 1 - 2 * x_1 + eps ## Outlier x_1[n] <- 4 y_sim[n] <- 10 * max(y_sim)
## Plot plot(x_1, y_sim) abline(a = 1, b = -2, col = "firebrick", lwd = 2)
## Fit fit <- lm(y_sim ~ x_1) summary(fit)
## ## Call: ## lm(formula = y_sim ~ x_1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -16.155 -4.801 -1.460 3.757 27.280 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.431 2.163 1.586 0.13017 ## x_1 5.480 1.752 3.128 0.00581 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 9.639 on 18 degrees of freedom ## Multiple R-squared: 0.3522, Adjusted R-squared: 0.3162 ## F-statistic: 9.787 on 1 and 18 DF, p-value: 0.005806
## Plot plot(x_1, y_sim) abline(a = 1, b = -2, col = "firebrick", lwd = 2) ## True abline(reg = fit, col = "lightblue", lwd = 2) ## Fitted
## Fit without outlier fit_no <- lm(y_sim[-n] ~ x_1[-n]) summary(fit_no)
## ## Call: ## lm(formula = y_sim[-n] ~ x_1[-n]) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.5245 -0.7559 0.1282 0.4912 1.3617 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.2107 0.2133 5.677 2.73e-05 *** ## x_1[-n] -2.3386 0.2442 -9.578 2.90e-08 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9235 on 17 degrees of freedom ## Multiple R-squared: 0.8437, Adjusted R-squared: 0.8345 ## F-statistic: 91.73 on 1 and 17 DF, p-value: 2.903e-08
## Plot plot(x_1, y_sim) abline(a = 1, b = -2, col = "firebrick", lwd = 2) ## True abline(reg = fit, col = "lightblue", lwd = 2) ## Fitted abline(reg = fit_no, col = "darkgreen", lwd = 2) ## Fitted - no outlier
Removing one point has a dramatic effect.
Can we compute a weight for the importance of one observation?
Look at the projection matrix \(\to\) leverage.
The prevision is computed from: \[ \hat{\mathbf{y}} = \mathbf{H} \mathbf{y} \] with \[ \mathbf{H} = \mathbf{P}^{\mathbf{X}} = \mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T} \]
i.e. \[ \hat{y}_i = \sum_{j = 1}^n h_{ij} y_j = h_{ii}y_i + \sum_{j\neq i} h_{ij} y_j \]
\(h_{ii}\) gives the “weight” of each observation on its own prevision.
Properties of \(\mathbf{H}\) ?
\[ \mathbf{H} = \mathbf{P}^{\mathbf{X}} = \mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T} \]
We have:
Reminder: \[ \mathbf{H} = \mathbf{H}^2 \quad \text{and} \quad \mathbf{H}^T = \mathbf{H} \]
1- \[ \text{tr}(\mathbf{H}) = \text{rg}(\mathbf{X}) = p \]
2- \[ p = \text{tr}(\mathbf{H}) = \text{tr}(\mathbf{H}^2) = \text{tr}(\mathbf{H}^T\mathbf{H}) = \sum_{i = 1}^n\sum_{j = 1}^n h_{ij}^2 \]
3- \[ \mathbf{H}_{ii} = (\mathbf{H}^2)_{ii} \]
hence: \[ h_{ii} = \sum_{j = 1}^n h_{ij}h_{ji} = \sum_{j = 1}^n h_{ij}^2 = h_{ii}^2 + \sum_{j \neq i} h_{ij}^2 \]
\[ h_{ii}(1 - h_{ii}) = \sum_{j \neq i} h_{ij}^2 \geq 0 \]
and \[ 0 \leq h_{ii} \leq 1 \]
4- \[ h_{ii}(1 - h_{ii}) = \sum_{j \neq i} h_{ij}^2 \]
If \(h_{ii} = 0\) or \(h_{ii} = 1\), then \(h_{ij} = 0\) for all \(j \neq i\).
5- \[ h_{ii}(1 - h_{ii}) = h_{ij}^2 + \sum_{k \neq i,j} h_{ik}^2 \]
\(h_{ii}(1 - h_{ii})\) is maximal in \(h_{ii} = 0.5\), so: \[ 0 \leq h_{ij}^2 + \sum_{k \neq i,j} h_{ik}^2 \leq 0.25 \] \[ 0 \leq h_{ij}^2 \leq 0.25 \] and: \[ -0.5 \leq h_{ij} \leq 0.5 \]
\[ \hat{y}_i = \sum_{j = 1}^n h_{ij} y_j = h_{ii}y_i + \sum_{j\neq i} h_{ij} y_j \]
As \(\text{tr}(\mathbf{H}) = \text{rg}(\mathbf{X}) = p\), if the \(h_{ii}\) are perfectly ballanced, (all equal to \(h\)), then \(h = p/n\).
Leverage Point \(i\) is a leverage point if:
## Plot plot(x_1, y_sim) abline(a = 1, b = -2, col = "firebrick", lwd = 2) ## True abline(reg = fit, col = "lightblue", lwd = 2) ## Fitted abline(reg = fit_no, col = "darkgreen", lwd = 2) ## Fitted - no outlier
## Leverage h_val <- hatvalues(fit) ## Plot barplot(h_val) abline(h = 2 * 2 / n, lty = 2) abline(h = 3 * 2 / n, lty = 3)
set.seed(1289) ## sim n <- 20 x_1 <- runif(n-1, min = -2, max = 2) eps <- rnorm(n-1, mean = 0, sd = 1) y_sim <- 1 - 2 * x_1 + eps ## (Not an) Outlier x_1[n] <- 4 y_sim[n] <- 1 - 2 * x_1[n]
Last point is consistent with the other observations.
But its \(x\) value is high \(\to\) high leverage point.
Recall: \[ \frac{1}{p\hat{\sigma}^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T (\mathbf{X}^T\mathbf{X})\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) \sim \mathcal{F}^p_{n-p} \]
Confidence Region: with a probability of \(1-\alpha\): \[ \frac{1}{p\hat{\sigma}^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T (\mathbf{X}^T\mathbf{X})\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) \leq f^p_{n-p}(1 - \alpha) \]
Cook Distance: \[ C_i = \frac{1}{p\hat{\sigma}^2} \left(\hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}}\right)^T (\mathbf{X}^T\mathbf{X}) \left(\hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}}\right) \]
Leave-one-out cross validation: for all \(i\)
Cook Distance: \[ C_i = \frac{1}{p\hat{\sigma}^2} \left(\hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}}\right)^T (\mathbf{X}^T\mathbf{X}) \left(\hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}}\right) \]
We have: \[ C_i = \frac{h_{ii}(y_{i} - \hat{y}_{i}^P)^2}{p\hat{\sigma}^2} = \frac{h_{ii}}{p(1-h_{ii})^2}\frac{\hat{\epsilon}_i^2}{\hat{\sigma}^2} = \frac{h_{ii}}{p(1-h_{ii})}t_i^2 \]
Note:
Cook “distance” is actually the square of a distance.
\[ C_i = \frac{h_{ii}(y_{i} - \hat{y}_{i}^P)^2}{p\hat{\sigma}^2} = \frac{h_{ii}}{p(1-h_{ii})^2}\frac{\hat{\epsilon}_i^2}{\hat{\sigma}^2} = \frac{h_{ii}}{p(1-h_{ii})}t_i^2 \]
Hints:
\[ (\mathbf{A} + \mathbf{u}\mathbf{v}^T)^{-1} = \mathbf{A}^{-1} - \frac{ \mathbf{A}^{-1} \mathbf{u}\mathbf{v}^T \mathbf{A}^{-1} }{ 1 + \mathbf{v}^T\mathbf{A}^{-1}\mathbf{u} }. \]
\[ \mathbf{X}^T\mathbf{X} = \mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)} + (\mathbf{x}^{i})^T\mathbf{x}^i \]
\[ \mathbf{X}^T\mathbf{y} = \mathbf{X}_{(-i)}^T\mathbf{y}_{(-i)} + (\mathbf{x}^{i})^Ty_i \]
\[ h_{ii} = [\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}]_{ii} = \mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T \]
So: \[ (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1} = (\mathbf{X}^T\mathbf{X} - (\mathbf{x}^{i})^T\mathbf{x}^i)^{-1} = \cdots \]
\[ \hat{\boldsymbol{\beta}}_{(-i)} = (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}\mathbf{X}_{(-i)}^T\mathbf{y}_{(-i)} = \cdots \]
\[ (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1} = (\mathbf{X}^T\mathbf{X} - (\mathbf{x}^{i})^T\mathbf{x}^i)^{-1} = \cdots \]
\[ (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1} = (\mathbf{X}^T\mathbf{X})^{-1} + \frac{ (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1} }{ 1 - \mathbf{x}^{i}(\mathbf{X}^T\mathbf{X})^{-1}(\mathbf{x}^{i})^T } \]
\[ (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1} = (\mathbf{X}^T\mathbf{X})^{-1} + \frac{(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1}}{1 - h_{ii}} \]
\[ \hat{\boldsymbol{\beta}}_{(-i)} = (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1}\mathbf{X}_{(-i)}^T\mathbf{y}_{(-i)} \]
But: \[ (\mathbf{X}_{(-i)}^T\mathbf{X}_{(-i)})^{-1} = (\mathbf{X}^T\mathbf{X})^{-1} + \frac{(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1}}{1 - h_{ii}} \] \[ \mathbf{X}_{(-i)}^T\mathbf{y}_{(-i)} = \mathbf{X}^T\mathbf{y} - (\mathbf{x}^{i})^Ty_i \]
Hence: \[ \begin{multline} \hat{\boldsymbol{\beta}}_{(-i)} = \left[ (\mathbf{X}^T\mathbf{X})^{-1} + \frac{(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1}}{1 - h_{ii}} \right] \\ \times [\mathbf{X}^T\mathbf{y} - (\mathbf{x}^{i})^Ty_i] \end{multline} \]
\[ \begin{multline} \hat{\boldsymbol{\beta}}_{(-i)} = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{y} - (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^Ty_i \\ + \frac{(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T}{1 - h_{ii}}\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{y} \\ - \frac{(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T\mathbf{x}^{i} (\mathbf{X}^T\mathbf{X})^{-1}(\mathbf{x}^{i})^T}{1 - h_{ii}} y_i \end{multline} \]
\[ \begin{multline} \hat{\boldsymbol{\beta}}_{(-i)} = \hat{\boldsymbol{\beta}} - (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^Ty_i + \frac{1}{1-h_{ii}}(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T \mathbf{x}^i \hat{\boldsymbol{\beta}} \\ - \frac{h_{ii}}{1-h_{ii}} (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T y_i \end{multline} \]
\[ \hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T \left[\frac{1}{1-h_{ii}} \hat{y}_i - \left(1 + \frac{h_{ii}}{1-h_{ii}}\right) y_i \right] \]
\[ \begin{aligned} \hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}} &= (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T \frac{1}{1-h_{ii}} (\hat{y}_i - y_i)\\ &= - (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T \frac{1}{1-h_{ii}} \hat{\epsilon}_i \end{aligned} \]
But remember: \[ y_i - \hat{y}_i^P = \frac{1}{1-h_{ii}}\hat{\epsilon}_i \] hence: \[ \hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}} = - (y_i - \hat{y}_i^P) (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T \]
We have: \[ \hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}} = - (y_i - \hat{y}_i^P) (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T \] \[ C_i = \frac{1}{p\hat{\sigma}^2} \left(\hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}}\right)^T (\mathbf{X}^T\mathbf{X}) \left(\hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}}\right) \]
Hence: \[ \begin{aligned} C_i &= \frac{1}{p\hat{\sigma}^2} (\mathbf{x}^{i}) (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{X}^T\mathbf{X}) (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{i})^T (y_i - \hat{y}_i^P)^2\\ &= \frac{1}{p\hat{\sigma}^2} h_{ii} (y_i - \hat{y}_i^P)^2 \end{aligned} \]
\[ C_i = \frac{1}{p\hat{\sigma}^2} \left(\hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}}\right)^T (\mathbf{X}^T\mathbf{X}) \left(\hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}}\right) = \frac{1}{p\hat{\sigma}^2} h_{ii} (y_i - \hat{y}_i^P)^2 \]
and: \[ y_i - \hat{y}_i^P = \frac{1}{1-h_{ii}}\hat{\epsilon}_i \]
hence: \[ C_i = \frac{h_{ii}}{p(1-h_{ii})^2}\frac{\hat{\epsilon}_i^2}{\hat{\sigma}^2} = \frac{h_{ii}}{p(1-h_{ii})}t_i^2 \]
\[ C_i = \frac{1}{p\hat{\sigma}^2} \left(\hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}}\right)^T (\mathbf{X}^T\mathbf{X}) \left(\hat{\boldsymbol{\beta}}_{(-i)} - \hat{\boldsymbol{\beta}}\right) \]
If \(C_i\) is “high”, then the point has a strong influence on the estimation.
According to Cook (1977):
\[ C_i = \frac{1}{p}\frac{h_{ii}}{(1-h_{ii})}t_i^2 \]
If \(C_i\) is “high”, then the point has a strong influence on the estimation.
\(C_i\) is the product of two contributions:
## linear model fit <- lm(sales ~ sqrt(TV) + radio + radio * sqrt(TV), data = ad)