20/03/2024

What we know

Gaussian Model

Model:

\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \]

  • \(\mathbf{y}\) random vector of \(n\) responses
  • \(\mathbf{X}\) non random \(n\times p\) matrix of predictors
  • \(\boldsymbol{\epsilon}\) random vector of \(n\) errors
  • \(\boldsymbol{\beta}\) non random, unknown vector of \(p\) coefficients

Assumptions:

  • (H1) \(rg(\mathbf{X}) = p\)
  • (H2) \(\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n)\)

Distribution of the estimators

Distribution of \(\hat{\boldsymbol{\beta}}\): \[ \hat{\boldsymbol{\beta}} \sim \mathcal{N}\left( \boldsymbol{\beta}; \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} \right) \quad \text{and} \quad \frac{\hat{\beta}_k - \beta_k}{\sqrt{\hat{\sigma}^2 [(\mathbf{X}^T\mathbf{X})^{-1}]_{kk}}} \sim \mathcal{T}_{n-p}. \]

Distribution of \(\hat{\sigma}^2\): \[ \frac{(n-p) \hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-p}. \]

Student t tests

\[ \text{Hypothesis:} \qquad\qquad \mathcal{H}_0: \beta_k = 0 \quad \text{vs} \quad \mathcal{H}_1: \beta_k \neq 0 \]

\[ \text{Test Statistic:} \qquad\qquad\qquad\qquad T_k = \frac{\hat{\beta}_k}{\sqrt{\hat{\sigma}^2_k}} \underset{\mathcal{H}_0}{\sim} \mathcal{T}_{n-p} \]

\[ \text{Reject Region:} \qquad\qquad\qquad\qquad\qquad\qquad\qquad\quad\\ \mathbb{P}[\text{Reject} | \mathcal{H}_0 \text{ true}] \leq \alpha \qquad\qquad\qquad\qquad\qquad\\ \qquad\qquad\qquad \iff \mathcal{R}_\alpha = \left\{ t \in \mathbb{R} ~|~ |t| \geq t_{n-p}(1 - \alpha/2) \right\} \]

\[ \text{p value:}\qquad\qquad\qquad\qquad\quad p = \mathbb{P}_{\mathcal{H}_0}[|T_k| > T_k^{obs}] \]

Problems with the \(t\)-test

Simulated Dataset

\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]

set.seed(1289)

## Predictors
n <- 500
x_1 <- runif(n, min = -2, max = 2)
x_2 <- runif(n, min = -2, max = 2)

## Noise
eps <- rnorm(n, mean = 0, sd = 2)

## Model sim
beta_0 <- -1; beta_1 <- 3; beta_2 <- -1
y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps

## Useless predictor
x_junk_1 <- runif(n, min = -2, max = 2)

Simulated Dataset - Fit

\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]

fit <- lm(y_sim ~ x_1 + x_2 + x_junk_1)
summary(fit)
## 
## Call:
## lm(formula = y_sim ~ x_1 + x_2 + x_junk_1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.275 -1.453 -0.028  1.371  4.893 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.82789    0.08814  -9.393   <2e-16 ***
## x_1          3.01467    0.07546  39.952   <2e-16 ***
## x_2         -0.95917    0.07469 -12.842   <2e-16 ***
## x_junk_1     0.01534    0.07791   0.197    0.844    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.957 on 496 degrees of freedom
## Multiple R-squared:  0.7905, Adjusted R-squared:  0.7893 
## F-statistic:   624 on 3 and 496 DF,  p-value: < 2.2e-16

Simulated Dataset

\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]

## Other Useless Predictors
p_junk <- 100
x_junk <- matrix(runif(n * p_junk, min = -2, max = 2),
                 ncol = p_junk)

## Data frame
data_junk <- data.frame(y_sim = y_sim,
                        x_junk = x_junk)

Simulated Dataset - Fit - Junk

fit <- lm(y_sim ~ -1 + ., data = data_junk)
summary(fit)
## 
## Call:
## lm(formula = y_sim ~ -1 + ., data = data_junk)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.8683  -3.0605  -0.2674   2.3554  11.0821 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## x_junk.1    0.2129516  0.1828010   1.165   0.2447  
## x_junk.2   -0.1075770  0.1919617  -0.560   0.5755  
## x_junk.3    0.4322452  0.1945466   2.222   0.0269 *
## x_junk.4    0.1471909  0.1886285   0.780   0.4357  
## x_junk.5   -0.1110445  0.1848774  -0.601   0.5484  
## x_junk.6    0.1276583  0.1829333   0.698   0.4857  
## x_junk.7    0.0079412  0.1944593   0.041   0.9674  
## x_junk.8   -0.1126224  0.1854571  -0.607   0.5440  
## x_junk.9   -0.0560611  0.1918247  -0.292   0.7702  
## x_junk.10  -0.3215171  0.1883253  -1.707   0.0886 .
## x_junk.11  -0.0875118  0.1891359  -0.463   0.6438  
## x_junk.12  -0.0822697  0.1893373  -0.435   0.6641  
## x_junk.13   0.0862599  0.1874289   0.460   0.6456  
## x_junk.14  -0.1322800  0.1895130  -0.698   0.4856  
## x_junk.15   0.1748496  0.1751535   0.998   0.3188  
## x_junk.16  -0.2152065  0.1795766  -1.198   0.2315  
## x_junk.17   0.0987989  0.1868782   0.529   0.5973  
## x_junk.18  -0.0589122  0.1964012  -0.300   0.7644  
## x_junk.19   0.0407544  0.1896260   0.215   0.8299  
## x_junk.20   0.2788872  0.1817698   1.534   0.1257  
## x_junk.21   0.0780682  0.1901043   0.411   0.6815  
## x_junk.22   0.0001428  0.1936429   0.001   0.9994  
## x_junk.23  -0.2146891  0.1860330  -1.154   0.2492  
## x_junk.24   0.1135748  0.1794805   0.633   0.5272  
## x_junk.25   0.1797087  0.1849661   0.972   0.3318  
## x_junk.26  -0.0095168  0.1885643  -0.050   0.9598  
## x_junk.27  -0.0697160  0.1782409  -0.391   0.6959  
## x_junk.28  -0.1294379  0.1832707  -0.706   0.4804  
## x_junk.29  -0.0138295  0.1830424  -0.076   0.9398  
## x_junk.30   0.2922539  0.1918362   1.523   0.1284  
## x_junk.31   0.1576348  0.1867874   0.844   0.3992  
## x_junk.32   0.0598461  0.1892184   0.316   0.7520  
## x_junk.33   0.2018882  0.1812726   1.114   0.2661  
## x_junk.34  -0.0116431  0.1840672  -0.063   0.9496  
## x_junk.35   0.0713938  0.1833419   0.389   0.6972  
## x_junk.36   0.0190741  0.1798446   0.106   0.9156  
## x_junk.37  -0.0238114  0.1885668  -0.126   0.8996  
## x_junk.38  -0.1867089  0.1838147  -1.016   0.3104  
## x_junk.39  -0.0777722  0.1862288  -0.418   0.6765  
## x_junk.40  -0.0434927  0.1859361  -0.234   0.8152  
## x_junk.41   0.0814161  0.1904542   0.427   0.6693  
## x_junk.42   0.1292109  0.1870557   0.691   0.4901  
## x_junk.43  -0.3022911  0.1835507  -1.647   0.1004  
## x_junk.44   0.0831695  0.1823006   0.456   0.6485  
## x_junk.45   0.1332532  0.1869381   0.713   0.4764  
## x_junk.46  -0.1348223  0.1838432  -0.733   0.4638  
## x_junk.47  -0.0218524  0.1972836  -0.111   0.9119  
## x_junk.48   0.2795708  0.1854204   1.508   0.1324  
## x_junk.49  -0.1531037  0.1798952  -0.851   0.3952  
## x_junk.50   0.0520440  0.1852843   0.281   0.7789  
## x_junk.51   0.1682396  0.1835292   0.917   0.3599  
## x_junk.52  -0.1190719  0.1853241  -0.643   0.5209  
## x_junk.53  -0.0989078  0.1857934  -0.532   0.5948  
## x_junk.54  -0.1325051  0.1840157  -0.720   0.4719  
## x_junk.55   0.1006848  0.1940173   0.519   0.6041  
## x_junk.56   0.2034148  0.1764952   1.153   0.2498  
## x_junk.57   0.1153209  0.1766600   0.653   0.5143  
## x_junk.58  -0.0463597  0.1772648  -0.262   0.7938  
## x_junk.59   0.0240049  0.1889276   0.127   0.8990  
## x_junk.60   0.3726537  0.1909670   1.951   0.0517 .
## x_junk.61   0.0250128  0.1864171   0.134   0.8933  
## x_junk.62  -0.2615601  0.1901999  -1.375   0.1698  
## x_junk.63   0.0774898  0.1871022   0.414   0.6790  
## x_junk.64   0.0545841  0.1966734   0.278   0.7815  
## x_junk.65   0.0725361  0.1882312   0.385   0.7002  
## x_junk.66  -0.1206593  0.1897174  -0.636   0.5251  
## x_junk.67  -0.3336932  0.1818668  -1.835   0.0673 .
## x_junk.68   0.0981128  0.1912111   0.513   0.6082  
## x_junk.69  -0.0153031  0.1858223  -0.082   0.9344  
## x_junk.70   0.1886025  0.1830855   1.030   0.3036  
## x_junk.71  -0.0695791  0.1906992  -0.365   0.7154  
## x_junk.72  -0.1030034  0.1840569  -0.560   0.5760  
## x_junk.73   0.0084075  0.1823678   0.046   0.9633  
## x_junk.74   0.2999893  0.1783620   1.682   0.0934 .
## x_junk.75   0.3355257  0.1843357   1.820   0.0695 .
## x_junk.76  -0.1894886  0.1956830  -0.968   0.3335  
## x_junk.77  -0.1183169  0.1843228  -0.642   0.5213  
## x_junk.78  -0.1651666  0.1832443  -0.901   0.3679  
## x_junk.79   0.0947620  0.1867396   0.507   0.6121  
## x_junk.80   0.2538003  0.1910030   1.329   0.1847  
## x_junk.81   0.0711922  0.1917948   0.371   0.7107  
## x_junk.82  -0.0098761  0.2003307  -0.049   0.9607  
## x_junk.83   0.0102922  0.1798480   0.057   0.9544  
## x_junk.84   0.2163918  0.1807194   1.197   0.2319  
## x_junk.85  -0.2514098  0.1881750  -1.336   0.1823  
## x_junk.86   0.0181126  0.1832811   0.099   0.9213  
## x_junk.87  -0.0907617  0.1809516  -0.502   0.6162  
## x_junk.88   0.0354845  0.1869592   0.190   0.8496  
## x_junk.89  -0.3283331  0.1867203  -1.758   0.0794 .
## x_junk.90  -0.2284999  0.1895045  -1.206   0.2286  
## x_junk.91  -0.0575054  0.1800322  -0.319   0.7496  
## x_junk.92   0.0612578  0.1799226   0.340   0.7337  
## x_junk.93  -0.0192369  0.1866956  -0.103   0.9180  
## x_junk.94  -0.0615478  0.1828148  -0.337   0.7365  
## x_junk.95   0.3675294  0.1778881   2.066   0.0395 *
## x_junk.96  -0.0015247  0.1899413  -0.008   0.9936  
## x_junk.97  -0.1364003  0.1779889  -0.766   0.4439  
## x_junk.98   0.4422968  0.1888431   2.342   0.0197 *
## x_junk.99   0.2366755  0.1815971   1.303   0.1932  
## x_junk.100  0.0010360  0.1905522   0.005   0.9957  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.311 on 400 degrees of freedom
## Multiple R-squared:  0.1889, Adjusted R-squared:  -0.01383 
## F-statistic: 0.9318 on 100 and 400 DF,  p-value: 0.6596

Simulated Dataset - Fit - Junk

p_values <- summary(fit)$coefficients[, 4]
hist(p_values, col = "lightblue", breaks = 20)

Simulated Dataset - Fit - Junk

sum(p_values <= 0.05)
## [1] 3
summary(fit)$coefficients[p_values <= 0.05, ]
##            Estimate Std. Error  t value   Pr(>|t|)
## x_junk.3  0.4322452  0.1945466 2.221807 0.02685525
## x_junk.95 0.3675294  0.1778881 2.066071 0.03946512
## x_junk.98 0.4422968  0.1888431 2.342138 0.01966334

\(~\)

Reject Region: \(\mathbb{P}[\text{Reject} | \mathcal{H}_0 \text{ true}] \leq \alpha\)

Just by chance, about 5% of false discoveries.

Joint Distribution of the Coefficients - \(\sigma^2\) unknown

Joint distribution of \(\hat{\boldsymbol{\beta}}\)

When \(\sigma^2\) is known:

\[ \hat{\boldsymbol{\beta}} \sim \mathcal{N}\left( \boldsymbol{\beta}; \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} \right). \]

When \(\sigma^2\) is unknown: \[ \frac{1}{p\hat{\sigma}^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T (\mathbf{X}^T\mathbf{X})\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) \sim \mathcal{F}^p_{n-p} \]

with \(\mathcal{F}^p_{n-p}\) the Fisher distribution.

Reminder: Fisher Distribution

Let \(U_1 \sim \chi^2_{p_1}\) and \(U_2 \sim \chi^2_{p_2}\), \(U_1\) and \(U_2\) independent. Then \[ F = \frac{U_1/p_1}{U_2/p_2} \sim \mathcal{F}^{p_1}_{p_2} \]

Joint distribution of \(\hat{\boldsymbol{\beta}}\) - Proof - Hints

Let’s show: \[ \frac{1}{p\hat{\sigma}^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T (\mathbf{X}^T\mathbf{X})\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) \sim \mathcal{F}^p_{n-p} \]

Hints:
\[ \hat{\boldsymbol{\beta}} \sim \mathcal{N}\left( \boldsymbol{\beta}; \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} \right) \quad \text{and} \quad \frac{(n-p) \hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-p} \]

\[ \text{If } \mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{\Sigma}) \quad \text{then} \quad (\mathbf{X} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{X} - \boldsymbol{\mu}) \sim \chi^2_p \]

\[ \text{If } U_1 \sim \chi^2_{p_1} \text{ and } U_2 \sim \chi^2_{p_2} \text{ independent, then } \frac{U_1/p_1}{U_2/p_2} \sim \mathcal{F}^{p_1}_{p_2} \]

Joint distribution of \(\hat{\boldsymbol{\beta}}\) - Proof

\[ \hat{\boldsymbol{\beta}} \sim \mathcal{N}\left( \boldsymbol{\beta}; \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} \right) \qquad\qquad \\ \qquad\qquad \implies \frac{1}{\sigma^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T (\mathbf{X}^T\mathbf{X})\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) \sim \chi^2_{p} \]

\[ \frac{(n-p) \hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-p} \quad \text{ independent from } \hat{\boldsymbol{\beta}} \]

Hence: \[ \begin{aligned} \frac{ \frac{1}{\sigma^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T (\mathbf{X}^T\mathbf{X})\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) / p }{ \frac{(n-p) \hat{\sigma}^2}{\sigma^2} / (n-p) } \qquad\qquad\qquad \\ \qquad\qquad = \frac{1}{p\hat{\sigma}^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T (\mathbf{X}^T\mathbf{X})\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) &\sim \mathcal{F}^p_{n-p}. \end{aligned} \]

Global Fisher F-test

Global Fisher F-test

Hypothesis: \[ \mathcal{H}_0: \beta_1 = \cdots = \beta_p = 0 \quad \text{vs} \quad \mathcal{H}_1: \exists~k ~|~ \beta_k \neq 0 \]

Test Statistic: \[ F = \frac{1}{p\hat{\sigma}^2}\hat{\boldsymbol{\beta}}^T (\mathbf{X}^T\mathbf{X})\hat{\boldsymbol{\beta}} \underset{\mathcal{H}_0}{\sim} \mathcal{F}^p_{n-p} \]

Reject Region: \[ \mathbb{P}[\text{Reject} | \mathcal{H}_0 \text{ true}] \leq \alpha \qquad\qquad\qquad\qquad\qquad\\ \qquad\qquad\qquad \!\iff\! \mathcal{R}_\alpha = \left\{ f \in \mathbb{R}_+ ~|~ f \geq f^p_{n-p}(1 - \alpha) \right\} \]

p value:\(\qquad\qquad p = \mathbb{P}_{\mathcal{H}_0}[F > F^{obs}]\)

Simulated Dataset - Fit - Junk

fit <- lm(y_sim ~ -1 + ., data = data_junk)
f_obs <- summary(fit)$fstatistic
f_obs
##      value      numdf      dendf 
##   0.931791 100.000000 400.000000
p_value_f <- 1 - pf(f_obs[1], f_obs[2], f_obs[3])
p_value_f
##     value 
## 0.6596268

We do not reject the null hypothesis that all the (junk) coefficients are zero (phew).

Fisher F-Test - Geometry

\[ \begin{aligned} F &= \frac{1}{p\hat{\sigma}^2}\hat{\boldsymbol{\beta}}^T (\mathbf{X}^T\mathbf{X})\hat{\boldsymbol{\beta}} = \frac{(\mathbf{X}\hat{\boldsymbol{\beta}})^T (\mathbf{X}\hat{\boldsymbol{\beta}}) / p}{\|\hat{\boldsymbol{\epsilon}}\|^2 / (n - p)} \\ &= \frac{\|\hat{\mathbf{y}}\|^2 / p}{\|\hat{\boldsymbol{\epsilon}}\|^2 / (n - p)} = \frac{\|\hat{\mathbf{y}}\|^2 / p}{\|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p)} \\ \end{aligned} \]

  • \(\mathcal{H}_0: \beta_1 = \cdots = \beta_p = 0\) i.e. the model is useless.
  • If \(\|\hat{\boldsymbol{\epsilon}}\|^2\) is small, then the error is small,
    and \(F\) tends to be large, i.e. we tend to reject \(\mathcal{H}_0\):
    the model is useful.
  • If \(\|\hat{\boldsymbol{\epsilon}}\|^2\) is large, then the error is large,
    and \(F\) tends to be small, i.e. we tend to not reject \(\mathcal{H}_0\):
    the model is rather useless.

Geometrical interpretation

Good model

  • \(\mathbf{y}\) not too far from \(\mathcal{M}(\mathbf{X})\).

  • \(\|\hat{\boldsymbol{\epsilon}}\|^2\) not too large compared to \(\|\hat{\mathbf{y}}\|^2\).

  • \(F\) tends to be large.

\(~\)

Bad model

  • \(\mathbf{y}\) almost orth. to \(\mathcal{M}(\mathbf{X})\).

  • \(\|\hat{\boldsymbol{\epsilon}}\|^2\) rather large compared to \(\|\hat{\mathbf{y}}\|^2\).

  • \(F\) tends to be small.

Simulated Dataset - True and Junk

## Data frame
data_all <- data.frame(y_sim = y_sim,
                       x_1 = x_1,
                       x_2 = x_2,
                       x_junk = x_junk)
## Fit
fit <- lm(y_sim ~ ., data = data_all)
## multiple t tests
p_values_t <- summary(fit)$coefficients[, 4]
names(p_values_t[p_values_t <= 0.05])
## [1] "(Intercept)" "x_1"         "x_2"         "x_junk.27"   "x_junk.76"  
## [6] "x_junk.80"
## f test
f_obs <- summary(fit)$fstatistic
p_value_f <- 1 - pf(f_obs[1], f_obs[2], f_obs[3])
p_value_f
## value 
##     0

Nested Models Fisher F-test

Nested Models

Can I test ? \[ \begin{aligned} \mathcal{H}_0&: \beta_{p_0 + 1} = \cdots = \beta_p = 0 &\text{vs}\\ \mathcal{H}_1&: \exists~k\in \{p_0+1, \dotsc, p\} ~|~ \beta_k \neq 0 \end{aligned} \]

i.e. decide between the full model: \[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \]

and the nested model: \[ \mathbf{y} = \mathbf{X}_0\boldsymbol{\beta}_0 + \boldsymbol{\epsilon} \]

with \(\mathbf{X} = (\mathbf{X}_0 ~ \mathbf{X}_1)\), \(rg(\mathbf{X}) = p\) and \(rg(\mathbf{X}_0) = p_0\).

Geometrical interpretation

Nested Models

Idea: Compare

  • \(\|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2\) distance of \(\hat{\mathbf{y}}\) compared to \(\hat{\mathbf{y}}_0\) to

  • \(\|\mathbf{y} - \hat{\mathbf{y}}\|^2\) residual error.

Then:

  • If \(\|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2\) small compared to \(\|\mathbf{y} - \hat{\mathbf{y}}\|^2\), then “\(\hat{\mathbf{y}} \approx \hat{\mathbf{y}}_0\)” and the null model is enough.
  • If \(\|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2\) large compared to \(\|\mathbf{y} - \hat{\mathbf{y}}\|^2\), then the full model is useful.

Geometrical interpretation

Good model

  • \(\hat{\mathbf{y}}\) quite different from \(\hat{\mathbf{y}}_0\).

  • Full model does add information.

\(~\)

Bad model

  • \(\hat{\mathbf{y}}\) similar to \(\hat{\mathbf{y}}_0\).

  • Full model does not add much information.

Nested Models - F test

\[ F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } \]

  • When \(F\) is large, full model is useful.
  • When \(F\) is small, null model is enough.
  • Distribution of \(F\) ?

Distribution of \(F\) - 1/2

\[ F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } \]

Let \(\mathbf{P}\) projection on \(\mathcal{M}(\mathbf{X})\) and \(\mathbf{P}_0\) projection on \(\mathcal{M}(\mathbf{X}_0)\)

\[ \begin{aligned} \hat{\mathbf{y}} - \hat{\mathbf{y}}_0 &= \mathbf{P}\mathbf{y} - \mathbf{P}_0\mathbf{y} = \mathbf{P}\mathbf{y} - \mathbf{P}_0\mathbf{P}\mathbf{y}\\ &= (\mathbf{I}_n - \mathbf{P}_0) \mathbf{P}\mathbf{y} = \mathbf{P}_0^\bot \mathbf{P}\mathbf{y}\\ &\in \mathcal{M}(\mathbf{X}_0)^\bot \cap \mathcal{M}(\mathbf{X}) \end{aligned} \]

\[ \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I}_n - \mathbf{P}) \mathbf{y} = \mathbf{P}^\bot \mathbf{y} \in \mathcal{M}(\mathbf{X})^\bot \]

\(\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\) and \(\mathbf{y} - \hat{\mathbf{y}}\) are orthogonal
i.e. their covariance is zero
i.e. (Gaussian assumption) they are independent.

Distribution of \(F\) - 2/2

From Cochran’s theorem: \[ \frac{1}{\sigma^2}\|\mathbf{y} - \hat{\mathbf{y}}\|^2 = \frac{1}{\sigma^2}\|\mathbf{P}^\bot\mathbf{y}\|^2 = \frac{1}{\sigma^2}\|\mathbf{P}^\bot\boldsymbol{\epsilon}\|^2 \sim \chi^2_{n-p} \]

\[ \frac{1}{\sigma^2}\|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0 - \mathbf{P}_0^\bot \mathbf{P}\mathbf{X}\boldsymbol{\beta}\|^2 = \frac{1}{\sigma^2}\|\mathbf{P}_0^\bot \mathbf{P}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\|^2 \sim \chi^2_{p-p_0} \] (\(\mathbf{P}_0^\bot \mathbf{P}\) is a projection on a space of dimension \(p - p_0\))

If \(\mathcal{H}_0\) is true, then \(\mathbf{P}_0^\bot \mathbf{P}\mathbf{X}\boldsymbol{\beta} = 0\), hence:

\[ F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{p - p_0}_{n - p}. \]

Nested F test

From Phythagoras’s theorem: \[ \begin{aligned} \|\mathbf{y} - \hat{\mathbf{y}}_0\|^2 &= \|\mathbf{y} - \mathbf{P}\mathbf{y} + \mathbf{P}\mathbf{y} - \mathbf{P}_0\mathbf{y}\|^2 = \|\mathbf{P}^\bot\mathbf{y} + \mathbf{P}_0^\bot\mathbf{P}\mathbf{y}\|^2\\ &= \|\mathbf{P}^\bot\mathbf{y}\|^2 + \|\mathbf{P}_0^\bot\mathbf{P}\mathbf{y}\|^2 = \|\mathbf{y} - \hat{\mathbf{y}}\|^2 + \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2\\ \end{aligned} \]

i.e. \[ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 = \|\mathbf{y} - \hat{\mathbf{y}}_0\|^2 - \|\mathbf{y} - \hat{\mathbf{y}}\|^2 = RSS_0 - RSS \]

Hence: \[ F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } = \frac{n - p}{p - p_0}\frac{RSS_0 - RSS}{RSS} \]

Nested F test

To test: \[ \begin{aligned} \mathcal{H}_0&: \beta_{p_0 + 1} = \cdots = \beta_p = 0 &\text{vs}\\ \mathcal{H}_1&: \exists~k\in \{p_0+1, \dotsc, p\} ~|~ \beta_k \neq 0 \end{aligned} \]

we use the \(F\) statistics: \[ F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } = \frac{n - p}{p - p_0}\frac{RSS_0 - RSS}{RSS} \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{p - p_0}_{n - p}. \]

Simulated Dataset - True vs Junk

\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]

## Data frame
data_all <- data.frame(y_sim = y_sim,
                       x_1 = x_1,
                       x_2 = x_2,
                       x_junk = x_junk)
## Fit true
fit_small <- lm(y_sim ~ x_1 + x_2, data = data_all)
## Fit true and junk
fit_all <- lm(y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3, data = data_all)
## Comparison
anova(fit_small, fit_all)
## Analysis of Variance Table
## 
## Model 1: y_sim ~ x_1 + x_2
## Model 2: y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    497 1900.5                           
## 2    494 1883.2  3    17.297 1.5124 0.2104

Simulated Dataset - True vs Junk

\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]

## Fit true
fit_small <- lm(y_sim ~ x_1 + x_2, data = data_all)
## Fit all
fit_all <- lm(y_sim ~ ., data = data_all)
## Comparison
anova(fit_small, fit_all)$`Pr(>F)`[2]
## [1] 0.3548787

We do not reject the null hypothesis that the first two variables are enough to explain the data.

Full \(F\) test

Assuming that we have an intercept, we often test: \[ \begin{aligned} \mathcal{H}_0&: \beta_{2} = \cdots = \beta_p = 0 &\text{vs}\\ \mathcal{H}_1&: \exists~k\in \{2, \dotsc, p\} ~|~ \beta_k \neq 0 \end{aligned} \]

The associated \(F\) statistics is (\(p_0 = 1\)): \[ F = \frac{ \|\hat{\mathbf{y}} - \bar{y}\mathbb{1}\|^2 / (p - 1) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{p - 1}_{n - p}. \]

That is the default test returned by R.

Full \(F\) test

\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]

fit <- lm(y_sim ~ x_1 + x_2)
summary(fit)
## 
## Call:
## lm(formula = y_sim ~ x_1 + x_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2523 -1.4517 -0.0302  1.3494  4.8711 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.82877    0.08794  -9.424   <2e-16 ***
## x_1          3.01415    0.07534  40.008   <2e-16 ***
## x_2         -0.95922    0.07462 -12.855   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.955 on 497 degrees of freedom
## Multiple R-squared:  0.7905, Adjusted R-squared:  0.7897 
## F-statistic: 937.8 on 2 and 497 DF,  p-value: < 2.2e-16

One coefficient \(F\) test

If \(p_0 = p-1\), we test: \[ \begin{aligned} \mathcal{H}_0&: \beta_p = 0 &\text{vs} \qquad \mathcal{H}_1&: \beta_p \neq 0 \end{aligned} \]

The associated \(F\) statistics is (\(p_0 = p-1\)): \[ F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_{-p}\|^2 }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{1}_{n - p}. \]

It can be shown that: \[ F = T^2 \quad \text{with} \quad T = \frac{\hat{\beta}_p}{\hat{\sigma}_p} \] so that it is equivalent to the \(t\)-test.

Variable Selection

How to choose the “best” model ?

Model: \[ y_i = \beta_1 x_{i1} + \dotsb + \beta_p x_{ip} + \epsilon_i \]

Problem:
Can we choose the “best” subset of non-zero \(\beta_k\) ?

I.e. Can we find the predictors that really have an impact on \(\mathbf{y}\) ?

\(F\)-test on all subsets ? Not practical, multiple testing.

Idea:
Find a “score” to assess the quality of a given fit.

Variable selection - \(R^2\)

\[ R^2 = 1 - \frac{RSS}{TSS} \]

## Function to get statistics for one fit
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(R2 = sumfit$r.squared)
  return(res)
}

\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]

Variable selection - \(R^2\)

## Fit true
get_stats_fit(lm(y_sim ~ x_1 + x_2, data = data_all))
##          R2
## 1 0.7905255
## Fit bigger
get_stats_fit(lm(y_sim ~ x_1 + x_2 + x_junk.1, data = data_all))
##        R2
## 1 0.79133
## Fit wrong
get_stats_fit(lm(y_sim ~ x_1 + x_junk.1, data = data_all))
##          R2
## 1 0.7252314

Variable selection - \(R^2\)

## Only one "junk" predictor
data_sub <- data.frame(y_sim = y_sim,
                       x_1 = x_1,
                       x_2 = x_2,
                       x_junk = x_junk_1)
## Fit all possible models
get_all_stats(data_sub)
##             R2          preds
## 1 0.7208723311            x_1
## 2 0.1158793338            x_2
## 3 0.0005972444         x_junk
## 4 0.7905254867        x_1+x_2
## 5 0.7208974123     x_1+x_junk
## 6 0.1164841596     x_2+x_junk
## 7 0.7905418666 x_1+x_2+x_junk

\(R^2\) is not enough.

Variable selection - \(R_a^2\)

\[ R_a^2 = 1 - \frac{RSS / (n-p)}{TSS / (n-1)} \]

## Function to get statistics
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(R2 = sumfit$r.squared,
                    R2a = sumfit$adj.r.squared)
  return(res)
}

Variable selection - \(R_a^2\)

get_all_stats(data_sub)
##             R2          R2a          preds
## 1 0.7208723311  0.720311834            x_1
## 2 0.1158793338  0.114103991            x_2
## 3 0.0005972444 -0.001409588         x_junk
## 4 0.7905254867  0.789682531        x_1+x_2
## 5 0.7208974123  0.719774263     x_1+x_junk
## 6 0.1164841596  0.112928764     x_2+x_junk
## 7 0.7905418666  0.789274983 x_1+x_2+x_junk

\(R_a^2\): adjust for the number of predictors.

Adjusted \(R_a^2\)

\[ R_a^2 = 1 - \frac{RSS / (n-p)}{TSS / (n-1)} \]

  • The larger the better.

  • When \(p\) is bigger, the fit must be really better for \(R_a^2\) to raise.

  • Intuitive, but not much theoretical justifications.

Predictive Risk

Simulation

set.seed(1289)

## Predictors
n <- 50
x_1 <- runif(n, min = -2, max = 2)
x_2 <- runif(n, min = -2, max = 2)

## Model
beta_0 <- -1; beta_1 <- 3; beta_2 <- -1

## sim
eps <- rnorm(n, mean = 0, sd = 2)
y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps

Simulation - Fit

## Useless predictors
p_junk <- 48
x_junk <- matrix(runif(n * p_junk, min = -2, max = 2),
                 ncol = p_junk)

## Good and Bad dataset
data_all <- data.frame(y_sim = y_sim,
                       x_1 = x_1,
                       x_2 = x_2,
                       x_junk = x_junk)

## Bad overfitting fit
fit_bad <- lm(y_sim ~ ., data = data_all)
pred_bad <- predict(fit_bad)

## Good fit
fit_good <- lm(y_sim ~ x_1 + x_2, data = data_all)
pred_good <- predict(fit_good)

Simulation - Fit

mu_y <- beta_0 + beta_1 * x_1 + beta_2 * x_2
plot(mu_y, y_sim, pch = 3,
     ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2]))

Simulation - Fit

mu_y <- beta_0 + beta_1 * x_1 + beta_2 * x_2
plot(mu_y, y_sim, pch = 3,
     ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2]))
## Good prediction (small model)
points(mu_y, pred_good, col = "lightblue", pch = 4)

Simulation - Fit

mu_y <- beta_0 + beta_1 * x_1 + beta_2 * x_2
plot(mu_y, y_sim, pch = 3,
     ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2]))
## Good prediction (small model)
points(mu_y, pred_good, col = "lightblue", pch = 4)
## Bad prediction (big model)
points(mu_y, pred_bad, col = "firebrick", pch = 4)

RSS: “fitted risk”

The bad (bigger) model is closer to the observed data.

## Distance between bad prediction and data (RSS)
sum((y_sim - pred_bad)^2)
## [1] 2.714946e-27
## Distance between good prediction and data (RSS)
sum((y_sim - pred_good)^2)
## [1] 206.9159

Overfitting: the RSS is smaller for the bad, bigger model.

Predictive Risk

Generate new data according to the same generative model:

## New dataset from the same model
eps_new <- rnorm(n, mean = 0, sd = 2)
y_new <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps_new

Questions:
Are the predictions from the bad (bigger) model
better or worse
than the predictions from the good (smaller) model,
compared to the new data ?

Simulation - Fit

plot(mu_y, y_sim, pch = 3,
     ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2]))
## Bad prediction (big model)
points(mu_y, pred_bad, col = "firebrick", pch = 4)
## Good prediction (small model)
points(mu_y, pred_good, col = "lightblue", pch = 4)
## New points
points(mu_y, y_new, col = "darkgreen", pch = 3)

Predictive Risk

Compute the risk between predictions and new data:

## Distance between bad prediction and true
sum((y_new - pred_bad)^2)
## [1] 535.296
## Distance between good prediction and true
sum((y_new - pred_good)^2)
## [1] 205.6555

The good (smaller) model behaves better on new data.

Predictive Risk

RSS: \(\|\mathbf{y} - \hat{\mathbf{y}}_p\|^2\) with \(\hat{\mathbf{y}}_p\) learnt on \(\mathbf{y}\)
\(\to\) gets smaller when model gets bigger (overfit).

Predictive Risk: \(\| \mathbf{y}' - \hat{\mathbf{y}}_p\|^2\) with \(\mathbf{y}'\) same distribution as \(\mathbf{y}\).
\(\to\) becomes larger if \(\hat{\mathbf{y}}_p\) is overfit.

Idea:
We want to select a model that has a good predictive risk.
i.e. that represents the underlying model best.

Problem:
In general, we don’t have means to generate \(\mathbf{y}'\).
\(\to\) We use an estimate of the (theoretical) predictive risk.

Theoretical Risk

Models

Ideal model: \[ \mathbf{y} = \boldsymbol{\mu} + \boldsymbol{\epsilon} \quad \boldsymbol{\mu} \in \mathbb{R}^n \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n). \]

\(\boldsymbol{\mu}\) the “true” expectation of the observations \(\mathbf{y}\).

Full linear regression model: \[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \quad rg(\mathbf{X}) = p, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n). \]

Assume that \(\boldsymbol{\mu}\) can be approximated as \(\mathbf{X} \boldsymbol{\beta}\), i.e. as an element of \(\mathcal{M}(\mathbf{X})\).

Models

Ideal model: \[ \mathbf{y} = \boldsymbol{\mu} + \boldsymbol{\epsilon} \quad \boldsymbol{\mu} \in \mathbb{R}^n \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n). \]

Sub-models:
keep only predictors \(\eta \subset \{1, \dotsc, p\}\) \[ \mathbf{y} = \mathbf{X}_{\eta} \boldsymbol{\beta}_{\eta} + \boldsymbol{\epsilon}_{\eta} \quad rg(\mathbf{X}_{\eta}) = |\eta|, \quad \boldsymbol{\epsilon}_{\eta} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n). \]

with \(\mathbf{X}_{\eta}\) the sub-matrix of \(\mathbf{X}\): \[ \mathbf{X}_{\eta} = (\mathbf{x}_{\eta_1} ~ \mathbf{x}_{\eta_2} \cdots \mathbf{x}_{\eta_{|\eta|}}) \]

Assume that \(\boldsymbol{\mu}\) can be approximated as \(\mathbf{X}_{\eta} \boldsymbol{\beta}_{\eta}\), i.e. as an element of \(\mathcal{M}(\mathbf{X}_{\eta})\).

Models

“Truth”: \(\mathbf{y} = \boldsymbol{\mu} + \boldsymbol{\epsilon}\), with \(\boldsymbol{\mu} \in \mathbb{R}^n\) and \(\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n)\).

Models: \[ \mathbf{y} = \mathbf{X}_{\eta} \boldsymbol{\beta}_{\eta} + \boldsymbol{\epsilon}_{\eta} \quad rg(\mathbf{X}_{\eta}) = |\eta|, \quad \boldsymbol{\epsilon}_{\eta} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n). \]

Estimators: \[ \begin{aligned} \hat{\boldsymbol{\beta}}_\eta & = \underset{\boldsymbol{\beta} \in \mathbb{R}^{|\eta|}}{\operatorname{argmin}} \|\mathbf{y} - \mathbf{X}_\eta\boldsymbol{\beta}_\eta \|^2 = (\mathbf{X}_\eta^{T}\mathbf{X}_\eta)^{-1}\mathbf{X}_\eta^{T}\mathbf{y}\\ \hat{\mathbf{y}}_\eta & = \mathbf{X}_\eta\hat{\boldsymbol{\beta}}_\eta = \mathbf{P}^{\mathbf{X}_\eta}\mathbf{y} \end{aligned} \]

True expectation: \[ \begin{aligned} \mathbf{y}_\eta = \mathbf{P}^{\mathbf{X}_\eta}\boldsymbol{\mu} \end{aligned} \]

Models

Risk

Risk of an estimator (theoretical)

Risk of an estimator: \[ R(\hat{\mathbf{y}}_\eta) = \mathbb{E}\left[ \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 \right] \]

Notes:

  • \(\boldsymbol{\mu}\) is unknown \(\to\) cannot be computed in practice.
  • Expectation of the difference between the true mean and the estimated projection.
  • Not subject to overfitting.

Bias-Variance Decomposition

Bias-Variance Decomposition

\[ R(\hat{\mathbf{y}}_\eta) = \mathbb{E}\left[ \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 \right] = \|\boldsymbol{\mu} - \mathbf{y}_\eta \|^2 + |\eta| \sigma^2 \]

  • Bias: \(\|\boldsymbol{\mu} - \mathbf{y}_\eta \|^2\)
    • Difference between true and projection on the model
    • Best that we could do if we knew the truth
    • Decreases when model is more complex, i.e. \(\eta\) gets larger.
  • Variance: \(|\eta| \sigma^2\)
    • Variance of the estimation
    • Increases when model is more complex, i.e. \(\eta\) gets larger.
  • Risk is a compromize between bias and variance.

Bias-Variance - No overfitting

\[ R(\hat{\mathbf{y}}_\eta) = \mathbb{E}\left[ \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 \right] = \|\boldsymbol{\mu} - \mathbf{y}_\eta \|^2 + |\eta| \sigma^2 \]

No overfitting:

Assume:

  • \(\boldsymbol{\mu} \in \mathcal{M}(\mathbf{X}_{\eta})\) so that \(\mathbf{y}_{\eta} = \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\mu} = \boldsymbol{\mu}\).

  • \(\eta \subset \eta'\) so that \(\mathbf{y}_{\eta'} = \mathbf{P}^{\mathbf{X}_{\eta'}}\boldsymbol{\mu} = \boldsymbol{\mu}\).

Then: \[ \begin{aligned} R(\hat{\mathbf{y}}_{\eta'}) & = \|\boldsymbol{\mu} - \mathbf{y}_{\eta'} \|^2 + |\eta'| \sigma^2 \\ & \geq \|\boldsymbol{\mu} - \mathbf{y}_{\eta'} \|^2 + |\eta| \sigma^2 = \|\boldsymbol{\mu} - \mathbf{y}_{\eta} \|^2 + |\eta| \sigma^2 \\ & \geq R(\hat{\mathbf{y}}_\eta) \end{aligned} \]

Bias-Variance - Proof - 1/2

\[ R(\hat{\mathbf{y}}_\eta) = \mathbb{E}\left[ \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 \right] = \mathbb{E}\left[ \|\boldsymbol{\mu} - \mathbf{y}_\eta + \mathbf{y}_\eta - \hat{\mathbf{y}}_\eta \|^2 \right] \]

\[ \begin{aligned} R(\hat{\mathbf{y}}_\eta) &= \mathbb{E}\left[ \|\boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\mu} + \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\mathbf{y} \|^2 \right] \\ &= \mathbb{E}\left[ \|(\mathbf{I}_n - \mathbf{P}^{\mathbf{X}_{\eta}})\boldsymbol{\mu} + \mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right]\\ &= \mathbb{E}\left[ \|\mathbf{P}^{\mathbf{X}_{\eta}^\bot}\boldsymbol{\mu} + \mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] \end{aligned} \]

\[ \begin{aligned} R(\hat{\mathbf{y}}_\eta) &= \mathbb{E}\left[ \|\mathbf{P}^{\mathbf{X}_{\eta}^\bot}\boldsymbol{\mu} \|^2 + \|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] \\ &= \|\mathbf{P}^{\mathbf{X}_{\eta}^\bot}\boldsymbol{\mu} \|^2 + \mathbb{E}\left[ \|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] \\ &= \|\boldsymbol{\mu} - \mathbf{y}_\eta \|^2 + \mathbb{E}\left[ \|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] \end{aligned} \]

Bias-Variance - Proof - 2/2

\[ R(\hat{\mathbf{y}}_\eta) = \|\boldsymbol{\mu} - \mathbf{y}_\eta \|^2 + \mathbb{E}\left[ \|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] \]

From Cochran’s Theorem: \[ \frac{1}{\sigma^2}\|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \sim \chi^2_{|\eta|} \]

Hence: \[ \mathbb{E}\left[ \frac{1}{\sigma^2}\|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] = |\eta| \]

And: \[ R(\hat{\mathbf{y}}_\eta) = \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 + |\eta|\sigma^2. \]

Penalized Least Squares

Oracle Estimator

We would like: \[ \eta_0 = \underset{\eta \subset \{1, \cdots, p\}}{\operatorname{argmin}} R(\hat{\mathbf{y}}_\eta) \]

  • Minimize the risk over all possible models.
  • Problem: \(R(\hat{\mathbf{y}}_\eta)\) is theoretical \(\to\) cannot be computed.
  • Idea: Minimize an estimator \(\hat{R}(\hat{\mathbf{y}}_\eta)\) of \(R(\hat{\mathbf{y}}_\eta)\): \[ \hat{\eta} = \underset{\eta \subset \{1, \cdots, p\}}{\operatorname{argmin}} \hat{R}(\hat{\mathbf{y}}_\eta). \]

Expectation of the RSS

For any model \(\eta\), the expectation of the RSS is: \[ \mathbb{E}[\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2] = R(\hat{\mathbf{y}}_\eta) + n\sigma^2 - 2|\eta|\sigma^2 \]

Expectation of the RSS - Proof

\[ \begin{aligned} \mathbb{E}[\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2] &= \mathbb{E}[\|\mathbf{y} - \boldsymbol{\mu} + \boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2] \\ &= \mathbb{E}[\|\mathbf{y} - \boldsymbol{\mu}\|^2 + \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 + 2 \langle \mathbf{y} - \boldsymbol{\mu}; \boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \rangle] \end{aligned} \]

\[ \begin{aligned} \mathbb{E}[\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2] &= \mathbb{E}[\|\boldsymbol{\epsilon}\|^2] + R(\hat{\mathbf{y}}_\eta) + 2 \mathbb{E}[\langle \boldsymbol{\epsilon}; \boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\mathbf{y} \rangle] \qquad \end{aligned} \]

\[ \begin{aligned} \mathbb{E}[\langle \boldsymbol{\epsilon}; \boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\mathbf{y} \rangle] &= \mathbb{E}[\langle \boldsymbol{\epsilon}; \boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} + \boldsymbol{\epsilon}) \rangle] \\ &= \mathbb{E}[\langle \boldsymbol{\epsilon}; \boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\mu}) \rangle] - \mathbb{E}[\langle \boldsymbol{\epsilon}; \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\epsilon} \rangle] \end{aligned} \]

\[ \begin{aligned} \mathbb{E}[\langle \boldsymbol{\epsilon}; \boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\mathbf{y} \rangle] &= 0 - \mathbb{E}[\langle \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\epsilon}; \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\epsilon} \rangle] \qquad\qquad~~ \\ &= - \mathbb{E}[\| \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\epsilon}\|^2] \\ &= - |\eta| \sigma^2 \end{aligned} \]

\[ \begin{aligned} \mathbb{E}[\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2] &= n \sigma^2 + R(\hat{\mathbf{y}}_\eta) - 2 |\eta| \sigma^2 \qquad\qquad\qquad\qquad \end{aligned} \]

Mallow’s \(C_p\)

We have: \[ \mathbb{E}[\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2] = R(\hat{\mathbf{y}}_\eta) + n\sigma^2 - 2|\eta|\sigma^2 \]

Mallow’s \(C_p\): \[ C_p(\hat{\mathbf{y}}_\eta) = \frac{1}{n} \left( \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 + 2 |\eta| \hat{\sigma}^2 \right) \]

  • If \(\hat{\sigma}^2\) is unbiased, then: \[ \mathbb{E}[C_p(\hat{\mathbf{y}}_\eta)] = \frac{1}{n}R(\hat{\mathbf{y}}_\eta) + \sigma^2 \]

Mallow’s \(C_p\)

We have: \[ C_p(\hat{\mathbf{y}}_\eta) = \frac{1}{n} \left( \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 + 2 |\eta| \hat{\sigma}^2 \right) \] And: \[ \mathbb{E}[C_p(\hat{\mathbf{y}}_\eta)] = \frac{1}{n}R(\hat{\mathbf{y}}_\eta) + \sigma^2 \]

Hence: \[ \hat{\eta} = \underset{\eta \subset \{1, \cdots, p\}}{\operatorname{argmin}} C_p(\hat{\mathbf{y}}_\eta) = \underset{\eta \subset \{1, \cdots, p\}}{\operatorname{argmin}} \hat{R}(\hat{\mathbf{y}}_\eta). \]

\(\to\) Minimizing Mallow’s \(C_p\) is the same as minimizing an unbiased estimate of the risk.

Mallow’s \(C_p\)

\[ C_p(\hat{\mathbf{y}}_\eta) = \frac{1}{n} \left( \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 + 2 |\eta| \hat{\sigma}^2 \right) \]

  • Penalize for the number of parameters \(|\eta|\).

  • The smaller the better.

Simulated Dataset

## Mallow's Cp
Cp <- function(fit) {
  n <- nobs(fit)
  p <- n - df.residual(fit)
  RSS <- deviance(fit)
  return( (RSS + p * RSS / (n - p)) / n)
}
## Function to get statistics
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(Cp = Cp(fit))
  return(res)
}

Variable selection

## Apply stats to all possible combinations
all_stats <- get_all_stats(data_sub)
all_stats
##          Cp          preds
## 1  5.085236            x_1
## 2 16.107190            x_2
## 3 18.207435         x_junk
## 4  3.823952        x_1+x_2
## 5  5.095010     x_1+x_junk
## 6 16.128558     x_2+x_junk
## 7  3.831361 x_1+x_2+x_junk

Mallow’s \(C_p\) is smallest for the true model.

Penalized Log Likelihood

Maximized Log Likelihood

Recall: \[ \begin{aligned} (\hat{\boldsymbol{\beta}}_\eta, \hat{\sigma}^2_{\eta,ML}) &= \underset{(\boldsymbol{\beta}, \sigma^2) \in \mathbb{R}^{|\eta|}\times\mathbb{R}_+^*}{\operatorname{argmax}} \log L(\boldsymbol{\beta}, \sigma^2 | \mathbf{y})\\ &= \underset{(\boldsymbol{\beta}, \sigma^2) \in \mathbb{R}^{|\eta|}\times\mathbb{R}_+^*}{\operatorname{argmax}} - \frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\|\mathbf{y} - \mathbf{X}_\eta\boldsymbol{\beta} \|^2 \end{aligned} \]

And: \[ \hat{\sigma}^2_{\eta,ML} = \frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2} {n} \quad;\quad \hat{\mathbf{y}}_\eta = \mathbf{X}_\eta\hat{\boldsymbol{\beta}}_\eta \]

Thus: \[ L(\hat{\boldsymbol{\beta}}_\eta, \hat{\sigma}^2_{ML} | \mathbf{y}) = \cdots \]

Maximized Log Likelihood

The maximized likelihood is equal to: \[ L(\hat{\boldsymbol{\beta}}_\eta, \hat{\sigma}^2_{ML} | \mathbf{y}) = - \frac{n}{2} \log\left(2\pi\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2}{n}\right) - \frac{1}{2\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2}{n}}\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 \]

i.e. \[ \log \hat{L}_\eta = - \frac{n}{2} \log\left(\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2}{n}\right) - \frac{n}{2} \log\left(2\pi\right) - \frac{n}{2} \]

Maximized Log Likelihood is increasing

Remember, for any \(\eta \subset \eta'\): \[ \|\mathbf{y} - \hat{\mathbf{y}}_{\eta'}\|^2 \leq \|\mathbf{y} - \hat{\mathbf{y}}_\eta\|^2 \]

Hence \(R^2_\eta = 1 - \frac{\|\mathbf{y} - \hat{\mathbf{y}}_\eta\|^2}{\|\mathbf{y} - \bar{y} \mathbb{1}\|^2}\) always increases when we add a predictor.

Similarly: \[ \log \hat{L}_\eta = - \frac{n}{2} \log\left(\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2}{n}\right) - \frac{n}{2} \log\left(2\pi\right) - \frac{n}{2} \] always increases when we add some predictors.

Penalized Maximized Log Likelihood

Problem: \(\log \hat{L}_\eta\) always increases when we add some predictors.

\(\to\) it can not be used for model selection (same as \(R^2\)).

Idea: add a penalty term that depends on the size of the model

\(\to\) minimize: \[ - \log \hat{L}_\eta + f(|\eta|) \qquad \text{with f increasing} \]

Penalized Maximized Log Likelihood

Minimize: \[ - \log \hat{L}_\eta + f(|\eta|) \qquad \text{with f increasing} \]

For \(\eta \subset \eta'\):

  • \(- \log \hat{L}_\eta\) decreases

  • \(f(|\eta|)\) increases

Goal: find \(f\) that compensates for the “overfit”, i.e. the noisy and insignificant increase of the likelihood when we add some unrelated predictors.

Simulated Dataset

## Function to get statistics
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(minusLogLik = -logLik(fit))
  return(res)
}

Variable selection

## Apply stats to all possible combinations
all_stats <- get_all_stats(data_sub)
all_stats
##   minusLogLik          preds
## 1    1115.053            x_1
## 2    1403.284            x_2
## 3    1433.925         x_junk
## 4    1043.286        x_1+x_2
## 5    1115.030     x_1+x_junk
## 6    1403.113     x_2+x_junk
## 7    1043.266 x_1+x_2+x_junk

When we add “junk” variable, \(- \log \hat{L}_\eta\) decreases, but not a lot.

Questions: can we find a good penalty that compensates for this noisy increasing ?

AIC and BIC

Akaike Information Criterion - AIC

\[ AIC = - 2\log \hat{L}_\eta + 2 |\eta| \]

\(~\)

  • \(\hat{L}_\eta\) is the maximized likelihood of the model.

  • \(|\eta|\) is the dimension of the model.

  • The smaller the better.

  • Asymptotic theoretical guaranties.

  • Easy to use and widely spread

Bayesian Information Criterion - BIC

\[ BIC = - 2\log \hat{L}_\eta + |\eta| \log(n) \]

\(~\)

  • \(\hat{L}_\eta\) is the maximized likelihood of the model.

  • \(|\eta|\) is the dimension of the model.

  • The smaller the better.

  • Based on an approximation of the marginal likelihood.

  • Easy to use and widely spread

See: Lebarbier, Mary-Huard. Le critère BIC : fondements théoriques et interprétation. 2004. [inria-00070685]

AIC vs BIC

\[ AIC = - 2\log \hat{L}_\eta + 2 |\eta| \]

\[ BIC = - 2\log \hat{L}_\eta + |\eta| \log(n) \]

  • Both have asymptotic theoretical justifications.

  • BIC penalizes more than AIC \(\to\) chooses smaller models.

  • Both widely used.

  • There are some (better) non-asymptotic criteria, see
    Giraud C. 2014. Introduction to high-dimensional statistics.

Simulated Dataset

## Function to get statistics
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(minusLogLik = -logLik(fit),
                    AIC = AIC(fit),
                    BIC = BIC(fit))
  return(res)
}

Variable selection

## Apply stats to all possible combinations
all_stats <- get_all_stats(data_sub)
all_stats
##   minusLogLik      AIC      BIC          preds
## 1    1115.053 2236.105 2248.749            x_1
## 2    1403.284 2812.567 2825.211            x_2
## 3    1433.925 2873.850 2886.493         x_junk
## 4    1043.286 2094.572 2111.430        x_1+x_2
## 5    1115.030 2238.060 2254.919     x_1+x_junk
## 6    1403.113 2814.225 2831.084     x_2+x_junk
## 7    1043.266 2096.533 2117.606 x_1+x_2+x_junk
## Select model
apply(all_stats[, -ncol(all_stats)], 2,
      function(x) all_stats[which.min(x), ncol(all_stats)])
##      minusLogLik              AIC              BIC 
## "x_1+x_2+x_junk"        "x_1+x_2"        "x_1+x_2"

Variable selection - Advertising data

## Advertising data
library(here)
ad <- read.csv(here("data", "Advertising.csv"), row.names = "X")

## Apply stats to all possible combinations
all_stats <- get_all_stats(ad[, c(4, 1:3)])
all_stats
##   minusLogLik       AIC       BIC              preds
## 1    519.0457 1044.0913 1053.9863                 TV
## 2    573.3369 1152.6738 1162.5687              radio
## 3    608.3357 1222.6714 1232.5663          newspaper
## 4    386.1970  780.3941  793.5874           TV+radio
## 5    509.8891 1027.7782 1040.9714       TV+newspaper
## 6    573.2361 1154.4723 1167.6655    radio+newspaper
## 7    386.1811  782.3622  798.8538 TV+radio+newspaper
## Select model
apply(all_stats[, -ncol(all_stats)], 2,
      function(x) all_stats[which.min(x), ncol(all_stats)])
##          minusLogLik                  AIC                  BIC 
## "TV+radio+newspaper"           "TV+radio"           "TV+radio"

Forward and Backward Search

Combinatorial problem

\[ p \text{ predictors } \to 2^p \text{ possible models.} \]

  • Cannot test all possible models.
  • “Manual” solution: forward or backward searches.

Forward search

Forward search - Init

## Full formula
full_formula <- formula(y_sim ~ x_1 + x_2 
                        + x_junk.1 + x_junk.2 + x_junk.3 
                        + x_junk.4 + x_junk.5 + x_junk.6)

## Complete search
2^8
## [1] 256
## No predictors
fit0 <- lm(y_sim ~ 1, data = data_all)

Forward search - Add term

## Add one
library(MASS)
addterm(fit0, full_formula)
## Single term additions
## 
## Model:
## y_sim ~ 1
##          Df Sum of Sq    RSS     AIC
## <none>                731.66 136.165
## x_1       1    466.43 265.23  87.429
## x_2       1     56.39 675.27 134.154
## x_junk.1  1     50.05 681.61 134.622
## x_junk.2  1     32.10 699.56 135.921
## x_junk.3  1      0.41 731.26 138.137
## x_junk.4  1      2.49 729.18 137.995
## x_junk.5  1     54.28 677.38 134.311
## x_junk.6  1     11.59 720.08 137.367

Forward search - Add term

## Add x_1
fit1 <- lm(y_sim ~ 1 + x_1, data = data_all)
## Add another
addterm(fit1, full_formula)
## Single term additions
## 
## Model:
## y_sim ~ 1 + x_1
##          Df Sum of Sq    RSS    AIC
## <none>                265.23 87.429
## x_2       1    58.314 206.92 77.014
## x_junk.1  1     6.038 259.19 88.277
## x_junk.2  1     0.159 265.07 89.399
## x_junk.3  1     0.376 264.85 89.358
## x_junk.4  1     0.403 264.83 89.353
## x_junk.5  1     2.587 262.64 88.939
## x_junk.6  1     6.791 258.44 88.132

Forward search - Add term

## Add x_1
fit2 <- lm(y_sim ~ 1 + x_1 + x_2, data = data_all)
## Add another
addterm(fit2, full_formula)
## Single term additions
## 
## Model:
## y_sim ~ 1 + x_1 + x_2
##          Df Sum of Sq    RSS    AIC
## <none>                206.92 77.014
## x_junk.1  1    3.0199 203.90 78.279
## x_junk.2  1    2.1852 204.73 78.484
## x_junk.3  1    0.8228 206.09 78.815
## x_junk.4  1    1.6416 205.27 78.616
## x_junk.5  1    4.9363 201.98 77.807
## x_junk.6  1    3.7814 203.13 78.092

Backward search

Backward search - Init

## Full formula
fit0 <- lm(y_sim ~ x_1 + x_2 
                        + x_junk.1 + x_junk.2 + x_junk.3 
                        + x_junk.4 + x_junk.5 + x_junk.6,
           data = data_all)
## drop one
dropterm(fit0)
## Single term deletions
## 
## Model:
## y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3 + x_junk.4 + 
##     x_junk.5 + x_junk.6
##          Df Sum of Sq    RSS     AIC
## <none>                191.34  85.100
## x_1       1    336.07 527.41 133.797
## x_2       1     57.75 249.09  96.289
## x_junk.1  1      2.92 194.25  83.857
## x_junk.2  1      2.26 193.60  83.688
## x_junk.3  1      1.46 192.80  83.480
## x_junk.4  1      1.48 192.81  83.485
## x_junk.5  1      2.08 193.42  83.641
## x_junk.6  1      4.38 195.72  84.233

Backward search - Drop term

## drop x_junk.3
fit1 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.1 + x_junk.2 +
           + x_junk.4 + x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit1)
## Single term deletions
## 
## Model:
## y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + +x_junk.4 + x_junk.5 + 
##     x_junk.6
##          Df Sum of Sq    RSS     AIC
## <none>                192.80  83.480
## x_1       1    335.92 528.71 131.921
## x_2       1     57.16 249.96  94.463
## x_junk.1  1      2.30 195.09  82.073
## x_junk.2  1      2.18 194.98  82.043
## x_junk.4  1      1.15 193.95  81.779
## x_junk.5  1      2.27 195.06  82.065
## x_junk.6  1      4.73 197.52  82.691

Backward search - Drop term

## drop x_junk.4
fit2 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.1 + x_junk.2 +
           + x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit2)
## Single term deletions
## 
## Model:
## y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + +x_junk.5 + x_junk.6
##          Df Sum of Sq    RSS     AIC
## <none>                193.95  81.779
## x_1       1    335.82 529.77 130.021
## x_2       1     56.44 250.39  92.550
## x_junk.1  1      1.95 195.90  80.278
## x_junk.2  1      2.27 196.21  80.359
## x_junk.5  1      2.96 196.91  80.537
## x_junk.6  1      4.74 198.69  80.986

Backward search - Drop term

## drop x_junk.1
fit3 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.2 +
           + x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit3)
## Single term deletions
## 
## Model:
## y_sim ~ x_1 + x_2 + x_junk.2 + +x_junk.5 + x_junk.6
##          Df Sum of Sq    RSS     AIC
## <none>                195.90  80.278
## x_1       1    368.05 563.94 131.147
## x_2       1     59.57 255.47  91.554
## x_junk.2  1      2.46 198.36  78.903
## x_junk.5  1      3.71 199.61  79.217
## x_junk.6  1      4.71 200.60  79.466

Backward search - Drop term

## drop x_junk.2
fit4 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit4)
## Single term deletions
## 
## Model:
## y_sim ~ x_1 + x_2 + x_junk.5 + x_junk.6
##          Df Sum of Sq    RSS     AIC
## <none>                198.36  78.903
## x_1       1    422.20 620.56 133.930
## x_2       1     57.59 255.95  89.648
## x_junk.5  1      4.78 203.13  78.092
## x_junk.6  1      3.62 201.98  77.807

Backward search - Drop term

## drop x_junk.6
fit5 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.5, data = data_all)
## Drop another
dropterm(fit5)
## Single term deletions
## 
## Model:
## y_sim ~ x_1 + x_2 + x_junk.5
##          Df Sum of Sq    RSS     AIC
## <none>                201.98  77.807
## x_1       1    426.25 628.23 132.544
## x_2       1     60.66 262.64  88.939
## x_junk.5  1      4.94 206.92  77.014

Backward search - Drop term

## drop x_junk.5
fit6 <- lm(y_sim ~ x_1 + x_2,
           data = data_all)
## Drop another
dropterm(fit6)
## Single term deletions
## 
## Model:
## y_sim ~ x_1 + x_2
##        Df Sum of Sq    RSS     AIC
## <none>              206.92  77.014
## x_1     1    468.35 675.27 134.154
## x_2     1     58.31 265.23  87.429

Forward and backward searches

  • Both are heuristics.
  • There are no guaranties that both converge to the same solution.
  • No theoretical guaranties on the selected model.
  • Model selection is still an active area of research.
    • See e.g: LASSO and other penalized criteria (M2)

AIC vs log FPE

Final Prediction Error - FPE

Mallow’s \(C_p\): \[ C_p(\hat{\mathbf{y}}_\eta) = \frac{1}{n} \left( \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 + 2 |\eta| \hat{\sigma}^2_{\text{full}} \right) \]

FPE: \[ FPE(\hat{\mathbf{y}}_\eta) = \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 + 2 |\eta| \hat{\sigma}^2_\eta \quad \text{with} \quad \hat{\sigma}^2_\eta = \frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2} {n - |\eta|} \]

Hence: \[ FPE(\hat{\mathbf{y}}_\eta) = \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 \left(1 + \frac{2 |\eta|} {n - |\eta|}\right) \]

AIC vs Log FPE

\[ \begin{aligned} \log \frac1n FPE(\hat{\mathbf{y}}_\eta) &= \log \left(\frac1n \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 \left(1 + \frac{2 |\eta|} {n - |\eta|}\right) \right)\\ &= \log \left(\frac1n \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 \right) + \log \left(1 + \frac{2 |\eta|} {n - |\eta|}\right) \end{aligned} \] But: \[ n \log\left(\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2}{n}\right) = - 2\log \hat{L}_\eta - n \log\left(2\pi\right) - n \]

AIC vs Log FPE

Hence: \[ n\log \frac1n FPE(\hat{\mathbf{y}}_\eta) + n \log\left(2\pi\right) + n \qquad \qquad \qquad \qquad\\ \qquad \qquad \qquad \qquad = - 2\log \hat{L}_\eta + n \log \left(1 + \frac{2 |\eta|} {n - |\eta|}\right)\\ \]

When \(n\) is large: \[ n \log \left(1 + \frac{2 |\eta|} {n - |\eta|}\right) \approx n \frac{2 |\eta|} {n - |\eta|} \approx 2 |\eta| \]

AIC vs Log FPE

Hence: \[ \begin{aligned} n\log \frac1n FPE(\hat{\mathbf{y}}_\eta) + n \log\left(2\pi\right) + n &\approx - 2\log \hat{L}_\eta + 2 |\eta| \\ &\approx AIC(\hat{\mathbf{y}}_\eta) \end{aligned} \]

The log FPE and the AIC criteria select for the same models (asymptotically).

BIC and Laplace Approximation

Bayesian Model Selection

In a Bayesian setting,
we have priors on the models \(\eta\) and parameters \(\boldsymbol{\theta}_\eta = (\boldsymbol{\beta}_\eta, \sigma^2_\eta)\): \[ \left\{ \begin{aligned} \eta &\sim \pi_{\eta} \\ \boldsymbol{\theta}_\eta & \sim p(\boldsymbol{\theta} | \eta) \end{aligned} \right. \]

The likelihood given some parameters and a model is: \(p(\mathbf{y} | \boldsymbol{\theta}, \eta)\)

the marginal likelihood given a model is: \[ p(\mathbf{y} | \eta) = \int p(\mathbf{y} | \boldsymbol{\theta}, \eta) p(\boldsymbol{\theta} | \eta) d\boldsymbol{\theta} \]

and the posterior probability of the models are: \[ p(\eta | \mathbf{y}). \]

Bayesian Model Selection

Goal: find the model with the highest posterior probability: \[ \hat{\eta} = \underset{\eta}{\operatorname{argmax}} p(\eta | \mathbf{y}) \]

Assume that all models have the same prior probability: \[ \pi_{\eta} \equiv \pi. \]

Then, from Bayes rule: \[ p(\eta | \mathbf{y}) = p(\mathbf{y} | \eta)\pi_{\eta} / p(\mathbf{y}) = p(\mathbf{y} | \eta) \times \pi / p(\mathbf{y}) \]

And: \[ \hat{\eta} = \underset{\eta}{\operatorname{argmax}} p(\eta | \mathbf{y}) = \underset{\eta}{\operatorname{argmax}} p(\mathbf{y} | \eta). \]

Bayesian Model Selection

Goal: find the model with the highest posterior probability: \[ \begin{aligned} \hat{\eta} & = \underset{\eta}{\operatorname{argmax}} p(\eta | \mathbf{y}) = \underset{\eta}{\operatorname{argmax}} p(\mathbf{y} | \eta) \\ & = \underset{\eta}{\operatorname{argmax}} \int p(\mathbf{y} | \boldsymbol{\theta}, \eta) p(\boldsymbol{\theta} | \eta) d\boldsymbol{\theta} \end{aligned} \]

Idea:
Assume that \[p(\mathbf{y} | \boldsymbol{\theta}, \eta) p(\boldsymbol{\theta} | \eta)\] is concentrated around its maximum \[p(\mathbf{y} | \boldsymbol{\theta}_\eta^*, \eta) p(\boldsymbol{\theta}_\eta^* | \eta)\] and apply Laplace Approximation.

Laplace Approximation

Let \(L: \mathbb{R}^q \to \mathbb{R}\) be a three times differentiable function, with a unique maximum \(\mathbf{u}^*\). Then: \[ \int_{\mathbb{R}^q} e^{n L(\mathbf{u})} d \mathbf{u} = \left(\frac{2\pi}{n}\right)^{q/2} |-L''(\mathbf{u}^*)|^{-1/2}e^{nL(\mathbf{u}^*)} + o(n^{-1}) \]

Applied to: \[ L(\boldsymbol{\theta}) = \frac1n\log p(\mathbf{y} | \boldsymbol{\theta}, \eta) + \frac1n\log p(\boldsymbol{\theta} | \eta) \] and assuming \(\log p(\mathbf{y} | \boldsymbol{\theta}^*, \eta) \approx \log p(\mathbf{y} | \hat{\boldsymbol{\theta}}, \eta)\), we get: \[ \log \int p(\mathbf{y} | \boldsymbol{\theta}, \eta) p(\boldsymbol{\theta} | \eta) d\boldsymbol{\theta} \approx \log p(\mathbf{y} | \hat{\boldsymbol{\theta}}, \eta) - \frac{|\eta|}{2} \log(n). \]

BIC

Hence: \[ \begin{aligned} \hat{\eta} & = \underset{\eta}{\operatorname{argmax}} p(\eta | \mathbf{y}) = \underset{\eta}{\operatorname{argmax}} p(\mathbf{y} | \eta) \\ & \approx \underset{\eta}{\operatorname{argmax}} \left\{\log p(\mathbf{y} | \hat{\boldsymbol{\theta}}, \eta) - \frac{|\eta|}{2} \log(n)\right\}\\ & \approx \underset{\eta}{\operatorname{argmin}} \left\{-2\log p(\mathbf{y} | \hat{\boldsymbol{\theta}}, \eta) + |\eta| \log(n) \right\}\\ &\approx \underset{\eta}{\operatorname{argmin}} \left\{- 2\log \hat{L}_\eta + |\eta| \log(n)\right\} = \underset{\eta}{\operatorname{argmin}} BIC(\hat{\mathbf{y}}_\eta) \end{aligned} \]

BIC is an approximation of the marginal likelihood.