20/03/2024
\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \]
Distribution of \(\hat{\boldsymbol{\beta}}\): \[ \hat{\boldsymbol{\beta}} \sim \mathcal{N}\left( \boldsymbol{\beta}; \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} \right) \quad \text{and} \quad \frac{\hat{\beta}_k - \beta_k}{\sqrt{\hat{\sigma}^2 [(\mathbf{X}^T\mathbf{X})^{-1}]_{kk}}} \sim \mathcal{T}_{n-p}. \]
Distribution of \(\hat{\sigma}^2\): \[ \frac{(n-p) \hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-p}. \]
\[ \text{Hypothesis:} \qquad\qquad \mathcal{H}_0: \beta_k = 0 \quad \text{vs} \quad \mathcal{H}_1: \beta_k \neq 0 \]
\[ \text{Test Statistic:} \qquad\qquad\qquad\qquad T_k = \frac{\hat{\beta}_k}{\sqrt{\hat{\sigma}^2_k}} \underset{\mathcal{H}_0}{\sim} \mathcal{T}_{n-p} \]
\[ \text{Reject Region:} \qquad\qquad\qquad\qquad\qquad\qquad\qquad\quad\\ \mathbb{P}[\text{Reject} | \mathcal{H}_0 \text{ true}] \leq \alpha \qquad\qquad\qquad\qquad\qquad\\ \qquad\qquad\qquad \iff \mathcal{R}_\alpha = \left\{ t \in \mathbb{R} ~|~ |t| \geq t_{n-p}(1 - \alpha/2) \right\} \]
\[ \text{p value:}\qquad\qquad\qquad\qquad\quad p = \mathbb{P}_{\mathcal{H}_0}[|T_k| > T_k^{obs}] \]
\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]
set.seed(1289) ## Predictors n <- 500 x_1 <- runif(n, min = -2, max = 2) x_2 <- runif(n, min = -2, max = 2) ## Noise eps <- rnorm(n, mean = 0, sd = 2) ## Model sim beta_0 <- -1; beta_1 <- 3; beta_2 <- -1 y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps ## Useless predictor x_junk_1 <- runif(n, min = -2, max = 2)
\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]
fit <- lm(y_sim ~ x_1 + x_2 + x_junk_1) summary(fit)
## ## Call: ## lm(formula = y_sim ~ x_1 + x_2 + x_junk_1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.275 -1.453 -0.028 1.371 4.893 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.82789 0.08814 -9.393 <2e-16 *** ## x_1 3.01467 0.07546 39.952 <2e-16 *** ## x_2 -0.95917 0.07469 -12.842 <2e-16 *** ## x_junk_1 0.01534 0.07791 0.197 0.844 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.957 on 496 degrees of freedom ## Multiple R-squared: 0.7905, Adjusted R-squared: 0.7893 ## F-statistic: 624 on 3 and 496 DF, p-value: < 2.2e-16
\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]
## Other Useless Predictors p_junk <- 100 x_junk <- matrix(runif(n * p_junk, min = -2, max = 2), ncol = p_junk) ## Data frame data_junk <- data.frame(y_sim = y_sim, x_junk = x_junk)
fit <- lm(y_sim ~ -1 + ., data = data_junk) summary(fit)
## ## Call: ## lm(formula = y_sim ~ -1 + ., data = data_junk) ## ## Residuals: ## Min 1Q Median 3Q Max ## -10.8683 -3.0605 -0.2674 2.3554 11.0821 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## x_junk.1 0.2129516 0.1828010 1.165 0.2447 ## x_junk.2 -0.1075770 0.1919617 -0.560 0.5755 ## x_junk.3 0.4322452 0.1945466 2.222 0.0269 * ## x_junk.4 0.1471909 0.1886285 0.780 0.4357 ## x_junk.5 -0.1110445 0.1848774 -0.601 0.5484 ## x_junk.6 0.1276583 0.1829333 0.698 0.4857 ## x_junk.7 0.0079412 0.1944593 0.041 0.9674 ## x_junk.8 -0.1126224 0.1854571 -0.607 0.5440 ## x_junk.9 -0.0560611 0.1918247 -0.292 0.7702 ## x_junk.10 -0.3215171 0.1883253 -1.707 0.0886 . ## x_junk.11 -0.0875118 0.1891359 -0.463 0.6438 ## x_junk.12 -0.0822697 0.1893373 -0.435 0.6641 ## x_junk.13 0.0862599 0.1874289 0.460 0.6456 ## x_junk.14 -0.1322800 0.1895130 -0.698 0.4856 ## x_junk.15 0.1748496 0.1751535 0.998 0.3188 ## x_junk.16 -0.2152065 0.1795766 -1.198 0.2315 ## x_junk.17 0.0987989 0.1868782 0.529 0.5973 ## x_junk.18 -0.0589122 0.1964012 -0.300 0.7644 ## x_junk.19 0.0407544 0.1896260 0.215 0.8299 ## x_junk.20 0.2788872 0.1817698 1.534 0.1257 ## x_junk.21 0.0780682 0.1901043 0.411 0.6815 ## x_junk.22 0.0001428 0.1936429 0.001 0.9994 ## x_junk.23 -0.2146891 0.1860330 -1.154 0.2492 ## x_junk.24 0.1135748 0.1794805 0.633 0.5272 ## x_junk.25 0.1797087 0.1849661 0.972 0.3318 ## x_junk.26 -0.0095168 0.1885643 -0.050 0.9598 ## x_junk.27 -0.0697160 0.1782409 -0.391 0.6959 ## x_junk.28 -0.1294379 0.1832707 -0.706 0.4804 ## x_junk.29 -0.0138295 0.1830424 -0.076 0.9398 ## x_junk.30 0.2922539 0.1918362 1.523 0.1284 ## x_junk.31 0.1576348 0.1867874 0.844 0.3992 ## x_junk.32 0.0598461 0.1892184 0.316 0.7520 ## x_junk.33 0.2018882 0.1812726 1.114 0.2661 ## x_junk.34 -0.0116431 0.1840672 -0.063 0.9496 ## x_junk.35 0.0713938 0.1833419 0.389 0.6972 ## x_junk.36 0.0190741 0.1798446 0.106 0.9156 ## x_junk.37 -0.0238114 0.1885668 -0.126 0.8996 ## x_junk.38 -0.1867089 0.1838147 -1.016 0.3104 ## x_junk.39 -0.0777722 0.1862288 -0.418 0.6765 ## x_junk.40 -0.0434927 0.1859361 -0.234 0.8152 ## x_junk.41 0.0814161 0.1904542 0.427 0.6693 ## x_junk.42 0.1292109 0.1870557 0.691 0.4901 ## x_junk.43 -0.3022911 0.1835507 -1.647 0.1004 ## x_junk.44 0.0831695 0.1823006 0.456 0.6485 ## x_junk.45 0.1332532 0.1869381 0.713 0.4764 ## x_junk.46 -0.1348223 0.1838432 -0.733 0.4638 ## x_junk.47 -0.0218524 0.1972836 -0.111 0.9119 ## x_junk.48 0.2795708 0.1854204 1.508 0.1324 ## x_junk.49 -0.1531037 0.1798952 -0.851 0.3952 ## x_junk.50 0.0520440 0.1852843 0.281 0.7789 ## x_junk.51 0.1682396 0.1835292 0.917 0.3599 ## x_junk.52 -0.1190719 0.1853241 -0.643 0.5209 ## x_junk.53 -0.0989078 0.1857934 -0.532 0.5948 ## x_junk.54 -0.1325051 0.1840157 -0.720 0.4719 ## x_junk.55 0.1006848 0.1940173 0.519 0.6041 ## x_junk.56 0.2034148 0.1764952 1.153 0.2498 ## x_junk.57 0.1153209 0.1766600 0.653 0.5143 ## x_junk.58 -0.0463597 0.1772648 -0.262 0.7938 ## x_junk.59 0.0240049 0.1889276 0.127 0.8990 ## x_junk.60 0.3726537 0.1909670 1.951 0.0517 . ## x_junk.61 0.0250128 0.1864171 0.134 0.8933 ## x_junk.62 -0.2615601 0.1901999 -1.375 0.1698 ## x_junk.63 0.0774898 0.1871022 0.414 0.6790 ## x_junk.64 0.0545841 0.1966734 0.278 0.7815 ## x_junk.65 0.0725361 0.1882312 0.385 0.7002 ## x_junk.66 -0.1206593 0.1897174 -0.636 0.5251 ## x_junk.67 -0.3336932 0.1818668 -1.835 0.0673 . ## x_junk.68 0.0981128 0.1912111 0.513 0.6082 ## x_junk.69 -0.0153031 0.1858223 -0.082 0.9344 ## x_junk.70 0.1886025 0.1830855 1.030 0.3036 ## x_junk.71 -0.0695791 0.1906992 -0.365 0.7154 ## x_junk.72 -0.1030034 0.1840569 -0.560 0.5760 ## x_junk.73 0.0084075 0.1823678 0.046 0.9633 ## x_junk.74 0.2999893 0.1783620 1.682 0.0934 . ## x_junk.75 0.3355257 0.1843357 1.820 0.0695 . ## x_junk.76 -0.1894886 0.1956830 -0.968 0.3335 ## x_junk.77 -0.1183169 0.1843228 -0.642 0.5213 ## x_junk.78 -0.1651666 0.1832443 -0.901 0.3679 ## x_junk.79 0.0947620 0.1867396 0.507 0.6121 ## x_junk.80 0.2538003 0.1910030 1.329 0.1847 ## x_junk.81 0.0711922 0.1917948 0.371 0.7107 ## x_junk.82 -0.0098761 0.2003307 -0.049 0.9607 ## x_junk.83 0.0102922 0.1798480 0.057 0.9544 ## x_junk.84 0.2163918 0.1807194 1.197 0.2319 ## x_junk.85 -0.2514098 0.1881750 -1.336 0.1823 ## x_junk.86 0.0181126 0.1832811 0.099 0.9213 ## x_junk.87 -0.0907617 0.1809516 -0.502 0.6162 ## x_junk.88 0.0354845 0.1869592 0.190 0.8496 ## x_junk.89 -0.3283331 0.1867203 -1.758 0.0794 . ## x_junk.90 -0.2284999 0.1895045 -1.206 0.2286 ## x_junk.91 -0.0575054 0.1800322 -0.319 0.7496 ## x_junk.92 0.0612578 0.1799226 0.340 0.7337 ## x_junk.93 -0.0192369 0.1866956 -0.103 0.9180 ## x_junk.94 -0.0615478 0.1828148 -0.337 0.7365 ## x_junk.95 0.3675294 0.1778881 2.066 0.0395 * ## x_junk.96 -0.0015247 0.1899413 -0.008 0.9936 ## x_junk.97 -0.1364003 0.1779889 -0.766 0.4439 ## x_junk.98 0.4422968 0.1888431 2.342 0.0197 * ## x_junk.99 0.2366755 0.1815971 1.303 0.1932 ## x_junk.100 0.0010360 0.1905522 0.005 0.9957 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.311 on 400 degrees of freedom ## Multiple R-squared: 0.1889, Adjusted R-squared: -0.01383 ## F-statistic: 0.9318 on 100 and 400 DF, p-value: 0.6596
p_values <- summary(fit)$coefficients[, 4] hist(p_values, col = "lightblue", breaks = 20)
sum(p_values <= 0.05)
## [1] 3
summary(fit)$coefficients[p_values <= 0.05, ]
## Estimate Std. Error t value Pr(>|t|) ## x_junk.3 0.4322452 0.1945466 2.221807 0.02685525 ## x_junk.95 0.3675294 0.1778881 2.066071 0.03946512 ## x_junk.98 0.4422968 0.1888431 2.342138 0.01966334
\(~\)
Reject Region: \(\mathbb{P}[\text{Reject} | \mathcal{H}_0 \text{ true}] \leq \alpha\)
Just by chance, about 5% of false discoveries.
When \(\sigma^2\) is known:
\[ \hat{\boldsymbol{\beta}} \sim \mathcal{N}\left( \boldsymbol{\beta}; \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} \right). \]
When \(\sigma^2\) is unknown: \[ \frac{1}{p\hat{\sigma}^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T (\mathbf{X}^T\mathbf{X})\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) \sim \mathcal{F}^p_{n-p} \]
with \(\mathcal{F}^p_{n-p}\) the Fisher distribution.
Let \(U_1 \sim \chi^2_{p_1}\) and \(U_2 \sim \chi^2_{p_2}\), \(U_1\) and \(U_2\) independent. Then \[ F = \frac{U_1/p_1}{U_2/p_2} \sim \mathcal{F}^{p_1}_{p_2} \]
Let’s show: \[ \frac{1}{p\hat{\sigma}^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T (\mathbf{X}^T\mathbf{X})\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) \sim \mathcal{F}^p_{n-p} \]
Hints:
\[
\hat{\boldsymbol{\beta}}
\sim
\mathcal{N}\left(
\boldsymbol{\beta};
\sigma^2 (\mathbf{X}^T\mathbf{X})^{-1}
\right)
\quad
\text{and}
\quad
\frac{(n-p) \hat{\sigma}^2}{\sigma^2}
\sim
\chi^2_{n-p}
\]
\[ \text{If } \mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{\Sigma}) \quad \text{then} \quad (\mathbf{X} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{X} - \boldsymbol{\mu}) \sim \chi^2_p \]
\[ \text{If } U_1 \sim \chi^2_{p_1} \text{ and } U_2 \sim \chi^2_{p_2} \text{ independent, then } \frac{U_1/p_1}{U_2/p_2} \sim \mathcal{F}^{p_1}_{p_2} \]
\[ \hat{\boldsymbol{\beta}} \sim \mathcal{N}\left( \boldsymbol{\beta}; \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} \right) \qquad\qquad \\ \qquad\qquad \implies \frac{1}{\sigma^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T (\mathbf{X}^T\mathbf{X})\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) \sim \chi^2_{p} \]
\[ \frac{(n-p) \hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-p} \quad \text{ independent from } \hat{\boldsymbol{\beta}} \]
Hence: \[ \begin{aligned} \frac{ \frac{1}{\sigma^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T (\mathbf{X}^T\mathbf{X})\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) / p }{ \frac{(n-p) \hat{\sigma}^2}{\sigma^2} / (n-p) } \qquad\qquad\qquad \\ \qquad\qquad = \frac{1}{p\hat{\sigma}^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T (\mathbf{X}^T\mathbf{X})\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) &\sim \mathcal{F}^p_{n-p}. \end{aligned} \]
Hypothesis: \[ \mathcal{H}_0: \beta_1 = \cdots = \beta_p = 0 \quad \text{vs} \quad \mathcal{H}_1: \exists~k ~|~ \beta_k \neq 0 \]
Test Statistic: \[ F = \frac{1}{p\hat{\sigma}^2}\hat{\boldsymbol{\beta}}^T (\mathbf{X}^T\mathbf{X})\hat{\boldsymbol{\beta}} \underset{\mathcal{H}_0}{\sim} \mathcal{F}^p_{n-p} \]
Reject Region: \[ \mathbb{P}[\text{Reject} | \mathcal{H}_0 \text{ true}] \leq \alpha \qquad\qquad\qquad\qquad\qquad\\ \qquad\qquad\qquad \!\iff\! \mathcal{R}_\alpha = \left\{ f \in \mathbb{R}_+ ~|~ f \geq f^p_{n-p}(1 - \alpha) \right\} \]
p value:\(\qquad\qquad p = \mathbb{P}_{\mathcal{H}_0}[F > F^{obs}]\)
fit <- lm(y_sim ~ -1 + ., data = data_junk) f_obs <- summary(fit)$fstatistic f_obs
## value numdf dendf ## 0.931791 100.000000 400.000000
p_value_f <- 1 - pf(f_obs[1], f_obs[2], f_obs[3]) p_value_f
## value ## 0.6596268
We do not reject the null hypothesis that all the (junk) coefficients are zero (phew).
\[ \begin{aligned} F &= \frac{1}{p\hat{\sigma}^2}\hat{\boldsymbol{\beta}}^T (\mathbf{X}^T\mathbf{X})\hat{\boldsymbol{\beta}} = \frac{(\mathbf{X}\hat{\boldsymbol{\beta}})^T (\mathbf{X}\hat{\boldsymbol{\beta}}) / p}{\|\hat{\boldsymbol{\epsilon}}\|^2 / (n - p)} \\ &= \frac{\|\hat{\mathbf{y}}\|^2 / p}{\|\hat{\boldsymbol{\epsilon}}\|^2 / (n - p)} = \frac{\|\hat{\mathbf{y}}\|^2 / p}{\|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p)} \\ \end{aligned} \]
Good model
\(\mathbf{y}\) not too far from \(\mathcal{M}(\mathbf{X})\).
\(\|\hat{\boldsymbol{\epsilon}}\|^2\) not too large compared to \(\|\hat{\mathbf{y}}\|^2\).
\(F\) tends to be large.
\(~\)
Bad model
\(\mathbf{y}\) almost orth. to \(\mathcal{M}(\mathbf{X})\).
\(\|\hat{\boldsymbol{\epsilon}}\|^2\) rather large compared to \(\|\hat{\mathbf{y}}\|^2\).
\(F\) tends to be small.
## Data frame data_all <- data.frame(y_sim = y_sim, x_1 = x_1, x_2 = x_2, x_junk = x_junk) ## Fit fit <- lm(y_sim ~ ., data = data_all) ## multiple t tests p_values_t <- summary(fit)$coefficients[, 4] names(p_values_t[p_values_t <= 0.05])
## [1] "(Intercept)" "x_1" "x_2" "x_junk.27" "x_junk.76" ## [6] "x_junk.80"
## f test f_obs <- summary(fit)$fstatistic p_value_f <- 1 - pf(f_obs[1], f_obs[2], f_obs[3]) p_value_f
## value ## 0
Can I test ? \[ \begin{aligned} \mathcal{H}_0&: \beta_{p_0 + 1} = \cdots = \beta_p = 0 &\text{vs}\\ \mathcal{H}_1&: \exists~k\in \{p_0+1, \dotsc, p\} ~|~ \beta_k \neq 0 \end{aligned} \]
i.e. decide between the full model: \[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \]
and the nested model: \[ \mathbf{y} = \mathbf{X}_0\boldsymbol{\beta}_0 + \boldsymbol{\epsilon} \]
with \(\mathbf{X} = (\mathbf{X}_0 ~ \mathbf{X}_1)\), \(rg(\mathbf{X}) = p\) and \(rg(\mathbf{X}_0) = p_0\).
Idea: Compare
\(\|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2\) distance of \(\hat{\mathbf{y}}\) compared to \(\hat{\mathbf{y}}_0\) to
\(\|\mathbf{y} - \hat{\mathbf{y}}\|^2\) residual error.
Then:
Good model
\(\hat{\mathbf{y}}\) quite different from \(\hat{\mathbf{y}}_0\).
Full model does add information.
\(~\)
Bad model
\(\hat{\mathbf{y}}\) similar to \(\hat{\mathbf{y}}_0\).
Full model does not add much information.
\[ F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } \]
\[ F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } \]
Let \(\mathbf{P}\) projection on \(\mathcal{M}(\mathbf{X})\) and \(\mathbf{P}_0\) projection on \(\mathcal{M}(\mathbf{X}_0)\)
\[ \begin{aligned} \hat{\mathbf{y}} - \hat{\mathbf{y}}_0 &= \mathbf{P}\mathbf{y} - \mathbf{P}_0\mathbf{y} = \mathbf{P}\mathbf{y} - \mathbf{P}_0\mathbf{P}\mathbf{y}\\ &= (\mathbf{I}_n - \mathbf{P}_0) \mathbf{P}\mathbf{y} = \mathbf{P}_0^\bot \mathbf{P}\mathbf{y}\\ &\in \mathcal{M}(\mathbf{X}_0)^\bot \cap \mathcal{M}(\mathbf{X}) \end{aligned} \]
\[ \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I}_n - \mathbf{P}) \mathbf{y} = \mathbf{P}^\bot \mathbf{y} \in \mathcal{M}(\mathbf{X})^\bot \]
\(\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\) and \(\mathbf{y} - \hat{\mathbf{y}}\) are orthogonal
i.e. their covariance is zero
i.e. (Gaussian assumption) they are independent.
From Cochran’s theorem: \[ \frac{1}{\sigma^2}\|\mathbf{y} - \hat{\mathbf{y}}\|^2 = \frac{1}{\sigma^2}\|\mathbf{P}^\bot\mathbf{y}\|^2 = \frac{1}{\sigma^2}\|\mathbf{P}^\bot\boldsymbol{\epsilon}\|^2 \sim \chi^2_{n-p} \]
\[ \frac{1}{\sigma^2}\|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0 - \mathbf{P}_0^\bot \mathbf{P}\mathbf{X}\boldsymbol{\beta}\|^2 = \frac{1}{\sigma^2}\|\mathbf{P}_0^\bot \mathbf{P}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\|^2 \sim \chi^2_{p-p_0} \] (\(\mathbf{P}_0^\bot \mathbf{P}\) is a projection on a space of dimension \(p - p_0\))
If \(\mathcal{H}_0\) is true, then \(\mathbf{P}_0^\bot \mathbf{P}\mathbf{X}\boldsymbol{\beta} = 0\), hence:
\[ F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{p - p_0}_{n - p}. \]
From Phythagoras’s theorem: \[ \begin{aligned} \|\mathbf{y} - \hat{\mathbf{y}}_0\|^2 &= \|\mathbf{y} - \mathbf{P}\mathbf{y} + \mathbf{P}\mathbf{y} - \mathbf{P}_0\mathbf{y}\|^2 = \|\mathbf{P}^\bot\mathbf{y} + \mathbf{P}_0^\bot\mathbf{P}\mathbf{y}\|^2\\ &= \|\mathbf{P}^\bot\mathbf{y}\|^2 + \|\mathbf{P}_0^\bot\mathbf{P}\mathbf{y}\|^2 = \|\mathbf{y} - \hat{\mathbf{y}}\|^2 + \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2\\ \end{aligned} \]
i.e. \[ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 = \|\mathbf{y} - \hat{\mathbf{y}}_0\|^2 - \|\mathbf{y} - \hat{\mathbf{y}}\|^2 = RSS_0 - RSS \]
Hence: \[ F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } = \frac{n - p}{p - p_0}\frac{RSS_0 - RSS}{RSS} \]
To test: \[ \begin{aligned} \mathcal{H}_0&: \beta_{p_0 + 1} = \cdots = \beta_p = 0 &\text{vs}\\ \mathcal{H}_1&: \exists~k\in \{p_0+1, \dotsc, p\} ~|~ \beta_k \neq 0 \end{aligned} \]
we use the \(F\) statistics: \[ F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_0\|^2 / (p - p_0) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } = \frac{n - p}{p - p_0}\frac{RSS_0 - RSS}{RSS} \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{p - p_0}_{n - p}. \]
\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]
## Data frame data_all <- data.frame(y_sim = y_sim, x_1 = x_1, x_2 = x_2, x_junk = x_junk) ## Fit true fit_small <- lm(y_sim ~ x_1 + x_2, data = data_all) ## Fit true and junk fit_all <- lm(y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3, data = data_all) ## Comparison anova(fit_small, fit_all)
## Analysis of Variance Table ## ## Model 1: y_sim ~ x_1 + x_2 ## Model 2: y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3 ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 497 1900.5 ## 2 494 1883.2 3 17.297 1.5124 0.2104
\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]
## Fit true fit_small <- lm(y_sim ~ x_1 + x_2, data = data_all) ## Fit all fit_all <- lm(y_sim ~ ., data = data_all) ## Comparison anova(fit_small, fit_all)$`Pr(>F)`[2]
## [1] 0.3548787
We do not reject the null hypothesis that the first two variables are enough to explain the data.
Assuming that we have an intercept, we often test: \[ \begin{aligned} \mathcal{H}_0&: \beta_{2} = \cdots = \beta_p = 0 &\text{vs}\\ \mathcal{H}_1&: \exists~k\in \{2, \dotsc, p\} ~|~ \beta_k \neq 0 \end{aligned} \]
The associated \(F\) statistics is (\(p_0 = 1\)): \[ F = \frac{ \|\hat{\mathbf{y}} - \bar{y}\mathbb{1}\|^2 / (p - 1) }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{p - 1}_{n - p}. \]
That is the default test returned by R
.
\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]
fit <- lm(y_sim ~ x_1 + x_2) summary(fit)
## ## Call: ## lm(formula = y_sim ~ x_1 + x_2) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.2523 -1.4517 -0.0302 1.3494 4.8711 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.82877 0.08794 -9.424 <2e-16 *** ## x_1 3.01415 0.07534 40.008 <2e-16 *** ## x_2 -0.95922 0.07462 -12.855 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.955 on 497 degrees of freedom ## Multiple R-squared: 0.7905, Adjusted R-squared: 0.7897 ## F-statistic: 937.8 on 2 and 497 DF, p-value: < 2.2e-16
If \(p_0 = p-1\), we test: \[ \begin{aligned} \mathcal{H}_0&: \beta_p = 0 &\text{vs} \qquad \mathcal{H}_1&: \beta_p \neq 0 \end{aligned} \]
The associated \(F\) statistics is (\(p_0 = p-1\)): \[ F = \frac{ \|\hat{\mathbf{y}} - \hat{\mathbf{y}}_{-p}\|^2 }{ \|\mathbf{y} - \hat{\mathbf{y}}\|^2 / (n - p) } \underset{\mathcal{H}_0}{\sim} \mathcal{F}^{1}_{n - p}. \]
It can be shown that: \[ F = T^2 \quad \text{with} \quad T = \frac{\hat{\beta}_p}{\hat{\sigma}_p} \] so that it is equivalent to the \(t\)-test.
Model: \[ y_i = \beta_1 x_{i1} + \dotsb + \beta_p x_{ip} + \epsilon_i \]
Problem:
Can we choose the “best” subset of non-zero \(\beta_k\) ?
I.e. Can we find the predictors that really have an impact on \(\mathbf{y}\) ?
\(F\)-test on all subsets ? Not practical, multiple testing.
Idea:
Find a “score” to assess the quality of a given fit.
\[ R^2 = 1 - \frac{RSS}{TSS} \]
## Function to get statistics for one fit get_stats_fit <- function(fit) { sumfit <- summary(fit) res <- data.frame(R2 = sumfit$r.squared) return(res) }
\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]
## Fit true get_stats_fit(lm(y_sim ~ x_1 + x_2, data = data_all))
## R2 ## 1 0.7905255
## Fit bigger get_stats_fit(lm(y_sim ~ x_1 + x_2 + x_junk.1, data = data_all))
## R2 ## 1 0.79133
## Fit wrong get_stats_fit(lm(y_sim ~ x_1 + x_junk.1, data = data_all))
## R2 ## 1 0.7252314
## Only one "junk" predictor data_sub <- data.frame(y_sim = y_sim, x_1 = x_1, x_2 = x_2, x_junk = x_junk_1) ## Fit all possible models get_all_stats(data_sub)
## R2 preds ## 1 0.7208723311 x_1 ## 2 0.1158793338 x_2 ## 3 0.0005972444 x_junk ## 4 0.7905254867 x_1+x_2 ## 5 0.7208974123 x_1+x_junk ## 6 0.1164841596 x_2+x_junk ## 7 0.7905418666 x_1+x_2+x_junk
\(R^2\) is not enough.
\[ R_a^2 = 1 - \frac{RSS / (n-p)}{TSS / (n-1)} \]
## Function to get statistics get_stats_fit <- function(fit) { sumfit <- summary(fit) res <- data.frame(R2 = sumfit$r.squared, R2a = sumfit$adj.r.squared) return(res) }
get_all_stats(data_sub)
## R2 R2a preds ## 1 0.7208723311 0.720311834 x_1 ## 2 0.1158793338 0.114103991 x_2 ## 3 0.0005972444 -0.001409588 x_junk ## 4 0.7905254867 0.789682531 x_1+x_2 ## 5 0.7208974123 0.719774263 x_1+x_junk ## 6 0.1164841596 0.112928764 x_2+x_junk ## 7 0.7905418666 0.789274983 x_1+x_2+x_junk
\(R_a^2\): adjust for the number of predictors.
\[ R_a^2 = 1 - \frac{RSS / (n-p)}{TSS / (n-1)} \]
The larger the better.
When \(p\) is bigger, the fit must be really better for \(R_a^2\) to raise.
Intuitive, but not much theoretical justifications.
set.seed(1289) ## Predictors n <- 50 x_1 <- runif(n, min = -2, max = 2) x_2 <- runif(n, min = -2, max = 2) ## Model beta_0 <- -1; beta_1 <- 3; beta_2 <- -1 ## sim eps <- rnorm(n, mean = 0, sd = 2) y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps
## Useless predictors p_junk <- 48 x_junk <- matrix(runif(n * p_junk, min = -2, max = 2), ncol = p_junk) ## Good and Bad dataset data_all <- data.frame(y_sim = y_sim, x_1 = x_1, x_2 = x_2, x_junk = x_junk) ## Bad overfitting fit fit_bad <- lm(y_sim ~ ., data = data_all) pred_bad <- predict(fit_bad) ## Good fit fit_good <- lm(y_sim ~ x_1 + x_2, data = data_all) pred_good <- predict(fit_good)
mu_y <- beta_0 + beta_1 * x_1 + beta_2 * x_2 plot(mu_y, y_sim, pch = 3, ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2]))
mu_y <- beta_0 + beta_1 * x_1 + beta_2 * x_2 plot(mu_y, y_sim, pch = 3, ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2])) ## Good prediction (small model) points(mu_y, pred_good, col = "lightblue", pch = 4)
mu_y <- beta_0 + beta_1 * x_1 + beta_2 * x_2 plot(mu_y, y_sim, pch = 3, ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2])) ## Good prediction (small model) points(mu_y, pred_good, col = "lightblue", pch = 4) ## Bad prediction (big model) points(mu_y, pred_bad, col = "firebrick", pch = 4)
The bad (bigger) model is closer to the observed data.
## Distance between bad prediction and data (RSS) sum((y_sim - pred_bad)^2)
## [1] 2.714946e-27
## Distance between good prediction and data (RSS) sum((y_sim - pred_good)^2)
## [1] 206.9159
Overfitting: the RSS is smaller for the bad, bigger model.
Generate new data according to the same generative model:
## New dataset from the same model eps_new <- rnorm(n, mean = 0, sd = 2) y_new <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps_new
Questions:
Are the predictions from the bad (bigger) model
better or worse
than the predictions from the good (smaller) model,
compared to the new data ?
plot(mu_y, y_sim, pch = 3, ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2])) ## Bad prediction (big model) points(mu_y, pred_bad, col = "firebrick", pch = 4) ## Good prediction (small model) points(mu_y, pred_good, col = "lightblue", pch = 4) ## New points points(mu_y, y_new, col = "darkgreen", pch = 3)
Compute the risk between predictions and new data:
## Distance between bad prediction and true sum((y_new - pred_bad)^2)
## [1] 535.296
## Distance between good prediction and true sum((y_new - pred_good)^2)
## [1] 205.6555
The good (smaller) model behaves better on new data.
RSS: \(\|\mathbf{y} - \hat{\mathbf{y}}_p\|^2\) with \(\hat{\mathbf{y}}_p\) learnt on \(\mathbf{y}\)
\(\to\) gets smaller when model gets bigger (overfit).
Predictive Risk: \(\| \mathbf{y}' - \hat{\mathbf{y}}_p\|^2\) with \(\mathbf{y}'\) same distribution as \(\mathbf{y}\).
\(\to\) becomes larger if \(\hat{\mathbf{y}}_p\) is overfit.
Idea:
We want to select a model that has a good predictive risk.
i.e. that represents the underlying model best.
Problem:
In general, we don’t have means to generate \(\mathbf{y}'\).
\(\to\) We use an estimate of the (theoretical) predictive risk.
Ideal model: \[ \mathbf{y} = \boldsymbol{\mu} + \boldsymbol{\epsilon} \quad \boldsymbol{\mu} \in \mathbb{R}^n \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n). \]
\(\boldsymbol{\mu}\) the “true” expectation of the observations \(\mathbf{y}\).
Full linear regression model: \[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \quad rg(\mathbf{X}) = p, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n). \]
Assume that \(\boldsymbol{\mu}\) can be approximated as \(\mathbf{X} \boldsymbol{\beta}\), i.e. as an element of \(\mathcal{M}(\mathbf{X})\).
Ideal model: \[ \mathbf{y} = \boldsymbol{\mu} + \boldsymbol{\epsilon} \quad \boldsymbol{\mu} \in \mathbb{R}^n \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n). \]
Sub-models:
keep only predictors \(\eta \subset \{1, \dotsc, p\}\) \[
\mathbf{y} = \mathbf{X}_{\eta} \boldsymbol{\beta}_{\eta} + \boldsymbol{\epsilon}_{\eta}
\quad
rg(\mathbf{X}_{\eta}) = |\eta|,
\quad
\boldsymbol{\epsilon}_{\eta} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n).
\]
with \(\mathbf{X}_{\eta}\) the sub-matrix of \(\mathbf{X}\): \[ \mathbf{X}_{\eta} = (\mathbf{x}_{\eta_1} ~ \mathbf{x}_{\eta_2} \cdots \mathbf{x}_{\eta_{|\eta|}}) \]
Assume that \(\boldsymbol{\mu}\) can be approximated as \(\mathbf{X}_{\eta} \boldsymbol{\beta}_{\eta}\), i.e. as an element of \(\mathcal{M}(\mathbf{X}_{\eta})\).
“Truth”: \(\mathbf{y} = \boldsymbol{\mu} + \boldsymbol{\epsilon}\), with \(\boldsymbol{\mu} \in \mathbb{R}^n\) and \(\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n)\).
Models: \[ \mathbf{y} = \mathbf{X}_{\eta} \boldsymbol{\beta}_{\eta} + \boldsymbol{\epsilon}_{\eta} \quad rg(\mathbf{X}_{\eta}) = |\eta|, \quad \boldsymbol{\epsilon}_{\eta} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n). \]
Estimators: \[ \begin{aligned} \hat{\boldsymbol{\beta}}_\eta & = \underset{\boldsymbol{\beta} \in \mathbb{R}^{|\eta|}}{\operatorname{argmin}} \|\mathbf{y} - \mathbf{X}_\eta\boldsymbol{\beta}_\eta \|^2 = (\mathbf{X}_\eta^{T}\mathbf{X}_\eta)^{-1}\mathbf{X}_\eta^{T}\mathbf{y}\\ \hat{\mathbf{y}}_\eta & = \mathbf{X}_\eta\hat{\boldsymbol{\beta}}_\eta = \mathbf{P}^{\mathbf{X}_\eta}\mathbf{y} \end{aligned} \]
True expectation: \[ \begin{aligned} \mathbf{y}_\eta = \mathbf{P}^{\mathbf{X}_\eta}\boldsymbol{\mu} \end{aligned} \]
Risk of an estimator: \[ R(\hat{\mathbf{y}}_\eta) = \mathbb{E}\left[ \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 \right] \]
Notes:
\[ R(\hat{\mathbf{y}}_\eta) = \mathbb{E}\left[ \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 \right] = \|\boldsymbol{\mu} - \mathbf{y}_\eta \|^2 + |\eta| \sigma^2 \]
\[ R(\hat{\mathbf{y}}_\eta) = \mathbb{E}\left[ \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 \right] = \|\boldsymbol{\mu} - \mathbf{y}_\eta \|^2 + |\eta| \sigma^2 \]
No overfitting:
Assume:
\(\boldsymbol{\mu} \in \mathcal{M}(\mathbf{X}_{\eta})\) so that \(\mathbf{y}_{\eta} = \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\mu} = \boldsymbol{\mu}\).
\(\eta \subset \eta'\) so that \(\mathbf{y}_{\eta'} = \mathbf{P}^{\mathbf{X}_{\eta'}}\boldsymbol{\mu} = \boldsymbol{\mu}\).
Then: \[ \begin{aligned} R(\hat{\mathbf{y}}_{\eta'}) & = \|\boldsymbol{\mu} - \mathbf{y}_{\eta'} \|^2 + |\eta'| \sigma^2 \\ & \geq \|\boldsymbol{\mu} - \mathbf{y}_{\eta'} \|^2 + |\eta| \sigma^2 = \|\boldsymbol{\mu} - \mathbf{y}_{\eta} \|^2 + |\eta| \sigma^2 \\ & \geq R(\hat{\mathbf{y}}_\eta) \end{aligned} \]
\[ R(\hat{\mathbf{y}}_\eta) = \mathbb{E}\left[ \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 \right] = \mathbb{E}\left[ \|\boldsymbol{\mu} - \mathbf{y}_\eta + \mathbf{y}_\eta - \hat{\mathbf{y}}_\eta \|^2 \right] \]
\[ \begin{aligned} R(\hat{\mathbf{y}}_\eta) &= \mathbb{E}\left[ \|\boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\mu} + \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\mathbf{y} \|^2 \right] \\ &= \mathbb{E}\left[ \|(\mathbf{I}_n - \mathbf{P}^{\mathbf{X}_{\eta}})\boldsymbol{\mu} + \mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right]\\ &= \mathbb{E}\left[ \|\mathbf{P}^{\mathbf{X}_{\eta}^\bot}\boldsymbol{\mu} + \mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] \end{aligned} \]
\[ \begin{aligned} R(\hat{\mathbf{y}}_\eta) &= \mathbb{E}\left[ \|\mathbf{P}^{\mathbf{X}_{\eta}^\bot}\boldsymbol{\mu} \|^2 + \|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] \\ &= \|\mathbf{P}^{\mathbf{X}_{\eta}^\bot}\boldsymbol{\mu} \|^2 + \mathbb{E}\left[ \|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] \\ &= \|\boldsymbol{\mu} - \mathbf{y}_\eta \|^2 + \mathbb{E}\left[ \|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] \end{aligned} \]
\[ R(\hat{\mathbf{y}}_\eta) = \|\boldsymbol{\mu} - \mathbf{y}_\eta \|^2 + \mathbb{E}\left[ \|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] \]
From Cochran’s Theorem: \[ \frac{1}{\sigma^2}\|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \sim \chi^2_{|\eta|} \]
Hence: \[ \mathbb{E}\left[ \frac{1}{\sigma^2}\|\mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} - \mathbf{y}) \|^2 \right] = |\eta| \]
And: \[ R(\hat{\mathbf{y}}_\eta) = \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 + |\eta|\sigma^2. \]
We would like: \[ \eta_0 = \underset{\eta \subset \{1, \cdots, p\}}{\operatorname{argmin}} R(\hat{\mathbf{y}}_\eta) \]
For any model \(\eta\), the expectation of the RSS is: \[ \mathbb{E}[\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2] = R(\hat{\mathbf{y}}_\eta) + n\sigma^2 - 2|\eta|\sigma^2 \]
\[ \begin{aligned} \mathbb{E}[\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2] &= \mathbb{E}[\|\mathbf{y} - \boldsymbol{\mu} + \boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2] \\ &= \mathbb{E}[\|\mathbf{y} - \boldsymbol{\mu}\|^2 + \|\boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \|^2 + 2 \langle \mathbf{y} - \boldsymbol{\mu}; \boldsymbol{\mu} - \hat{\mathbf{y}}_\eta \rangle] \end{aligned} \]
\[ \begin{aligned} \mathbb{E}[\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2] &= \mathbb{E}[\|\boldsymbol{\epsilon}\|^2] + R(\hat{\mathbf{y}}_\eta) + 2 \mathbb{E}[\langle \boldsymbol{\epsilon}; \boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\mathbf{y} \rangle] \qquad \end{aligned} \]
\[ \begin{aligned} \mathbb{E}[\langle \boldsymbol{\epsilon}; \boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\mathbf{y} \rangle] &= \mathbb{E}[\langle \boldsymbol{\epsilon}; \boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}(\boldsymbol{\mu} + \boldsymbol{\epsilon}) \rangle] \\ &= \mathbb{E}[\langle \boldsymbol{\epsilon}; \boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\mu}) \rangle] - \mathbb{E}[\langle \boldsymbol{\epsilon}; \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\epsilon} \rangle] \end{aligned} \]
\[ \begin{aligned} \mathbb{E}[\langle \boldsymbol{\epsilon}; \boldsymbol{\mu} - \mathbf{P}^{\mathbf{X}_{\eta}}\mathbf{y} \rangle] &= 0 - \mathbb{E}[\langle \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\epsilon}; \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\epsilon} \rangle] \qquad\qquad~~ \\ &= - \mathbb{E}[\| \mathbf{P}^{\mathbf{X}_{\eta}}\boldsymbol{\epsilon}\|^2] \\ &= - |\eta| \sigma^2 \end{aligned} \]
\[ \begin{aligned} \mathbb{E}[\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2] &= n \sigma^2 + R(\hat{\mathbf{y}}_\eta) - 2 |\eta| \sigma^2 \qquad\qquad\qquad\qquad \end{aligned} \]
We have: \[ \mathbb{E}[\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2] = R(\hat{\mathbf{y}}_\eta) + n\sigma^2 - 2|\eta|\sigma^2 \]
Mallow’s \(C_p\): \[ C_p(\hat{\mathbf{y}}_\eta) = \frac{1}{n} \left( \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 + 2 |\eta| \hat{\sigma}^2 \right) \]
We have: \[ C_p(\hat{\mathbf{y}}_\eta) = \frac{1}{n} \left( \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 + 2 |\eta| \hat{\sigma}^2 \right) \] And: \[ \mathbb{E}[C_p(\hat{\mathbf{y}}_\eta)] = \frac{1}{n}R(\hat{\mathbf{y}}_\eta) + \sigma^2 \]
Hence: \[ \hat{\eta} = \underset{\eta \subset \{1, \cdots, p\}}{\operatorname{argmin}} C_p(\hat{\mathbf{y}}_\eta) = \underset{\eta \subset \{1, \cdots, p\}}{\operatorname{argmin}} \hat{R}(\hat{\mathbf{y}}_\eta). \]
\(\to\) Minimizing Mallow’s \(C_p\) is the same as minimizing an unbiased estimate of the risk.
\[ C_p(\hat{\mathbf{y}}_\eta) = \frac{1}{n} \left( \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 + 2 |\eta| \hat{\sigma}^2 \right) \]
Penalize for the number of parameters \(|\eta|\).
The smaller the better.
## Mallow's Cp Cp <- function(fit) { n <- nobs(fit) p <- n - df.residual(fit) RSS <- deviance(fit) return( (RSS + p * RSS / (n - p)) / n) } ## Function to get statistics get_stats_fit <- function(fit) { sumfit <- summary(fit) res <- data.frame(Cp = Cp(fit)) return(res) }
## Apply stats to all possible combinations all_stats <- get_all_stats(data_sub) all_stats
## Cp preds ## 1 5.085236 x_1 ## 2 16.107190 x_2 ## 3 18.207435 x_junk ## 4 3.823952 x_1+x_2 ## 5 5.095010 x_1+x_junk ## 6 16.128558 x_2+x_junk ## 7 3.831361 x_1+x_2+x_junk
Mallow’s \(C_p\) is smallest for the true model.
Recall: \[ \begin{aligned} (\hat{\boldsymbol{\beta}}_\eta, \hat{\sigma}^2_{\eta,ML}) &= \underset{(\boldsymbol{\beta}, \sigma^2) \in \mathbb{R}^{|\eta|}\times\mathbb{R}_+^*}{\operatorname{argmax}} \log L(\boldsymbol{\beta}, \sigma^2 | \mathbf{y})\\ &= \underset{(\boldsymbol{\beta}, \sigma^2) \in \mathbb{R}^{|\eta|}\times\mathbb{R}_+^*}{\operatorname{argmax}} - \frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\|\mathbf{y} - \mathbf{X}_\eta\boldsymbol{\beta} \|^2 \end{aligned} \]
And: \[ \hat{\sigma}^2_{\eta,ML} = \frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2} {n} \quad;\quad \hat{\mathbf{y}}_\eta = \mathbf{X}_\eta\hat{\boldsymbol{\beta}}_\eta \]
Thus: \[ L(\hat{\boldsymbol{\beta}}_\eta, \hat{\sigma}^2_{ML} | \mathbf{y}) = \cdots \]
The maximized likelihood is equal to: \[ L(\hat{\boldsymbol{\beta}}_\eta, \hat{\sigma}^2_{ML} | \mathbf{y}) = - \frac{n}{2} \log\left(2\pi\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2}{n}\right) - \frac{1}{2\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2}{n}}\|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 \]
i.e. \[ \log \hat{L}_\eta = - \frac{n}{2} \log\left(\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2}{n}\right) - \frac{n}{2} \log\left(2\pi\right) - \frac{n}{2} \]
Remember, for any \(\eta \subset \eta'\): \[ \|\mathbf{y} - \hat{\mathbf{y}}_{\eta'}\|^2 \leq \|\mathbf{y} - \hat{\mathbf{y}}_\eta\|^2 \]
Hence \(R^2_\eta = 1 - \frac{\|\mathbf{y} - \hat{\mathbf{y}}_\eta\|^2}{\|\mathbf{y} - \bar{y} \mathbb{1}\|^2}\) always increases when we add a predictor.
Similarly: \[ \log \hat{L}_\eta = - \frac{n}{2} \log\left(\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2}{n}\right) - \frac{n}{2} \log\left(2\pi\right) - \frac{n}{2} \] always increases when we add some predictors.
Problem: \(\log \hat{L}_\eta\) always increases when we add some predictors.
\(\to\) it can not be used for model selection (same as \(R^2\)).
Idea: add a penalty term that depends on the size of the model
\(\to\) minimize: \[ - \log \hat{L}_\eta + f(|\eta|) \qquad \text{with f increasing} \]
Minimize: \[ - \log \hat{L}_\eta + f(|\eta|) \qquad \text{with f increasing} \]
For \(\eta \subset \eta'\):
\(- \log \hat{L}_\eta\) decreases
\(f(|\eta|)\) increases
Goal: find \(f\) that compensates for the “overfit”, i.e. the noisy and insignificant increase of the likelihood when we add some unrelated predictors.
## Function to get statistics get_stats_fit <- function(fit) { sumfit <- summary(fit) res <- data.frame(minusLogLik = -logLik(fit)) return(res) }
## Apply stats to all possible combinations all_stats <- get_all_stats(data_sub) all_stats
## minusLogLik preds ## 1 1115.053 x_1 ## 2 1403.284 x_2 ## 3 1433.925 x_junk ## 4 1043.286 x_1+x_2 ## 5 1115.030 x_1+x_junk ## 6 1403.113 x_2+x_junk ## 7 1043.266 x_1+x_2+x_junk
When we add “junk” variable, \(- \log \hat{L}_\eta\) decreases, but not a lot.
Questions: can we find a good penalty that compensates for this noisy increasing ?
\[ AIC = - 2\log \hat{L}_\eta + 2 |\eta| \]
\(~\)
\(\hat{L}_\eta\) is the maximized likelihood of the model.
\(|\eta|\) is the dimension of the model.
The smaller the better.
Asymptotic theoretical guaranties.
Easy to use and widely spread
\[ BIC = - 2\log \hat{L}_\eta + |\eta| \log(n) \]
\(~\)
\(\hat{L}_\eta\) is the maximized likelihood of the model.
\(|\eta|\) is the dimension of the model.
The smaller the better.
Based on an approximation of the marginal likelihood.
Easy to use and widely spread
See: Lebarbier, Mary-Huard. Le critère BIC : fondements théoriques et interprétation. 2004. [inria-00070685]
\[ AIC = - 2\log \hat{L}_\eta + 2 |\eta| \]
\[ BIC = - 2\log \hat{L}_\eta + |\eta| \log(n) \]
Both have asymptotic theoretical justifications.
BIC penalizes more than AIC \(\to\) chooses smaller models.
Both widely used.
There are some (better) non-asymptotic criteria, see
Giraud C. 2014. Introduction to high-dimensional statistics.
## Function to get statistics get_stats_fit <- function(fit) { sumfit <- summary(fit) res <- data.frame(minusLogLik = -logLik(fit), AIC = AIC(fit), BIC = BIC(fit)) return(res) }
## Apply stats to all possible combinations all_stats <- get_all_stats(data_sub) all_stats
## minusLogLik AIC BIC preds ## 1 1115.053 2236.105 2248.749 x_1 ## 2 1403.284 2812.567 2825.211 x_2 ## 3 1433.925 2873.850 2886.493 x_junk ## 4 1043.286 2094.572 2111.430 x_1+x_2 ## 5 1115.030 2238.060 2254.919 x_1+x_junk ## 6 1403.113 2814.225 2831.084 x_2+x_junk ## 7 1043.266 2096.533 2117.606 x_1+x_2+x_junk
## Select model apply(all_stats[, -ncol(all_stats)], 2, function(x) all_stats[which.min(x), ncol(all_stats)])
## minusLogLik AIC BIC ## "x_1+x_2+x_junk" "x_1+x_2" "x_1+x_2"
## Advertising data library(here) ad <- read.csv(here("data", "Advertising.csv"), row.names = "X") ## Apply stats to all possible combinations all_stats <- get_all_stats(ad[, c(4, 1:3)]) all_stats
## minusLogLik AIC BIC preds ## 1 519.0457 1044.0913 1053.9863 TV ## 2 573.3369 1152.6738 1162.5687 radio ## 3 608.3357 1222.6714 1232.5663 newspaper ## 4 386.1970 780.3941 793.5874 TV+radio ## 5 509.8891 1027.7782 1040.9714 TV+newspaper ## 6 573.2361 1154.4723 1167.6655 radio+newspaper ## 7 386.1811 782.3622 798.8538 TV+radio+newspaper
## Select model apply(all_stats[, -ncol(all_stats)], 2, function(x) all_stats[which.min(x), ncol(all_stats)])
## minusLogLik AIC BIC ## "TV+radio+newspaper" "TV+radio" "TV+radio"
\[ p \text{ predictors } \to 2^p \text{ possible models.} \]
## Full formula full_formula <- formula(y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3 + x_junk.4 + x_junk.5 + x_junk.6) ## Complete search 2^8
## [1] 256
## No predictors fit0 <- lm(y_sim ~ 1, data = data_all)
## Add one library(MASS) addterm(fit0, full_formula)
## Single term additions ## ## Model: ## y_sim ~ 1 ## Df Sum of Sq RSS AIC ## <none> 731.66 136.165 ## x_1 1 466.43 265.23 87.429 ## x_2 1 56.39 675.27 134.154 ## x_junk.1 1 50.05 681.61 134.622 ## x_junk.2 1 32.10 699.56 135.921 ## x_junk.3 1 0.41 731.26 138.137 ## x_junk.4 1 2.49 729.18 137.995 ## x_junk.5 1 54.28 677.38 134.311 ## x_junk.6 1 11.59 720.08 137.367
## Add x_1 fit1 <- lm(y_sim ~ 1 + x_1, data = data_all) ## Add another addterm(fit1, full_formula)
## Single term additions ## ## Model: ## y_sim ~ 1 + x_1 ## Df Sum of Sq RSS AIC ## <none> 265.23 87.429 ## x_2 1 58.314 206.92 77.014 ## x_junk.1 1 6.038 259.19 88.277 ## x_junk.2 1 0.159 265.07 89.399 ## x_junk.3 1 0.376 264.85 89.358 ## x_junk.4 1 0.403 264.83 89.353 ## x_junk.5 1 2.587 262.64 88.939 ## x_junk.6 1 6.791 258.44 88.132
## Add x_1 fit2 <- lm(y_sim ~ 1 + x_1 + x_2, data = data_all) ## Add another addterm(fit2, full_formula)
## Single term additions ## ## Model: ## y_sim ~ 1 + x_1 + x_2 ## Df Sum of Sq RSS AIC ## <none> 206.92 77.014 ## x_junk.1 1 3.0199 203.90 78.279 ## x_junk.2 1 2.1852 204.73 78.484 ## x_junk.3 1 0.8228 206.09 78.815 ## x_junk.4 1 1.6416 205.27 78.616 ## x_junk.5 1 4.9363 201.98 77.807 ## x_junk.6 1 3.7814 203.13 78.092
## Full formula fit0 <- lm(y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3 + x_junk.4 + x_junk.5 + x_junk.6, data = data_all)
## drop one dropterm(fit0)
## Single term deletions ## ## Model: ## y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3 + x_junk.4 + ## x_junk.5 + x_junk.6 ## Df Sum of Sq RSS AIC ## <none> 191.34 85.100 ## x_1 1 336.07 527.41 133.797 ## x_2 1 57.75 249.09 96.289 ## x_junk.1 1 2.92 194.25 83.857 ## x_junk.2 1 2.26 193.60 83.688 ## x_junk.3 1 1.46 192.80 83.480 ## x_junk.4 1 1.48 192.81 83.485 ## x_junk.5 1 2.08 193.42 83.641 ## x_junk.6 1 4.38 195.72 84.233
## drop x_junk.3 fit1 <- lm(y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + + x_junk.4 + x_junk.5 + x_junk.6, data = data_all) ## Drop another dropterm(fit1)
## Single term deletions ## ## Model: ## y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + +x_junk.4 + x_junk.5 + ## x_junk.6 ## Df Sum of Sq RSS AIC ## <none> 192.80 83.480 ## x_1 1 335.92 528.71 131.921 ## x_2 1 57.16 249.96 94.463 ## x_junk.1 1 2.30 195.09 82.073 ## x_junk.2 1 2.18 194.98 82.043 ## x_junk.4 1 1.15 193.95 81.779 ## x_junk.5 1 2.27 195.06 82.065 ## x_junk.6 1 4.73 197.52 82.691
## drop x_junk.4 fit2 <- lm(y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + + x_junk.5 + x_junk.6, data = data_all) ## Drop another dropterm(fit2)
## Single term deletions ## ## Model: ## y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + +x_junk.5 + x_junk.6 ## Df Sum of Sq RSS AIC ## <none> 193.95 81.779 ## x_1 1 335.82 529.77 130.021 ## x_2 1 56.44 250.39 92.550 ## x_junk.1 1 1.95 195.90 80.278 ## x_junk.2 1 2.27 196.21 80.359 ## x_junk.5 1 2.96 196.91 80.537 ## x_junk.6 1 4.74 198.69 80.986
## drop x_junk.1 fit3 <- lm(y_sim ~ x_1 + x_2 + x_junk.2 + + x_junk.5 + x_junk.6, data = data_all) ## Drop another dropterm(fit3)
## Single term deletions ## ## Model: ## y_sim ~ x_1 + x_2 + x_junk.2 + +x_junk.5 + x_junk.6 ## Df Sum of Sq RSS AIC ## <none> 195.90 80.278 ## x_1 1 368.05 563.94 131.147 ## x_2 1 59.57 255.47 91.554 ## x_junk.2 1 2.46 198.36 78.903 ## x_junk.5 1 3.71 199.61 79.217 ## x_junk.6 1 4.71 200.60 79.466
## drop x_junk.2 fit4 <- lm(y_sim ~ x_1 + x_2 + x_junk.5 + x_junk.6, data = data_all) ## Drop another dropterm(fit4)
## Single term deletions ## ## Model: ## y_sim ~ x_1 + x_2 + x_junk.5 + x_junk.6 ## Df Sum of Sq RSS AIC ## <none> 198.36 78.903 ## x_1 1 422.20 620.56 133.930 ## x_2 1 57.59 255.95 89.648 ## x_junk.5 1 4.78 203.13 78.092 ## x_junk.6 1 3.62 201.98 77.807
## drop x_junk.6 fit5 <- lm(y_sim ~ x_1 + x_2 + x_junk.5, data = data_all) ## Drop another dropterm(fit5)
## Single term deletions ## ## Model: ## y_sim ~ x_1 + x_2 + x_junk.5 ## Df Sum of Sq RSS AIC ## <none> 201.98 77.807 ## x_1 1 426.25 628.23 132.544 ## x_2 1 60.66 262.64 88.939 ## x_junk.5 1 4.94 206.92 77.014
## drop x_junk.5 fit6 <- lm(y_sim ~ x_1 + x_2, data = data_all) ## Drop another dropterm(fit6)
## Single term deletions ## ## Model: ## y_sim ~ x_1 + x_2 ## Df Sum of Sq RSS AIC ## <none> 206.92 77.014 ## x_1 1 468.35 675.27 134.154 ## x_2 1 58.31 265.23 87.429
Mallow’s \(C_p\): \[ C_p(\hat{\mathbf{y}}_\eta) = \frac{1}{n} \left( \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 + 2 |\eta| \hat{\sigma}^2_{\text{full}} \right) \]
FPE: \[ FPE(\hat{\mathbf{y}}_\eta) = \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 + 2 |\eta| \hat{\sigma}^2_\eta \quad \text{with} \quad \hat{\sigma}^2_\eta = \frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2} {n - |\eta|} \]
Hence: \[ FPE(\hat{\mathbf{y}}_\eta) = \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 \left(1 + \frac{2 |\eta|} {n - |\eta|}\right) \]
\[ \begin{aligned} \log \frac1n FPE(\hat{\mathbf{y}}_\eta) &= \log \left(\frac1n \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 \left(1 + \frac{2 |\eta|} {n - |\eta|}\right) \right)\\ &= \log \left(\frac1n \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2 \right) + \log \left(1 + \frac{2 |\eta|} {n - |\eta|}\right) \end{aligned} \] But: \[ n \log\left(\frac{ \|\mathbf{y} - \hat{\mathbf{y}}_\eta \|^2}{n}\right) = - 2\log \hat{L}_\eta - n \log\left(2\pi\right) - n \]
Hence: \[ n\log \frac1n FPE(\hat{\mathbf{y}}_\eta) + n \log\left(2\pi\right) + n \qquad \qquad \qquad \qquad\\ \qquad \qquad \qquad \qquad = - 2\log \hat{L}_\eta + n \log \left(1 + \frac{2 |\eta|} {n - |\eta|}\right)\\ \]
When \(n\) is large: \[ n \log \left(1 + \frac{2 |\eta|} {n - |\eta|}\right) \approx n \frac{2 |\eta|} {n - |\eta|} \approx 2 |\eta| \]
Hence: \[ \begin{aligned} n\log \frac1n FPE(\hat{\mathbf{y}}_\eta) + n \log\left(2\pi\right) + n &\approx - 2\log \hat{L}_\eta + 2 |\eta| \\ &\approx AIC(\hat{\mathbf{y}}_\eta) \end{aligned} \]
The log FPE and the AIC criteria select for the same models (asymptotically).
In a Bayesian setting,
we have priors on the models \(\eta\) and parameters \(\boldsymbol{\theta}_\eta = (\boldsymbol{\beta}_\eta, \sigma^2_\eta)\): \[
\left\{
\begin{aligned}
\eta &\sim \pi_{\eta} \\
\boldsymbol{\theta}_\eta & \sim p(\boldsymbol{\theta} | \eta)
\end{aligned}
\right.
\]
The likelihood given some parameters and a model is: \(p(\mathbf{y} | \boldsymbol{\theta}, \eta)\)
the marginal likelihood given a model is: \[ p(\mathbf{y} | \eta) = \int p(\mathbf{y} | \boldsymbol{\theta}, \eta) p(\boldsymbol{\theta} | \eta) d\boldsymbol{\theta} \]
and the posterior probability of the models are: \[ p(\eta | \mathbf{y}). \]
Goal: find the model with the highest posterior probability: \[ \hat{\eta} = \underset{\eta}{\operatorname{argmax}} p(\eta | \mathbf{y}) \]
Assume that all models have the same prior probability: \[ \pi_{\eta} \equiv \pi. \]
Then, from Bayes rule: \[ p(\eta | \mathbf{y}) = p(\mathbf{y} | \eta)\pi_{\eta} / p(\mathbf{y}) = p(\mathbf{y} | \eta) \times \pi / p(\mathbf{y}) \]
And: \[ \hat{\eta} = \underset{\eta}{\operatorname{argmax}} p(\eta | \mathbf{y}) = \underset{\eta}{\operatorname{argmax}} p(\mathbf{y} | \eta). \]
Goal: find the model with the highest posterior probability: \[ \begin{aligned} \hat{\eta} & = \underset{\eta}{\operatorname{argmax}} p(\eta | \mathbf{y}) = \underset{\eta}{\operatorname{argmax}} p(\mathbf{y} | \eta) \\ & = \underset{\eta}{\operatorname{argmax}} \int p(\mathbf{y} | \boldsymbol{\theta}, \eta) p(\boldsymbol{\theta} | \eta) d\boldsymbol{\theta} \end{aligned} \]
Idea:
Assume that \[p(\mathbf{y} | \boldsymbol{\theta}, \eta) p(\boldsymbol{\theta} | \eta)\] is concentrated around its maximum \[p(\mathbf{y} | \boldsymbol{\theta}_\eta^*, \eta) p(\boldsymbol{\theta}_\eta^* | \eta)\] and apply Laplace Approximation.
Let \(L: \mathbb{R}^q \to \mathbb{R}\) be a three times differentiable function, with a unique maximum \(\mathbf{u}^*\). Then: \[ \int_{\mathbb{R}^q} e^{n L(\mathbf{u})} d \mathbf{u} = \left(\frac{2\pi}{n}\right)^{q/2} |-L''(\mathbf{u}^*)|^{-1/2}e^{nL(\mathbf{u}^*)} + o(n^{-1}) \]
Applied to: \[ L(\boldsymbol{\theta}) = \frac1n\log p(\mathbf{y} | \boldsymbol{\theta}, \eta) + \frac1n\log p(\boldsymbol{\theta} | \eta) \] and assuming \(\log p(\mathbf{y} | \boldsymbol{\theta}^*, \eta) \approx \log p(\mathbf{y} | \hat{\boldsymbol{\theta}}, \eta)\), we get: \[ \log \int p(\mathbf{y} | \boldsymbol{\theta}, \eta) p(\boldsymbol{\theta} | \eta) d\boldsymbol{\theta} \approx \log p(\mathbf{y} | \hat{\boldsymbol{\theta}}, \eta) - \frac{|\eta|}{2} \log(n). \]
Hence: \[ \begin{aligned} \hat{\eta} & = \underset{\eta}{\operatorname{argmax}} p(\eta | \mathbf{y}) = \underset{\eta}{\operatorname{argmax}} p(\mathbf{y} | \eta) \\ & \approx \underset{\eta}{\operatorname{argmax}} \left\{\log p(\mathbf{y} | \hat{\boldsymbol{\theta}}, \eta) - \frac{|\eta|}{2} \log(n)\right\}\\ & \approx \underset{\eta}{\operatorname{argmin}} \left\{-2\log p(\mathbf{y} | \hat{\boldsymbol{\theta}}, \eta) + |\eta| \log(n) \right\}\\ &\approx \underset{\eta}{\operatorname{argmin}} \left\{- 2\log \hat{L}_\eta + |\eta| \log(n)\right\} = \underset{\eta}{\operatorname{argmin}} BIC(\hat{\mathbf{y}}_\eta) \end{aligned} \]
BIC is an approximation of the marginal likelihood.