Paul Bastide - Ibrahim Bouzalmat
20/03/2024
y=Xβ+ϵ
Distribution of ˆβ: ˆβ∼N(β;σ2(XTX)−1)andˆβk−βk√ˆσ2[(XTX)−1]kk∼Tn−p.
Distribution of ˆσ2: (n−p)ˆσ2σ2∼χ2n−p.
Hypothesis:H0:βk=0vsH1:βk≠0
Test Statistic:Tk=ˆβk√ˆσ2k∼H0Tn−p
Reject Region:P[Reject|H0 true]≤α⟺Rα={t∈R | |t|≥tn−p(1−α/2)}
p value:p=PH0[|Tk|>Tobsk]
yi=−1+3xi1−xi2+ϵi
set.seed(1289) ## Predictors n <- 500 x_1 <- runif(n, min = -2, max = 2) x_2 <- runif(n, min = -2, max = 2) ## Noise eps <- rnorm(n, mean = 0, sd = 2) ## Model sim beta_0 <- -1; beta_1 <- 3; beta_2 <- -1 y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps ## Useless predictor x_junk_1 <- runif(n, min = -2, max = 2)
yi=−1+3xi1−xi2+ϵi
fit <- lm(y_sim ~ x_1 + x_2 + x_junk_1) summary(fit)
## ## Call: ## lm(formula = y_sim ~ x_1 + x_2 + x_junk_1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.275 -1.453 -0.028 1.371 4.893 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.82789 0.08814 -9.393 <2e-16 *** ## x_1 3.01467 0.07546 39.952 <2e-16 *** ## x_2 -0.95917 0.07469 -12.842 <2e-16 *** ## x_junk_1 0.01534 0.07791 0.197 0.844 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.957 on 496 degrees of freedom ## Multiple R-squared: 0.7905, Adjusted R-squared: 0.7893 ## F-statistic: 624 on 3 and 496 DF, p-value: < 2.2e-16
yi=−1+3xi1−xi2+ϵi
## Other Useless Predictors p_junk <- 100 x_junk <- matrix(runif(n * p_junk, min = -2, max = 2), ncol = p_junk) ## Data frame data_junk <- data.frame(y_sim = y_sim, x_junk = x_junk)
fit <- lm(y_sim ~ -1 + ., data = data_junk) summary(fit)
## ## Call: ## lm(formula = y_sim ~ -1 + ., data = data_junk) ## ## Residuals: ## Min 1Q Median 3Q Max ## -10.8683 -3.0605 -0.2674 2.3554 11.0821 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## x_junk.1 0.2129516 0.1828010 1.165 0.2447 ## x_junk.2 -0.1075770 0.1919617 -0.560 0.5755 ## x_junk.3 0.4322452 0.1945466 2.222 0.0269 * ## x_junk.4 0.1471909 0.1886285 0.780 0.4357 ## x_junk.5 -0.1110445 0.1848774 -0.601 0.5484 ## x_junk.6 0.1276583 0.1829333 0.698 0.4857 ## x_junk.7 0.0079412 0.1944593 0.041 0.9674 ## x_junk.8 -0.1126224 0.1854571 -0.607 0.5440 ## x_junk.9 -0.0560611 0.1918247 -0.292 0.7702 ## x_junk.10 -0.3215171 0.1883253 -1.707 0.0886 . ## x_junk.11 -0.0875118 0.1891359 -0.463 0.6438 ## x_junk.12 -0.0822697 0.1893373 -0.435 0.6641 ## x_junk.13 0.0862599 0.1874289 0.460 0.6456 ## x_junk.14 -0.1322800 0.1895130 -0.698 0.4856 ## x_junk.15 0.1748496 0.1751535 0.998 0.3188 ## x_junk.16 -0.2152065 0.1795766 -1.198 0.2315 ## x_junk.17 0.0987989 0.1868782 0.529 0.5973 ## x_junk.18 -0.0589122 0.1964012 -0.300 0.7644 ## x_junk.19 0.0407544 0.1896260 0.215 0.8299 ## x_junk.20 0.2788872 0.1817698 1.534 0.1257 ## x_junk.21 0.0780682 0.1901043 0.411 0.6815 ## x_junk.22 0.0001428 0.1936429 0.001 0.9994 ## x_junk.23 -0.2146891 0.1860330 -1.154 0.2492 ## x_junk.24 0.1135748 0.1794805 0.633 0.5272 ## x_junk.25 0.1797087 0.1849661 0.972 0.3318 ## x_junk.26 -0.0095168 0.1885643 -0.050 0.9598 ## x_junk.27 -0.0697160 0.1782409 -0.391 0.6959 ## x_junk.28 -0.1294379 0.1832707 -0.706 0.4804 ## x_junk.29 -0.0138295 0.1830424 -0.076 0.9398 ## x_junk.30 0.2922539 0.1918362 1.523 0.1284 ## x_junk.31 0.1576348 0.1867874 0.844 0.3992 ## x_junk.32 0.0598461 0.1892184 0.316 0.7520 ## x_junk.33 0.2018882 0.1812726 1.114 0.2661 ## x_junk.34 -0.0116431 0.1840672 -0.063 0.9496 ## x_junk.35 0.0713938 0.1833419 0.389 0.6972 ## x_junk.36 0.0190741 0.1798446 0.106 0.9156 ## x_junk.37 -0.0238114 0.1885668 -0.126 0.8996 ## x_junk.38 -0.1867089 0.1838147 -1.016 0.3104 ## x_junk.39 -0.0777722 0.1862288 -0.418 0.6765 ## x_junk.40 -0.0434927 0.1859361 -0.234 0.8152 ## x_junk.41 0.0814161 0.1904542 0.427 0.6693 ## x_junk.42 0.1292109 0.1870557 0.691 0.4901 ## x_junk.43 -0.3022911 0.1835507 -1.647 0.1004 ## x_junk.44 0.0831695 0.1823006 0.456 0.6485 ## x_junk.45 0.1332532 0.1869381 0.713 0.4764 ## x_junk.46 -0.1348223 0.1838432 -0.733 0.4638 ## x_junk.47 -0.0218524 0.1972836 -0.111 0.9119 ## x_junk.48 0.2795708 0.1854204 1.508 0.1324 ## x_junk.49 -0.1531037 0.1798952 -0.851 0.3952 ## x_junk.50 0.0520440 0.1852843 0.281 0.7789 ## x_junk.51 0.1682396 0.1835292 0.917 0.3599 ## x_junk.52 -0.1190719 0.1853241 -0.643 0.5209 ## x_junk.53 -0.0989078 0.1857934 -0.532 0.5948 ## x_junk.54 -0.1325051 0.1840157 -0.720 0.4719 ## x_junk.55 0.1006848 0.1940173 0.519 0.6041 ## x_junk.56 0.2034148 0.1764952 1.153 0.2498 ## x_junk.57 0.1153209 0.1766600 0.653 0.5143 ## x_junk.58 -0.0463597 0.1772648 -0.262 0.7938 ## x_junk.59 0.0240049 0.1889276 0.127 0.8990 ## x_junk.60 0.3726537 0.1909670 1.951 0.0517 . ## x_junk.61 0.0250128 0.1864171 0.134 0.8933 ## x_junk.62 -0.2615601 0.1901999 -1.375 0.1698 ## x_junk.63 0.0774898 0.1871022 0.414 0.6790 ## x_junk.64 0.0545841 0.1966734 0.278 0.7815 ## x_junk.65 0.0725361 0.1882312 0.385 0.7002 ## x_junk.66 -0.1206593 0.1897174 -0.636 0.5251 ## x_junk.67 -0.3336932 0.1818668 -1.835 0.0673 . ## x_junk.68 0.0981128 0.1912111 0.513 0.6082 ## x_junk.69 -0.0153031 0.1858223 -0.082 0.9344 ## x_junk.70 0.1886025 0.1830855 1.030 0.3036 ## x_junk.71 -0.0695791 0.1906992 -0.365 0.7154 ## x_junk.72 -0.1030034 0.1840569 -0.560 0.5760 ## x_junk.73 0.0084075 0.1823678 0.046 0.9633 ## x_junk.74 0.2999893 0.1783620 1.682 0.0934 . ## x_junk.75 0.3355257 0.1843357 1.820 0.0695 . ## x_junk.76 -0.1894886 0.1956830 -0.968 0.3335 ## x_junk.77 -0.1183169 0.1843228 -0.642 0.5213 ## x_junk.78 -0.1651666 0.1832443 -0.901 0.3679 ## x_junk.79 0.0947620 0.1867396 0.507 0.6121 ## x_junk.80 0.2538003 0.1910030 1.329 0.1847 ## x_junk.81 0.0711922 0.1917948 0.371 0.7107 ## x_junk.82 -0.0098761 0.2003307 -0.049 0.9607 ## x_junk.83 0.0102922 0.1798480 0.057 0.9544 ## x_junk.84 0.2163918 0.1807194 1.197 0.2319 ## x_junk.85 -0.2514098 0.1881750 -1.336 0.1823 ## x_junk.86 0.0181126 0.1832811 0.099 0.9213 ## x_junk.87 -0.0907617 0.1809516 -0.502 0.6162 ## x_junk.88 0.0354845 0.1869592 0.190 0.8496 ## x_junk.89 -0.3283331 0.1867203 -1.758 0.0794 . ## x_junk.90 -0.2284999 0.1895045 -1.206 0.2286 ## x_junk.91 -0.0575054 0.1800322 -0.319 0.7496 ## x_junk.92 0.0612578 0.1799226 0.340 0.7337 ## x_junk.93 -0.0192369 0.1866956 -0.103 0.9180 ## x_junk.94 -0.0615478 0.1828148 -0.337 0.7365 ## x_junk.95 0.3675294 0.1778881 2.066 0.0395 * ## x_junk.96 -0.0015247 0.1899413 -0.008 0.9936 ## x_junk.97 -0.1364003 0.1779889 -0.766 0.4439 ## x_junk.98 0.4422968 0.1888431 2.342 0.0197 * ## x_junk.99 0.2366755 0.1815971 1.303 0.1932 ## x_junk.100 0.0010360 0.1905522 0.005 0.9957 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.311 on 400 degrees of freedom ## Multiple R-squared: 0.1889, Adjusted R-squared: -0.01383 ## F-statistic: 0.9318 on 100 and 400 DF, p-value: 0.6596
p_values <- summary(fit)$coefficients[, 4] hist(p_values, col = "lightblue", breaks = 20)
sum(p_values <= 0.05)
## [1] 3
summary(fit)$coefficients[p_values <= 0.05, ]
## Estimate Std. Error t value Pr(>|t|) ## x_junk.3 0.4322452 0.1945466 2.221807 0.02685525 ## x_junk.95 0.3675294 0.1778881 2.066071 0.03946512 ## x_junk.98 0.4422968 0.1888431 2.342138 0.01966334
Reject Region: P[Reject|H0 true]≤α
Just by chance, about 5% of false discoveries.
When σ2 is known:
ˆβ∼N(β;σ2(XTX)−1).
When σ2 is unknown: 1pˆσ2(ˆβ−β)T(XTX)(ˆβ−β)∼Fpn−p
with Fpn−p the Fisher distribution.
Let U1∼χ2p1 and U2∼χ2p2, U1 and U2 independent. Then F=U1/p1U2/p2∼Fp1p2
Let’s show: 1pˆσ2(ˆβ−β)T(XTX)(ˆβ−β)∼Fpn−p
Hints:
ˆβ∼N(β;σ2(XTX)−1)and(n−p)ˆσ2σ2∼χ2n−p
If X∼N(μ,Σ)then(X−μ)TΣ−1(X−μ)∼χ2p
If U1∼χ2p1 and U2∼χ2p2 independent, then U1/p1U2/p2∼Fp1p2
ˆβ∼N(β;σ2(XTX)−1)⟹1σ2(ˆβ−β)T(XTX)(ˆβ−β)∼χ2p
(n−p)ˆσ2σ2∼χ2n−p independent from ˆβ
Hence: 1σ2(ˆβ−β)T(XTX)(ˆβ−β)/p(n−p)ˆσ2σ2/(n−p)=1pˆσ2(ˆβ−β)T(XTX)(ˆβ−β)∼Fpn−p.
Hypothesis: H0:β1=⋯=βp=0vsH1:∃ k | βk≠0
Test Statistic: F=1pˆσ2ˆβT(XTX)ˆβ∼H0Fpn−p
Reject Region: P[Reject|H0 true]≤α⟺Rα={f∈R+ | f≥fpn−p(1−α)}
p value:p=PH0[F>Fobs]
fit <- lm(y_sim ~ -1 + ., data = data_junk) f_obs <- summary(fit)$fstatistic f_obs
## value numdf dendf ## 0.931791 100.000000 400.000000
p_value_f <- 1 - pf(f_obs[1], f_obs[2], f_obs[3]) p_value_f
## value ## 0.6596268
We do not reject the null hypothesis that all the (junk) coefficients are zero (phew).
F=1pˆσ2ˆβT(XTX)ˆβ=(Xˆβ)T(Xˆβ)/p‖ˆϵ‖2/(n−p)=‖ˆy‖2/p‖ˆϵ‖2/(n−p)=‖ˆy‖2/p‖y−ˆy‖2/(n−p)
Good model
y not too far from M(X).
‖ˆϵ‖2 not too large compared to ‖ˆy‖2.
F tends to be large.
Bad model
y almost orth. to M(X).
‖ˆϵ‖2 rather large compared to ‖ˆy‖2.
F tends to be small.
## Data frame data_all <- data.frame(y_sim = y_sim, x_1 = x_1, x_2 = x_2, x_junk = x_junk) ## Fit fit <- lm(y_sim ~ ., data = data_all) ## multiple t tests p_values_t <- summary(fit)$coefficients[, 4] names(p_values_t[p_values_t <= 0.05])
## [1] "(Intercept)" "x_1" "x_2" "x_junk.27" "x_junk.76" ## [6] "x_junk.80"
## f test f_obs <- summary(fit)$fstatistic p_value_f <- 1 - pf(f_obs[1], f_obs[2], f_obs[3]) p_value_f
## value ## 0
Can I test ? H0:βp0+1=⋯=βp=0vsH1:∃ k∈{p0+1,…,p} | βk≠0
i.e. decide between the full model: y=Xβ+ϵ
and the nested model: y=X0β0+ϵ
with X=(X0 X1), rg(X)=p and rg(X0)=p0.
Idea: Compare
‖ˆy−ˆy0‖2 distance of ˆy compared to ˆy0 to
‖y−ˆy‖2 residual error.
Then:
Good model
ˆy quite different from ˆy0.
Full model does add information.
Bad model
ˆy similar to ˆy0.
Full model does not add much information.
F=‖ˆy−ˆy0‖2/(p−p0)‖y−ˆy‖2/(n−p)
F=‖ˆy−ˆy0‖2/(p−p0)‖y−ˆy‖2/(n−p)
Let P projection on M(X) and P0 projection on M(X0)
ˆy−ˆy0=Py−P0y=Py−P0Py=(In−P0)Py=P⊥0Py∈M(X0)⊥∩M(X)
y−ˆy=(In−P)y=P⊥y∈M(X)⊥
ˆy−ˆy0 and y−ˆy are orthogonal
i.e. their covariance is zero
i.e. (Gaussian assumption) they are independent.
From Cochran’s theorem: 1σ2‖y−ˆy‖2=1σ2‖P⊥y‖2=1σ2‖P⊥ϵ‖2∼χ2n−p
1σ2‖ˆy−ˆy0−P⊥0PXβ‖2=1σ2‖P⊥0P(y−Xβ)‖2∼χ2p−p0
If H0 is true, then P⊥0PXβ=0, hence:
F=‖ˆy−ˆy0‖2/(p−p0)‖y−ˆy‖2/(n−p)∼H0Fp−p0n−p.
From Phythagoras’s theorem: ‖y−ˆy0‖2=‖y−Py+Py−P0y‖2=‖P⊥y+P⊥0Py‖2=‖P⊥y‖2+‖P⊥0Py‖2=‖y−ˆy‖2+‖ˆy−ˆy0‖2
i.e. ‖ˆy−ˆy0‖2=‖y−ˆy0‖2−‖y−ˆy‖2=RSS0−RSS
Hence: F=‖ˆy−ˆy0‖2/(p−p0)‖y−ˆy‖2/(n−p)=n−pp−p0RSS0−RSSRSS
To test: H0:βp0+1=⋯=βp=0vsH1:∃ k∈{p0+1,…,p} | βk≠0
we use the F statistics: F=‖ˆy−ˆy0‖2/(p−p0)‖y−ˆy‖2/(n−p)=n−pp−p0RSS0−RSSRSS∼H0Fp−p0n−p.
yi=−1+3xi1−xi2+ϵi
## Data frame data_all <- data.frame(y_sim = y_sim, x_1 = x_1, x_2 = x_2, x_junk = x_junk) ## Fit true fit_small <- lm(y_sim ~ x_1 + x_2, data = data_all) ## Fit true and junk fit_all <- lm(y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3, data = data_all) ## Comparison anova(fit_small, fit_all)
## Analysis of Variance Table ## ## Model 1: y_sim ~ x_1 + x_2 ## Model 2: y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3 ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 497 1900.5 ## 2 494 1883.2 3 17.297 1.5124 0.2104
yi=−1+3xi1−xi2+ϵi
## Fit true fit_small <- lm(y_sim ~ x_1 + x_2, data = data_all) ## Fit all fit_all <- lm(y_sim ~ ., data = data_all) ## Comparison anova(fit_small, fit_all)$`Pr(>F)`[2]
## [1] 0.3548787
We do not reject the null hypothesis that the first two variables are enough to explain the data.
Assuming that we have an intercept, we often test: H0:β2=⋯=βp=0vsH1:∃ k∈{2,…,p} | βk≠0
The associated F statistics is (p0=1): F=‖ˆy−ˉy1‖2/(p−1)‖y−ˆy‖2/(n−p)∼H0Fp−1n−p.
That is the default test returned by R
.
yi=−1+3xi1−xi2+ϵi
fit <- lm(y_sim ~ x_1 + x_2) summary(fit)
## ## Call: ## lm(formula = y_sim ~ x_1 + x_2) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.2523 -1.4517 -0.0302 1.3494 4.8711 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.82877 0.08794 -9.424 <2e-16 *** ## x_1 3.01415 0.07534 40.008 <2e-16 *** ## x_2 -0.95922 0.07462 -12.855 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.955 on 497 degrees of freedom ## Multiple R-squared: 0.7905, Adjusted R-squared: 0.7897 ## F-statistic: 937.8 on 2 and 497 DF, p-value: < 2.2e-16
If p0=p−1, we test: H0:βp=0vsH1:βp≠0
The associated F statistics is (p0=p−1): F=‖ˆy−ˆy−p‖2‖y−ˆy‖2/(n−p)∼H0F1n−p.
It can be shown that: F=T2withT=ˆβpˆσp
Model: yi=β1xi1+⋯+βpxip+ϵi
Problem:
Can we choose the “best” subset of non-zero βk ?
I.e. Can we find the predictors that really have an impact on y ?
F-test on all subsets ? Not practical, multiple testing.
Idea:
Find a “score” to assess the quality of a given fit.
R2=1−RSSTSS
## Function to get statistics for one fit get_stats_fit <- function(fit) { sumfit <- summary(fit) res <- data.frame(R2 = sumfit$r.squared) return(res) }
yi=−1+3xi1−xi2+ϵi
## Fit true get_stats_fit(lm(y_sim ~ x_1 + x_2, data = data_all))
## R2 ## 1 0.7905255
## Fit bigger get_stats_fit(lm(y_sim ~ x_1 + x_2 + x_junk.1, data = data_all))
## R2 ## 1 0.79133
## Fit wrong get_stats_fit(lm(y_sim ~ x_1 + x_junk.1, data = data_all))
## R2 ## 1 0.7252314
## Only one "junk" predictor data_sub <- data.frame(y_sim = y_sim, x_1 = x_1, x_2 = x_2, x_junk = x_junk_1) ## Fit all possible models get_all_stats(data_sub)
## R2 preds ## 1 0.7208723311 x_1 ## 2 0.1158793338 x_2 ## 3 0.0005972444 x_junk ## 4 0.7905254867 x_1+x_2 ## 5 0.7208974123 x_1+x_junk ## 6 0.1164841596 x_2+x_junk ## 7 0.7905418666 x_1+x_2+x_junk
R2 is not enough.
R2a=1−RSS/(n−p)TSS/(n−1)
## Function to get statistics get_stats_fit <- function(fit) { sumfit <- summary(fit) res <- data.frame(R2 = sumfit$r.squared, R2a = sumfit$adj.r.squared) return(res) }
get_all_stats(data_sub)
## R2 R2a preds ## 1 0.7208723311 0.720311834 x_1 ## 2 0.1158793338 0.114103991 x_2 ## 3 0.0005972444 -0.001409588 x_junk ## 4 0.7905254867 0.789682531 x_1+x_2 ## 5 0.7208974123 0.719774263 x_1+x_junk ## 6 0.1164841596 0.112928764 x_2+x_junk ## 7 0.7905418666 0.789274983 x_1+x_2+x_junk
R2a: adjust for the number of predictors.
R2a=1−RSS/(n−p)TSS/(n−1)
The larger the better.
When p is bigger, the fit must be really better for R2a to raise.
Intuitive, but not much theoretical justifications.
set.seed(1289) ## Predictors n <- 50 x_1 <- runif(n, min = -2, max = 2) x_2 <- runif(n, min = -2, max = 2) ## Model beta_0 <- -1; beta_1 <- 3; beta_2 <- -1 ## sim eps <- rnorm(n, mean = 0, sd = 2) y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps
## Useless predictors p_junk <- 48 x_junk <- matrix(runif(n * p_junk, min = -2, max = 2), ncol = p_junk) ## Good and Bad dataset data_all <- data.frame(y_sim = y_sim, x_1 = x_1, x_2 = x_2, x_junk = x_junk) ## Bad overfitting fit fit_bad <- lm(y_sim ~ ., data = data_all) pred_bad <- predict(fit_bad) ## Good fit fit_good <- lm(y_sim ~ x_1 + x_2, data = data_all) pred_good <- predict(fit_good)
mu_y <- beta_0 + beta_1 * x_1 + beta_2 * x_2 plot(mu_y, y_sim, pch = 3, ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2]))
mu_y <- beta_0 + beta_1 * x_1 + beta_2 * x_2 plot(mu_y, y_sim, pch = 3, ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2])) ## Good prediction (small model) points(mu_y, pred_good, col = "lightblue", pch = 4)
mu_y <- beta_0 + beta_1 * x_1 + beta_2 * x_2 plot(mu_y, y_sim, pch = 3, ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2])) ## Good prediction (small model) points(mu_y, pred_good, col = "lightblue", pch = 4) ## Bad prediction (big model) points(mu_y, pred_bad, col = "firebrick", pch = 4)
The bad (bigger) model is closer to the observed data.
## Distance between bad prediction and data (RSS) sum((y_sim - pred_bad)^2)
## [1] 2.714946e-27
## Distance between good prediction and data (RSS) sum((y_sim - pred_good)^2)
## [1] 206.9159
Overfitting: the RSS is smaller for the bad, bigger model.
Generate new data according to the same generative model:
## New dataset from the same model eps_new <- rnorm(n, mean = 0, sd = 2) y_new <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps_new
Questions:
Are the predictions from the bad (bigger) model
better or worse
than the predictions from the good (smaller) model,
compared to the new data ?
plot(mu_y, y_sim, pch = 3, ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2])) ## Bad prediction (big model) points(mu_y, pred_bad, col = "firebrick", pch = 4) ## Good prediction (small model) points(mu_y, pred_good, col = "lightblue", pch = 4) ## New points points(mu_y, y_new, col = "darkgreen", pch = 3)
Compute the risk between predictions and new data:
## Distance between bad prediction and true sum((y_new - pred_bad)^2)
## [1] 535.296
## Distance between good prediction and true sum((y_new - pred_good)^2)
## [1] 205.6555
The good (smaller) model behaves better on new data.
RSS: ‖y−ˆyp‖2 with ˆyp learnt on y
→ gets smaller when model gets bigger (overfit).
Predictive Risk: ‖y′−ˆyp‖2 with y′ same distribution as y.
→ becomes larger if ˆyp is overfit.
Idea:
We want to select a model that has a good predictive risk.
i.e. that represents the underlying model best.
Problem:
In general, we don’t have means to generate y′.
→ We use an estimate of the (theoretical) predictive risk.
Ideal model: y=μ+ϵμ∈Rnϵ∼N(0,σ2In).
μ the “true” expectation of the observations y.
Full linear regression model: y=Xβ+ϵrg(X)=p,ϵ∼N(0,σ2In).
Assume that μ can be approximated as Xβ, i.e. as an element of M(X).
Ideal model: y=μ+ϵμ∈Rnϵ∼N(0,σ2In).
Sub-models:
keep only predictors η⊂{1,…,p} y=Xηβη+ϵηrg(Xη)=|η|,ϵη∼N(0,σ2In).
with Xη the sub-matrix of X: Xη=(xη1 xη2⋯xη|η|)
Assume that μ can be approximated as Xηβη, i.e. as an element of M(Xη).
“Truth”: y=μ+ϵ, with μ∈Rn and ϵ∼N(0,σ2In).
Models: y=Xηβη+ϵηrg(Xη)=|η|,ϵη∼N(0,σ2In).
Estimators: ˆβη=argminβ∈R|η|‖y−Xηβη‖2=(XTηXη)−1XTηyˆyη=Xηˆβη=PXηy
True expectation: yη=PXημ
Risk of an estimator: R(ˆyη)=E[‖μ−ˆyη‖2]
Notes:
R(ˆyη)=E[‖μ−ˆyη‖2]=‖μ−yη‖2+|η|σ2
R(ˆyη)=E[‖μ−ˆyη‖2]=‖μ−yη‖2+|η|σ2
No overfitting:
Assume:
μ∈M(Xη) so that yη=PXημ=μ.
η⊂η′ so that yη′=PXη′μ=μ.
Then: R(ˆyη′)=‖μ−yη′‖2+|η′|σ2≥‖μ−yη′‖2+|η|σ2=‖μ−yη‖2+|η|σ2≥R(ˆyη)
R(ˆyη)=E[‖μ−ˆyη‖2]=E[‖μ−yη+yη−ˆyη‖2]
R(ˆyη)=E[‖μ−PXημ+PXημ−PXηy‖2]=E[‖(In−PXη)μ+PXη(μ−y)‖2]=E[‖PX⊥ημ+PXη(μ−y)‖2]
R(ˆyη)=E[‖PX⊥ημ‖2+‖PXη(μ−y)‖2]=‖PX⊥ημ‖2+E[‖PXη(μ−y)‖2]=‖μ−yη‖2+E[‖PXη(μ−y)‖2]
R(ˆyη)=‖μ−yη‖2+E[‖PXη(μ−y)‖2]
From Cochran’s Theorem: 1σ2‖PXη(μ−y)‖2∼χ2|η|
Hence: E[1σ2‖PXη(μ−y)‖2]=|η|
And: R(ˆyη)=‖μ−ˆyη‖2+|η|σ2.
We would like: η0=argminη⊂{1,⋯,p}R(ˆyη)
For any model η, the expectation of the RSS is: E[‖y−ˆyη‖2]=R(ˆyη)+nσ2−2|η|σ2
E[‖y−ˆyη‖2]=E[‖y−μ+μ−ˆyη‖2]=E[‖y−μ‖2+‖μ−ˆyη‖2+2⟨y−μ;μ−ˆyη⟩]
E[‖y−ˆyη‖2]=E[‖ϵ‖2]+R(ˆyη)+2E[⟨ϵ;μ−PXηy⟩]
E[⟨ϵ;μ−PXηy⟩]=E[⟨ϵ;μ−PXη(μ+ϵ)⟩]=E[⟨ϵ;μ−PXημ)⟩]−E[⟨ϵ;PXηϵ⟩]
E[⟨ϵ;μ−PXηy⟩]=0−E[⟨PXηϵ;PXηϵ⟩] =−E[‖PXηϵ‖2]=−|η|σ2
E[‖y−ˆyη‖2]=nσ2+R(ˆyη)−2|η|σ2
We have: E[‖y−ˆyη‖2]=R(ˆyη)+nσ2−2|η|σ2
Mallow’s Cp: Cp(ˆyη)=1n(‖y−ˆyη‖2+2|η|ˆσ2)
We have: Cp(ˆyη)=1n(‖y−ˆyη‖2+2|η|ˆσ2)
Hence: ˆη=argminη⊂{1,⋯,p}Cp(ˆyη)=argminη⊂{1,⋯,p}ˆR(ˆyη).
→ Minimizing Mallow’s Cp is the same as minimizing an unbiased estimate of the risk.
Cp(ˆyη)=1n(‖y−ˆyη‖2+2|η|ˆσ2)
Penalize for the number of parameters |η|.
The smaller the better.
## Mallow's Cp Cp <- function(fit) { n <- nobs(fit) p <- n - df.residual(fit) RSS <- deviance(fit) return( (RSS + p * RSS / (n - p)) / n) } ## Function to get statistics get_stats_fit <- function(fit) { sumfit <- summary(fit) res <- data.frame(Cp = Cp(fit)) return(res) }
## Apply stats to all possible combinations all_stats <- get_all_stats(data_sub) all_stats
## Cp preds ## 1 5.085236 x_1 ## 2 16.107190 x_2 ## 3 18.207435 x_junk ## 4 3.823952 x_1+x_2 ## 5 5.095010 x_1+x_junk ## 6 16.128558 x_2+x_junk ## 7 3.831361 x_1+x_2+x_junk
Mallow’s Cp is smallest for the true model.
Recall: (ˆβη,ˆσ2η,ML)=argmax(β,σ2)∈R|η|×R∗+logL(β,σ2|y)=argmax(β,σ2)∈R|η|×R∗+−n2log(2πσ2)−12σ2‖y−Xηβ‖2
And: ˆσ2η,ML=‖y−ˆyη‖2n;ˆyη=Xηˆβη
Thus: L(ˆβη,ˆσ2ML|y)=⋯
The maximized likelihood is equal to: L(ˆβη,ˆσ2ML|y)=−n2log(2π‖y−ˆyη‖2n)−12‖y−ˆyη‖2n‖y−ˆyη‖2
i.e. logˆLη=−n2log(‖y−ˆyη‖2n)−n2log(2π)−n2
Remember, for any η⊂η′: ‖y−ˆyη′‖2≤‖y−ˆyη‖2
Hence R2η=1−‖y−ˆyη‖2‖y−ˉy1‖2 always increases when we add a predictor.
Similarly: logˆLη=−n2log(‖y−ˆyη‖2n)−n2log(2π)−n2
Problem: logˆLη always increases when we add some predictors.
→ it can not be used for model selection (same as R2).
Idea: add a penalty term that depends on the size of the model
→ minimize: −logˆLη+f(|η|)with f increasing
Minimize: −logˆLη+f(|η|)with f increasing
For η⊂η′:
−logˆLη decreases
f(|η|) increases
Goal: find f that compensates for the “overfit”, i.e. the noisy and insignificant increase of the likelihood when we add some unrelated predictors.
## Function to get statistics get_stats_fit <- function(fit) { sumfit <- summary(fit) res <- data.frame(minusLogLik = -logLik(fit)) return(res) }
## Apply stats to all possible combinations all_stats <- get_all_stats(data_sub) all_stats
## minusLogLik preds ## 1 1115.053 x_1 ## 2 1403.284 x_2 ## 3 1433.925 x_junk ## 4 1043.286 x_1+x_2 ## 5 1115.030 x_1+x_junk ## 6 1403.113 x_2+x_junk ## 7 1043.266 x_1+x_2+x_junk
When we add “junk” variable, −logˆLη decreases, but not a lot.
Questions: can we find a good penalty that compensates for this noisy increasing ?
AIC=−2logˆLη+2|η|
ˆLη is the maximized likelihood of the model.
|η| is the dimension of the model.
The smaller the better.
Asymptotic theoretical guaranties.
Easy to use and widely spread
BIC=−2logˆLη+|η|log(n)
ˆLη is the maximized likelihood of the model.
|η| is the dimension of the model.
The smaller the better.
Based on an approximation of the marginal likelihood.
Easy to use and widely spread
See: Lebarbier, Mary-Huard. Le critère BIC : fondements théoriques et interprétation. 2004. [inria-00070685]
AIC=−2logˆLη+2|η|
BIC=−2logˆLη+|η|log(n)
Both have asymptotic theoretical justifications.
BIC penalizes more than AIC → chooses smaller models.
Both widely used.
There are some (better) non-asymptotic criteria, see
Giraud C. 2014. Introduction to high-dimensional statistics.
## Function to get statistics get_stats_fit <- function(fit) { sumfit <- summary(fit) res <- data.frame(minusLogLik = -logLik(fit), AIC = AIC(fit), BIC = BIC(fit)) return(res) }
## Apply stats to all possible combinations all_stats <- get_all_stats(data_sub) all_stats
## minusLogLik AIC BIC preds ## 1 1115.053 2236.105 2248.749 x_1 ## 2 1403.284 2812.567 2825.211 x_2 ## 3 1433.925 2873.850 2886.493 x_junk ## 4 1043.286 2094.572 2111.430 x_1+x_2 ## 5 1115.030 2238.060 2254.919 x_1+x_junk ## 6 1403.113 2814.225 2831.084 x_2+x_junk ## 7 1043.266 2096.533 2117.606 x_1+x_2+x_junk
## Select model apply(all_stats[, -ncol(all_stats)], 2, function(x) all_stats[which.min(x), ncol(all_stats)])
## minusLogLik AIC BIC ## "x_1+x_2+x_junk" "x_1+x_2" "x_1+x_2"
## Advertising data library(here) ad <- read.csv(here("data", "Advertising.csv"), row.names = "X") ## Apply stats to all possible combinations all_stats <- get_all_stats(ad[, c(4, 1:3)]) all_stats
## minusLogLik AIC BIC preds ## 1 519.0457 1044.0913 1053.9863 TV ## 2 573.3369 1152.6738 1162.5687 radio ## 3 608.3357 1222.6714 1232.5663 newspaper ## 4 386.1970 780.3941 793.5874 TV+radio ## 5 509.8891 1027.7782 1040.9714 TV+newspaper ## 6 573.2361 1154.4723 1167.6655 radio+newspaper ## 7 386.1811 782.3622 798.8538 TV+radio+newspaper
## Select model apply(all_stats[, -ncol(all_stats)], 2, function(x) all_stats[which.min(x), ncol(all_stats)])
## minusLogLik AIC BIC ## "TV+radio+newspaper" "TV+radio" "TV+radio"
p predictors →2p possible models.
## Full formula full_formula <- formula(y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3 + x_junk.4 + x_junk.5 + x_junk.6) ## Complete search 2^8
## [1] 256
## No predictors fit0 <- lm(y_sim ~ 1, data = data_all)
## Add one library(MASS) addterm(fit0, full_formula)
## Single term additions ## ## Model: ## y_sim ~ 1 ## Df Sum of Sq RSS AIC ## <none> 731.66 136.165 ## x_1 1 466.43 265.23 87.429 ## x_2 1 56.39 675.27 134.154 ## x_junk.1 1 50.05 681.61 134.622 ## x_junk.2 1 32.10 699.56 135.921 ## x_junk.3 1 0.41 731.26 138.137 ## x_junk.4 1 2.49 729.18 137.995 ## x_junk.5 1 54.28 677.38 134.311 ## x_junk.6 1 11.59 720.08 137.367
## Add x_1 fit1 <- lm(y_sim ~ 1 + x_1, data = data_all) ## Add another addterm(fit1, full_formula)
## Single term additions ## ## Model: ## y_sim ~ 1 + x_1 ## Df Sum of Sq RSS AIC ## <none> 265.23 87.429 ## x_2 1 58.314 206.92 77.014 ## x_junk.1 1 6.038 259.19 88.277 ## x_junk.2 1 0.159 265.07 89.399 ## x_junk.3 1 0.376 264.85 89.358 ## x_junk.4 1 0.403 264.83 89.353 ## x_junk.5 1 2.587 262.64 88.939 ## x_junk.6 1 6.791 258.44 88.132
## Add x_1 fit2 <- lm(y_sim ~ 1 + x_1 + x_2, data = data_all) ## Add another addterm(fit2, full_formula)
## Single term additions ## ## Model: ## y_sim ~ 1 + x_1 + x_2 ## Df Sum of Sq RSS AIC ## <none> 206.92 77.014 ## x_junk.1 1 3.0199 203.90 78.279 ## x_junk.2 1 2.1852 204.73 78.484 ## x_junk.3 1 0.8228 206.09 78.815 ## x_junk.4 1 1.6416 205.27 78.616 ## x_junk.5 1 4.9363 201.98 77.807 ## x_junk.6 1 3.7814 203.13 78.092
## Full formula fit0 <- lm(y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3 + x_junk.4 + x_junk.5 + x_junk.6, data = data_all)
## drop one dropterm(fit0)
## Single term deletions ## ## Model: ## y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3 + x_junk.4 + ## x_junk.5 + x_junk.6 ## Df Sum of Sq RSS AIC ## <none> 191.34 85.100 ## x_1 1 336.07 527.41 133.797 ## x_2 1 57.75 249.09 96.289 ## x_junk.1 1 2.92 194.25 83.857 ## x_junk.2 1 2.26 193.60 83.688 ## x_junk.3 1 1.46 192.80 83.480 ## x_junk.4 1 1.48 192.81 83.485 ## x_junk.5 1 2.08 193.42 83.641 ## x_junk.6 1 4.38 195.72 84.233
## drop x_junk.3 fit1 <- lm(y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + + x_junk.4 + x_junk.5 + x_junk.6, data = data_all) ## Drop another dropterm(fit1)
## Single term deletions ## ## Model: ## y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + +x_junk.4 + x_junk.5 + ## x_junk.6 ## Df Sum of Sq RSS AIC ## <none> 192.80 83.480 ## x_1 1 335.92 528.71 131.921 ## x_2 1 57.16 249.96 94.463 ## x_junk.1 1 2.30 195.09 82.073 ## x_junk.2 1 2.18 194.98 82.043 ## x_junk.4 1 1.15 193.95 81.779 ## x_junk.5 1 2.27 195.06 82.065 ## x_junk.6 1 4.73 197.52 82.691
## drop x_junk.4 fit2 <- lm(y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + + x_junk.5 + x_junk.6, data = data_all) ## Drop another dropterm(fit2)
## Single term deletions ## ## Model: ## y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + +x_junk.5 + x_junk.6 ## Df Sum of Sq RSS AIC ## <none> 193.95 81.779 ## x_1 1 335.82 529.77 130.021 ## x_2 1 56.44 250.39 92.550 ## x_junk.1 1 1.95 195.90 80.278 ## x_junk.2 1 2.27 196.21 80.359 ## x_junk.5 1 2.96 196.91 80.537 ## x_junk.6 1 4.74 198.69 80.986
## drop x_junk.1 fit3 <- lm(y_sim ~ x_1 + x_2 + x_junk.2 + + x_junk.5 + x_junk.6, data = data_all) ## Drop another dropterm(fit3)
## Single term deletions ## ## Model: ## y_sim ~ x_1 + x_2 + x_junk.2 + +x_junk.5 + x_junk.6 ## Df Sum of Sq RSS AIC ## <none> 195.90 80.278 ## x_1 1 368.05 563.94 131.147 ## x_2 1 59.57 255.47 91.554 ## x_junk.2 1 2.46 198.36 78.903 ## x_junk.5 1 3.71 199.61 79.217 ## x_junk.6 1 4.71 200.60 79.466
## drop x_junk.2 fit4 <- lm(y_sim ~ x_1 + x_2 + x_junk.5 + x_junk.6, data = data_all) ## Drop another dropterm(fit4)
## Single term deletions ## ## Model: ## y_sim ~ x_1 + x_2 + x_junk.5 + x_junk.6 ## Df Sum of Sq RSS AIC ## <none> 198.36 78.903 ## x_1 1 422.20 620.56 133.930 ## x_2 1 57.59 255.95 89.648 ## x_junk.5 1 4.78 203.13 78.092 ## x_junk.6 1 3.62 201.98 77.807
## drop x_junk.6 fit5 <- lm(y_sim ~ x_1 + x_2 + x_junk.5, data = data_all) ## Drop another dropterm(fit5)
## Single term deletions ## ## Model: ## y_sim ~ x_1 + x_2 + x_junk.5 ## Df Sum of Sq RSS AIC ## <none> 201.98 77.807 ## x_1 1 426.25 628.23 132.544 ## x_2 1 60.66 262.64 88.939 ## x_junk.5 1 4.94 206.92 77.014
## drop x_junk.5 fit6 <- lm(y_sim ~ x_1 + x_2, data = data_all) ## Drop another dropterm(fit6)
## Single term deletions ## ## Model: ## y_sim ~ x_1 + x_2 ## Df Sum of Sq RSS AIC ## <none> 206.92 77.014 ## x_1 1 468.35 675.27 134.154 ## x_2 1 58.31 265.23 87.429
Mallow’s Cp: Cp(ˆyη)=1n(‖y−ˆyη‖2+2|η|ˆσ2full)
FPE: FPE(ˆyη)=‖y−ˆyη‖2+2|η|ˆσ2ηwithˆσ2η=‖y−ˆyη‖2n−|η|
Hence: FPE(ˆyη)=‖y−ˆyη‖2(1+2|η|n−|η|)
log1nFPE(ˆyη)=log(1n‖y−ˆyη‖2(1+2|η|n−|η|))=log(1n‖y−ˆyη‖2)+log(1+2|η|n−|η|)
Hence: nlog1nFPE(ˆyη)+nlog(2π)+n=−2logˆLη+nlog(1+2|η|n−|η|)
When n is large: nlog(1+2|η|n−|η|)≈n2|η|n−|η|≈2|η|
Hence: nlog1nFPE(ˆyη)+nlog(2π)+n≈−2logˆLη+2|η|≈AIC(ˆyη)
The log FPE and the AIC criteria select for the same models (asymptotically).
In a Bayesian setting,
we have priors on the models η and parameters θη=(βη,σ2η): {η∼πηθη∼p(θ|η)
The likelihood given some parameters and a model is: p(y|θ,η)
the marginal likelihood given a model is: p(y|η)=∫p(y|θ,η)p(θ|η)dθ
and the posterior probability of the models are: p(η|y).
Goal: find the model with the highest posterior probability: ˆη=argmaxηp(η|y)
Assume that all models have the same prior probability: πη≡π.
Then, from Bayes rule: p(η|y)=p(y|η)πη/p(y)=p(y|η)×π/p(y)
And: ˆη=argmaxηp(η|y)=argmaxηp(y|η).
Goal: find the model with the highest posterior probability: ˆη=argmaxηp(η|y)=argmaxηp(y|η)=argmaxη∫p(y|θ,η)p(θ|η)dθ
Idea:
Assume that p(y|θ,η)p(θ|η)
Let L:Rq→R be a three times differentiable function, with a unique maximum u∗. Then: ∫RqenL(u)du=(2πn)q/2|−L″(u∗)|−1/2enL(u∗)+o(n−1)
Applied to: L(θ)=1nlogp(y|θ,η)+1nlogp(θ|η)
Hence: ˆη=argmaxηp(η|y)=argmaxηp(y|η)≈argmaxη{logp(y|ˆθ,η)−|η|2log(n)}≈argminη{−2logp(y|ˆθ,η)+|η|log(n)}≈argminη{−2logˆLη+|η|log(n)}=argminηBIC(ˆyη)
BIC is an approximation of the marginal likelihood.