Régression Multiple - Sélection de Variable

Paul Bastide - Ibrahim Bouzalmat

20/03/2024

What we know

Gaussian Model

Model:

y=Xβ+ϵ

  • y random vector of n responses
  • X non random n×p matrix of predictors
  • ϵ random vector of n errors
  • β non random, unknown vector of p coefficients

Assumptions:

  • (H1) rg(X)=p
  • (H2) ϵN(0,σ2In)

Distribution of the estimators

Distribution of ˆβ: ˆβN(β;σ2(XTX)1)andˆβkβkˆσ2[(XTX)1]kkTnp.

Distribution of ˆσ2: (np)ˆσ2σ2χ2np.

Student t tests

Hypothesis:H0:βk=0vsH1:βk0

Test Statistic:Tk=ˆβkˆσ2kH0Tnp

Reject Region:P[Reject|H0 true]αRα={tR | |t|tnp(1α/2)}

p value:p=PH0[|Tk|>Tobsk]

Problems with the t-test

Simulated Dataset

yi=1+3xi1xi2+ϵi

set.seed(1289)

## Predictors
n <- 500
x_1 <- runif(n, min = -2, max = 2)
x_2 <- runif(n, min = -2, max = 2)

## Noise
eps <- rnorm(n, mean = 0, sd = 2)

## Model sim
beta_0 <- -1; beta_1 <- 3; beta_2 <- -1
y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps

## Useless predictor
x_junk_1 <- runif(n, min = -2, max = 2)

Simulated Dataset - Fit

yi=1+3xi1xi2+ϵi

fit <- lm(y_sim ~ x_1 + x_2 + x_junk_1)
summary(fit)
## 
## Call:
## lm(formula = y_sim ~ x_1 + x_2 + x_junk_1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.275 -1.453 -0.028  1.371  4.893 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.82789    0.08814  -9.393   <2e-16 ***
## x_1          3.01467    0.07546  39.952   <2e-16 ***
## x_2         -0.95917    0.07469 -12.842   <2e-16 ***
## x_junk_1     0.01534    0.07791   0.197    0.844    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.957 on 496 degrees of freedom
## Multiple R-squared:  0.7905, Adjusted R-squared:  0.7893 
## F-statistic:   624 on 3 and 496 DF,  p-value: < 2.2e-16

Simulated Dataset

yi=1+3xi1xi2+ϵi

## Other Useless Predictors
p_junk <- 100
x_junk <- matrix(runif(n * p_junk, min = -2, max = 2),
                 ncol = p_junk)

## Data frame
data_junk <- data.frame(y_sim = y_sim,
                        x_junk = x_junk)

Simulated Dataset - Fit - Junk

fit <- lm(y_sim ~ -1 + ., data = data_junk)
summary(fit)
## 
## Call:
## lm(formula = y_sim ~ -1 + ., data = data_junk)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.8683  -3.0605  -0.2674   2.3554  11.0821 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## x_junk.1    0.2129516  0.1828010   1.165   0.2447  
## x_junk.2   -0.1075770  0.1919617  -0.560   0.5755  
## x_junk.3    0.4322452  0.1945466   2.222   0.0269 *
## x_junk.4    0.1471909  0.1886285   0.780   0.4357  
## x_junk.5   -0.1110445  0.1848774  -0.601   0.5484  
## x_junk.6    0.1276583  0.1829333   0.698   0.4857  
## x_junk.7    0.0079412  0.1944593   0.041   0.9674  
## x_junk.8   -0.1126224  0.1854571  -0.607   0.5440  
## x_junk.9   -0.0560611  0.1918247  -0.292   0.7702  
## x_junk.10  -0.3215171  0.1883253  -1.707   0.0886 .
## x_junk.11  -0.0875118  0.1891359  -0.463   0.6438  
## x_junk.12  -0.0822697  0.1893373  -0.435   0.6641  
## x_junk.13   0.0862599  0.1874289   0.460   0.6456  
## x_junk.14  -0.1322800  0.1895130  -0.698   0.4856  
## x_junk.15   0.1748496  0.1751535   0.998   0.3188  
## x_junk.16  -0.2152065  0.1795766  -1.198   0.2315  
## x_junk.17   0.0987989  0.1868782   0.529   0.5973  
## x_junk.18  -0.0589122  0.1964012  -0.300   0.7644  
## x_junk.19   0.0407544  0.1896260   0.215   0.8299  
## x_junk.20   0.2788872  0.1817698   1.534   0.1257  
## x_junk.21   0.0780682  0.1901043   0.411   0.6815  
## x_junk.22   0.0001428  0.1936429   0.001   0.9994  
## x_junk.23  -0.2146891  0.1860330  -1.154   0.2492  
## x_junk.24   0.1135748  0.1794805   0.633   0.5272  
## x_junk.25   0.1797087  0.1849661   0.972   0.3318  
## x_junk.26  -0.0095168  0.1885643  -0.050   0.9598  
## x_junk.27  -0.0697160  0.1782409  -0.391   0.6959  
## x_junk.28  -0.1294379  0.1832707  -0.706   0.4804  
## x_junk.29  -0.0138295  0.1830424  -0.076   0.9398  
## x_junk.30   0.2922539  0.1918362   1.523   0.1284  
## x_junk.31   0.1576348  0.1867874   0.844   0.3992  
## x_junk.32   0.0598461  0.1892184   0.316   0.7520  
## x_junk.33   0.2018882  0.1812726   1.114   0.2661  
## x_junk.34  -0.0116431  0.1840672  -0.063   0.9496  
## x_junk.35   0.0713938  0.1833419   0.389   0.6972  
## x_junk.36   0.0190741  0.1798446   0.106   0.9156  
## x_junk.37  -0.0238114  0.1885668  -0.126   0.8996  
## x_junk.38  -0.1867089  0.1838147  -1.016   0.3104  
## x_junk.39  -0.0777722  0.1862288  -0.418   0.6765  
## x_junk.40  -0.0434927  0.1859361  -0.234   0.8152  
## x_junk.41   0.0814161  0.1904542   0.427   0.6693  
## x_junk.42   0.1292109  0.1870557   0.691   0.4901  
## x_junk.43  -0.3022911  0.1835507  -1.647   0.1004  
## x_junk.44   0.0831695  0.1823006   0.456   0.6485  
## x_junk.45   0.1332532  0.1869381   0.713   0.4764  
## x_junk.46  -0.1348223  0.1838432  -0.733   0.4638  
## x_junk.47  -0.0218524  0.1972836  -0.111   0.9119  
## x_junk.48   0.2795708  0.1854204   1.508   0.1324  
## x_junk.49  -0.1531037  0.1798952  -0.851   0.3952  
## x_junk.50   0.0520440  0.1852843   0.281   0.7789  
## x_junk.51   0.1682396  0.1835292   0.917   0.3599  
## x_junk.52  -0.1190719  0.1853241  -0.643   0.5209  
## x_junk.53  -0.0989078  0.1857934  -0.532   0.5948  
## x_junk.54  -0.1325051  0.1840157  -0.720   0.4719  
## x_junk.55   0.1006848  0.1940173   0.519   0.6041  
## x_junk.56   0.2034148  0.1764952   1.153   0.2498  
## x_junk.57   0.1153209  0.1766600   0.653   0.5143  
## x_junk.58  -0.0463597  0.1772648  -0.262   0.7938  
## x_junk.59   0.0240049  0.1889276   0.127   0.8990  
## x_junk.60   0.3726537  0.1909670   1.951   0.0517 .
## x_junk.61   0.0250128  0.1864171   0.134   0.8933  
## x_junk.62  -0.2615601  0.1901999  -1.375   0.1698  
## x_junk.63   0.0774898  0.1871022   0.414   0.6790  
## x_junk.64   0.0545841  0.1966734   0.278   0.7815  
## x_junk.65   0.0725361  0.1882312   0.385   0.7002  
## x_junk.66  -0.1206593  0.1897174  -0.636   0.5251  
## x_junk.67  -0.3336932  0.1818668  -1.835   0.0673 .
## x_junk.68   0.0981128  0.1912111   0.513   0.6082  
## x_junk.69  -0.0153031  0.1858223  -0.082   0.9344  
## x_junk.70   0.1886025  0.1830855   1.030   0.3036  
## x_junk.71  -0.0695791  0.1906992  -0.365   0.7154  
## x_junk.72  -0.1030034  0.1840569  -0.560   0.5760  
## x_junk.73   0.0084075  0.1823678   0.046   0.9633  
## x_junk.74   0.2999893  0.1783620   1.682   0.0934 .
## x_junk.75   0.3355257  0.1843357   1.820   0.0695 .
## x_junk.76  -0.1894886  0.1956830  -0.968   0.3335  
## x_junk.77  -0.1183169  0.1843228  -0.642   0.5213  
## x_junk.78  -0.1651666  0.1832443  -0.901   0.3679  
## x_junk.79   0.0947620  0.1867396   0.507   0.6121  
## x_junk.80   0.2538003  0.1910030   1.329   0.1847  
## x_junk.81   0.0711922  0.1917948   0.371   0.7107  
## x_junk.82  -0.0098761  0.2003307  -0.049   0.9607  
## x_junk.83   0.0102922  0.1798480   0.057   0.9544  
## x_junk.84   0.2163918  0.1807194   1.197   0.2319  
## x_junk.85  -0.2514098  0.1881750  -1.336   0.1823  
## x_junk.86   0.0181126  0.1832811   0.099   0.9213  
## x_junk.87  -0.0907617  0.1809516  -0.502   0.6162  
## x_junk.88   0.0354845  0.1869592   0.190   0.8496  
## x_junk.89  -0.3283331  0.1867203  -1.758   0.0794 .
## x_junk.90  -0.2284999  0.1895045  -1.206   0.2286  
## x_junk.91  -0.0575054  0.1800322  -0.319   0.7496  
## x_junk.92   0.0612578  0.1799226   0.340   0.7337  
## x_junk.93  -0.0192369  0.1866956  -0.103   0.9180  
## x_junk.94  -0.0615478  0.1828148  -0.337   0.7365  
## x_junk.95   0.3675294  0.1778881   2.066   0.0395 *
## x_junk.96  -0.0015247  0.1899413  -0.008   0.9936  
## x_junk.97  -0.1364003  0.1779889  -0.766   0.4439  
## x_junk.98   0.4422968  0.1888431   2.342   0.0197 *
## x_junk.99   0.2366755  0.1815971   1.303   0.1932  
## x_junk.100  0.0010360  0.1905522   0.005   0.9957  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.311 on 400 degrees of freedom
## Multiple R-squared:  0.1889, Adjusted R-squared:  -0.01383 
## F-statistic: 0.9318 on 100 and 400 DF,  p-value: 0.6596

Simulated Dataset - Fit - Junk

p_values <- summary(fit)$coefficients[, 4]
hist(p_values, col = "lightblue", breaks = 20)

Simulated Dataset - Fit - Junk

sum(p_values <= 0.05)
## [1] 3
summary(fit)$coefficients[p_values <= 0.05, ]
##            Estimate Std. Error  t value   Pr(>|t|)
## x_junk.3  0.4322452  0.1945466 2.221807 0.02685525
## x_junk.95 0.3675294  0.1778881 2.066071 0.03946512
## x_junk.98 0.4422968  0.1888431 2.342138 0.01966334

 

Reject Region: P[Reject|H0 true]α

Just by chance, about 5% of false discoveries.

Joint Distribution of the Coefficients - σ2 unknown

Joint distribution of ˆβ

When σ2 is known:

ˆβN(β;σ2(XTX)1).

When σ2 is unknown: 1pˆσ2(ˆββ)T(XTX)(ˆββ)Fpnp

with Fpnp the Fisher distribution.

Reminder: Fisher Distribution

Let U1χ2p1 and U2χ2p2, U1 and U2 independent. Then F=U1/p1U2/p2Fp1p2

Joint distribution of ˆβ - Proof - Hints

Let’s show: 1pˆσ2(ˆββ)T(XTX)(ˆββ)Fpnp

Hints:
ˆβN(β;σ2(XTX)1)and(np)ˆσ2σ2χ2np

If XN(μ,Σ)then(Xμ)TΣ1(Xμ)χ2p

If U1χ2p1 and U2χ2p2 independent, then U1/p1U2/p2Fp1p2

Joint distribution of ˆβ - Proof

ˆβN(β;σ2(XTX)1)1σ2(ˆββ)T(XTX)(ˆββ)χ2p

(np)ˆσ2σ2χ2np independent from ˆβ

Hence: 1σ2(ˆββ)T(XTX)(ˆββ)/p(np)ˆσ2σ2/(np)=1pˆσ2(ˆββ)T(XTX)(ˆββ)Fpnp.

Global Fisher F-test

Global Fisher F-test

Hypothesis: H0:β1==βp=0vsH1: k | βk0

Test Statistic: F=1pˆσ2ˆβT(XTX)ˆβH0Fpnp

Reject Region: P[Reject|H0 true]αRα={fR+ | ffpnp(1α)}

p value:p=PH0[F>Fobs]

Simulated Dataset - Fit - Junk

fit <- lm(y_sim ~ -1 + ., data = data_junk)
f_obs <- summary(fit)$fstatistic
f_obs
##      value      numdf      dendf 
##   0.931791 100.000000 400.000000
p_value_f <- 1 - pf(f_obs[1], f_obs[2], f_obs[3])
p_value_f
##     value 
## 0.6596268

We do not reject the null hypothesis that all the (junk) coefficients are zero (phew).

Fisher F-Test - Geometry

F=1pˆσ2ˆβT(XTX)ˆβ=(Xˆβ)T(Xˆβ)/pˆϵ2/(np)=ˆy2/pˆϵ2/(np)=ˆy2/pyˆy2/(np)

  • H0:β1==βp=0 i.e. the model is useless.
  • If ˆϵ2 is small, then the error is small,
    and F tends to be large, i.e. we tend to reject H0:
    the model is useful.
  • If ˆϵ2 is large, then the error is large,
    and F tends to be small, i.e. we tend to not reject H0:
    the model is rather useless.

Geometrical interpretation

Good model

  • y not too far from M(X).

  • ˆϵ2 not too large compared to ˆy2.

  • F tends to be large.

 

Bad model

  • y almost orth. to M(X).

  • ˆϵ2 rather large compared to ˆy2.

  • F tends to be small.

Simulated Dataset - True and Junk

## Data frame
data_all <- data.frame(y_sim = y_sim,
                       x_1 = x_1,
                       x_2 = x_2,
                       x_junk = x_junk)
## Fit
fit <- lm(y_sim ~ ., data = data_all)
## multiple t tests
p_values_t <- summary(fit)$coefficients[, 4]
names(p_values_t[p_values_t <= 0.05])
## [1] "(Intercept)" "x_1"         "x_2"         "x_junk.27"   "x_junk.76"  
## [6] "x_junk.80"
## f test
f_obs <- summary(fit)$fstatistic
p_value_f <- 1 - pf(f_obs[1], f_obs[2], f_obs[3])
p_value_f
## value 
##     0

Nested Models Fisher F-test

Nested Models

Can I test ? H0:βp0+1==βp=0vsH1: k{p0+1,,p} | βk0

i.e. decide between the full model: y=Xβ+ϵ

and the nested model: y=X0β0+ϵ

with X=(X0 X1), rg(X)=p and rg(X0)=p0.

Geometrical interpretation

Nested Models

Idea: Compare

  • ˆyˆy02 distance of ˆy compared to ˆy0 to

  • yˆy2 residual error.

Then:

  • If ˆyˆy02 small compared to yˆy2, then “ˆyˆy0” and the null model is enough.
  • If ˆyˆy02 large compared to yˆy2, then the full model is useful.

Geometrical interpretation

Good model

  • ˆy quite different from ˆy0.

  • Full model does add information.

 

Bad model

  • ˆy similar to ˆy0.

  • Full model does not add much information.

Nested Models - F test

F=ˆyˆy02/(pp0)yˆy2/(np)

  • When F is large, full model is useful.
  • When F is small, null model is enough.
  • Distribution of F ?

Distribution of F - 1/2

F=ˆyˆy02/(pp0)yˆy2/(np)

Let P projection on M(X) and P0 projection on M(X0)

ˆyˆy0=PyP0y=PyP0Py=(InP0)Py=P0PyM(X0)M(X)

yˆy=(InP)y=PyM(X)

ˆyˆy0 and yˆy are orthogonal
i.e. their covariance is zero
i.e. (Gaussian assumption) they are independent.

Distribution of F - 2/2

From Cochran’s theorem: 1σ2yˆy2=1σ2Py2=1σ2Pϵ2χ2np

1σ2ˆyˆy0P0PXβ2=1σ2P0P(yXβ)2χ2pp0

(P0P is a projection on a space of dimension pp0)

If H0 is true, then P0PXβ=0, hence:

F=ˆyˆy02/(pp0)yˆy2/(np)H0Fpp0np.

Nested F test

From Phythagoras’s theorem: yˆy02=yPy+PyP0y2=Py+P0Py2=Py2+P0Py2=yˆy2+ˆyˆy02

i.e. ˆyˆy02=yˆy02yˆy2=RSS0RSS

Hence: F=ˆyˆy02/(pp0)yˆy2/(np)=nppp0RSS0RSSRSS

Nested F test

To test: H0:βp0+1==βp=0vsH1: k{p0+1,,p} | βk0

we use the F statistics: F=ˆyˆy02/(pp0)yˆy2/(np)=nppp0RSS0RSSRSSH0Fpp0np.

Simulated Dataset - True vs Junk

yi=1+3xi1xi2+ϵi

## Data frame
data_all <- data.frame(y_sim = y_sim,
                       x_1 = x_1,
                       x_2 = x_2,
                       x_junk = x_junk)
## Fit true
fit_small <- lm(y_sim ~ x_1 + x_2, data = data_all)
## Fit true and junk
fit_all <- lm(y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3, data = data_all)
## Comparison
anova(fit_small, fit_all)
## Analysis of Variance Table
## 
## Model 1: y_sim ~ x_1 + x_2
## Model 2: y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    497 1900.5                           
## 2    494 1883.2  3    17.297 1.5124 0.2104

Simulated Dataset - True vs Junk

yi=1+3xi1xi2+ϵi

## Fit true
fit_small <- lm(y_sim ~ x_1 + x_2, data = data_all)
## Fit all
fit_all <- lm(y_sim ~ ., data = data_all)
## Comparison
anova(fit_small, fit_all)$`Pr(>F)`[2]
## [1] 0.3548787

We do not reject the null hypothesis that the first two variables are enough to explain the data.

Full F test

Assuming that we have an intercept, we often test: H0:β2==βp=0vsH1: k{2,,p} | βk0

The associated F statistics is (p0=1): F=ˆyˉy12/(p1)yˆy2/(np)H0Fp1np.

That is the default test returned by R.

Full F test

yi=1+3xi1xi2+ϵi

fit <- lm(y_sim ~ x_1 + x_2)
summary(fit)
## 
## Call:
## lm(formula = y_sim ~ x_1 + x_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2523 -1.4517 -0.0302  1.3494  4.8711 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.82877    0.08794  -9.424   <2e-16 ***
## x_1          3.01415    0.07534  40.008   <2e-16 ***
## x_2         -0.95922    0.07462 -12.855   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.955 on 497 degrees of freedom
## Multiple R-squared:  0.7905, Adjusted R-squared:  0.7897 
## F-statistic: 937.8 on 2 and 497 DF,  p-value: < 2.2e-16

One coefficient F test

If p0=p1, we test: H0:βp=0vsH1:βp0

The associated F statistics is (p0=p1): F=ˆyˆyp2yˆy2/(np)H0F1np.

It can be shown that: F=T2withT=ˆβpˆσp

so that it is equivalent to the t-test.

Variable Selection

How to choose the “best” model ?

Model: yi=β1xi1++βpxip+ϵi

Problem:
Can we choose the “best” subset of non-zero βk ?

I.e. Can we find the predictors that really have an impact on y ?

F-test on all subsets ? Not practical, multiple testing.

Idea:
Find a “score” to assess the quality of a given fit.

Variable selection - R2

R2=1RSSTSS

## Function to get statistics for one fit
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(R2 = sumfit$r.squared)
  return(res)
}

yi=1+3xi1xi2+ϵi

Variable selection - R2

## Fit true
get_stats_fit(lm(y_sim ~ x_1 + x_2, data = data_all))
##          R2
## 1 0.7905255
## Fit bigger
get_stats_fit(lm(y_sim ~ x_1 + x_2 + x_junk.1, data = data_all))
##        R2
## 1 0.79133
## Fit wrong
get_stats_fit(lm(y_sim ~ x_1 + x_junk.1, data = data_all))
##          R2
## 1 0.7252314

Variable selection - R2

## Only one "junk" predictor
data_sub <- data.frame(y_sim = y_sim,
                       x_1 = x_1,
                       x_2 = x_2,
                       x_junk = x_junk_1)
## Fit all possible models
get_all_stats(data_sub)
##             R2          preds
## 1 0.7208723311            x_1
## 2 0.1158793338            x_2
## 3 0.0005972444         x_junk
## 4 0.7905254867        x_1+x_2
## 5 0.7208974123     x_1+x_junk
## 6 0.1164841596     x_2+x_junk
## 7 0.7905418666 x_1+x_2+x_junk

R2 is not enough.

Variable selection - R2a

R2a=1RSS/(np)TSS/(n1)

## Function to get statistics
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(R2 = sumfit$r.squared,
                    R2a = sumfit$adj.r.squared)
  return(res)
}

Variable selection - R2a

get_all_stats(data_sub)
##             R2          R2a          preds
## 1 0.7208723311  0.720311834            x_1
## 2 0.1158793338  0.114103991            x_2
## 3 0.0005972444 -0.001409588         x_junk
## 4 0.7905254867  0.789682531        x_1+x_2
## 5 0.7208974123  0.719774263     x_1+x_junk
## 6 0.1164841596  0.112928764     x_2+x_junk
## 7 0.7905418666  0.789274983 x_1+x_2+x_junk

R2a: adjust for the number of predictors.

Adjusted R2a

R2a=1RSS/(np)TSS/(n1)

  • The larger the better.

  • When p is bigger, the fit must be really better for R2a to raise.

  • Intuitive, but not much theoretical justifications.

Predictive Risk

Simulation

set.seed(1289)

## Predictors
n <- 50
x_1 <- runif(n, min = -2, max = 2)
x_2 <- runif(n, min = -2, max = 2)

## Model
beta_0 <- -1; beta_1 <- 3; beta_2 <- -1

## sim
eps <- rnorm(n, mean = 0, sd = 2)
y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps

Simulation - Fit

## Useless predictors
p_junk <- 48
x_junk <- matrix(runif(n * p_junk, min = -2, max = 2),
                 ncol = p_junk)

## Good and Bad dataset
data_all <- data.frame(y_sim = y_sim,
                       x_1 = x_1,
                       x_2 = x_2,
                       x_junk = x_junk)

## Bad overfitting fit
fit_bad <- lm(y_sim ~ ., data = data_all)
pred_bad <- predict(fit_bad)

## Good fit
fit_good <- lm(y_sim ~ x_1 + x_2, data = data_all)
pred_good <- predict(fit_good)

Simulation - Fit

mu_y <- beta_0 + beta_1 * x_1 + beta_2 * x_2
plot(mu_y, y_sim, pch = 3,
     ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2]))

Simulation - Fit

mu_y <- beta_0 + beta_1 * x_1 + beta_2 * x_2
plot(mu_y, y_sim, pch = 3,
     ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2]))
## Good prediction (small model)
points(mu_y, pred_good, col = "lightblue", pch = 4)

Simulation - Fit

mu_y <- beta_0 + beta_1 * x_1 + beta_2 * x_2
plot(mu_y, y_sim, pch = 3,
     ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2]))
## Good prediction (small model)
points(mu_y, pred_good, col = "lightblue", pch = 4)
## Bad prediction (big model)
points(mu_y, pred_bad, col = "firebrick", pch = 4)

RSS: “fitted risk”

The bad (bigger) model is closer to the observed data.

## Distance between bad prediction and data (RSS)
sum((y_sim - pred_bad)^2)
## [1] 2.714946e-27
## Distance between good prediction and data (RSS)
sum((y_sim - pred_good)^2)
## [1] 206.9159

Overfitting: the RSS is smaller for the bad, bigger model.

Predictive Risk

Generate new data according to the same generative model:

## New dataset from the same model
eps_new <- rnorm(n, mean = 0, sd = 2)
y_new <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps_new

Questions:
Are the predictions from the bad (bigger) model
better or worse
than the predictions from the good (smaller) model,
compared to the new data ?

Simulation - Fit

plot(mu_y, y_sim, pch = 3,
     ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2]))
## Bad prediction (big model)
points(mu_y, pred_bad, col = "firebrick", pch = 4)
## Good prediction (small model)
points(mu_y, pred_good, col = "lightblue", pch = 4)
## New points
points(mu_y, y_new, col = "darkgreen", pch = 3)

Predictive Risk

Compute the risk between predictions and new data:

## Distance between bad prediction and true
sum((y_new - pred_bad)^2)
## [1] 535.296
## Distance between good prediction and true
sum((y_new - pred_good)^2)
## [1] 205.6555

The good (smaller) model behaves better on new data.

Predictive Risk

RSS: yˆyp2 with ˆyp learnt on y
gets smaller when model gets bigger (overfit).

Predictive Risk: yˆyp2 with y same distribution as y.
becomes larger if ˆyp is overfit.

Idea:
We want to select a model that has a good predictive risk.
i.e. that represents the underlying model best.

Problem:
In general, we don’t have means to generate y.
We use an estimate of the (theoretical) predictive risk.

Theoretical Risk

Models

Ideal model: y=μ+ϵμRnϵN(0,σ2In).

μ the “true” expectation of the observations y.

Full linear regression model: y=Xβ+ϵrg(X)=p,ϵN(0,σ2In).

Assume that μ can be approximated as Xβ, i.e. as an element of M(X).

Models

Ideal model: y=μ+ϵμRnϵN(0,σ2In).

Sub-models:
keep only predictors η{1,,p} y=Xηβη+ϵηrg(Xη)=|η|,ϵηN(0,σ2In).

with Xη the sub-matrix of X: Xη=(xη1 xη2xη|η|)

Assume that μ can be approximated as Xηβη, i.e. as an element of M(Xη).

Models

“Truth”: y=μ+ϵ, with μRn and ϵN(0,σ2In).

Models: y=Xηβη+ϵηrg(Xη)=|η|,ϵηN(0,σ2In).

Estimators: ˆβη=argminβR|η|yXηβη2=(XTηXη)1XTηyˆyη=Xηˆβη=PXηy

True expectation: yη=PXημ

Models

Risk

Risk of an estimator (theoretical)

Risk of an estimator: R(ˆyη)=E[μˆyη2]

Notes:

  • μ is unknown cannot be computed in practice.
  • Expectation of the difference between the true mean and the estimated projection.
  • Not subject to overfitting.

Bias-Variance Decomposition

Bias-Variance Decomposition

R(ˆyη)=E[μˆyη2]=μyη2+|η|σ2

  • Bias: μyη2
    • Difference between true and projection on the model
    • Best that we could do if we knew the truth
    • Decreases when model is more complex, i.e. η gets larger.
  • Variance: |η|σ2
    • Variance of the estimation
    • Increases when model is more complex, i.e. η gets larger.
  • Risk is a compromize between bias and variance.

Bias-Variance - No overfitting

R(ˆyη)=E[μˆyη2]=μyη2+|η|σ2

No overfitting:

Assume:

  • μM(Xη) so that yη=PXημ=μ.

  • ηη so that yη=PXημ=μ.

Then: R(ˆyη)=μyη2+|η|σ2μyη2+|η|σ2=μyη2+|η|σ2R(ˆyη)

Bias-Variance - Proof - 1/2

R(ˆyη)=E[μˆyη2]=E[μyη+yηˆyη2]

R(ˆyη)=E[μPXημ+PXημPXηy2]=E[(InPXη)μ+PXη(μy)2]=E[PXημ+PXη(μy)2]

R(ˆyη)=E[PXημ2+PXη(μy)2]=PXημ2+E[PXη(μy)2]=μyη2+E[PXη(μy)2]

Bias-Variance - Proof - 2/2

R(ˆyη)=μyη2+E[PXη(μy)2]

From Cochran’s Theorem: 1σ2PXη(μy)2χ2|η|

Hence: E[1σ2PXη(μy)2]=|η|

And: R(ˆyη)=μˆyη2+|η|σ2.

Penalized Least Squares

Oracle Estimator

We would like: η0=argminη{1,,p}R(ˆyη)

  • Minimize the risk over all possible models.
  • Problem: R(ˆyη) is theoretical cannot be computed.
  • Idea: Minimize an estimator ˆR(ˆyη) of R(ˆyη): ˆη=argminη{1,,p}ˆR(ˆyη).

Expectation of the RSS

For any model η, the expectation of the RSS is: E[yˆyη2]=R(ˆyη)+nσ22|η|σ2

Expectation of the RSS - Proof

E[yˆyη2]=E[yμ+μˆyη2]=E[yμ2+μˆyη2+2yμ;μˆyη]

E[yˆyη2]=E[ϵ2]+R(ˆyη)+2E[ϵ;μPXηy]

E[ϵ;μPXηy]=E[ϵ;μPXη(μ+ϵ)]=E[ϵ;μPXημ)]E[ϵ;PXηϵ]

E[ϵ;μPXηy]=0E[PXηϵ;PXηϵ]  =E[PXηϵ2]=|η|σ2

E[yˆyη2]=nσ2+R(ˆyη)2|η|σ2

Mallow’s Cp

We have: E[yˆyη2]=R(ˆyη)+nσ22|η|σ2

Mallow’s Cp: Cp(ˆyη)=1n(yˆyη2+2|η|ˆσ2)

  • If ˆσ2 is unbiased, then: E[Cp(ˆyη)]=1nR(ˆyη)+σ2

Mallow’s Cp

We have: Cp(ˆyη)=1n(yˆyη2+2|η|ˆσ2)

And: E[Cp(ˆyη)]=1nR(ˆyη)+σ2

Hence: ˆη=argminη{1,,p}Cp(ˆyη)=argminη{1,,p}ˆR(ˆyη).

Minimizing Mallow’s Cp is the same as minimizing an unbiased estimate of the risk.

Mallow’s Cp

Cp(ˆyη)=1n(yˆyη2+2|η|ˆσ2)

  • Penalize for the number of parameters |η|.

  • The smaller the better.

Simulated Dataset

## Mallow's Cp
Cp <- function(fit) {
  n <- nobs(fit)
  p <- n - df.residual(fit)
  RSS <- deviance(fit)
  return( (RSS + p * RSS / (n - p)) / n)
}
## Function to get statistics
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(Cp = Cp(fit))
  return(res)
}

Variable selection

## Apply stats to all possible combinations
all_stats <- get_all_stats(data_sub)
all_stats
##          Cp          preds
## 1  5.085236            x_1
## 2 16.107190            x_2
## 3 18.207435         x_junk
## 4  3.823952        x_1+x_2
## 5  5.095010     x_1+x_junk
## 6 16.128558     x_2+x_junk
## 7  3.831361 x_1+x_2+x_junk

Mallow’s Cp is smallest for the true model.

Penalized Log Likelihood

Maximized Log Likelihood

Recall: (ˆβη,ˆσ2η,ML)=argmax(β,σ2)R|η|×R+logL(β,σ2|y)=argmax(β,σ2)R|η|×R+n2log(2πσ2)12σ2yXηβ2

And: ˆσ2η,ML=yˆyη2n;ˆyη=Xηˆβη

Thus: L(ˆβη,ˆσ2ML|y)=

Maximized Log Likelihood

The maximized likelihood is equal to: L(ˆβη,ˆσ2ML|y)=n2log(2πyˆyη2n)12yˆyη2nyˆyη2

i.e. logˆLη=n2log(yˆyη2n)n2log(2π)n2

Maximized Log Likelihood is increasing

Remember, for any ηη: yˆyη2yˆyη2

Hence R2η=1yˆyη2yˉy12 always increases when we add a predictor.

Similarly: logˆLη=n2log(yˆyη2n)n2log(2π)n2

always increases when we add some predictors.

Penalized Maximized Log Likelihood

Problem: logˆLη always increases when we add some predictors.

it can not be used for model selection (same as R2).

Idea: add a penalty term that depends on the size of the model

minimize: logˆLη+f(|η|)with f increasing

Penalized Maximized Log Likelihood

Minimize: logˆLη+f(|η|)with f increasing

For ηη:

  • logˆLη decreases

  • f(|η|) increases

Goal: find f that compensates for the “overfit”, i.e. the noisy and insignificant increase of the likelihood when we add some unrelated predictors.

Simulated Dataset

## Function to get statistics
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(minusLogLik = -logLik(fit))
  return(res)
}

Variable selection

## Apply stats to all possible combinations
all_stats <- get_all_stats(data_sub)
all_stats
##   minusLogLik          preds
## 1    1115.053            x_1
## 2    1403.284            x_2
## 3    1433.925         x_junk
## 4    1043.286        x_1+x_2
## 5    1115.030     x_1+x_junk
## 6    1403.113     x_2+x_junk
## 7    1043.266 x_1+x_2+x_junk

When we add “junk” variable, logˆLη decreases, but not a lot.

Questions: can we find a good penalty that compensates for this noisy increasing ?

AIC and BIC

Akaike Information Criterion - AIC

AIC=2logˆLη+2|η|

 

  • ˆLη is the maximized likelihood of the model.

  • |η| is the dimension of the model.

  • The smaller the better.

  • Asymptotic theoretical guaranties.

  • Easy to use and widely spread

Bayesian Information Criterion - BIC

BIC=2logˆLη+|η|log(n)

 

  • ˆLη is the maximized likelihood of the model.

  • |η| is the dimension of the model.

  • The smaller the better.

  • Based on an approximation of the marginal likelihood.

  • Easy to use and widely spread

See: Lebarbier, Mary-Huard. Le critère BIC : fondements théoriques et interprétation. 2004. [inria-00070685]

AIC vs BIC

AIC=2logˆLη+2|η|

BIC=2logˆLη+|η|log(n)

  • Both have asymptotic theoretical justifications.

  • BIC penalizes more than AIC chooses smaller models.

  • Both widely used.

  • There are some (better) non-asymptotic criteria, see
    Giraud C. 2014. Introduction to high-dimensional statistics.

Simulated Dataset

## Function to get statistics
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(minusLogLik = -logLik(fit),
                    AIC = AIC(fit),
                    BIC = BIC(fit))
  return(res)
}

Variable selection

## Apply stats to all possible combinations
all_stats <- get_all_stats(data_sub)
all_stats
##   minusLogLik      AIC      BIC          preds
## 1    1115.053 2236.105 2248.749            x_1
## 2    1403.284 2812.567 2825.211            x_2
## 3    1433.925 2873.850 2886.493         x_junk
## 4    1043.286 2094.572 2111.430        x_1+x_2
## 5    1115.030 2238.060 2254.919     x_1+x_junk
## 6    1403.113 2814.225 2831.084     x_2+x_junk
## 7    1043.266 2096.533 2117.606 x_1+x_2+x_junk
## Select model
apply(all_stats[, -ncol(all_stats)], 2,
      function(x) all_stats[which.min(x), ncol(all_stats)])
##      minusLogLik              AIC              BIC 
## "x_1+x_2+x_junk"        "x_1+x_2"        "x_1+x_2"

Variable selection - Advertising data

## Advertising data
library(here)
ad <- read.csv(here("data", "Advertising.csv"), row.names = "X")

## Apply stats to all possible combinations
all_stats <- get_all_stats(ad[, c(4, 1:3)])
all_stats
##   minusLogLik       AIC       BIC              preds
## 1    519.0457 1044.0913 1053.9863                 TV
## 2    573.3369 1152.6738 1162.5687              radio
## 3    608.3357 1222.6714 1232.5663          newspaper
## 4    386.1970  780.3941  793.5874           TV+radio
## 5    509.8891 1027.7782 1040.9714       TV+newspaper
## 6    573.2361 1154.4723 1167.6655    radio+newspaper
## 7    386.1811  782.3622  798.8538 TV+radio+newspaper
## Select model
apply(all_stats[, -ncol(all_stats)], 2,
      function(x) all_stats[which.min(x), ncol(all_stats)])
##          minusLogLik                  AIC                  BIC 
## "TV+radio+newspaper"           "TV+radio"           "TV+radio"

Forward and Backward Search

Combinatorial problem

p predictors 2p possible models.

  • Cannot test all possible models.
  • “Manual” solution: forward or backward searches.

Forward search

Forward search - Init

## Full formula
full_formula <- formula(y_sim ~ x_1 + x_2 
                        + x_junk.1 + x_junk.2 + x_junk.3 
                        + x_junk.4 + x_junk.5 + x_junk.6)

## Complete search
2^8
## [1] 256
## No predictors
fit0 <- lm(y_sim ~ 1, data = data_all)

Forward search - Add term

## Add one
library(MASS)
addterm(fit0, full_formula)
## Single term additions
## 
## Model:
## y_sim ~ 1
##          Df Sum of Sq    RSS     AIC
## <none>                731.66 136.165
## x_1       1    466.43 265.23  87.429
## x_2       1     56.39 675.27 134.154
## x_junk.1  1     50.05 681.61 134.622
## x_junk.2  1     32.10 699.56 135.921
## x_junk.3  1      0.41 731.26 138.137
## x_junk.4  1      2.49 729.18 137.995
## x_junk.5  1     54.28 677.38 134.311
## x_junk.6  1     11.59 720.08 137.367

Forward search - Add term

## Add x_1
fit1 <- lm(y_sim ~ 1 + x_1, data = data_all)
## Add another
addterm(fit1, full_formula)
## Single term additions
## 
## Model:
## y_sim ~ 1 + x_1
##          Df Sum of Sq    RSS    AIC
## <none>                265.23 87.429
## x_2       1    58.314 206.92 77.014
## x_junk.1  1     6.038 259.19 88.277
## x_junk.2  1     0.159 265.07 89.399
## x_junk.3  1     0.376 264.85 89.358
## x_junk.4  1     0.403 264.83 89.353
## x_junk.5  1     2.587 262.64 88.939
## x_junk.6  1     6.791 258.44 88.132

Forward search - Add term

## Add x_1
fit2 <- lm(y_sim ~ 1 + x_1 + x_2, data = data_all)
## Add another
addterm(fit2, full_formula)
## Single term additions
## 
## Model:
## y_sim ~ 1 + x_1 + x_2
##          Df Sum of Sq    RSS    AIC
## <none>                206.92 77.014
## x_junk.1  1    3.0199 203.90 78.279
## x_junk.2  1    2.1852 204.73 78.484
## x_junk.3  1    0.8228 206.09 78.815
## x_junk.4  1    1.6416 205.27 78.616
## x_junk.5  1    4.9363 201.98 77.807
## x_junk.6  1    3.7814 203.13 78.092

Backward search

Backward search - Init

## Full formula
fit0 <- lm(y_sim ~ x_1 + x_2 
                        + x_junk.1 + x_junk.2 + x_junk.3 
                        + x_junk.4 + x_junk.5 + x_junk.6,
           data = data_all)
## drop one
dropterm(fit0)
## Single term deletions
## 
## Model:
## y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3 + x_junk.4 + 
##     x_junk.5 + x_junk.6
##          Df Sum of Sq    RSS     AIC
## <none>                191.34  85.100
## x_1       1    336.07 527.41 133.797
## x_2       1     57.75 249.09  96.289
## x_junk.1  1      2.92 194.25  83.857
## x_junk.2  1      2.26 193.60  83.688
## x_junk.3  1      1.46 192.80  83.480
## x_junk.4  1      1.48 192.81  83.485
## x_junk.5  1      2.08 193.42  83.641
## x_junk.6  1      4.38 195.72  84.233

Backward search - Drop term

## drop x_junk.3
fit1 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.1 + x_junk.2 +
           + x_junk.4 + x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit1)
## Single term deletions
## 
## Model:
## y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + +x_junk.4 + x_junk.5 + 
##     x_junk.6
##          Df Sum of Sq    RSS     AIC
## <none>                192.80  83.480
## x_1       1    335.92 528.71 131.921
## x_2       1     57.16 249.96  94.463
## x_junk.1  1      2.30 195.09  82.073
## x_junk.2  1      2.18 194.98  82.043
## x_junk.4  1      1.15 193.95  81.779
## x_junk.5  1      2.27 195.06  82.065
## x_junk.6  1      4.73 197.52  82.691

Backward search - Drop term

## drop x_junk.4
fit2 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.1 + x_junk.2 +
           + x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit2)
## Single term deletions
## 
## Model:
## y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + +x_junk.5 + x_junk.6
##          Df Sum of Sq    RSS     AIC
## <none>                193.95  81.779
## x_1       1    335.82 529.77 130.021
## x_2       1     56.44 250.39  92.550
## x_junk.1  1      1.95 195.90  80.278
## x_junk.2  1      2.27 196.21  80.359
## x_junk.5  1      2.96 196.91  80.537
## x_junk.6  1      4.74 198.69  80.986

Backward search - Drop term

## drop x_junk.1
fit3 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.2 +
           + x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit3)
## Single term deletions
## 
## Model:
## y_sim ~ x_1 + x_2 + x_junk.2 + +x_junk.5 + x_junk.6
##          Df Sum of Sq    RSS     AIC
## <none>                195.90  80.278
## x_1       1    368.05 563.94 131.147
## x_2       1     59.57 255.47  91.554
## x_junk.2  1      2.46 198.36  78.903
## x_junk.5  1      3.71 199.61  79.217
## x_junk.6  1      4.71 200.60  79.466

Backward search - Drop term

## drop x_junk.2
fit4 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit4)
## Single term deletions
## 
## Model:
## y_sim ~ x_1 + x_2 + x_junk.5 + x_junk.6
##          Df Sum of Sq    RSS     AIC
## <none>                198.36  78.903
## x_1       1    422.20 620.56 133.930
## x_2       1     57.59 255.95  89.648
## x_junk.5  1      4.78 203.13  78.092
## x_junk.6  1      3.62 201.98  77.807

Backward search - Drop term

## drop x_junk.6
fit5 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.5, data = data_all)
## Drop another
dropterm(fit5)
## Single term deletions
## 
## Model:
## y_sim ~ x_1 + x_2 + x_junk.5
##          Df Sum of Sq    RSS     AIC
## <none>                201.98  77.807
## x_1       1    426.25 628.23 132.544
## x_2       1     60.66 262.64  88.939
## x_junk.5  1      4.94 206.92  77.014

Backward search - Drop term

## drop x_junk.5
fit6 <- lm(y_sim ~ x_1 + x_2,
           data = data_all)
## Drop another
dropterm(fit6)
## Single term deletions
## 
## Model:
## y_sim ~ x_1 + x_2
##        Df Sum of Sq    RSS     AIC
## <none>              206.92  77.014
## x_1     1    468.35 675.27 134.154
## x_2     1     58.31 265.23  87.429

Forward and backward searches

  • Both are heuristics.
  • There are no guaranties that both converge to the same solution.
  • No theoretical guaranties on the selected model.
  • Model selection is still an active area of research.
    • See e.g: LASSO and other penalized criteria (M2)

AIC vs log FPE

Final Prediction Error - FPE

Mallow’s Cp: Cp(ˆyη)=1n(yˆyη2+2|η|ˆσ2full)

FPE: FPE(ˆyη)=yˆyη2+2|η|ˆσ2ηwithˆσ2η=yˆyη2n|η|

Hence: FPE(ˆyη)=yˆyη2(1+2|η|n|η|)

AIC vs Log FPE

log1nFPE(ˆyη)=log(1nyˆyη2(1+2|η|n|η|))=log(1nyˆyη2)+log(1+2|η|n|η|)

But: nlog(yˆyη2n)=2logˆLηnlog(2π)n

AIC vs Log FPE

Hence: nlog1nFPE(ˆyη)+nlog(2π)+n=2logˆLη+nlog(1+2|η|n|η|)

When n is large: nlog(1+2|η|n|η|)n2|η|n|η|2|η|

AIC vs Log FPE

Hence: nlog1nFPE(ˆyη)+nlog(2π)+n2logˆLη+2|η|AIC(ˆyη)

The log FPE and the AIC criteria select for the same models (asymptotically).

BIC and Laplace Approximation

Bayesian Model Selection

In a Bayesian setting,
we have priors on the models η and parameters θη=(βη,σ2η): {ηπηθηp(θ|η)

The likelihood given some parameters and a model is: p(y|θ,η)

the marginal likelihood given a model is: p(y|η)=p(y|θ,η)p(θ|η)dθ

and the posterior probability of the models are: p(η|y).

Bayesian Model Selection

Goal: find the model with the highest posterior probability: ˆη=argmaxηp(η|y)

Assume that all models have the same prior probability: πηπ.

Then, from Bayes rule: p(η|y)=p(y|η)πη/p(y)=p(y|η)×π/p(y)

And: ˆη=argmaxηp(η|y)=argmaxηp(y|η).

Bayesian Model Selection

Goal: find the model with the highest posterior probability: ˆη=argmaxηp(η|y)=argmaxηp(y|η)=argmaxηp(y|θ,η)p(θ|η)dθ

Idea:
Assume that p(y|θ,η)p(θ|η)

is concentrated around its maximum p(y|θη,η)p(θη|η)
and apply Laplace Approximation.

Laplace Approximation

Let L:RqR be a three times differentiable function, with a unique maximum u. Then: RqenL(u)du=(2πn)q/2|L(u)|1/2enL(u)+o(n1)

Applied to: L(θ)=1nlogp(y|θ,η)+1nlogp(θ|η)

and assuming logp(y|θ,η)logp(y|ˆθ,η), we get: logp(y|θ,η)p(θ|η)dθlogp(y|ˆθ,η)|η|2log(n).

BIC

Hence: ˆη=argmaxηp(η|y)=argmaxηp(y|η)argmaxη{logp(y|ˆθ,η)|η|2log(n)}argminη{2logp(y|ˆθ,η)+|η|log(n)}argminη{2logˆLη+|η|log(n)}=argminηBIC(ˆyη)

BIC is an approximation of the marginal likelihood.