Régression Multiple - Sélection de Variable

Paul Bastide - Ibrahim Bouzalmat

20/03/2024

What we know

Gaussian Model

Model:

random vector of responses
non random matrix of predictors
random vector of errors
non random, unknown vector of coefficients

Assumptions:

(H1)
(H2)

Distribution of the estimators

Distribution of :

Student t tests

Problems with the -test

Simulated Dataset

set.seed(1289)

## Predictors
n <- 500
x_1 <- runif(n, min = -2, max = 2)
x_2 <- runif(n, min = -2, max = 2)

## Noise
eps <- rnorm(n, mean = 0, sd = 2)

## Model sim
beta_0 <- -1; beta_1 <- 3; beta_2 <- -1
y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps

## Useless predictor
x_junk_1 <- runif(n, min = -2, max = 2)

Simulated Dataset - Fit

fit <- lm(y_sim ~ x_1 + x_2 + x_junk_1)
summary(fit)

## 
## Call:
## lm(formula = y_sim ~ x_1 + x_2 + x_junk_1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.275 -1.453 -0.028  1.371  4.893 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.82789    0.08814  -9.393   <2e-16 ***
## x_1          3.01467    0.07546  39.952   <2e-16 ***
## x_2         -0.95917    0.07469 -12.842   <2e-16 ***
## x_junk_1     0.01534    0.07791   0.197    0.844    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.957 on 496 degrees of freedom
## Multiple R-squared:  0.7905, Adjusted R-squared:  0.7893 
## F-statistic:   624 on 3 and 496 DF,  p-value: < 2.2e-16

Simulated Dataset

## Other Useless Predictors
p_junk <- 100
x_junk <- matrix(runif(n * p_junk, min = -2, max = 2),
                 ncol = p_junk)

## Data frame
data_junk <- data.frame(y_sim = y_sim,
                        x_junk = x_junk)

Simulated Dataset - Fit - Junk

fit <- lm(y_sim ~ -1 + ., data = data_junk)
summary(fit)

## 
## Call:
## lm(formula = y_sim ~ -1 + ., data = data_junk)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.8683  -3.0605  -0.2674   2.3554  11.0821 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## x_junk.1    0.2129516  0.1828010   1.165   0.2447  
## x_junk.2   -0.1075770  0.1919617  -0.560   0.5755  
## x_junk.3    0.4322452  0.1945466   2.222   0.0269 *
## x_junk.4    0.1471909  0.1886285   0.780   0.4357  
## x_junk.5   -0.1110445  0.1848774  -0.601   0.5484  
## x_junk.6    0.1276583  0.1829333   0.698   0.4857  
## x_junk.7    0.0079412  0.1944593   0.041   0.9674  
## x_junk.8   -0.1126224  0.1854571  -0.607   0.5440  
## x_junk.9   -0.0560611  0.1918247  -0.292   0.7702  
## x_junk.10  -0.3215171  0.1883253  -1.707   0.0886 .
## x_junk.11  -0.0875118  0.1891359  -0.463   0.6438  
## x_junk.12  -0.0822697  0.1893373  -0.435   0.6641  
## x_junk.13   0.0862599  0.1874289   0.460   0.6456  
## x_junk.14  -0.1322800  0.1895130  -0.698   0.4856  
## x_junk.15   0.1748496  0.1751535   0.998   0.3188  
## x_junk.16  -0.2152065  0.1795766  -1.198   0.2315  
## x_junk.17   0.0987989  0.1868782   0.529   0.5973  
## x_junk.18  -0.0589122  0.1964012  -0.300   0.7644  
## x_junk.19   0.0407544  0.1896260   0.215   0.8299  
## x_junk.20   0.2788872  0.1817698   1.534   0.1257  
## x_junk.21   0.0780682  0.1901043   0.411   0.6815  
## x_junk.22   0.0001428  0.1936429   0.001   0.9994  
## x_junk.23  -0.2146891  0.1860330  -1.154   0.2492  
## x_junk.24   0.1135748  0.1794805   0.633   0.5272  
## x_junk.25   0.1797087  0.1849661   0.972   0.3318  
## x_junk.26  -0.0095168  0.1885643  -0.050   0.9598  
## x_junk.27  -0.0697160  0.1782409  -0.391   0.6959  
## x_junk.28  -0.1294379  0.1832707  -0.706   0.4804  
## x_junk.29  -0.0138295  0.1830424  -0.076   0.9398  
## x_junk.30   0.2922539  0.1918362   1.523   0.1284  
## x_junk.31   0.1576348  0.1867874   0.844   0.3992  
## x_junk.32   0.0598461  0.1892184   0.316   0.7520  
## x_junk.33   0.2018882  0.1812726   1.114   0.2661  
## x_junk.34  -0.0116431  0.1840672  -0.063   0.9496  
## x_junk.35   0.0713938  0.1833419   0.389   0.6972  
## x_junk.36   0.0190741  0.1798446   0.106   0.9156  
## x_junk.37  -0.0238114  0.1885668  -0.126   0.8996  
## x_junk.38  -0.1867089  0.1838147  -1.016   0.3104  
## x_junk.39  -0.0777722  0.1862288  -0.418   0.6765  
## x_junk.40  -0.0434927  0.1859361  -0.234   0.8152  
## x_junk.41   0.0814161  0.1904542   0.427   0.6693  
## x_junk.42   0.1292109  0.1870557   0.691   0.4901  
## x_junk.43  -0.3022911  0.1835507  -1.647   0.1004  
## x_junk.44   0.0831695  0.1823006   0.456   0.6485  
## x_junk.45   0.1332532  0.1869381   0.713   0.4764  
## x_junk.46  -0.1348223  0.1838432  -0.733   0.4638  
## x_junk.47  -0.0218524  0.1972836  -0.111   0.9119  
## x_junk.48   0.2795708  0.1854204   1.508   0.1324  
## x_junk.49  -0.1531037  0.1798952  -0.851   0.3952  
## x_junk.50   0.0520440  0.1852843   0.281   0.7789  
## x_junk.51   0.1682396  0.1835292   0.917   0.3599  
## x_junk.52  -0.1190719  0.1853241  -0.643   0.5209  
## x_junk.53  -0.0989078  0.1857934  -0.532   0.5948  
## x_junk.54  -0.1325051  0.1840157  -0.720   0.4719  
## x_junk.55   0.1006848  0.1940173   0.519   0.6041  
## x_junk.56   0.2034148  0.1764952   1.153   0.2498  
## x_junk.57   0.1153209  0.1766600   0.653   0.5143  
## x_junk.58  -0.0463597  0.1772648  -0.262   0.7938  
## x_junk.59   0.0240049  0.1889276   0.127   0.8990  
## x_junk.60   0.3726537  0.1909670   1.951   0.0517 .
## x_junk.61   0.0250128  0.1864171   0.134   0.8933  
## x_junk.62  -0.2615601  0.1901999  -1.375   0.1698  
## x_junk.63   0.0774898  0.1871022   0.414   0.6790  
## x_junk.64   0.0545841  0.1966734   0.278   0.7815  
## x_junk.65   0.0725361  0.1882312   0.385   0.7002  
## x_junk.66  -0.1206593  0.1897174  -0.636   0.5251  
## x_junk.67  -0.3336932  0.1818668  -1.835   0.0673 .
## x_junk.68   0.0981128  0.1912111   0.513   0.6082  
## x_junk.69  -0.0153031  0.1858223  -0.082   0.9344  
## x_junk.70   0.1886025  0.1830855   1.030   0.3036  
## x_junk.71  -0.0695791  0.1906992  -0.365   0.7154  
## x_junk.72  -0.1030034  0.1840569  -0.560   0.5760  
## x_junk.73   0.0084075  0.1823678   0.046   0.9633  
## x_junk.74   0.2999893  0.1783620   1.682   0.0934 .
## x_junk.75   0.3355257  0.1843357   1.820   0.0695 .
## x_junk.76  -0.1894886  0.1956830  -0.968   0.3335  
## x_junk.77  -0.1183169  0.1843228  -0.642   0.5213  
## x_junk.78  -0.1651666  0.1832443  -0.901   0.3679  
## x_junk.79   0.0947620  0.1867396   0.507   0.6121  
## x_junk.80   0.2538003  0.1910030   1.329   0.1847  
## x_junk.81   0.0711922  0.1917948   0.371   0.7107  
## x_junk.82  -0.0098761  0.2003307  -0.049   0.9607  
## x_junk.83   0.0102922  0.1798480   0.057   0.9544  
## x_junk.84   0.2163918  0.1807194   1.197   0.2319  
## x_junk.85  -0.2514098  0.1881750  -1.336   0.1823  
## x_junk.86   0.0181126  0.1832811   0.099   0.9213  
## x_junk.87  -0.0907617  0.1809516  -0.502   0.6162  
## x_junk.88   0.0354845  0.1869592   0.190   0.8496  
## x_junk.89  -0.3283331  0.1867203  -1.758   0.0794 .
## x_junk.90  -0.2284999  0.1895045  -1.206   0.2286  
## x_junk.91  -0.0575054  0.1800322  -0.319   0.7496  
## x_junk.92   0.0612578  0.1799226   0.340   0.7337  
## x_junk.93  -0.0192369  0.1866956  -0.103   0.9180  
## x_junk.94  -0.0615478  0.1828148  -0.337   0.7365  
## x_junk.95   0.3675294  0.1778881   2.066   0.0395 *
## x_junk.96  -0.0015247  0.1899413  -0.008   0.9936  
## x_junk.97  -0.1364003  0.1779889  -0.766   0.4439  
## x_junk.98   0.4422968  0.1888431   2.342   0.0197 *
## x_junk.99   0.2366755  0.1815971   1.303   0.1932  
## x_junk.100  0.0010360  0.1905522   0.005   0.9957  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.311 on 400 degrees of freedom
## Multiple R-squared:  0.1889, Adjusted R-squared:  -0.01383 
## F-statistic: 0.9318 on 100 and 400 DF,  p-value: 0.6596

Simulated Dataset - Fit - Junk

p_values <- summary(fit)$coefficients[, 4]
hist(p_values, col = "lightblue", breaks = 20)

Simulated Dataset - Fit - Junk

sum(p_values <= 0.05)

## [1] 3

summary(fit)$coefficients[p_values <= 0.05, ]

##            Estimate Std. Error  t value   Pr(>|t|)
## x_junk.3  0.4322452  0.1945466 2.221807 0.02685525
## x_junk.95 0.3675294  0.1778881 2.066071 0.03946512
## x_junk.98 0.4422968  0.1888431 2.342138 0.01966334

Reject Region:

Just by chance, about 5% of false discoveries.

Joint Distribution of the Coefficients - unknown

Joint distribution of

When is known:

When is unknown:

with the Fisher distribution.

Reminder: Fisher Distribution

Let and , and independent. Then

Joint distribution of - Proof - Hints

Let’s show:

Hints:

Joint distribution of - Proof

Hence:

Global Fisher F-test

Hypothesis:

Test Statistic:

Reject Region:

p value:

Simulated Dataset - Fit - Junk

fit <- lm(y_sim ~ -1 + ., data = data_junk)
f_obs <- summary(fit)$fstatistic
f_obs

##      value      numdf      dendf 
##   0.931791 100.000000 400.000000

p_value_f <- 1 - pf(f_obs[1], f_obs[2], f_obs[3])
p_value_f

##     value 
## 0.6596268

We do not reject the null hypothesis that all the (junk) coefficients are zero (phew).

Fisher F-Test - Geometry

i.e. the model is useless.

If is small, then the error is small,
and tends to be large, i.e. we tend to reject :
the model is useful.

If is large, then the error is large,
and tends to be small, i.e. we tend to not reject :
the model is rather useless.

Geometrical interpretation

Good model

not too far from .
not too large compared to .
tends to be large.

Bad model

almost orth. to .
rather large compared to .
tends to be small.

Simulated Dataset - True and Junk

## Data frame
data_all <- data.frame(y_sim = y_sim,
                       x_1 = x_1,
                       x_2 = x_2,
                       x_junk = x_junk)
## Fit
fit <- lm(y_sim ~ ., data = data_all)
## multiple t tests
p_values_t <- summary(fit)$coefficients[, 4]
names(p_values_t[p_values_t <= 0.05])

## [1] "(Intercept)" "x_1"         "x_2"         "x_junk.27"   "x_junk.76"  
## [6] "x_junk.80"

## f test
f_obs <- summary(fit)$fstatistic
p_value_f <- 1 - pf(f_obs[1], f_obs[2], f_obs[3])
p_value_f

## value 
##     0

Nested Models Fisher F-test

Nested Models

Can I test ?

i.e. decide between the full model:

and the nested model:

with , and .

Geometrical interpretation

Nested Models

Idea: Compare

distance of compared to to
residual error.

Then:

If small compared to , then “” and the null model is enough.

If large compared to , then the full model is useful.

Geometrical interpretation

Good model

quite different from .
Full model does add information.

Bad model

similar to .
Full model does not add much information.

Nested Models - F test

When is large, full model is useful.

When is small, null model is enough.

Distribution of ?

Distribution of - 1/2

Let projection on and projection on

and are orthogonal
i.e. their covariance is zero
i.e. (Gaussian assumption) they are independent.

Distribution of - 2/2

From Cochran’s theorem:

(

is a projection on a space of dimension

)

If is true, then , hence:

Nested F test

From Phythagoras’s theorem:

i.e.

Hence:

Nested F test

To test:

we use the statistics:

Simulated Dataset - True vs Junk

## Data frame
data_all <- data.frame(y_sim = y_sim,
                       x_1 = x_1,
                       x_2 = x_2,
                       x_junk = x_junk)
## Fit true
fit_small <- lm(y_sim ~ x_1 + x_2, data = data_all)
## Fit true and junk
fit_all <- lm(y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3, data = data_all)
## Comparison
anova(fit_small, fit_all)

## Analysis of Variance Table
## 
## Model 1: y_sim ~ x_1 + x_2
## Model 2: y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    497 1900.5                           
## 2    494 1883.2  3    17.297 1.5124 0.2104

Simulated Dataset - True vs Junk

## Fit true
fit_small <- lm(y_sim ~ x_1 + x_2, data = data_all)
## Fit all
fit_all <- lm(y_sim ~ ., data = data_all)
## Comparison
anova(fit_small, fit_all)$`Pr(>F)`[2]

## [1] 0.3548787

We do not reject the null hypothesis that the first two variables are enough to explain the data.

Full test

Assuming that we have an intercept, we often test:

The associated statistics is ():

That is the default test returned by R.

Full test

fit <- lm(y_sim ~ x_1 + x_2)
summary(fit)

## 
## Call:
## lm(formula = y_sim ~ x_1 + x_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2523 -1.4517 -0.0302  1.3494  4.8711 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.82877    0.08794  -9.424   <2e-16 ***
## x_1          3.01415    0.07534  40.008   <2e-16 ***
## x_2         -0.95922    0.07462 -12.855   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.955 on 497 degrees of freedom
## Multiple R-squared:  0.7905, Adjusted R-squared:  0.7897 
## F-statistic: 937.8 on 2 and 497 DF,  p-value: < 2.2e-16

One coefficient test

If , we test:

The associated statistics is ():

It can be shown that:

so that it is equivalent to the

-test.

Variable Selection

How to choose the “best” model ?

Model:

Problem:
Can we choose the “best” subset of non-zero ?

I.e. Can we find the predictors that really have an impact on ?

-test on all subsets ? Not practical, multiple testing.

Idea:
Find a “score” to assess the quality of a given fit.

Variable selection -

## Function to get statistics for one fit
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(R2 = sumfit$r.squared)
  return(res)
}

Variable selection -

## Fit true
get_stats_fit(lm(y_sim ~ x_1 + x_2, data = data_all))

##          R2
## 1 0.7905255

## Fit bigger
get_stats_fit(lm(y_sim ~ x_1 + x_2 + x_junk.1, data = data_all))

##        R2
## 1 0.79133

## Fit wrong
get_stats_fit(lm(y_sim ~ x_1 + x_junk.1, data = data_all))

##          R2
## 1 0.7252314

Variable selection -

## Only one "junk" predictor
data_sub <- data.frame(y_sim = y_sim,
                       x_1 = x_1,
                       x_2 = x_2,
                       x_junk = x_junk_1)
## Fit all possible models
get_all_stats(data_sub)

##             R2          preds
## 1 0.7208723311            x_1
## 2 0.1158793338            x_2
## 3 0.0005972444         x_junk
## 4 0.7905254867        x_1+x_2
## 5 0.7208974123     x_1+x_junk
## 6 0.1164841596     x_2+x_junk
## 7 0.7905418666 x_1+x_2+x_junk

is not enough.

Variable selection -

## Function to get statistics
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(R2 = sumfit$r.squared,
                    R2a = sumfit$adj.r.squared)
  return(res)
}

Variable selection -

get_all_stats(data_sub)

##             R2          R2a          preds
## 1 0.7208723311  0.720311834            x_1
## 2 0.1158793338  0.114103991            x_2
## 3 0.0005972444 -0.001409588         x_junk
## 4 0.7905254867  0.789682531        x_1+x_2
## 5 0.7208974123  0.719774263     x_1+x_junk
## 6 0.1164841596  0.112928764     x_2+x_junk
## 7 0.7905418666  0.789274983 x_1+x_2+x_junk

: adjust for the number of predictors.

Adjusted

The larger the better.
When is bigger, the fit must be really better for to raise.
Intuitive, but not much theoretical justifications.

Predictive Risk

Simulation

set.seed(1289)

## Predictors
n <- 50
x_1 <- runif(n, min = -2, max = 2)
x_2 <- runif(n, min = -2, max = 2)

## Model
beta_0 <- -1; beta_1 <- 3; beta_2 <- -1

## sim
eps <- rnorm(n, mean = 0, sd = 2)
y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps

Simulation - Fit

## Useless predictors
p_junk <- 48
x_junk <- matrix(runif(n * p_junk, min = -2, max = 2),
                 ncol = p_junk)

## Good and Bad dataset
data_all <- data.frame(y_sim = y_sim,
                       x_1 = x_1,
                       x_2 = x_2,
                       x_junk = x_junk)

## Bad overfitting fit
fit_bad <- lm(y_sim ~ ., data = data_all)
pred_bad <- predict(fit_bad)

## Good fit
fit_good <- lm(y_sim ~ x_1 + x_2, data = data_all)
pred_good <- predict(fit_good)

Simulation - Fit

mu_y <- beta_0 + beta_1 * x_1 + beta_2 * x_2
plot(mu_y, y_sim, pch = 3,
     ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2]))

Simulation - Fit

mu_y <- beta_0 + beta_1 * x_1 + beta_2 * x_2
plot(mu_y, y_sim, pch = 3,
     ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2]))
## Good prediction (small model)
points(mu_y, pred_good, col = "lightblue", pch = 4)

Simulation - Fit

mu_y <- beta_0 + beta_1 * x_1 + beta_2 * x_2
plot(mu_y, y_sim, pch = 3,
     ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2]))
## Good prediction (small model)
points(mu_y, pred_good, col = "lightblue", pch = 4)
## Bad prediction (big model)
points(mu_y, pred_bad, col = "firebrick", pch = 4)

RSS: “fitted risk”

The bad (bigger) model is closer to the observed data.

## Distance between bad prediction and data (RSS)
sum((y_sim - pred_bad)^2)

## [1] 2.714946e-27

## Distance between good prediction and data (RSS)
sum((y_sim - pred_good)^2)

## [1] 206.9159

Overfitting: the RSS is smaller for the bad, bigger model.

Predictive Risk

Generate new data according to the same generative model:

## New dataset from the same model
eps_new <- rnorm(n, mean = 0, sd = 2)
y_new <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps_new

Questions:
Are the predictions from the bad (bigger) model
better or worse
than the predictions from the good (smaller) model,
compared to the new data ?

Simulation - Fit

plot(mu_y, y_sim, pch = 3,
     ylab = "y", xlab = expression(beta[0] + beta[1]*x[1] + beta[2]*x[2]))
## Bad prediction (big model)
points(mu_y, pred_bad, col = "firebrick", pch = 4)
## Good prediction (small model)
points(mu_y, pred_good, col = "lightblue", pch = 4)
## New points
points(mu_y, y_new, col = "darkgreen", pch = 3)

Predictive Risk

Compute the risk between predictions and new data:

## Distance between bad prediction and true
sum((y_new - pred_bad)^2)

## [1] 535.296

## Distance between good prediction and true
sum((y_new - pred_good)^2)

## [1] 205.6555

The good (smaller) model behaves better on new data.

Predictive Risk

RSS: with learnt on
gets smaller when model gets bigger (overfit).

Predictive Risk: with same distribution as .
becomes larger if is overfit.

Idea:
We want to select a model that has a good predictive risk.
i.e. that represents the underlying model best.

Problem:
In general, we don’t have means to generate .
We use an estimate of the (theoretical) predictive risk.

Theoretical Risk

Models

Ideal model:

the “true” expectation of the observations .

Full linear regression model:

Assume that can be approximated as , i.e. as an element of .

Models

Ideal model:

Sub-models:
keep only predictors

with the sub-matrix of :

Assume that can be approximated as , i.e. as an element of .

Models

“Truth”: , with and .

Models:

Estimators:

True expectation:

Models

Risk

Risk of an estimator (theoretical)

Risk of an estimator:

Notes:

is unknown cannot be computed in practice.

Expectation of the difference between the true mean and the estimated projection.

Not subject to overfitting.

Bias-Variance Decomposition

Bias: ‖μ−yη‖2
- Difference between true and projection on the model
- Best that we could do if we knew the truth
- Decreases when model is more complex, i.e. gets larger.

Variance: |η|σ2
- Variance of the estimation
- Increases when model is more complex, i.e. gets larger.

Risk is a compromize between bias and variance.

Bias-Variance - No overfitting

No overfitting:

Assume:

so that .
so that .

Then:

Bias-Variance - Proof - 1/2

Bias-Variance - Proof - 2/2

From Cochran’s Theorem:

Hence:

And:

Penalized Least Squares

Oracle Estimator

We would like:

Minimize the risk over all possible models.

Problem: is theoretical cannot be computed.

Idea: Minimize an estimator of :

Expectation of the RSS

For any model , the expectation of the RSS is:

Expectation of the RSS - Proof

Mallow’s

We have:

Mallow’s :

If is unbiased, then:

Mallow’s

We have:

And:

Hence:

Minimizing Mallow’s is the same as minimizing an unbiased estimate of the risk.

Mallow’s

Penalize for the number of parameters .
The smaller the better.

Simulated Dataset

## Mallow's Cp
Cp <- function(fit) {
  n <- nobs(fit)
  p <- n - df.residual(fit)
  RSS <- deviance(fit)
  return( (RSS + p * RSS / (n - p)) / n)
}
## Function to get statistics
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(Cp = Cp(fit))
  return(res)
}

Variable selection

## Apply stats to all possible combinations
all_stats <- get_all_stats(data_sub)
all_stats

##          Cp          preds
## 1  5.085236            x_1
## 2 16.107190            x_2
## 3 18.207435         x_junk
## 4  3.823952        x_1+x_2
## 5  5.095010     x_1+x_junk
## 6 16.128558     x_2+x_junk
## 7  3.831361 x_1+x_2+x_junk

Mallow’s is smallest for the true model.

Penalized Log Likelihood

Maximized Log Likelihood

Recall:

And:

Thus:

Maximized Log Likelihood

The maximized likelihood is equal to:

i.e.

Maximized Log Likelihood is increasing

Remember, for any :

Hence always increases when we add a predictor.

Similarly:

always increases when we add some predictors.

Penalized Maximized Log Likelihood

Problem: always increases when we add some predictors.

it can not be used for model selection (same as ).

Idea: add a penalty term that depends on the size of the model

minimize:

Penalized Maximized Log Likelihood

Minimize:

For :

decreases
increases

Goal: find that compensates for the “overfit”, i.e. the noisy and insignificant increase of the likelihood when we add some unrelated predictors.

Simulated Dataset

## Function to get statistics
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(minusLogLik = -logLik(fit))
  return(res)
}

Variable selection

## Apply stats to all possible combinations
all_stats <- get_all_stats(data_sub)
all_stats

##   minusLogLik          preds
## 1    1115.053            x_1
## 2    1403.284            x_2
## 3    1433.925         x_junk
## 4    1043.286        x_1+x_2
## 5    1115.030     x_1+x_junk
## 6    1403.113     x_2+x_junk
## 7    1043.266 x_1+x_2+x_junk

When we add “junk” variable, decreases, but not a lot.

Questions: can we find a good penalty that compensates for this noisy increasing ?

AIC and BIC

Akaike Information Criterion - AIC

is the maximized likelihood of the model.
is the dimension of the model.
The smaller the better.
Asymptotic theoretical guaranties.
Easy to use and widely spread

Bayesian Information Criterion - BIC

is the maximized likelihood of the model.
is the dimension of the model.
The smaller the better.
Based on an approximation of the marginal likelihood.
Easy to use and widely spread

See: Lebarbier, Mary-Huard. Le critère BIC : fondements théoriques et interprétation. 2004. [inria-00070685]

AIC vs BIC

Both have asymptotic theoretical justifications.
BIC penalizes more than AIC chooses smaller models.
Both widely used.
There are some (better) non-asymptotic criteria, see
Giraud C. 2014. Introduction to high-dimensional statistics.

Simulated Dataset

## Function to get statistics
get_stats_fit <- function(fit) {
  sumfit <- summary(fit)
  res <- data.frame(minusLogLik = -logLik(fit),
                    AIC = AIC(fit),
                    BIC = BIC(fit))
  return(res)
}

Variable selection

## Apply stats to all possible combinations
all_stats <- get_all_stats(data_sub)
all_stats

##   minusLogLik      AIC      BIC          preds
## 1    1115.053 2236.105 2248.749            x_1
## 2    1403.284 2812.567 2825.211            x_2
## 3    1433.925 2873.850 2886.493         x_junk
## 4    1043.286 2094.572 2111.430        x_1+x_2
## 5    1115.030 2238.060 2254.919     x_1+x_junk
## 6    1403.113 2814.225 2831.084     x_2+x_junk
## 7    1043.266 2096.533 2117.606 x_1+x_2+x_junk

## Select model
apply(all_stats[, -ncol(all_stats)], 2,
      function(x) all_stats[which.min(x), ncol(all_stats)])

##      minusLogLik              AIC              BIC 
## "x_1+x_2+x_junk"        "x_1+x_2"        "x_1+x_2"

Variable selection - Advertising data

## Advertising data
library(here)
ad <- read.csv(here("data", "Advertising.csv"), row.names = "X")

## Apply stats to all possible combinations
all_stats <- get_all_stats(ad[, c(4, 1:3)])
all_stats

##   minusLogLik       AIC       BIC              preds
## 1    519.0457 1044.0913 1053.9863                 TV
## 2    573.3369 1152.6738 1162.5687              radio
## 3    608.3357 1222.6714 1232.5663          newspaper
## 4    386.1970  780.3941  793.5874           TV+radio
## 5    509.8891 1027.7782 1040.9714       TV+newspaper
## 6    573.2361 1154.4723 1167.6655    radio+newspaper
## 7    386.1811  782.3622  798.8538 TV+radio+newspaper

## Select model
apply(all_stats[, -ncol(all_stats)], 2,
      function(x) all_stats[which.min(x), ncol(all_stats)])

##          minusLogLik                  AIC                  BIC 
## "TV+radio+newspaper"           "TV+radio"           "TV+radio"

Forward and Backward Search

Combinatorial problem

Cannot test all possible models.

“Manual” solution: forward or backward searches.

Forward search

Start with the null model (no predictor)

Try to add one predictor ( models to fit)

Choose the best fitting among the models.

Repeat until no more predictors to add.

Choose best fit with , AIC or BIC.

Forward search - Init

## Full formula
full_formula <- formula(y_sim ~ x_1 + x_2 
                        + x_junk.1 + x_junk.2 + x_junk.3 
                        + x_junk.4 + x_junk.5 + x_junk.6)

## Complete search
2^8

## [1] 256

## No predictors
fit0 <- lm(y_sim ~ 1, data = data_all)

Forward search - Add term

## Add one
library(MASS)
addterm(fit0, full_formula)

## Single term additions
## 
## Model:
## y_sim ~ 1
##          Df Sum of Sq    RSS     AIC
## <none>                731.66 136.165
## x_1       1    466.43 265.23  87.429
## x_2       1     56.39 675.27 134.154
## x_junk.1  1     50.05 681.61 134.622
## x_junk.2  1     32.10 699.56 135.921
## x_junk.3  1      0.41 731.26 138.137
## x_junk.4  1      2.49 729.18 137.995
## x_junk.5  1     54.28 677.38 134.311
## x_junk.6  1     11.59 720.08 137.367

Forward search - Add term

## Add x_1
fit1 <- lm(y_sim ~ 1 + x_1, data = data_all)
## Add another
addterm(fit1, full_formula)

## Single term additions
## 
## Model:
## y_sim ~ 1 + x_1
##          Df Sum of Sq    RSS    AIC
## <none>                265.23 87.429
## x_2       1    58.314 206.92 77.014
## x_junk.1  1     6.038 259.19 88.277
## x_junk.2  1     0.159 265.07 89.399
## x_junk.3  1     0.376 264.85 89.358
## x_junk.4  1     0.403 264.83 89.353
## x_junk.5  1     2.587 262.64 88.939
## x_junk.6  1     6.791 258.44 88.132

Forward search - Add term

## Add x_1
fit2 <- lm(y_sim ~ 1 + x_1 + x_2, data = data_all)
## Add another
addterm(fit2, full_formula)

## Single term additions
## 
## Model:
## y_sim ~ 1 + x_1 + x_2
##          Df Sum of Sq    RSS    AIC
## <none>                206.92 77.014
## x_junk.1  1    3.0199 203.90 78.279
## x_junk.2  1    2.1852 204.73 78.484
## x_junk.3  1    0.8228 206.09 78.815
## x_junk.4  1    1.6416 205.27 78.616
## x_junk.5  1    4.9363 201.98 77.807
## x_junk.6  1    3.7814 203.13 78.092

Backward search

Start with the full model (all predictors)

Try to remove one predictor ( models to fit)

Choose the best fitting among the models.

Repeat until no more predictors to remove.

Choose best fit with , AIC or BIC.

Backward search - Init

## Full formula
fit0 <- lm(y_sim ~ x_1 + x_2 
                        + x_junk.1 + x_junk.2 + x_junk.3 
                        + x_junk.4 + x_junk.5 + x_junk.6,
           data = data_all)

## drop one
dropterm(fit0)

## Single term deletions
## 
## Model:
## y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + x_junk.3 + x_junk.4 + 
##     x_junk.5 + x_junk.6
##          Df Sum of Sq    RSS     AIC
## <none>                191.34  85.100
## x_1       1    336.07 527.41 133.797
## x_2       1     57.75 249.09  96.289
## x_junk.1  1      2.92 194.25  83.857
## x_junk.2  1      2.26 193.60  83.688
## x_junk.3  1      1.46 192.80  83.480
## x_junk.4  1      1.48 192.81  83.485
## x_junk.5  1      2.08 193.42  83.641
## x_junk.6  1      4.38 195.72  84.233

Backward search - Drop term

## drop x_junk.3
fit1 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.1 + x_junk.2 +
           + x_junk.4 + x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit1)

## Single term deletions
## 
## Model:
## y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + +x_junk.4 + x_junk.5 + 
##     x_junk.6
##          Df Sum of Sq    RSS     AIC
## <none>                192.80  83.480
## x_1       1    335.92 528.71 131.921
## x_2       1     57.16 249.96  94.463
## x_junk.1  1      2.30 195.09  82.073
## x_junk.2  1      2.18 194.98  82.043
## x_junk.4  1      1.15 193.95  81.779
## x_junk.5  1      2.27 195.06  82.065
## x_junk.6  1      4.73 197.52  82.691

Backward search - Drop term

## drop x_junk.4
fit2 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.1 + x_junk.2 +
           + x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit2)

## Single term deletions
## 
## Model:
## y_sim ~ x_1 + x_2 + x_junk.1 + x_junk.2 + +x_junk.5 + x_junk.6
##          Df Sum of Sq    RSS     AIC
## <none>                193.95  81.779
## x_1       1    335.82 529.77 130.021
## x_2       1     56.44 250.39  92.550
## x_junk.1  1      1.95 195.90  80.278
## x_junk.2  1      2.27 196.21  80.359
## x_junk.5  1      2.96 196.91  80.537
## x_junk.6  1      4.74 198.69  80.986

Backward search - Drop term

## drop x_junk.1
fit3 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.2 +
           + x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit3)

## Single term deletions
## 
## Model:
## y_sim ~ x_1 + x_2 + x_junk.2 + +x_junk.5 + x_junk.6
##          Df Sum of Sq    RSS     AIC
## <none>                195.90  80.278
## x_1       1    368.05 563.94 131.147
## x_2       1     59.57 255.47  91.554
## x_junk.2  1      2.46 198.36  78.903
## x_junk.5  1      3.71 199.61  79.217
## x_junk.6  1      4.71 200.60  79.466

Backward search - Drop term

## drop x_junk.2
fit4 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.5 + x_junk.6, data = data_all)
## Drop another
dropterm(fit4)

## Single term deletions
## 
## Model:
## y_sim ~ x_1 + x_2 + x_junk.5 + x_junk.6
##          Df Sum of Sq    RSS     AIC
## <none>                198.36  78.903
## x_1       1    422.20 620.56 133.930
## x_2       1     57.59 255.95  89.648
## x_junk.5  1      4.78 203.13  78.092
## x_junk.6  1      3.62 201.98  77.807

Backward search - Drop term

## drop x_junk.6
fit5 <- lm(y_sim ~ x_1 + x_2 
           + x_junk.5, data = data_all)
## Drop another
dropterm(fit5)

## Single term deletions
## 
## Model:
## y_sim ~ x_1 + x_2 + x_junk.5
##          Df Sum of Sq    RSS     AIC
## <none>                201.98  77.807
## x_1       1    426.25 628.23 132.544
## x_2       1     60.66 262.64  88.939
## x_junk.5  1      4.94 206.92  77.014

Backward search - Drop term

## drop x_junk.5
fit6 <- lm(y_sim ~ x_1 + x_2,
           data = data_all)
## Drop another
dropterm(fit6)

## Single term deletions
## 
## Model:
## y_sim ~ x_1 + x_2
##        Df Sum of Sq    RSS     AIC
## <none>              206.92  77.014
## x_1     1    468.35 675.27 134.154
## x_2     1     58.31 265.23  87.429

Forward and backward searches

Both are heuristics.

There are no guaranties that both converge to the same solution.

No theoretical guaranties on the selected model.

Model selection is still an active area of research.
- See e.g: LASSO and other penalized criteria (M2)

AIC vs log FPE

Final Prediction Error - FPE

Mallow’s :

FPE:

Hence:

AIC vs Log FPE

But:

AIC vs Log FPE

Hence:

When is large:

AIC vs Log FPE

Hence:

The log FPE and the AIC criteria select for the same models (asymptotically).

BIC and Laplace Approximation

Bayesian Model Selection

In a Bayesian setting,
we have priors on the models and parameters :

The likelihood given some parameters and a model is:

the marginal likelihood given a model is:

and the posterior probability of the models are:

Bayesian Model Selection

Goal: find the model with the highest posterior probability:

Assume that all models have the same prior probability:

Then, from Bayes rule:

And:

Bayesian Model Selection

Goal: find the model with the highest posterior probability:

Idea:
Assume that

is concentrated around its maximum

and apply Laplace Approximation.

Laplace Approximation

Let be a three times differentiable function, with a unique maximum . Then:

Applied to:

and assuming

, we get:

BIC

Hence:

BIC is an approximation of the marginal likelihood.

Régression Multiple - Sélection de Variable

What we know

Gaussian Model

Model:

Assumptions:

Distribution of the estimators

Student t tests

Problems with the t-test

Simulated Dataset

Simulated Dataset - Fit

Simulated Dataset

Simulated Dataset - Fit - Junk

Simulated Dataset - Fit - Junk

Simulated Dataset - Fit - Junk

Joint Distribution of the Coefficients - σ2 unknown

Joint distribution of ˆβ

Reminder: Fisher Distribution

Joint distribution of ˆβ - Proof - Hints

Joint distribution of ˆβ - Proof

Global Fisher F-test

Global Fisher F-test

Simulated Dataset - Fit - Junk

Fisher F-Test - Geometry

Geometrical interpretation

Simulated Dataset - True and Junk

Nested Models Fisher F-test

Nested Models

Geometrical interpretation

Nested Models

Geometrical interpretation

Nested Models - F test

Distribution of F - 1/2

Distribution of F - 2/2

Nested F test

Nested F test

Simulated Dataset - True vs Junk

Simulated Dataset - True vs Junk

Full F test

Full F test

One coefficient F test

Variable Selection

How to choose the “best” model ?

Variable selection - R2

Variable selection - R2

Variable selection - R2

Variable selection - R2a

Variable selection - R2a

Adjusted R2a

Predictive Risk

Simulation

Simulation - Fit

Simulation - Fit

Simulation - Fit

Simulation - Fit

RSS: “fitted risk”

Predictive Risk

Simulation - Fit

Predictive Risk

Predictive Risk

Theoretical Risk

Models

Models

Models

Models

Risk

Risk of an estimator (theoretical)

Bias-Variance Decomposition

Bias-Variance Decomposition

Bias-Variance - No overfitting

Bias-Variance - Proof - 1/2

Bias-Variance - Proof - 2/2

Penalized Least Squares

Oracle Estimator

Expectation of the RSS

Expectation of the RSS - Proof

Mallow’s Cp

Mallow’s Cp

Mallow’s Cp

Simulated Dataset

Variable selection

Problems with the -test

Joint Distribution of the Coefficients - unknown

Joint distribution of

Joint distribution of - Proof - Hints

Joint distribution of - Proof

Distribution of - 1/2

Distribution of - 2/2

Full test

Full test

One coefficient test

Variable selection -

Variable selection -

Variable selection -

Variable selection -

Variable selection -

Adjusted

Mallow’s

Mallow’s

Mallow’s