Régression Multiple - Modèle Gaussien

Paul Bastide - Ibrahim Bouzalmat

14/02/2024

What we know

Model

Model:

y=Xβ+ϵ

  • y random vector of n responses
  • X non random n×p matrix of predictors
  • ϵ random vector of n errors
  • β non random, unknown vector of p coefficients

Assumptions:

  • (H1) rg(X)=p
  • (H2) E[ϵ]=0 and Var[ϵ]=σ2In

OLS Estimators

ˆβ=argminβRpyXβ2

ˆβ=(XTX)1XTyE[ˆβ]=βVar[ˆβ]=σ2(XTX)1.

ˆy=Xˆβ=PXyˆϵ=yˆy=(InPX)y=PXy

E[ˆϵ]=0Var[ˆϵ]=σ2PX

E[ˆy]=XβVar[ˆy]=σ2PX

Cov[ˆϵ;ˆy]=0.

ˆσ2=1npˆϵ2E[ˆσ2]=σ2

Questions

  • Confidence interval for ˆβ,ˆσ2 ?

  • Prediction intervals ?

  • Assumptions on the moments are not enough.
    We need assumptions on the specific distribution of the ϵi.

  • Most common assumption: ϵi are Gaussian.

Gaussian Model

Gaussian Model

Model:

y=Xβ+ϵ

  • y random vector of n responses
  • X non random n×p matrix of predictors
  • ϵ random vector of n errors
  • β non random, unknown vector of p coefficients

Assumptions:

  • (H1) rg(X)=p
  • (H2) ϵN(0,σ2In)

Reminder - Notations

The column and row vectors of X are written as:

X=(x1xp)=(x1xn)

Hence: yi=xiβ+ϵi 1in

Maximum Likelihood Estimators

Distribution of y

  • Model: yi=xiβ+ϵiwithϵiiidN(0,σ2)for1in

  • Distribution of yi: yiiidN(xiβ,σ2)for1in

Likelihood of y

  • Distribution of yi: yiiidN(xiβ,σ2)for1in
  • Likelihood of y=(y1,,yn)T: L(β,σ2|y1,,yn)=ni=1L(β,σ2|yi)[ind.]

and L(β,σ2|yi)=

Likelihood of y

  • Distribution of yi: yiiidN(xiβ,σ2)for1in

  • Likelihood of y=(y1,,yn)T: L(β,σ2|y1,,yn)=ni=1L(β,σ2|yi)[ind.]

and

L(β,σ2|yi)=12πσ2exp((yixiβ)22σ2)

Likelihood of y

  • Likelihood of y=(y1,,yn)T: L(β,σ2|y1,,yn)=ni=1L(β,σ2|yi)=ni=112πσ2exp((yixiβ)22σ2)=1(2πσ2)nexp(12σ2ni=1(yixiβ)2)

logL(β,σ2|y1,,yn)=n2log(2πσ2)12σ2ni=1(yixiβ)2

ML Estimators

  • The Maximum Likelihood estimators are: (ˆβML,ˆσ2ML)=argmax(β,σ2)Rp×R+logL(β,σ2|y)

with: logL(β,σ2|y)=n2log(2πσ2)12σ2ni=1(yixiβ)2

ML Estimators - β

logL(β,σ2|y)=n2log(2πσ2)12σ2ni=1(yixiβ)2Sum of Squares yXβ2

  • For any σ2>0, ˆβML=argmaxβRplogL(β,σ2|y)=argminβRpyXβ2=ˆβOLS

The ML estimators are the same as the OLS estimators.

ML Estimators - σ2

ˆσ2ML=argmaxσ2R+logL(ˆβ,σ2|y)=argmaxσ2R+n2log(2πσ2)12σ2ni=1(yixiˆβ)2=argminσ2R+n2log(2πσ2)+12σ2ni=1ˆϵ2i

We get:

ˆσ2ML=1nni=1ˆϵ2i=ˆϵ2n=yXˆβ2n

ML Estimators - σ2 - Remarks

  • The ML estimator is different from the unbiased estimator of the variance we saw earlier.

ˆσ2ML=ˆϵ2nˆϵ2np=ˆσ2

  • The ML estimator of the variance is biased: E[ˆσ2ML]=npnσ2σ2

Reminder -
Gaussian Vectors

Reminder: Gaussian Distribution 1/4

Let X be a Gaussian r.v. with expectation μ and variance σ2: XN(μ,σ2).

It admits a probability density function (pdf): f(x)=12πσ2exp((xμ)22σ2)xR.

Moments: E[X]=μVar[X]=σ2

Reminder: Gaussian Vector 2/4

Let X be a Gaussian vector with expectation μ and variance Σ: XN(μ,Σ).

It admits a probability density function (pdf): f(x)=1(2π)n|Σ|exp(12(xμ)TΣ1(xμ)) xRn.

Any linear combination of its coordinates is Gaussian.

Moments: E[X]=μVar[X]=E[(Xμ)(Xμ)T]=Σ

Reminder: Gaussian Vector 3/4

Property:
If XN(μ,Σ), then for any a matrix and b vector,
Y=aX+bN(aμ+b,aΣaT).

Property:
Let XN(μ,Σ), with Σ invertible. Then Σ1 is symmetric, positive definite and it is possible to find Σ1/2 invertible such that: [Σ1/2]TΣ1/2=Σ1

Then: Y=Σ1/2(Xμ)N(0,I).

Reminder: Gaussian Vector 4/4

Proof:
Y=Σ1/2(Xμ)

Take a=Σ1/2andb=Σ1/2μ

Then: aμ+b=Σ1/2μΣ1/2μ=0

and: aΣaT=Σ1/2Σ[Σ1/2]T=I.

Distribution of the Coefficients - σ2 known

Distribution of ˆβ

  • We already know the moments of the estimator:

ˆβ=(XTX)1XTyE[ˆβ]=βVar[ˆβ]=σ2(XTX)1.

  • As y is a Gaussian vector, ˆβ is hence also a Gaussian vector.

  • Hence, when the variance σ2 is known, we get:

ˆβN(β;σ2(XTX)1).

Cochran Theorem

Chi Squared Distribution

Definition:
Let X1,,Xp be p standard normal iid rv : XiN(0,1). X=pi=1X2iχ2p

is a Chi squared r.v. with p degrees of freedom.

Property:
Let X be a Gaussian vector of size p with expectation μ and variance Σ: XN(μ,Σ). Then, if Σ is invertible, (Xμ)TΣ1(Xμ)χ2p

is a Chi squared r.v. with p degrees of freedom.

Chi Squared Distribution - Proof

X is a Gaussian vector XN(μ,Σ). with Σ is invertible, so: Y=Σ1/2(Xμ)N(0,Ip).

And: (Xμ)TΣ1(Xμ)=YTY=pi=1Y2i

with Y1,,Yp be p standard normal iid rv.

Hence: (Xμ)TΣ1(Xμ)=pi=1Y2iχ2p.

Cochran Theorem

Theorem: Let

  • Y a Gaussian vector YN(μ,σ2In);
  • M a subspace of Rn of dimension p;
  • P the orthogonal projection matrix on M;
  • P=InP the orthogonal projection matrix on M.

Then we have:

  1. PYN(Pμ,σ2P) and PYN(Pμ,σ2P);
  2. PY and PY are independent;
  3. 1σ2P(Yμ)2χ2p and 1σ2P(Yμ)2χ2np

Cochran Theorem - Proof - hints

Theorem:

  • Y a Gaussian vector YN(μ,σ2In);
  • P the orthogonal projection matrix on M dim p;
  • P=InP the orthogonal projection matrix on M.
  1. PYN(Pμ,σ2P) and PYN(Pμ,σ2P);
  2. PY and PY are independent;
  3. 1σ2P(Yμ)2χ2p and 1σ2P(Yμ)2χ2np

Hints: 1.XN(μ,Σ)aX+bN(aμ+b,aΣaT).2.P=UΔUTUUT=InΔ=(Ip000).2. (bis)Z=UTYN(,)andΔZ=(Zp0np)

Cochran Theorem - Proof - (1)

From the linear property of Gaussian vectors: PYN(Pμ,P[σ2In]PT)

but: P[σ2In]PT=σ2PPT=σ2PP[orthogonal]=σ2P[projection]

Same for PY.

Cochran Theorem - Proof - (2) - 1/2

As P is an orthogonal projection matrix: P=UΔUTUUT=InΔ=(Ip000)

Define: Z=UTYN(UTμ,UT[σ2In]U=σ2In).

Then: ΔZ=(Zp0np)independent from(InΔ)Z=(0pZnp)

Cochran Theorem - Proof - (2) - 2/2

P=UΔUTUUT=InΔ=(Ip000)

ΔZ=(Zp0np)independent from(InΔ)Z=(0pZnp)

Lead to: UΔZ=UΔUTY=PY
independent from U(InΔ)Z=U(InΔ)UTY=PY.

Cochran Theorem - Proof - (3) - 1/3

1σ2P(Yμ)2=1σ2UΔUT(Yμ)2=

1σ2P(Yμ)2=1σ2[UΔUT(Yμ)]T[UΔUT(Yμ)]=1σ2[ΔUT(Yμ)]TUTU[ΔUT(Yμ)]=1σ2(ΔUTYΔUTμ)T(ΔUTYΔUTμ)

Cochran Theorem - Proof - (3) - 2/3

1σ2P(Yμ)2=1σ2(ΔUTYΔUTμ)T(ΔUTYΔUTμ)

But: ΔUTY=ΔZ=(Zp0np)andΔUTμ=((UTμ)p0np)

Hence: 1σ2P(Yμ)2=1σ2(Zp(UTμ)p)T(Zp(UTμ)p)

Cochran Theorem - Proof - (3) - 3/3

1σ2P(Yμ)2=1σ2(Zp(UTμ)p)T(Zp(UTμ)p)=(Zp(UTμ)p)T[σ2Ip]1(Zp(UTμ)p)

But: ZN(UTμ,σ2In)henceZpN((UTμ)p,σ2Ip)

thanks to previous lemma: 1σ2P(Yμ)2χ2p

Same for 1σ2P(Yμ)2χ2np

Distribution of ˆσ2

Distribution of ˆσ2

When the variance σ2 is known, we get:

(np)ˆσ2σ2χ2np

and

ˆβ and ˆσ2 are independent.

Proof: Use Cochran Theorem

We have:

  • Y a Gaussian vector YN(Xβ,σ2In);
  • M(X)=span{x1,,xp};
  • PX=X(XTX)1XT orthogonal projection on M(X);
  • PX=InPX

And:

  • ˆβ=(XTX)1XTy
  • ˆσ2=ˆϵ2np=yXˆβ2np

Proof - Chi squared

ˆσ2=yXˆβ2np=yˆy2np=yPXy2np=PXy2np

Hence: npσ2ˆσ2=1σ2PXy2

And, from Cochran’s Theorem: npσ2ˆσ2=1σ2PXy2χ2np.

Proof - Independence

ˆβ=(XTX)1XTy=(XTX)1(XTX)(XTX)1XTy=(XTX)1XT(X(XTX)1XT)y=(XTX)1XTPXy

and ˆσ2=1npPXy2

But PXy and PXy are independent from Cochran’s theorem.

Hence, ˆβ and ˆσ2 are independent.

Distribution of the Coefficients - σ2 unknown

Distribution of ˆβ

When σ2 is known:

ˆβN(β;σ2(XTX)1).

i.e. for 1kp:

ˆβkN(βk,σ2[(XTX)1]kk)

i.e.

ˆβkβkσ2[(XTX)1]kkN(0,1).

  • Problem σ2 is generally unknown.
  • Solution Replace σ2 by ˆσ2.

Distribution of ˆβ

When σ2 is unknown, for any 1kp: ˆβkβkˆσ2[(XTX)1]kkTnp.

Notation: ˆσ2k=ˆσ2ˆβk=ˆσ2[(XTX)1]kk

Attention:
[(XTX)1]kk[(XTX)kk]1

Reminder: Student Distribution

  • ZN(0,1)
  • Xχ2p
  • Z and X independent

Then T=ZX/pTp

Distribution of ˆβ - Proof

ˆβkβkσ2[(XTX)1]kkN(0,1)and(np)ˆσ2σ2χ2np

ˆβkβkσ2[(XTX)1]kkand(np)ˆσ2σ2independent

Hence: ˆβkβkσ2[(XTX)1]kk(np)ˆσ2σ2/(np)=ˆβkβkˆσ2[(XTX)1]kkTnp

Confidence Intervals and Tests

Confidence intervals

ˆβkβkˆσ2kTnpandnpσ2ˆσ2χ2np

With probability 1α: βk[ˆβk±tnp(1α/2)ˆσ2k]σ2[(np)ˆσ2cnp(1α/2);(np)ˆσ2cnp(α/2)]

Tests

Hypothesis:H0:βk=0vsH1:βk0

Test Statistic:Tk=ˆβkˆσ2kH0Tnp

Reject Region:P[Reject|H0 true]αRα={tR | |t|tnp(1α/2)}

p value:p=PH0[|Tk|>Tobsk]

Test - Null Distribution

Test - Do not reject H0

Test - Reject H0

Test - p value

Simulated Dataset

Simulated Dataset

yi=1+3xi1xi2+ϵi

set.seed(12890926)

## Predictors
n <- 100
x_1 <- runif(n, min = -2, max = 2)
x_2 <- runif(n, min = 0, max = 4)

## Noise
eps <- rnorm(n, mean = 0, sd = 1)

## Model sim
beta_0 <- -1; beta_1 <- 3; beta_2 <- -1
y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps

Simulated Dataset - Fit

yi=1+3xi1xi2+ϵi

fit <- lm(y_sim ~ x_1 + x_2)
summary(fit)
## 
## Call:
## lm(formula = y_sim ~ x_1 + x_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.54413 -0.71088  0.00976  0.66096  1.98562 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.83037    0.19076  -4.353 3.33e-05 ***
## x_1          2.87065    0.08777  32.706  < 2e-16 ***
## x_2         -1.07506    0.08229 -13.064  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9761 on 97 degrees of freedom
## Multiple R-squared:  0.9376, Adjusted R-squared:  0.9363 
## F-statistic: 728.9 on 2 and 97 DF,  p-value: < 2.2e-16

Simulated Dataset - Wrong Fit

## unrelated noise variable x_3
x_3 <- runif(n, min = -4, max = 0)
## Fit
fit <- lm(y_sim ~ x_1 + x_2 + x_3); summary(fit)
## 
## Call:
## lm(formula = y_sim ~ x_1 + x_2 + x_3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6257 -0.7069  0.0366  0.6483  1.9548 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.99755    0.25455  -3.919 0.000167 ***
## x_1          2.87506    0.08789  32.711  < 2e-16 ***
## x_2         -1.07733    0.08233 -13.086  < 2e-16 ***
## x_3         -0.08755    0.08826  -0.992 0.323698    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9762 on 96 degrees of freedom
## Multiple R-squared:  0.9382, Adjusted R-squared:  0.9363 
## F-statistic: 486.2 on 3 and 96 DF,  p-value: < 2.2e-16

Prediction

Predict a new point

  • We fitted ˆβ1,,ˆβp on ((y1,x1),,(yn,xn))
  • A new line of predictors xn+1 comes along. How can we guess yn+1 ?
  • We use the same model: yn+1=xn+1β+ϵn+1
    with ϵn+1N(0,σ2), independent from all (ϵi)1in.
  • We predict yn+1 with: ˆyn+1=xn+1ˆβ
  • Question: What is the error ˆϵn+1=yn+1ˆyn+1 ?

Prediction Error - Known variance

The prediction error ˆϵn+1=yn+1ˆyn+1 is such that: ˆϵn+1N(0,σ2(1+xn+1(XTX)1(xn+1)T))

Proof:

  • ˆϵn+1 is Gaussian as the sum of two Gaussian variables.
  • We already know its mean and variance (see previous lesson).

Remark: xn+1(XTX)1(xn+1)T is a scalar.

Prediction Error - Unknown variance

We get: yn+1ˆyn+1ˆσ2(1+xn+1(XTX)1(xn+1)T)Tnp

Proof:

yn+1ˆyn+1ˆσ2(1+xn+1(XTX)1(xn+1)T)=yn+1ˆyn+1σ2(1+xn+1(XTX)1(xn+1)T)(np)ˆσ2σ2/(np)

Prediction - Confidence Interval

With probability 1α: yn+1[ˆyn+1±tnp(1α/2)ˆσ2(1+xn+1(XTX)1(xn+1)T)]

Advertising

Advertising

fit_all <- lm(sales ~ TV + radio + newspaper, data = ad)
summary(fit_all)
## 
## Call:
## lm(formula = sales ~ TV + radio + newspaper, data = ad)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## TV           0.045765   0.001395  32.809   <2e-16 ***
## radio        0.188530   0.008611  21.893   <2e-16 ***
## newspaper   -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16