Régression Multiple - Moindres Carrés

Paul Bastide - Ibrahim Bouzalmat

31/01/2024

Advertising Data

Advertising - Data

library(here)
ad <- read.csv(here("data", "Advertising.csv"), row.names = "X")
head(ad)
##      TV radio newspaper sales
## 1 230.1  37.8      69.2  22.1
## 2  44.5  39.3      45.1  10.4
## 3  17.2  45.9      69.3   9.3
## 4 151.5  41.3      58.5  18.5
## 5 180.8  10.8      58.4  12.9
## 6   8.7  48.9      75.0   7.2
  • TV, radio, newspaper: advertising budgets (thousands of $)
  • sales: number of sales (thousands of units)

Advertising - Data

attach(ad)
par(mfrow = c(1, 3))
plot(TV, sales); plot(radio, sales); plot(newspaper, sales)

Advertising - Questions

  • Previous chapters: use only one medium to predict sales
    sales βTV0+βTV1 TV
    or
    sales βradio0+βradio1 radio
    or
    sales βnewspaper0+βnewspaper1 newspaper
  • Can we use all variables at once ?
    sales β0+β1 TV +β2 radio +β3 newspaper
  • Makes it possible to answer new questions:
    • Is the global fit better ?
    • Which media contribute to sales ?
    • Is there synergy among the media ?

From Simple to Multiple Regression

Model - Simple Regression

yi=β0+β1xi+ϵi,1in

  • yi: quantitative response for i (sales)
  • xi: quantitative predicting variable for i (TV)
  • ϵi: “error” for i
    • random variable
    • (H1) E[ϵi]=0 for all i (centered)
    • (H2) Var[ϵi]=σ2 for all i (identical variance)
    • (H3) Cov[ϵi;ϵj]=0 for all ij (independent)

Model - Multiple Regression

yi=β0+β1xi1++βpxip+ϵi,1in

  • yi: quantitative response for i (sales)
  • xij: quantitative predicting variable j for observation i
    (TV, radio and newspaper, i.e. p=3)
  • ϵi: “error” for i
    • random variable
    • (H1) E[ϵi]=0 for all i (centered)
    • (H2) Var[ϵi]=σ2 for all i (identical variance)
    • (H3) Cov[ϵi;ϵj]=0 for all ij (independent)

Simple Regression - Vectorial notation

(y1yn)=β0(11)+β1(x1xn)+(ϵ1ϵn)

i.e. y=β01+β1x+ϵ

  • y=(y1,,yn)T random vector of responses
  • 1=(1,,1)T vector of ones
  • x=(x1,,xn)T non random vector of predictors
  • ϵ=(ϵ1,,ϵn)T random vector of errors
  • β0, β1 non random, unknown coefficients

Multiple Regression - Matricial notation

yi=β0+β1xi1++βpxip+ϵi,1in

(y1yn)=β0(11)+β1(x11xn1)++βp(x1pxnp)+(ϵ1ϵn)

(y1yn)=(1x11x1p1xn1xnp)(β0β1βp)+(ϵ1ϵn)

y=Xβ+ϵ

Multiple Regression - Matricial notation

yi=β0+pk=1βkxik+ϵi,1in

i.e y=Xβ+ϵ

  • y random vector of responses
  • X non random matrix of predictors
  • ϵ random vector of errors
  • β non random, unknown vector of coefficients

Multiple Regression - Intercept

With the intercept, the model is written as: (y1yn)=(1x11x1p1xn1xnp)(β0β1βp)+(ϵ1ϵn)

Re-numbering, we get: (y1yn)=(x11x12x1(p+1)xn1xn2xn(p+1))(β1β2βp+1)+(ϵ1ϵn)
So that we see the intercept 1=(1,,1)T as a predictor.

Multiple Regression - Intercept

Without loss of generality, we write:

(y1yn)=(x11x1pxn1xnp)(β1βp)+(ϵ1ϵn)

We use this convention in the rest of the text.

Model

Multiple Regression - Model

Model:

y=Xβ+ϵ

  • y random vector of n responses
  • X non random n×p matrix of predictors
  • ϵ random vector of n errors
  • β non random, unknown vector of p coefficients

Assumptions:

  • (H1) rg(X)=p
  • (H2) E[ϵ]=0 and Var[ϵ]=σ2In

Notations

The column and row vectors of X are written as:

X=(x1xp)=(x1xn)

For 1in: yi=[Xβ]i+ϵi=xiβ+ϵi

And: y=pk=1βkxk+ϵ

Degrees of freedom

For the model y=Xβ+ϵwithrg(X)=p

there are np degrees of freedom.

Degrees of freedom - simple regression

In a simple regression, we have: y=β01+β1x+ϵ=(1x11xn)(β0β1)+(ϵ1ϵn)

  • There are n2 degrees of freedom.

  • But only one “explanatory variables” x.

Degrees of freedom - with intercept

If we write: (y1yn)=(1x11x1p1xn1xnp)(β0β1βp)+(ϵ1ϵn)

  • There are n(p+1)=np1 degrees of freedom.

  • But only p “explanatory variables” x1,,xp.

Degrees of freedom - general case

If we write: (y1yn)=(x11x1pxn1xnp)(β1βp)+(ϵ1ϵn)

  • There are np degrees of freedom.

  • If x1=1 is the intercept,
    then there are p1 “explanatory variables” x2,,xp.

Reminders: Orthogonal Projection

Reminder - Euclidean Geometry 1/2

Let x=(x1,,xn)T and y=(y1,,yn)T be two vectors of Rn.

The squared euclidean norm of x is given by: x2=ni=1x2i=xTx

The euclidean scalar product between x and y is given by: x;y=ni=1xiyi=xTy=yTx

Reminder - Euclidean Geometry 2/2

We have the following formulas:

x+y2=x2+y2+2x;y

xy2=x2+y22x;y

x;y=14(x+y2xy2)

x;yxy(Cauchy-Schwarz)

x+yx+y(Triangle Inequality)

Reminder: Orthogonal Projection - 1/3

Let yRn,
(x1,,xp)Rn××Rn p linearly independent vectors,
and M(X)=span{x1,,xp}.

The orthogonal projection of y on M(X) is defined as the unique vector ProjM(X)(y)M(X) such that: ProjM(X)(y)=argmin˜y=XββRpy˜y2.

It is also the unique vector such that yProjM(X)(y) is orthogonal to M(X), i.e. orthogonal to xk for all 1kp.

There is a unique matrix PX such that ProjM(X)(y)=PXy.

Reminder: Orthogonal Projection - 2/4

Reminder: Orthogonal Projection - 3/4

Definition:

  • An n square matrix P is a projection matrix if P2=P.
  • If P2=P and P is symmetric,
    then P is an orthogonal projection matrix.
  • For any xRn, Px is the projection on Im(P), parallel to Ker(P).
  • If P is a projection matrix, then P is an orthogonal projection matrix iff for any xRn, we have: x=Px+(InP)x
    with Px and (InP)x orthogonal.

Reminder: Matrix of Projection - 4/4

Definition:
An n×n matrix is said to be orthogonal if UUT=In.
The columns of U are then an orthogonal basis of Rn.

Property:
If P is an orthogonal projection matrix, then there exists one orthogonal matrix U and a diagonal matrix Δ such that: P=UΔUT

with: Δ=(Ip000)andp=dim(Im(P))=rg(P)

Ordinary Least Squares (OLS)

OLS - Definition

The OLS estimator ˆβ is given by:

ˆβ=argminβRp{ni=1(yipj=1βjxij)2}

 

Goal: Minimize the squared errors between:

  • the prediction of the model β1xi1++βpxip and
  • the actual observed value yi.

OLS - Projection

ˆβ=argminβRp{ni=1(yipj=1βjxij)2}=argminβRp{ni=1(yi(Xβ)i)2}=argminβRpyXβ2

OLS - Projection

ˆβ=argminβRp{ni=1(yipj=1βjxij)2}=argminβRpyXβ2

hence

ˆy=Xˆβ=argmin˜y=XββRpy˜y2=ProjM(X)(y)=PXy

ˆy is the orthogonal projection of y on M(X)=span{x1,,xp}

the plan spanned by the columns of X.

Geometrical Interpretation

OLS - Projection

ˆβ=argminβRp{ni=1(yipj=1βjxij)2}=argminβRpyXβ2

ˆy=Xˆβ=argmin˜y=XββRpy˜y2=ProjM(X)(y)=PXy

PX is the orthogonal projection matrix on M(X).

y=PXy+(InPX)y

with PXy and (InPX)y orthogonal.

OLS - Expression

OLS - Expression

ˆβ=argminβRpyXβ2andˆy=Xˆβ=PXy

The OLS estimator is given by: ˆβ=(XTX)1XTy

And the matrix of orthogonal projection on M(X) is: PX=X(XTX)1XT

OLS - Proof (Geometrical) - 1/2

ˆy=Xˆβ is the orthogonal projection of y on M(X).

It is the only vector such that yXˆβ is orthogonal to M(X).

I.e. yXˆβ is orthogonal to all xk: xk,yXˆβ=xTk(yXˆβ)=01kp

i.e.(x1xp)T(yXˆβ)=(xT1(yXˆβ)xTp(yXˆβ))=0

andXT(yXˆβ)=0.

OLS - Proof (Geometrical) - 2/2

XT(yXˆβ)=0.

XTXˆβ=XTy

As rg(X)=p, XTX is invertible, and:

ˆβ=(XTX)1XTy

Then: ˆy=PXy=Xˆβ=X(XTX)1XTy

As this equality is true for any y, we get:

PX=X(XTX)1XT.

OLS - Proof (Analytical)

The minimum of quadratic function S is obtained at the point where the gradient is zero:

S(β)=yXβ2=βTXTXβ2βTXTy+yTy

S(ˆβ)==0

S(ˆβ)=2XTXˆβ2XTy=0

XTXˆβ=XTy

Same conclusion as before.

Reminder - Expectation and Variance of random vectors

Expectation and Variance

Let z be a random vector of dimension n.

  • The expectation of z is an n vector: E[z]=(E[z1],,E[zn])T

  • The variance of z is an n×n matrix: Var[z]=[Cov[zi,zj]]1i,jn=E[(zE[z])(zE[z])T]=E[zzT]E[z]E[z]T

Expectation and Variance - Properties

Let A be a m×n deterministic matrix, and b be a m vector. Then:

E[Az+b]=AE[z]+b

Var[Az+b]=AVar[z]AT.

Attention: Transpose is on the right (variance is a matrix).

The OLS Estimator is unbiased

OLS estimator is unbiased

The OLS estimator ˆβ=(XTX)1XTy

is unbiased: E[ˆβ]=β.

OLS estimator is unbiased - Proof - 1/3

E[ˆβ]=E[(XTX)1XTy]=

OLS estimator is unbiased - Proof - 2/3

E[ˆβ]=E[(XTX)1XTy]=(XTX)1XTE[y]

And: E[y]=E[Xβ+ϵ]=

OLS estimator is unbiased - Proof - 3/3

E[ˆβ]=E[(XTX)1XTy]=(XTX)1XTE[y]

And: E[y]=E[Xβ+ϵ]=Xβ+E[ϵ]=Xβ

Hence: E[ˆβ]=(XTX)1XTXβ=β.

Variance of the OLS Estimator

Variance of the OLS Estimator

The OLS estimator ˆβ=(XTX)1XTy

has variance: Var[ˆβ]=σ2(XTX)1.

Variance of OLS Estimator - Proof - 1/3

Var[ˆβ]=Var[(XTX)1XTy]=

Variance of OLS Estimator - Proof - 2/3

Var[ˆβ]=Var[(XTX)1XTy]=(XTX)1XTVar[y][(XTX)1XT]T=(XTX)1XTVar[y]X(XTX)1

And: Var[y]=Var[Xβ+ϵ]=

Variance of OLS Estimator - Proof - 3/3

Var[ˆβ]=Var[(XTX)1XTy]=(XTX)1XTVar[y][(XTX)1XT]T=(XTX)1XTVar[y]X(XTX)1

And: Var[y]=Var[Xβ+ϵ]=Var[ϵ]=σ2In

Hence: Var[ˆβ]=(XTX)1XT[σ2In]X(XTX)1=σ2(XTX)1(XTX)(XTX)1=σ2(XTX)1.

Gauss - Markov Theorem

Partial order on Symmetric matrices

Definition Let S1 and S2 two real n×n symmetric matrices.
We say that S1 is smaller than S2, and write S1S2

iif S2S1 is positive, semi-definite, i.e.: zT(S2S1)z0zRn,

or, equivalently: zTS1zzTS2zzRn.

Gauss-Markov theorem

The OLS estimator ˆβ is the BLUE
(Best Linear Unbiased Estimator):
it is the linear unbiased estimator with the minimal variance.

Remark: The OLS estimator ˆβ=(XTX)1XTy=Ay

is indeed linear, and it is unbiased.

Reminder - Variance of the Sum

Let u and v two random vectors of size n. Then: Var[u+v]=Var[u]+Var[v]+Cov[u;v]+Cov[v;u]

where: Cov[u;v]=E[uvT]E[u]E[v]T=Cov[v;u]T

Gauss-Markov - Proof - 1/6

Let ˉβ a linear unbiased estimator of β.
Let’s show that Var[ˆβ]Var[ˉβ].

Var[ˉβ]=Var[ˉβˆβ+ˆβ]=

Gauss-Markov - Proof - 2/6

Let ˉβ a linear unbiased estimator of β.
Let’s show that Var[ˆβ]Var[ˉβ].

Var[ˉβ]=Var[ˉβˆβ+ˆβ]=Var[ˉβˆβ]+Var[ˆβ]+Cov[ˉβˆβ;ˆβ]+Cov[ˉβˆβ;ˆβ]T

i.e. Var[ˉβ]Var[ˆβ]=Var[ˉβˆβ]+Cov[ˉβˆβ;ˆβ]+Cov[ˉβˆβ;ˆβ]T

We need to prove: Var[ˉβ]Var[ˆβ] is positive, semi-definite.

Gauss-Markov - Proof - 3/6

Var[ˉβ]Var[ˆβ]=Var[ˉβˆβ]+Cov[ˉβˆβ;ˆβ]+Cov[ˉβˆβ;ˆβ]T

We need to prove: Var[ˉβ]Var[ˆβ] is positive, semi-definite.

As Var[ˉβˆβ] is positive, semi-definite, we just need to prove: Cov[ˉβˆβ;ˆβ]=0.

Gauss-Markov - Proof - 3/6

ˉβ is a linear unbiased estimator of β.
Hence: ˉβ=By

and: E[ˉβ]=E[By]=BE[y]=BE[Xβ+ϵ]=BXβ=β

for all β, hence: BX=In.

Gauss-Markov - Proof - 4/6

Let’s show that: Cov[ˉβˆβ;ˆβ]=0.

Cov[ˉβˆβ;ˆβ]=Cov[ˉβ;ˆβ]Var[ˆβ]=

Gauss-Markov - Proof - 5/6

Cov[ˉβ;ˆβ]=E[ˉβˆβT]E[ˉβ]E[ˆβ]T=E[By[(XTX)1XTy]T]ββT=BE[yyT]X(XTX)1ββT

and: E[yyT]=Var[y]+E[y]E[y]T=σ2In+XββTXT

hence: Cov[ˉβ;ˆβ]=σ2BX(XTX)1+BXββTXTX(XTX)1ββT

as BX=In: Cov[ˉβ;ˆβ]=σ2(XTX)1+ββTββT=σ2(XTX)1

Gauss-Markov - Proof - 6/6

Finally: Cov[ˉβˆβ;ˆβ]=Cov[ˉβ;ˆβ]Var[ˆβ]=σ2(XTX)1σ2(XTX)1=0.

and: Var[ˉβ]Var[ˆβ]=Var[ˉβˆβ]

is positive, semi-definite, i.e.: Var[ˉβ]Var[ˆβ].

Which ends the proof.

Residuals

Geometrical Interpretation

Residuals - Definitions

ˆϵ=yˆy=(InPX)y=PXy=PX(Xβ+ϵ)=PXϵ

Residuals - Bias and Variance

E[ˆϵ]=0Var[ˆϵ]=σ2PX

Residuals - Bias and Variance - Proof

Bias: E[ˆϵ]=E[PXϵ]=

E[ˆϵ]=PXE[ϵ]=0.

Variance: Var[ˆϵ]=Var[PXϵ]=

Var[ˆϵ]=PXVar[ϵ][PX]T=σ2PX[PX]T=

Var[ˆϵ]=σ2PXPX=σ2PX.

ˆy - Bias and Variance

E[ˆy]=XβVar[ˆy]=σ2PX

ˆy - Bias and Variance - Proof

Bias: E[ˆy]=E[Xˆβ]=

E[ˆy]=XE[ˆβ]=Xβ

Variance: Var[ˆy]=Var[Xˆβ]=

Var[ˆy]=XVar[ˆβ]XT=X[σ2(XTX)1]XT=

Var[ˆy]=σ2X(XTX)1XT=σ2PX.

Covariance of ˆϵ and ˆy

Cov[ˆϵ;ˆy]=0.

Covariance of ˆϵ and ˆy - Proof

Cov[ˆϵ;ˆy]=Cov[ˆϵ;yˆϵ]=

Cov[ˆϵ;ˆy]=Cov[ˆϵ;y]Var[ˆϵ]=

Cov[ˆϵ;ˆy]=Cov[PXy;y]σ2PX=

Cov[ˆϵ;ˆy]=PXVar[y]σ2PX=

Cov[ˆϵ;ˆy]=PX[σ2In]σ2PX=

Cov[ˆϵ;ˆy]=σ2PXσ2PX=0.

Variance Estimation

Unbiased Variance Estimator

ˆσ2=1npˆϵ2=1npRSS

is an unbiased estimator of σ2.

Note: p parameters n - p degrees of freedom.

Unbiased Var Estimator - Proof - 1/2

E[ˆσ2]=1npE[ˆϵ2]=

Magic trick: ˆϵ2=Tr(ˆϵ2)=Tr(ˆϵTˆϵ)=Tr(ˆϵˆϵT)

Unbiased Var Estimator - Proof - 2/2

We get: E[ˆϵ2]=E[Tr(ˆϵˆϵT)]=Tr(E[ˆϵˆϵT])[Trace is linear]=Tr(Var[ˆϵ])[E[ˆϵ]=0]=Tr(σ2PX)=σ2Tr(PX)=σ2(np)

as PX is the projection matrix on a space of dimension np.

Prediction

Predict a new point

  • We fitted ˆβ1,,ˆβp on ((y1,x1),,(yn,xn))
  • A new line of predictors xn+1 comes along. How can we guess yn+1 ?
  • We use the same model: yn+1=xn+1β+ϵn+1
    with E[ϵn+1]=0, Var[ϵn+1]=σ2 and Cov[ϵn+1;ϵi]=0.
  • We predict yn+1 with: ˆyn+1=xn+1ˆβ
  • Question: What is the error ˆϵn+1=yn+1ˆyn+1 ?

Prediction Error

The prediction error ˆϵn+1=yn+1ˆyn+1 is such that: E[ˆϵn+1]=0Var[ˆϵn+1]=σ2(1+xn+1(XTX)1(xn+1)T)

Remarks:

  • xn+1 is a line-vector of dimension 1×p.

  • X is a matrix of dimension n×p.

  • XTX is a matrix of dimension p×p.

  • xn+1(XTX)1(xn+1)T is a scalar.

Prediction Error - Proof 1/2

E[ˆϵn+1]=E[yn+1xn+1ˆβ]=

E[ˆϵn+1]=E[yn+1xn+1ˆβ]=E[yn+1]xn+1E[ˆβ]=xn+1βxn+1β=0

Prediction Error - Proof 2/2

Because ˆyn+1 does not depend on ϵn+1:

Var[ˆϵn+1]=Var[yn+1ˆyn+1]=Var[yn+1]+Var[ˆyn+1]

Hence: Var[ˆϵn+1]=σ2+Var[xn+1ˆβ]=σ2+xn+1Var[ˆβ](xn+1)T=σ2+xn+1σ2(XTX)1(xn+1)T

Geometrical Interpretation

Variance Decomposition

Assuming 1 is in the model:

Using Pythagore:

yˉy12=ˆyˉy1+yˆy2=ˆyˉy1+ˆϵ2=ˆyˉy12+ˆϵ2

Variance Decomposition

yˉy12=ˆyˉy12+ˆϵ2TSS=ESS+RSS

  • TSS: Total Sum of Squares
    Amount of variability in the response y before the regression is performed.
  • RSS: Residual Sum of Squares
    Amount of variability that is left after performing the regression.
  • ESS: Explained Sum of Squares
    Amount of variability that is explained by the regression.

R2 Statistic

R2=ESSTSS=ˆyˉy12yˉy12=1ˆϵ2yˉy12=1RSSTSS

  • R2 is the proportion of variability in y that can be explained by the regression.
  • 0R21
  • R2 = 1: the regression is perfect.
    y=ˆy and y is in M(x).
  • R2 = 0: the regression is useless.
    ˆy=ˉy1, the empirical mean is sufficient.

R2 Statistic

R2=ESSTSS=ˆyˉy12yˉy12=cos2(θ)

R2 Statistic

TSS=yˉy12=ni=1(yiˉy)2"total inertia"RSS=yˆy2=ni=1(yiˆyi)2"intra-class inertia"ESS=ˆyˉy12=ni=1(ˆyiˉy)2"inter-class inertia"

 

R2=ESSTSS=1RSSTSS

R2 Statistic - Issue - 1/3

yi=2+3xi1xi2+ϵi

set.seed(12890926)

## Predictors
n <- 100
x_1 <- runif(n, min = -2, max = 2)
x_2 <- runif(n, min = 0, max = 4)

## Noise
eps <- rnorm(n, mean = 0, sd = 5)

## Model sim
beta_0 <- -2; beta_1 <- 3; beta_2 <- -1
y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps

R2 Statistic - Issue - 2/3

yi=2+3xi1xi2+ϵi

fit <- lm(y_sim ~ x_1 + x_2)
summary(fit)$r.squared
## [1] 0.3337178
## unrelated noise variable x_3
x_3 <- runif(n, min = -4, max = 0)
## Fit
fit2 <- lm(y_sim ~ x_1 + x_2 + x_3)
summary(fit2)$r.squared
## [1] 0.3404782

The more there are predictors, the better R2 is !

R2 Statistic - Issue - 3/3

R2=1RSSTSS=1ˆϵ2yˉy12=1yXˆβ2yˉy12

If X=(X xp+1) has one more row, then: yXˆβ2=minβRp+1yXβ2=minβRpβp+1RyXβxp+1βp+12minβRpyXβ2

Hence:yXˆβ2yXˆβ2

Adjusted R2 Statistic

R2a=1RSS/(np)TSS/(n1)=1(n1)RSS(np)TSS=1n1np(1R2)

  • Penalize for the number of predictors p.

  • Attention p includes the intercept (rank of matrix X).

Adjusted R2 Statistic

yi=1+3xi1xi2+ϵi

fit <- lm(y_sim ~ x_1 + x_2)
summary(fit)$adj.r.squared
## [1] 0.31998
## unrelated noise variable x_3 and x_4
x_3 <- runif(n, min = -4, max = 0)
## Fit
fit2 <- lm(y_sim ~ x_1 + x_2 + x_3)
summary(fit2)$adj.r.squared
## [1] 0.3133534

Advertising

Advertising

fit_all <- lm(sales ~ TV + radio + newspaper, data = ad)
summary(fit_all)
## 
## Call:
## lm(formula = sales ~ TV + radio + newspaper, data = ad)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## TV           0.045765   0.001395  32.809   <2e-16 ***
## radio        0.188530   0.008611  21.893   <2e-16 ***
## newspaper   -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

To be continued

Questions

  • Confidence interval for (ˆβ,ˆσ2) ?

  • Can we test ˆβ=0 (i.e. no linear trend) ?

  • Assumptions on the moments are not enough.
    We need assumptions on the specific distribution of the ϵi.
  • Most common assumption: ϵi are Gaussian.