Paul Bastide - Ibrahim Bouzalmat
31/01/2024
library(here) ad <- read.csv(here("data", "Advertising.csv"), row.names = "X") head(ad)
## TV radio newspaper sales ## 1 230.1 37.8 69.2 22.1 ## 2 44.5 39.3 45.1 10.4 ## 3 17.2 45.9 69.3 9.3 ## 4 151.5 41.3 58.5 18.5 ## 5 180.8 10.8 58.4 12.9 ## 6 8.7 48.9 75.0 7.2
TV
, radio
, newspaper
: advertising budgets (thousands of $)sales
: number of sales (thousands of units)attach(ad) par(mfrow = c(1, 3)) plot(TV, sales); plot(radio, sales); plot(newspaper, sales)
sales
sales
≈βTV0+βTV1 TV
sales
≈βradio0+βradio1 radio
sales
≈βnewspaper0+βnewspaper1 newspaper
sales
≈β0+β1 TV
+β2 radio
+β3 newspaper
yi=β0+β1xi+ϵi,∀1≤i≤n
sales
)TV
)yi=β0+β1xi1+⋯+βpxip+ϵi,∀1≤i≤n
sales
)TV
, radio
and newspaper
, i.e. p=3)(y1⋮yn)=β0(1⋮1)+β1(x1⋮xn)+(ϵ1⋮ϵn)
yi=β0+β1xi1+⋯+βpxip+ϵi,∀1≤i≤n
(y1⋮yn)=β0(1⋮1)+β1(x11⋮xn1)+⋯+βp(x1p⋮xnp)+(ϵ1⋮ϵn)
(y1⋮yn)=(1x11…x1p⋮1xn1…xnp)(β0β1⋮βp)+(ϵ1⋮ϵn)
y=Xβ+ϵ
yi=β0+p∑k=1βkxik+ϵi,∀1≤i≤n
i.e y=Xβ+ϵ
With the intercept, the model is written as: (y1⋮yn)=(1x11…x1p⋮1xn1…xnp)(β0β1⋮βp)+(ϵ1⋮ϵn)
Without loss of generality, we write:
(y1⋮yn)=(x11…x1p⋮xn1…xnp)(β1⋮βp)+(ϵ1⋮ϵn)
We use this convention in the rest of the text.
y=Xβ+ϵ
The column and row vectors of X are written as:
X=(x1⋯xp)=(x1⋮xn)
For 1≤i≤n: yi=[Xβ]i+ϵi=xiβ+ϵi
And: y=p∑k=1βkxk+ϵ
For the model y=Xβ+ϵwithrg(X)=p
there are n−p degrees of freedom.
In a simple regression, we have: y=β01+β1x+ϵ=(1x1⋮⋮1xn)(β0β1)+(ϵ1⋮ϵn)
There are n−2 degrees of freedom.
But only one “explanatory variables” x.
If we write: (y1⋮yn)=(1x11…x1p⋮1xn1…xnp)(β0β1⋮βp)+(ϵ1⋮ϵn)
There are n−(p+1)=n−p−1 degrees of freedom.
But only p “explanatory variables” x1,⋯,xp.
If we write: (y1⋮yn)=(x11…x1p⋮xn1…xnp)(β1⋮βp)+(ϵ1⋮ϵn)
There are n−p degrees of freedom.
If x1=1 is the intercept,
then there are p−1 “explanatory variables” x2,⋯,xp.
Let x=(x1,…,xn)T and y=(y1,…,yn)T be two vectors of Rn.
The squared euclidean norm of x is given by: ‖x‖2=n∑i=1x2i=xTx
The euclidean scalar product between x and y is given by: ⟨x;y⟩=n∑i=1xiyi=xTy=yTx
We have the following formulas:
‖x+y‖2=‖x‖2+‖y‖2+2⟨x;y⟩
‖x−y‖2=‖x‖2+‖y‖2−2⟨x;y⟩
⟨x;y⟩=14(‖x+y‖2−‖x−y‖2)
⟨x;y⟩≤‖x‖‖y‖(Cauchy-Schwarz)
‖x+y‖≤‖x‖+‖y‖(Triangle Inequality)
Let y∈Rn,
(x1,…,xp)∈Rn×⋯×Rn p linearly independent vectors,
and M(X)=span{x1,…,xp}.
The orthogonal projection of y on M(X) is defined as the unique vector ProjM(X)(y)∈M(X) such that: ProjM(X)(y)=argmin˜y=Xββ∈Rp‖y−˜y‖2.
It is also the unique vector such that y−ProjM(X)(y) is orthogonal to M(X), i.e. orthogonal to xk for all 1≤k≤p.
There is a unique matrix PX such that ProjM(X)(y)=PXy.
Definition:
Definition:
An n×n matrix is said to be orthogonal if UUT=In.
The columns of U are then an orthogonal basis of Rn.
Property:
If P is an orthogonal projection matrix, then there exists one orthogonal matrix U and a diagonal matrix Δ such that: P=UΔUT
The OLS estimator ˆβ is given by:
ˆβ=argminβ∈Rp{n∑i=1(yi−p∑j=1βjxij)2}
Goal: Minimize the squared errors between:
ˆβ=argminβ∈Rp{n∑i=1(yi−p∑j=1βjxij)2}=argminβ∈Rp{n∑i=1(yi−(Xβ)i)2}=argminβ∈Rp‖y−Xβ‖2
ˆβ=argminβ∈Rp{n∑i=1(yi−p∑j=1βjxij)2}=argminβ∈Rp‖y−Xβ‖2
hence
ˆy=Xˆβ=argmin˜y=Xββ∈Rp‖y−˜y‖2=ProjM(X)(y)=PXy
ˆy is the orthogonal projection of y on M(X)=span{x1,…,xp}
ˆβ=argminβ∈Rp{n∑i=1(yi−p∑j=1βjxij)2}=argminβ∈Rp‖y−Xβ‖2
ˆy=Xˆβ=argmin˜y=Xββ∈Rp‖y−˜y‖2=ProjM(X)(y)=PXy
PX is the orthogonal projection matrix on M(X).
y=PXy+(In−PX)y
with PXy and (In−PX)y orthogonal.
ˆβ=argminβ∈Rp‖y−Xβ‖2andˆy=Xˆβ=PXy
The OLS estimator is given by: ˆβ=(XTX)−1XTy
And the matrix of orthogonal projection on M(X) is: PX=X(XTX)−1XT
ˆy=Xˆβ is the orthogonal projection of y on M(X).
It is the only vector such that y−Xˆβ is orthogonal to M(X).
I.e. y−Xˆβ is orthogonal to all xk: ⟨xk,y−Xˆβ⟩=xTk(y−Xˆβ)=01≤k≤p
i.e.(x1…xp)T(y−Xˆβ)=(xT1(y−Xˆβ)⋮xTp(y−Xˆβ))=0
andXT(y−Xˆβ)=0.
XT(y−Xˆβ)=0.
XTXˆβ=XTy
As rg(X)=p, XTX is invertible, and:
ˆβ=(XTX)−1XTy
Then: ˆy=PXy=Xˆβ=X(XTX)−1XTy
As this equality is true for any y, we get:
PX=X(XTX)−1XT.
The minimum of quadratic function S is obtained at the point where the gradient is zero:
S(β)=‖y−Xβ‖2=βTXTXβ−2βTXTy+yTy
∇S(ˆβ)=⋯=0
∇S(ˆβ)=2XTXˆβ−2XTy=0
XTXˆβ=XTy
Same conclusion as before.
Let z be a random vector of dimension n.
The expectation of z is an n vector: E[z]=(E[z1],⋯,E[zn])T
The variance of z is an n×n matrix: Var[z]=[Cov[zi,zj]]1≤i,j≤n=E[(z−E[z])(z−E[z])T]=E[zzT]−E[z]E[z]T
Let A be a m×n deterministic matrix, and b be a m vector. Then:
E[Az+b]=AE[z]+b
Var[Az+b]=AVar[z]AT.
Attention: Transpose is on the right (variance is a matrix).
The OLS estimator ˆβ=(XTX)−1XTy
E[ˆβ]=E[(XTX)−1XTy]=⋯
E[ˆβ]=E[(XTX)−1XTy]=(XTX)−1XTE[y]
And: E[y]=E[Xβ+ϵ]=⋯
E[ˆβ]=E[(XTX)−1XTy]=(XTX)−1XTE[y]
And: E[y]=E[Xβ+ϵ]=Xβ+E[ϵ]=Xβ
Hence: E[ˆβ]=(XTX)−1XTXβ=β.
The OLS estimator ˆβ=(XTX)−1XTy
Var[ˆβ]=Var[(XTX)−1XTy]=⋯
Var[ˆβ]=Var[(XTX)−1XTy]=(XTX)−1XTVar[y][(XTX)−1XT]T=(XTX)−1XTVar[y]X(XTX)−1
And: Var[y]=Var[Xβ+ϵ]=⋯
Var[ˆβ]=Var[(XTX)−1XTy]=(XTX)−1XTVar[y][(XTX)−1XT]T=(XTX)−1XTVar[y]X(XTX)−1
And: Var[y]=Var[Xβ+ϵ]=Var[ϵ]=σ2In
Hence: Var[ˆβ]=(XTX)−1XT[σ2In]X(XTX)−1=σ2(XTX)−1(XTX)(XTX)−1=σ2(XTX)−1.
Definition Let S1 and S2 two real n×n symmetric matrices.
We say that S1 is smaller than S2, and write S1≤S2
or, equivalently: zTS1z≤zTS2z∀z∈Rn.
The OLS estimator ˆβ is the BLUE
(Best Linear Unbiased Estimator):
it is the linear unbiased estimator with the minimal variance.
Remark: The OLS estimator ˆβ=(XTX)−1XTy=Ay
Let u and v two random vectors of size n. Then: Var[u+v]=Var[u]+Var[v]+Cov[u;v]+Cov[v;u]
where: Cov[u;v]=E[uvT]−E[u]E[v]T=Cov[v;u]T
Let ˉβ a linear unbiased estimator of β.
Let’s show that Var[ˆβ]≤Var[ˉβ].
Var[ˉβ]=Var[ˉβ−ˆβ+ˆβ]=⋯
Let ˉβ a linear unbiased estimator of β.
Let’s show that Var[ˆβ]≤Var[ˉβ].
Var[ˉβ]=Var[ˉβ−ˆβ+ˆβ]=Var[ˉβ−ˆβ]+Var[ˆβ]+Cov[ˉβ−ˆβ;ˆβ]+Cov[ˉβ−ˆβ;ˆβ]T
i.e. Var[ˉβ]−Var[ˆβ]=Var[ˉβ−ˆβ]+Cov[ˉβ−ˆβ;ˆβ]+Cov[ˉβ−ˆβ;ˆβ]T
We need to prove: Var[ˉβ]−Var[ˆβ] is positive, semi-definite.
Var[ˉβ]−Var[ˆβ]=Var[ˉβ−ˆβ]+Cov[ˉβ−ˆβ;ˆβ]+Cov[ˉβ−ˆβ;ˆβ]T
We need to prove: Var[ˉβ]−Var[ˆβ] is positive, semi-definite.
As Var[ˉβ−ˆβ] is positive, semi-definite, we just need to prove: Cov[ˉβ−ˆβ;ˆβ]=0.
ˉβ is a linear unbiased estimator of β.
Hence: ˉβ=By
and: E[ˉβ]=E[By]=BE[y]=BE[Xβ+ϵ]=BXβ=β
Let’s show that: Cov[ˉβ−ˆβ;ˆβ]=0.
Cov[ˉβ−ˆβ;ˆβ]=Cov[ˉβ;ˆβ]−Var[ˆβ]=⋯
Cov[ˉβ;ˆβ]=E[ˉβˆβT]−E[ˉβ]E[ˆβ]T=E[By[(XTX)−1XTy]T]−ββT=BE[yyT]X(XTX)−1−ββT
and: E[yyT]=Var[y]+E[y]E[y]T=σ2In+XββTXT
hence: Cov[ˉβ;ˆβ]=σ2BX(XTX)−1+BXββTXTX(XTX)−1−ββT
as BX=In: Cov[ˉβ;ˆβ]=σ2(XTX)−1+ββT−ββT=σ2(XTX)−1
Finally: Cov[ˉβ−ˆβ;ˆβ]=Cov[ˉβ;ˆβ]−Var[ˆβ]=σ2(XTX)−1−σ2(XTX)−1=0.
and: Var[ˉβ]−Var[ˆβ]=Var[ˉβ−ˆβ]
is positive, semi-definite, i.e.: Var[ˉβ]≥Var[ˆβ].
ˆϵ=y−ˆy=(In−PX)y=PX⊥y=PX⊥(Xβ+ϵ)=PX⊥ϵ
E[ˆϵ]=0Var[ˆϵ]=σ2PX⊥
Bias: E[ˆϵ]=E[PX⊥ϵ]=⋯
E[ˆϵ]=PX⊥E[ϵ]=0.
Variance: Var[ˆϵ]=Var[PX⊥ϵ]=⋯
Var[ˆϵ]=PX⊥Var[ϵ][PX⊥]T=σ2PX⊥[PX⊥]T=⋯
Var[ˆϵ]=σ2PX⊥PX⊥=σ2PX⊥.
E[ˆy]=XβVar[ˆy]=σ2PX
Bias: E[ˆy]=E[Xˆβ]=⋯
E[ˆy]=XE[ˆβ]=Xβ
Variance: Var[ˆy]=Var[Xˆβ]=⋯
Var[ˆy]=XVar[ˆβ]XT=X[σ2(XTX)−1]XT=⋯
Var[ˆy]=σ2X(XTX)−1XT=σ2PX.
Cov[ˆϵ;ˆy]=0.
Cov[ˆϵ;ˆy]=Cov[ˆϵ;y−ˆϵ]=⋯
Cov[ˆϵ;ˆy]=Cov[ˆϵ;y]−Var[ˆϵ]=⋯
Cov[ˆϵ;ˆy]=Cov[PX⊥y;y]−σ2PX⊥=⋯
Cov[ˆϵ;ˆy]=PX⊥Var[y]−σ2PX⊥=⋯
Cov[ˆϵ;ˆy]=PX⊥[σ2In]−σ2PX⊥=⋯
Cov[ˆϵ;ˆy]=σ2PX⊥−σ2PX⊥=0.
ˆσ2=1n−p‖ˆϵ‖2=1n−pRSS
is an unbiased estimator of σ2.
Note: p
parameters → n - p
degrees of freedom.
E[ˆσ2]=1n−pE[‖ˆϵ‖2]=⋯
Magic trick: ‖ˆϵ‖2=Tr(‖ˆϵ‖2)=Tr(ˆϵTˆϵ)=Tr(ˆϵˆϵT)
We get: E[‖ˆϵ‖2]=E[Tr(ˆϵˆϵT)]=Tr(E[ˆϵˆϵT])[Trace is linear]=Tr(Var[ˆϵ])[E[ˆϵ]=0]=Tr(σ2PX⊥)=σ2Tr(PX⊥)=σ2(n−p)
as PX⊥ is the projection matrix on a space of dimension n−p.
The prediction error ˆϵn+1=yn+1−ˆyn+1 is such that: E[ˆϵn+1]=0Var[ˆϵn+1]=σ2(1+xn+1(XTX)−1(xn+1)T)
Remarks:
xn+1 is a line-vector of dimension 1×p.
X is a matrix of dimension n×p.
XTX is a matrix of dimension p×p.
xn+1(XTX)−1(xn+1)T is a scalar.
E[ˆϵn+1]=E[yn+1−xn+1ˆβ]=⋯
E[ˆϵn+1]=E[yn+1−xn+1ˆβ]=E[yn+1]−xn+1E[ˆβ]=xn+1β−xn+1β=0
Because ˆyn+1 does not depend on ϵn+1:
Var[ˆϵn+1]=Var[yn+1−ˆyn+1]=Var[yn+1]+Var[ˆyn+1]
Hence: Var[ˆϵn+1]=σ2+Var[xn+1ˆβ]=σ2+xn+1Var[ˆβ](xn+1)T=σ2+xn+1σ2(XTX)−1(xn+1)T
Assuming 1 is in the model:
Using Pythagore:
‖y−ˉy1‖2=‖ˆy−ˉy1+y−ˆy‖2=‖ˆy−ˉy1+ˆϵ‖2=‖ˆy−ˉy1‖2+‖ˆϵ‖2
‖y−ˉy1‖2=‖ˆy−ˉy1‖2+‖ˆϵ‖2TSS=ESS+RSS
R2=ESSTSS=‖ˆy−ˉy1‖2‖y−ˉy1‖2=1−‖ˆϵ‖2‖y−ˉy1‖2=1−RSSTSS
R2=ESSTSS=‖ˆy−ˉy1‖2‖y−ˉy1‖2=cos2(θ)
TSS=‖y−ˉy1‖2=n∑i=1(yi−ˉy)2"total inertia"RSS=‖y−ˆy‖2=n∑i=1(yi−ˆyi)2"intra-class inertia"ESS=‖ˆy−ˉy1‖2=n∑i=1(ˆyi−ˉy)2"inter-class inertia"
R2=ESSTSS=1−RSSTSS
yi=−2+3xi1−xi2+ϵi
set.seed(12890926) ## Predictors n <- 100 x_1 <- runif(n, min = -2, max = 2) x_2 <- runif(n, min = 0, max = 4) ## Noise eps <- rnorm(n, mean = 0, sd = 5) ## Model sim beta_0 <- -2; beta_1 <- 3; beta_2 <- -1 y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps
yi=−2+3xi1−xi2+ϵi
fit <- lm(y_sim ~ x_1 + x_2) summary(fit)$r.squared
## [1] 0.3337178
## unrelated noise variable x_3 x_3 <- runif(n, min = -4, max = 0) ## Fit fit2 <- lm(y_sim ~ x_1 + x_2 + x_3) summary(fit2)$r.squared
## [1] 0.3404782
The more there are predictors, the better R2 is !
R2=1−RSSTSS=1−‖ˆϵ‖2‖y−ˉy1‖2=1−‖y−Xˆβ‖2‖y−ˉy1‖2
If X′=(X xp+1) has one more row, then: ‖y−X′ˆβ′‖2=minβ′∈Rp+1‖y−X′β′‖2=minβ∈Rpβp+1∈R‖y−Xβ−xp+1βp+1‖2≤minβ∈Rp‖y−Xβ‖2
Hence:‖y−X′ˆβ′‖2≤‖y−Xˆβ‖2
R2a=1−RSS/(n−p)TSS/(n−1)=1−(n−1)RSS(n−p)TSS=1−n−1n−p(1−R2)
Penalize for the number of predictors p.
Attention p includes the intercept (rank of matrix X).
yi=−1+3xi1−xi2+ϵi
fit <- lm(y_sim ~ x_1 + x_2) summary(fit)$adj.r.squared
## [1] 0.31998
## unrelated noise variable x_3 and x_4 x_3 <- runif(n, min = -4, max = 0) ## Fit fit2 <- lm(y_sim ~ x_1 + x_2 + x_3) summary(fit2)$adj.r.squared
## [1] 0.3133534
fit_all <- lm(sales ~ TV + radio + newspaper, data = ad) summary(fit_all)
## ## Call: ## lm(formula = sales ~ TV + radio + newspaper, data = ad) ## ## Residuals: ## Min 1Q Median 3Q Max ## -8.8277 -0.8908 0.2418 1.1893 2.8292 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.938889 0.311908 9.422 <2e-16 *** ## TV 0.045765 0.001395 32.809 <2e-16 *** ## radio 0.188530 0.008611 21.893 <2e-16 *** ## newspaper -0.001037 0.005871 -0.177 0.86 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.686 on 196 degrees of freedom ## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956 ## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
Confidence interval for (ˆβ,ˆσ2) ?
Can we test ˆβ=0 (i.e. no linear trend) ?