Paul Bastide - Ibrahim Bouzalmat
14/02/2024
y=Xβ+ϵ
ˆβ=argminβ∈Rp‖y−Xβ‖2
ˆy=Xˆβ=PXyˆϵ=y−ˆy=(In−PX)y=PX⊥y
E[ˆϵ]=0Var[ˆϵ]=σ2PX⊥
E[ˆy]=XβVar[ˆy]=σ2PX
Cov[ˆϵ;ˆy]=0.
ˆσ2=1n−p‖ˆϵ‖2E[ˆσ2]=σ2
Confidence interval for ˆβ,ˆσ2 ?
Prediction intervals ?
Assumptions on the moments are not enough.
↪ We need assumptions on the specific distribution of the ϵi.
Most common assumption: ϵi are Gaussian.
y=Xβ+ϵ
The column and row vectors of X are written as:
X=(x1⋯xp)=(x1⋮xn)
Hence: yi=xiβ+ϵi∀ 1≤i≤n
Model: yi=xiβ+ϵiwithϵi∼iidN(0,σ2)for1≤i≤n
Distribution of yi: yi∼iidN(xiβ,σ2)for1≤i≤n
and L(β,σ2|yi)=⋯
Distribution of yi: yi∼iidN(xiβ,σ2)for1≤i≤n
Likelihood of y=(y1,…,yn)T: L(β,σ2|y1,…,yn)=n∏i=1L(β,σ2|yi)[ind.]
and
L(β,σ2|yi)=1√2πσ2exp(−(yi−xiβ)22σ2)
logL(β,σ2|y1,…,yn)=−n2log(2πσ2)−12σ2n∑i=1(yi−xiβ)2
with: logL(β,σ2|y)=−n2log(2πσ2)−12σ2n∑i=1(yi−xiβ)2
logL(β,σ2|y)=−n2log(2πσ2)−12σ2n∑i=1(yi−xiβ)2⏟Sum of Squares ‖y−Xβ‖2
↪ The ML estimators are the same as the OLS estimators.
ˆσ2ML=argmaxσ2∈R∗+logL(ˆβ,σ2|y)=argmaxσ2∈R∗+−n2log(2πσ2)−12σ2n∑i=1(yi−xiˆβ)2=argminσ2∈R∗+n2log(2πσ2)+12σ2n∑i=1ˆϵ2i
ˆσ2ML=1nn∑i=1ˆϵ2i=‖ˆϵ‖2n=‖y−Xˆβ‖2n
ˆσ2ML=‖ˆϵ‖2n≠‖ˆϵ‖2n−p=ˆσ2
Let X be a Gaussian r.v. with expectation μ and variance σ2: X∼N(μ,σ2).
It admits a probability density function (pdf): f(x)=1√2πσ2exp(−(x−μ)22σ2)∀x∈R.
Moments: E[X]=μVar[X]=σ2
Let X be a Gaussian vector with expectation μ and variance Σ: X∼N(μ,Σ).
It admits a probability density function (pdf): f(x)=1√(2π)n|Σ|exp(−12(x−μ)TΣ−1(x−μ))∀ x∈Rn.
Any linear combination of its coordinates is Gaussian.
Moments: E[X]=μVar[X]=E[(X−μ)(X−μ)T]=Σ
Property:
If X∼N(μ,Σ), then for any a matrix and b vector,
Y=aX+b∼N(aμ+b,aΣaT).
Property:
Let X∼N(μ,Σ), with Σ invertible. Then Σ−1 is symmetric, positive definite and it is possible to find Σ−1/2 invertible such that: [Σ−1/2]TΣ−1/2=Σ−1
Proof:
Y=Σ−1/2(X−μ)
Take a=Σ−1/2andb=−Σ−1/2μ
Then: aμ+b=Σ−1/2μ−Σ−1/2μ=0
and: aΣaT=Σ−1/2Σ[Σ−1/2]T=I.
ˆβ=(XTX)−1XTyE[ˆβ]=βVar[ˆβ]=σ2(XTX)−1.
As y is a Gaussian vector, ˆβ is hence also a Gaussian vector.
Hence, when the variance σ2 is known, we get:
ˆβ∼N(β;σ2(XTX)−1).
Definition:
Let X1,…,Xp be p standard normal iid rv : Xi∼N(0,1). X=p∑i=1X2i∼χ2p
is a Chi squared r.v. with p degrees of freedom.
Property:
Let X be a Gaussian vector of size p with expectation μ and variance Σ: X∼N(μ,Σ). Then, if Σ is invertible, (X−μ)TΣ−1(X−μ)∼χ2p
is a Chi squared r.v. with p degrees of freedom.
X is a Gaussian vector X∼N(μ,Σ). with Σ is invertible, so: Y=Σ−1/2(X−μ)∼N(0,Ip).
And: (X−μ)TΣ−1(X−μ)=YTY=p∑i=1Y2i
Hence: (X−μ)TΣ−1(X−μ)=p∑i=1Y2i∼χ2p.
Theorem: Let
Then we have:
Theorem:
Hints: 1.X∼N(μ,Σ)⟹aX+b∼N(aμ+b,aΣaT).2.P=UΔUTUUT=InΔ=(Ip000).2. (bis)Z=UTY∼N(…,…)andΔZ=(Zp0n−p)
From the linear property of Gaussian vectors: PY∼N(Pμ,P[σ2In]PT)
but: P[σ2In]PT=σ2PPT=σ2PP[orthogonal]=σ2P[projection]
Same for P⊥Y.
As P is an orthogonal projection matrix: P=UΔUTUUT=InΔ=(Ip000)
Define: Z=UTY∼N(UTμ,UT[σ2In]U=σ2In).
Then: ΔZ=(Zp0n−p)independent from(In−Δ)Z=(0pZn−p)
P=UΔUTUUT=InΔ=(Ip000)
ΔZ=(Zp0n−p)independent from(In−Δ)Z=(0pZn−p)
1σ2‖P(Y−μ)‖2=1σ2‖UΔUT(Y−μ)‖2=⋯
1σ2‖P(Y−μ)‖2=1σ2[UΔUT(Y−μ)]T[UΔUT(Y−μ)]=1σ2[ΔUT(Y−μ)]TUTU[ΔUT(Y−μ)]=1σ2(ΔUTY−ΔUTμ)T(ΔUTY−ΔUTμ)
1σ2‖P(Y−μ)‖2=1σ2(ΔUTY−ΔUTμ)T(ΔUTY−ΔUTμ)
But: ΔUTY=ΔZ=(Zp0n−p)andΔUTμ=((UTμ)p0n−p)
Hence: 1σ2‖P(Y−μ)‖2=1σ2(Zp−(UTμ)p)T(Zp−(UTμ)p)
1σ2‖P(Y−μ)‖2=1σ2(Zp−(UTμ)p)T(Zp−(UTμ)p)=(Zp−(UTμ)p)T[σ2Ip]−1(Zp−(UTμ)p)
But: Z∼N(UTμ,σ2In)henceZp∼N((UTμ)p,σ2Ip)
thanks to previous lemma: 1σ2‖P(Y−μ)‖2∼χ2p
Same for 1σ2‖P⊥(Y−μ)‖2∼χ2n−p
When the variance σ2 is known, we get:
(n−p)ˆσ2σ2∼χ2n−p
and
ˆβ and ˆσ2 are independent.
We have:
And:
ˆσ2=‖y−Xˆβ‖2n−p=‖y−ˆy‖2n−p=‖y−PXy‖2n−p=‖PX⊥y‖2n−p
Hence: n−pσ2ˆσ2=1σ2‖PX⊥y‖2
And, from Cochran’s Theorem: n−pσ2ˆσ2=1σ2‖PX⊥y‖2∼χ2n−p.
ˆβ=(XTX)−1XTy=(XTX)−1(XTX)(XTX)−1XTy=(XTX)−1XT(X(XTX)−1XT)y=(XTX)−1XTPXy
and ˆσ2=1n−p‖PX⊥y‖2
But PXy and PX⊥y are independent from Cochran’s theorem.
Hence, ˆβ and ˆσ2 are independent.
When σ2 is known:
ˆβ∼N(β;σ2(XTX)−1).
i.e. for 1≤k≤p:
ˆβk∼N(βk,σ2[(XTX)−1]kk)
i.e.
ˆβk−βk√σ2[(XTX)−1]kk∼N(0,1).
When σ2 is unknown, for any 1≤k≤p: ˆβk−βk√ˆσ2[(XTX)−1]kk∼Tn−p.
Attention:
[(XTX)−1]kk≠[(XTX)kk]−1
Then T=Z√X/p∼Tp
ˆβk−βk√σ2[(XTX)−1]kk∼N(0,1)and(n−p)ˆσ2σ2∼χ2n−p
ˆβk−βk√σ2[(XTX)−1]kkand(n−p)ˆσ2σ2independent
Hence: ˆβk−βk√σ2[(XTX)−1]kk√(n−p)ˆσ2σ2/(n−p)=ˆβk−βk√ˆσ2[(XTX)−1]kk∼Tn−p
ˆβk−βk√ˆσ2k∼Tn−pandn−pσ2ˆσ2∼χ2n−p
With probability 1−α: βk∈[ˆβk±tn−p(1−α/2)√ˆσ2k]σ2∈[(n−p)ˆσ2cn−p(1−α/2);(n−p)ˆσ2cn−p(α/2)]
Hypothesis:H0:βk=0vsH1:βk≠0
Test Statistic:Tk=ˆβk√ˆσ2k∼H0Tn−p
Reject Region:P[Reject|H0 true]≤α⟺Rα={t∈R | |t|≥tn−p(1−α/2)}
p value:p=PH0[|Tk|>Tobsk]
yi=−1+3xi1−xi2+ϵi
set.seed(12890926) ## Predictors n <- 100 x_1 <- runif(n, min = -2, max = 2) x_2 <- runif(n, min = 0, max = 4) ## Noise eps <- rnorm(n, mean = 0, sd = 1) ## Model sim beta_0 <- -1; beta_1 <- 3; beta_2 <- -1 y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps
yi=−1+3xi1−xi2+ϵi
fit <- lm(y_sim ~ x_1 + x_2) summary(fit)
## ## Call: ## lm(formula = y_sim ~ x_1 + x_2) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.54413 -0.71088 0.00976 0.66096 1.98562 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.83037 0.19076 -4.353 3.33e-05 *** ## x_1 2.87065 0.08777 32.706 < 2e-16 *** ## x_2 -1.07506 0.08229 -13.064 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9761 on 97 degrees of freedom ## Multiple R-squared: 0.9376, Adjusted R-squared: 0.9363 ## F-statistic: 728.9 on 2 and 97 DF, p-value: < 2.2e-16
## unrelated noise variable x_3 x_3 <- runif(n, min = -4, max = 0) ## Fit fit <- lm(y_sim ~ x_1 + x_2 + x_3); summary(fit)
## ## Call: ## lm(formula = y_sim ~ x_1 + x_2 + x_3) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.6257 -0.7069 0.0366 0.6483 1.9548 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.99755 0.25455 -3.919 0.000167 *** ## x_1 2.87506 0.08789 32.711 < 2e-16 *** ## x_2 -1.07733 0.08233 -13.086 < 2e-16 *** ## x_3 -0.08755 0.08826 -0.992 0.323698 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9762 on 96 degrees of freedom ## Multiple R-squared: 0.9382, Adjusted R-squared: 0.9363 ## F-statistic: 486.2 on 3 and 96 DF, p-value: < 2.2e-16
The prediction error ˆϵn+1=yn+1−ˆyn+1 is such that: ˆϵn+1∼N(0,σ2(1+xn+1(XTX)−1(xn+1)T))
Remark: xn+1(XTX)−1(xn+1)T is a scalar.
We get: yn+1−ˆyn+1√ˆσ2(1+xn+1(XTX)−1(xn+1)T)∼Tn−p
Proof:
yn+1−ˆyn+1√ˆσ2(1+xn+1(XTX)−1(xn+1)T)=yn+1−ˆyn+1√σ2(1+xn+1(XTX)−1(xn+1)T)√(n−p)ˆσ2σ2/(n−p)
With probability 1−α: yn+1∈[ˆyn+1±tn−p(1−α/2)√ˆσ2(1+xn+1(XTX)−1(xn+1)T)]
fit_all <- lm(sales ~ TV + radio + newspaper, data = ad) summary(fit_all)
## ## Call: ## lm(formula = sales ~ TV + radio + newspaper, data = ad) ## ## Residuals: ## Min 1Q Median 3Q Max ## -8.8277 -0.8908 0.2418 1.1893 2.8292 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.938889 0.311908 9.422 <2e-16 *** ## TV 0.045765 0.001395 32.809 <2e-16 *** ## radio 0.188530 0.008611 21.893 <2e-16 *** ## newspaper -0.001037 0.005871 -0.177 0.86 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.686 on 196 degrees of freedom ## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956 ## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16