Régression Multiple - Moindres Carrés

31/01/2024

Advertising Data

Advertising - Data

library(here)
ad <- read.csv(here("data", "Advertising.csv"), row.names = "X")
head(ad)

##      TV radio newspaper sales
## 1 230.1  37.8      69.2  22.1
## 2  44.5  39.3      45.1  10.4
## 3  17.2  45.9      69.3   9.3
## 4 151.5  41.3      58.5  18.5
## 5 180.8  10.8      58.4  12.9
## 6   8.7  48.9      75.0   7.2

TV, radio, newspaper: advertising budgets (thousands of $)
sales: number of sales (thousands of units)

Advertising - Data

attach(ad)
par(mfrow = c(1, 3))
plot(TV, sales); plot(radio, sales); plot(newspaper, sales)

Advertising - Questions

Previous chapters: use only one medium to predict sales
sales $\approx \beta_0^{TV} + \beta_1^{TV}$ TV
or
sales $\approx \beta_0^{radio} + \beta_1^{radio}$ radio
or
sales $\approx \beta_0^{newspaper} + \beta_1^{newspaper}$ newspaper

Can we use all variables at once ?
sales $\approx \beta_0 + \beta_1$ TV $+ \beta_2$ radio $+ \beta_3$ newspaper

Makes it possible to answer new questions:
- Is the global fit better ?
- Which media contribute to sales ?
- Is there synergy among the media ?

From Simple to Multiple Regression

Model - Simple Regression

\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \quad \forall 1 \leq i \leq n\]

$y_i$: quantitative response for $i$ (sales)
$x_i$: quantitative predicting variable for $i$ (TV)
$\epsilon_i$: “error” for $i$
- random variable
- (H1) $\mathbb{E}[\epsilon_i] = 0$ for all $i$ (centered)
- (H2) $\mathbb{Var}[\epsilon_i] = \sigma^2$ for all $i$ (identical variance)
- (H3) $\mathbb{Cov}[\epsilon_i; \epsilon_j] = 0$ for all $i \neq j$ (independent)

Model - Multiple Regression

\[y_i = \beta_0 + \beta_1 x_{i1} + \dotsb + \beta_p x_{ip} + \epsilon_i, \quad \forall 1 \leq i \leq n\]

$y_i$: quantitative response for $i$ (sales)
$x_{ij}$: quantitative predicting variable $j$ for observation $i$
(TV, radio and newspaper, i.e. $p=3$)
$\epsilon_i$: “error” for $i$
- random variable
- (H1) $\mathbb{E}[\epsilon_i] = 0$ for all $i$ (centered)
- (H2) $\mathbb{Var}[\epsilon_i] = \sigma^2$ for all $i$ (identical variance)
- (H3) $\mathbb{Cov}[\epsilon_i; \epsilon_j] = 0$ for all $i \neq j$ (independent)

Simple Regression - Vectorial notation

\[ \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix} = \beta_0 \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} + \beta_1 \begin{pmatrix} x_1 \\ \vdots \\ x_n \end{pmatrix} + \begin{pmatrix} \epsilon_1 \\ \vdots \\ \epsilon_n \end{pmatrix} \] i.e. \[\mathbf{y} = \beta_0 \mathbb{1} + \beta_1 \mathbf{x} + \boldsymbol{\epsilon}\]

$\mathbf{y} = (y_1, \dotsc, y_n)^T$ random vector of responses
$\mathbb{1} = (1, \dotsc, 1)^T$ vector of ones
$\mathbf{x} = (x_1, \dotsc, x_n)^T$ non random vector of predictors
$\boldsymbol{\epsilon} = (\epsilon_1, \dotsc, \epsilon_n)^T$ random vector of errors
$\beta_0$, $\beta_1$ non random, unknown coefficients

Multiple Regression - Matricial notation

\[y_i = \beta_0 + \beta_1 x_{i1} + \dotsb + \beta_p x_{ip} + \epsilon_i, \quad \forall 1 \leq i \leq n\]

\[ \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix} = \beta_0 \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} + \beta_1 \begin{pmatrix} x_{11} \\ \vdots \\ x_{n1} \end{pmatrix} + \dotsb + \beta_p \begin{pmatrix} x_{1p} \\ \vdots \\ x_{np} \end{pmatrix} + \begin{pmatrix} \epsilon_1 \\ \vdots \\ \epsilon_n \end{pmatrix} \]

\[ \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix} = \begin{pmatrix} 1 & x_{11} & \dotsc & x_{1p}\\ \vdots \\ 1 & x_{n1} & \dotsc & x_{np} \end{pmatrix} \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{pmatrix} + \begin{pmatrix} \epsilon_1 \\ \vdots \\ \epsilon_n \end{pmatrix} \]

\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \]

Multiple Regression - Matricial notation

\[y_i = \beta_0 + \sum_{k = 1}^p \beta_k x_{ik} + \epsilon_i, \quad \forall 1 \leq i \leq n\]

i.e \[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \]

$\mathbf{y}$ random vector of responses
$\mathbf{X}$ non random matrix of predictors
$\boldsymbol{\epsilon}$ random vector of errors
$\boldsymbol{\beta}$ non random, unknown vector of coefficients

Multiple Regression - Intercept

With the intercept, the model is written as: \[ \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix} = \begin{pmatrix} 1 & x_{11} & \dotsc & x_{1p}\\ \vdots \\ 1 & x_{n1} & \dotsc & x_{np} \end{pmatrix} \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{pmatrix} + \begin{pmatrix} \epsilon_1 \\ \vdots \\ \epsilon_n \end{pmatrix} \] Re-numbering, we get: \[ \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix} = \begin{pmatrix} x'_{11} & x'_{12} & \dotsc & x'_{1(p+1)}\\ \vdots \\ x'_{n1} & x'_{n2} & \dotsc & x'_{n(p+1)} \end{pmatrix} \begin{pmatrix} \beta'_1 \\ \beta'_2 \\ \vdots \\ \beta'_{p+1} \end{pmatrix} + \begin{pmatrix} \epsilon_1 \\ \vdots \\ \epsilon_n \end{pmatrix} \] So that we see the intercept $\mathbb{1} = (1, \dotsc, 1)^T$ as a predictor.

Multiple Regression - Intercept

Without loss of generality, we write:

\[ \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix} = \begin{pmatrix} x_{11} & \dotsc & x_{1p}\\ \vdots \\ x_{n1} & \dotsc & x_{np} \end{pmatrix} \begin{pmatrix} \beta_1 \\ \vdots \\ \beta_p \end{pmatrix} + \begin{pmatrix} \epsilon_1 \\ \vdots \\ \epsilon_n \end{pmatrix} \]

We use this convention in the rest of the text.

Model

Multiple Regression - Model

Model:

\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \]

$\mathbf{y}$ random vector of $n$ responses
$\mathbf{X}$ non random $n\times p$ matrix of predictors
$\boldsymbol{\epsilon}$ random vector of $n$ errors
$\boldsymbol{\beta}$ non random, unknown vector of $p$ coefficients

Assumptions:

(H1) $rg(\mathbf{X}) = p$
(H2) $\mathbb{E}[\boldsymbol{\epsilon}] = 0$ and $\mathbb{Var}[\boldsymbol{\epsilon}] = \sigma^2 \mathbf{I}_n$

Notations

The column and row vectors of $\mathbf{X}$ are written as:

\[ \mathbf{X} = \begin{pmatrix} \mathbf{x}_{1} & \cdots & \mathbf{x}_{p} \end{pmatrix} = \begin{pmatrix} \mathbf{x}^{1} \\ \vdots \\ \mathbf{x}^{n} \end{pmatrix} \]

For $1\leq i \leq n$: \[ y_i = [\mathbf{X}\boldsymbol{\beta}]_i + \epsilon_i = \mathbf{x}^{i} \boldsymbol{\beta} + \epsilon_i \]

And: \[ \mathbf{y} = \sum_{k = 1}^p \beta_k \mathbf{x}_k + \boldsymbol{\epsilon} \]

Degrees of freedom

For the model \[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \quad \text{with} \quad rg(\mathbf{X}) = p \]

there are $n-p$ degrees of freedom.

Degrees of freedom - simple regression

In a simple regression, we have: \[ \mathbf{y} = \beta_0 \mathbb{1} + \beta_1 \mathbf{x} + \boldsymbol{\epsilon} = \begin{pmatrix} 1 & x_{1}\\ \vdots & \vdots \\ 1 & x_{n} \end{pmatrix} \begin{pmatrix} \beta_0 \\ \beta_1 \end{pmatrix} + \begin{pmatrix} \epsilon_1 \\ \vdots \\ \epsilon_n \end{pmatrix} \]

There are $n - 2$ degrees of freedom.
But only one “explanatory variables” $\mathbf{x}$.

Degrees of freedom - with intercept

If we write: \[ \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix} = \begin{pmatrix} 1 & x_{11} & \dotsc & x_{1p}\\ \vdots \\ 1 & x_{n1} & \dotsc & x_{np} \end{pmatrix} \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{pmatrix} + \begin{pmatrix} \epsilon_1 \\ \vdots \\ \epsilon_n \end{pmatrix} \]

There are $n - (p + 1) = n - p - 1$ degrees of freedom.
But only $p$ “explanatory variables” $\mathbf{x}_1, \cdots, \mathbf{x}_p$.

Degrees of freedom - general case

If we write: \[ \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix} = \begin{pmatrix} x_{11} & \dotsc & x_{1p}\\ \vdots \\ x_{n1} & \dotsc & x_{np} \end{pmatrix} \begin{pmatrix} \beta_1 \\ \vdots \\ \beta_p \end{pmatrix} + \begin{pmatrix} \epsilon_1 \\ \vdots \\ \epsilon_n \end{pmatrix} \]

There are $n - p$ degrees of freedom.
If $\mathbf{x}_1 = \mathbb{1}$ is the intercept,
then there are $p - 1$ “explanatory variables” $\mathbf{x}_2, \cdots, \mathbf{x}_p$.

Reminders: Orthogonal Projection

Reminder - Euclidean Geometry 1/2

Let $\mathbf{x} = (x_1, \dotsc, x_n)^T$ and $\mathbf{y} = (y_1, \dotsc, y_n)^T$ be two vectors of $\mathbb{R}^n$.

The squared euclidean norm of $\mathbf{x}$ is given by: \[ \|\mathbf{x}\|^2 = \sum_{i = 1}^n x_i^2 = \mathbf{x}^T\mathbf{x} \]

The euclidean scalar product between $\mathbf{x}$ and $\mathbf{y}$ is given by: \[ \langle \mathbf{x} ; \mathbf{y}\rangle = \sum_{i = 1}^n x_iy_i = \mathbf{x}^T\mathbf{y} = \mathbf{y}^T\mathbf{x} \]

Reminder - Euclidean Geometry 2/2

We have the following formulas:

\[ \|\mathbf{x} + \mathbf{y}\|^2 = \|\mathbf{x}\|^2 + \|\mathbf{y}\|^2 + 2\langle \mathbf{x} ; \mathbf{y}\rangle \]

\[ \|\mathbf{x} - \mathbf{y}\|^2 = \|\mathbf{x}\|^2 + \|\mathbf{y}\|^2 - 2\langle \mathbf{x} ; \mathbf{y}\rangle \]

\[ \langle \mathbf{x} ; \mathbf{y}\rangle = \frac{1}{4} \left(\|\mathbf{x} + \mathbf{y}\|^2 - \|\mathbf{x} - \mathbf{y}\|^2 \right) \]

\[ \langle \mathbf{x} ; \mathbf{y}\rangle \leq \|\mathbf{x}\|\|\mathbf{y}\| \qquad \text{(Cauchy-Schwarz)} \]

\[ \|\mathbf{x} + \mathbf{y}\| \leq \|\mathbf{x}\| + \|\mathbf{y}\| \qquad \text{(Triangle Inequality)} \]

Reminder: Orthogonal Projection - 1/3

Let $\mathbf{y} \in \mathbb{R}^n$,
$(\mathbf{x}_1, \dotsc, \mathbf{x}_p) \in \mathbb{R}^n \times \dotsb \times \mathbb{R}^n$ $p$ linearly independent vectors,
and $\mathcal{M}(\mathbf{X}) = \text{span}\{\mathbf{x}_1, \dotsc, \mathbf{x}_p\}$.

The orthogonal projection of $\mathbf{y}$ on $\mathcal{M}(\mathbf{X})$ is defined as the unique vector $\text{Proj}_{\mathcal{M}(\mathbf{X})}(\mathbf{y}) \in \mathcal{M}(\mathbf{X})$ such that: \[ \text{Proj}_{\mathcal{M}(\mathbf{X})}(\mathbf{y}) = \underset{\tilde{\mathbf{y}} = \mathbf{X}\boldsymbol{\beta}\\ \boldsymbol{\beta} \in \mathbb{R}^p}{\operatorname{argmin}} \|\mathbf{y} - \tilde{\mathbf{y}} \|^2. \]

It is also the unique vector such that $\mathbf{y} - \text{Proj}_{\mathcal{M}(\mathbf{X})}(\mathbf{y})$ is orthogonal to $\mathcal{M}(\mathbf{X})$, i.e. orthogonal to $\mathbf{x}_k$ for all $1 \leq k \leq p$.

There is a unique matrix $\mathbf{P}^\mathbf{X}$ such that $\text{Proj}_{\mathcal{M}(\mathbf{X})}(\mathbf{y}) = \mathbf{P}^\mathbf{X} \mathbf{y}$.

Reminder: Orthogonal Projection - 2/4

Reminder: Orthogonal Projection - 3/4

Definition:

An $n$ square matrix $\mathbf{P}$ is a projection matrix if $\mathbf{P}^2 = \mathbf{P}$.

If $\mathbf{P}^2 = \mathbf{P}$ and $\mathbf{P}$ is symmetric,
then $\mathbf{P}$ is an orthogonal projection matrix.

For any $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{P}\mathbf{x}$ is the projection on $\text{Im}(\mathbf{P})$, parallel to $\text{Ker}(\mathbf{P})$.

If $\mathbf{P}$ is a projection matrix, then $\mathbf{P}$ is an orthogonal projection matrix iff for any $\mathbf{x} \in \mathbb{R}^n$, we have: \[ \mathbf{x} = \mathbf{P}\mathbf{x} + (\mathbf{I}_n - \mathbf{P})\mathbf{x} \] with $\mathbf{P}\mathbf{x}$ and $(\mathbf{I}_n - \mathbf{P})\mathbf{x}$ orthogonal.

Reminder: Matrix of Projection - 4/4

Definition:
An $n \times n$ matrix is said to be orthogonal if $\mathbf{U}\mathbf{U}^T = \mathbf{I}_n$.
The columns of $\mathbf{U}$ are then an orthogonal basis of $\mathbb{R}^n$.

Property:
If $\mathbf{P}$ is an orthogonal projection matrix, then there exists one orthogonal matrix $\mathbf{U}$ and a diagonal matrix $\mathbf{\Delta}$ such that: \[ \mathbf{P} = \mathbf{U}\mathbf{\Delta}\mathbf{U}^T \] with: \[ \mathbf{\Delta} = \begin{pmatrix} \mathbf{I}_p & \mathbf{0} \\ \mathbf{0} & \mathbf{0} \\ \end{pmatrix} \quad \text{and} \quad p = \dim(\text{Im}(\mathbf{P})) = \text{rg}(\mathbf{P}) \]

Ordinary Least Squares (OLS)

OLS - Definition

The OLS estimator $\hat{\boldsymbol{\beta}}$ is given by:

\[ \hat{\boldsymbol{\beta}} = \underset{\boldsymbol{\beta} \in \mathbb{R}^p}{\operatorname{argmin}} \left\{ \sum_{i = 1}^n \left(y_i - \sum_{j = 1}^p \beta_j x_{ij}\right)^2 \right\} \] $~$

Goal: Minimize the squared errors between:

the prediction of the model $\beta_1 x_{i1} + \dotsb + \beta_p x_{ip}$ and
the actual observed value $y_i$.

OLS - Projection

\[ \begin{aligned} \hat{\boldsymbol{\beta}} & = \underset{\boldsymbol{\beta} \in \mathbb{R}^p}{\operatorname{argmin}} \left\{ \sum_{i = 1}^n \left(y_i - \sum_{j = 1}^p \beta_j x_{ij}\right)^2 \right\}\\ & = \underset{\boldsymbol{\beta} \in \mathbb{R}^p}{\operatorname{argmin}} \left\{ \sum_{i = 1}^n \left(y_i - (\mathbf{X}\boldsymbol{\beta})_i\right)^2 \right\}\\ & = \underset{\boldsymbol{\beta} \in \mathbb{R}^p}{\operatorname{argmin}} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta} \|^2 \end{aligned} \]

OLS - Projection

\[ \hat{\boldsymbol{\beta}} = \underset{\boldsymbol{\beta} \in \mathbb{R}^p}{\operatorname{argmin}} \left\{ \sum_{i = 1}^n \left(y_i - \sum_{j = 1}^p \beta_j x_{ij}\right)^2 \right\} = \underset{\boldsymbol{\beta} \in \mathbb{R}^p}{\operatorname{argmin}} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta} \|^2 \]

hence

\[ \hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \underset{\tilde{\mathbf{y}} = \mathbf{X}\boldsymbol{\beta}\\ \boldsymbol{\beta} \in \mathbb{R}^p}{\operatorname{argmin}} \|\mathbf{y} - \tilde{\mathbf{y}} \|^2 = \text{Proj}_{\mathcal{M}(\mathbf{X})}(\mathbf{y}) = \mathbf{P}^{\mathbf{X}}\mathbf{y} \]

$\hat{\mathbf{y}}$ is the orthogonal projection of $\mathbf{y}$ on \[\mathcal{M}(\mathbf{X}) = \text{span}\{\mathbf{x}_1, \dotsc, \mathbf{x}_p\}\] the plan spanned by the columns of $\mathbf{X}$.

Geometrical Interpretation

OLS - Projection

$\mathbf{P}^{\mathbf{X}}$ is the orthogonal projection matrix on $\mathcal{M}(\mathbf{X})$.

\[ \mathbf{y} = \mathbf{P}^{\mathbf{X}}\mathbf{y} + (\mathbf{I}_n - \mathbf{P}^{\mathbf{X}})\mathbf{y} \]

with $\mathbf{P}^{\mathbf{X}}\mathbf{y}$ and $(\mathbf{I}_n - \mathbf{P}^{\mathbf{X}})\mathbf{y}$ orthogonal.

OLS - Expression

\[ \hat{\boldsymbol{\beta}} = \underset{\boldsymbol{\beta} \in \mathbb{R}^p}{\operatorname{argmin}} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta} \|^2 \quad \text{and} \quad \hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{P}^{\mathbf{X}}\mathbf{y} \]

The OLS estimator is given by: \[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y} \]

And the matrix of orthogonal projection on $\mathcal{M}(\mathbf{X})$ is: \[ \mathbf{P}^{\mathbf{X}} = \mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T} \]

OLS - Proof (Geometrical) - 1/2

$\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}$ is the orthogonal projection of $\mathbf{y}$ on $\mathcal{M}(\mathbf{X})$.

It is the only vector such that $\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}$ is orthogonal to $\mathcal{M}(\mathbf{X})$.

I.e. $\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}$ is orthogonal to all $\mathbf{x}_k$: \[ \langle \mathbf{x}_k, \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}} \rangle = \mathbf{x}_k^T (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = 0 \qquad 1 \leq k \leq p \]

\[ i.e. \quad (\mathbf{x}_1 \dotsc \mathbf{x}_p)^T (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \begin{pmatrix} \mathbf{x}_1^T (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})\\ \vdots\\ \mathbf{x}_p^T (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) \end{pmatrix} = \mathbf{0} \]

\[ and \quad \mathbf{X}^T (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = 0. \]

OLS - Proof (Geometrical) - 2/2

\[ \mathbf{X}^T (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = 0. \]

\[ \mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^T\mathbf{y} \]

As $rg(\mathbf{X}) = p$, $\mathbf{X}^T\mathbf{X}$ is invertible, and:

\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \]

Then: \[ \hat{\mathbf{y}} = \mathbf{P}^{\mathbf{X}}\mathbf{y} = \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X} (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \]

As this equality is true for any $\mathbf{y}$, we get:

\[ \mathbf{P}^{\mathbf{X}} = \mathbf{X} (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T. \]

OLS - Proof (Analytical)

The minimum of quadratic function $S$ is obtained at the point where the gradient is zero:

\[ S(\boldsymbol{\beta}) = \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta} \|^2 = \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} - 2 \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \mathbf{y}^T \mathbf{y} \]

\[ \nabla S(\hat{\boldsymbol{\beta}}) = \cdots = 0 \]

\[ \nabla S(\hat{\boldsymbol{\beta}}) = 2 \mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} - 2 \mathbf{X}^T\mathbf{y} = 0 \]

\[ \mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^T\mathbf{y} \]

Same conclusion as before.

Reminder - Expectation and Variance of random vectors

Expectation and Variance

Let $\mathbf{z}$ be a random vector of dimension $n$.

The expectation of $\mathbf{z}$ is an $n$ vector: \[ \mathbb{E}[\mathbf{z}] = (\mathbb{E}[z_1], \cdots, \mathbb{E}[z_n])^T \]
The variance of $\mathbf{z}$ is an $n \times n$ matrix: \[ \begin{aligned} \mathbb{Var}[\mathbf{z}] &= \left[\mathbb{Cov}[z_i, z_j]\right]_{1 \leq i, j \leq n}\\ &= \mathbb{E}\left[(\mathbf{z} - \mathbb{E}[\mathbf{z}])(\mathbf{z} - \mathbb{E}[\mathbf{z}])^T\right]\\ &= \mathbb{E}[\mathbf{z}\mathbf{z}^T] - \mathbb{E}[\mathbf{z}]\mathbb{E}[\mathbf{z}]^T \end{aligned} \]

Expectation and Variance - Properties

Let $\mathbf{A}$ be a $m \times n$ deterministic matrix, and $\mathbf{b}$ be a $m$ vector. Then:

\[ \mathbb{E}[\mathbf{A}\mathbf{z} + \mathbf{b}] = \mathbf{A}\mathbb{E}[\mathbf{z}] + \mathbf{b} \]

\[ \mathbb{Var}[\mathbf{A}\mathbf{z} + \mathbf{b}] = \mathbf{A}\mathbb{Var}[\mathbf{z}]\mathbf{A}^T. \]

Attention: Transpose is on the right (variance is a matrix).

The OLS Estimator is unbiased

OLS estimator is unbiased

The OLS estimator \[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \] is unbiased: \[ \mathbb{E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}. \]

OLS estimator is unbiased - Proof - 1/3

\[ \mathbb{E}[\hat{\boldsymbol{\beta}}] = \mathbb{E}[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}] = \cdots \]

OLS estimator is unbiased - Proof - 2/3

\[ \mathbb{E}[\hat{\boldsymbol{\beta}}] = \mathbb{E}[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}] = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbb{E}[\mathbf{y}] \]

And: \[ \mathbb{E}[\mathbf{y}] = \mathbb{E}[\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}] = \cdots \]

OLS estimator is unbiased - Proof - 3/3

\[ \mathbb{E}[\hat{\boldsymbol{\beta}}] = \mathbb{E}[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}] = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbb{E}[\mathbf{y}] \]

And: \[ \mathbb{E}[\mathbf{y}] = \mathbb{E}[\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}] = \mathbf{X}\boldsymbol{\beta} + \mathbb{E}[\boldsymbol{\epsilon}] = \mathbf{X}\boldsymbol{\beta} \]

Hence: \[ \mathbb{E}[\hat{\boldsymbol{\beta}}] = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \boldsymbol{\beta}. \]

Variance of the OLS Estimator

The OLS estimator \[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \] has variance: \[ \mathbb{Var}[\hat{\boldsymbol{\beta}}] = \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1}. \]

Variance of OLS Estimator - Proof - 1/3

\[ \mathbb{Var}[\hat{\boldsymbol{\beta}}] = \mathbb{Var}[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}] = \cdots \]

Variance of OLS Estimator - Proof - 2/3

\[ \begin{aligned} \mathbb{Var}[\hat{\boldsymbol{\beta}}] & = \mathbb{Var}[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}]\\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \mathbb{Var}[\mathbf{y}] \left[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\right]^T \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \mathbb{Var}[\mathbf{y}] \mathbf{X} (\mathbf{X}^T\mathbf{X})^{-1} \end{aligned} \]

And: \[ \mathbb{Var}[\mathbf{y}] = \mathbb{Var}[\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}] = \cdots \]

Variance of OLS Estimator - Proof - 3/3

And: \[ \begin{aligned} \mathbb{Var}[\mathbf{y}] = \mathbb{Var}[\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}] = \mathbb{Var}[\boldsymbol{\epsilon}] = \sigma^2 \mathbf{I}_n \end{aligned} \]

Hence: \[ \begin{aligned} \mathbb{Var}[\hat{\boldsymbol{\beta}}] & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \left[\sigma^2 \mathbf{I}_n\right] \mathbf{X} (\mathbf{X}^T\mathbf{X})^{-1}\\ &= \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{X}^T\mathbf{X}) (\mathbf{X}^T\mathbf{X})^{-1}\\ &= \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1}. \end{aligned} \]

Gauss - Markov Theorem

Partial order on Symmetric matrices

Definition Let $\mathbf{S}_1$ and $\mathbf{S}_2$ two real $n\times n$ symmetric matrices.
We say that $\mathbf{S}_1$ is smaller than $\mathbf{S}_2$, and write \[\mathbf{S}_1 \leq \mathbf{S}_2\] iif $\mathbf{S}_2 - \mathbf{S}_1$ is positive, semi-definite, i.e.: \[ \mathbf{z}^T(\mathbf{S}_2 - \mathbf{S}_1)\mathbf{z} \geq 0 \qquad \forall \mathbf{z} \in \mathbb{R}^n, \]

or, equivalently: \[ \mathbf{z}^T \mathbf{S}_1\mathbf{z} \leq \mathbf{z}^T \mathbf{S}_2\mathbf{z} \qquad \forall \mathbf{z} \in \mathbb{R}^n. \]

Gauss-Markov theorem

The OLS estimator $\hat{\boldsymbol{\beta}}$ is the BLUE
(Best Linear Unbiased Estimator):
it is the linear unbiased estimator with the minimal variance.

Remark: The OLS estimator \[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} = \mathbf{A}\mathbf{y} \] is indeed linear, and it is unbiased.

Reminder - Variance of the Sum

Let $\mathbf{u}$ and $\mathbf{v}$ two random vectors of size $n$. Then: \[ \mathbb{Var}[\mathbf{u} + \mathbf{v}] = \mathbb{Var}[\mathbf{u}] + \mathbb{Var}[\mathbf{v}] + \mathbb{Cov}[\mathbf{u}; \mathbf{v}] + \mathbb{Cov}[\mathbf{v}; \mathbf{u}] \]

where: \[ \mathbb{Cov}[\mathbf{u}; \mathbf{v}] = \mathbb{E}[\mathbf{u}\mathbf{v}^T] - \mathbb{E}[\mathbf{u}]\mathbb{E}[\mathbf{v}]^T = \mathbb{Cov}[\mathbf{v}; \mathbf{u}]^T \]

Gauss-Markov - Proof - 1/6

Let $\bar{\boldsymbol{\beta}}$ a linear unbiased estimator of $\boldsymbol{\beta}$.
Let’s show that $\mathbb{Var}[\hat{\boldsymbol{\beta}}] \leq \mathbb{Var}[\bar{\boldsymbol{\beta}}]$.

\[ \mathbb{Var}[\bar{\boldsymbol{\beta}}] = \mathbb{Var}[\bar{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}} + \hat{\boldsymbol{\beta}}] = \cdots \]

Gauss-Markov - Proof - 2/6

Let $\bar{\boldsymbol{\beta}}$ a linear unbiased estimator of $\boldsymbol{\beta}$.
Let’s show that $\mathbb{Var}[\hat{\boldsymbol{\beta}}] \leq \mathbb{Var}[\bar{\boldsymbol{\beta}}]$.

\[ \begin{aligned} \mathbb{Var}[\bar{\boldsymbol{\beta}}] & = \mathbb{Var}[\bar{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}} + \hat{\boldsymbol{\beta}}]\\ & = \mathbb{Var}[\bar{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}] + \mathbb{Var}[\hat{\boldsymbol{\beta}}]\\ & \qquad + \mathbb{Cov}[\bar{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}; \hat{\boldsymbol{\beta}}] + \mathbb{Cov}[\bar{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}; \hat{\boldsymbol{\beta}}]^T \end{aligned} \]

i.e. \[ \begin{aligned} \mathbb{Var}[\bar{\boldsymbol{\beta}}] - \mathbb{Var}[\hat{\boldsymbol{\beta}}] & = \mathbb{Var}[\bar{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}] \\ & \qquad + \mathbb{Cov}[\bar{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}; \hat{\boldsymbol{\beta}}] + \mathbb{Cov}[\bar{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}; \hat{\boldsymbol{\beta}}]^T \end{aligned} \]

We need to prove: $\mathbb{Var}[\bar{\boldsymbol{\beta}}] - \mathbb{Var}[\hat{\boldsymbol{\beta}}]$ is positive, semi-definite.

Gauss-Markov - Proof - 3/6

\[ \begin{aligned} \mathbb{Var}[\bar{\boldsymbol{\beta}}] - \mathbb{Var}[\hat{\boldsymbol{\beta}}] & = \mathbb{Var}[\bar{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}] \\ & \qquad + \mathbb{Cov}[\bar{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}; \hat{\boldsymbol{\beta}}] + \mathbb{Cov}[\bar{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}; \hat{\boldsymbol{\beta}}]^T \end{aligned} \]

We need to prove: $\mathbb{Var}[\bar{\boldsymbol{\beta}}] - \mathbb{Var}[\hat{\boldsymbol{\beta}}]$ is positive, semi-definite.

As $\mathbb{Var}[\bar{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}]$ is positive, semi-definite, we just need to prove: \[ \mathbb{Cov}[\bar{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}; \hat{\boldsymbol{\beta}}] = 0. \]

Gauss-Markov - Proof - 3/6

$\bar{\boldsymbol{\beta}}$ is a linear unbiased estimator of $\boldsymbol{\beta}$.
Hence: \[ \bar{\boldsymbol{\beta}} = \mathbf{B} \mathbf{y} \]

and: \[ \begin{aligned} \mathbb{E}[\bar{\boldsymbol{\beta}}] &= \mathbb{E}[\mathbf{B} \mathbf{y}] = \mathbf{B} \mathbb{E}[\mathbf{y}] = \mathbf{B} \mathbb{E}[\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}] = \mathbf{B} \mathbf{X}\boldsymbol{\beta}\\ &= \boldsymbol{\beta} \end{aligned} \] for all $\boldsymbol{\beta}$, hence: \[ \mathbf{B} \mathbf{X} = \mathbf{I}_n. \]

Gauss-Markov - Proof - 4/6

Let’s show that: \[ \mathbb{Cov}[\bar{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}; \hat{\boldsymbol{\beta}}] = 0. \]

\[ \begin{aligned} \mathbb{Cov}[\bar{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}; \hat{\boldsymbol{\beta}}] & = \mathbb{Cov}[\bar{\boldsymbol{\beta}}; \hat{\boldsymbol{\beta}}] - \mathbb{Var}[\hat{\boldsymbol{\beta}}]\\ & = \cdots \end{aligned} \]

Gauss-Markov - Proof - 5/6

\[ \begin{aligned} \mathbb{Cov}[\bar{\boldsymbol{\beta}}; \hat{\boldsymbol{\beta}}] & = \mathbb{E}[\bar{\boldsymbol{\beta}} \hat{\boldsymbol{\beta}}^T] - \mathbb{E}[\bar{\boldsymbol{\beta}}]\mathbb{E}[\hat{\boldsymbol{\beta}}]^T \\ & = \mathbb{E}[\mathbf{B} \mathbf{y} [(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}]^T] - \boldsymbol{\beta}\boldsymbol{\beta}^T \\ & = \mathbf{B} \mathbb{E}[\mathbf{y}\mathbf{y}^T]\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} - \boldsymbol{\beta}\boldsymbol{\beta}^T \\ \end{aligned} \]

and: \[ \mathbb{E}[\mathbf{y}\mathbf{y}^T] = \mathbb{Var}[\mathbf{y}] + \mathbb{E}[\mathbf{y}]\mathbb{E}[\mathbf{y}]^T = \sigma^2 \mathbf{I}_n + \mathbf{X}\boldsymbol{\beta}\boldsymbol{\beta}^T\mathbf{X}^T \]

hence: \[ \begin{aligned} \mathbb{Cov}[\bar{\boldsymbol{\beta}}; \hat{\boldsymbol{\beta}}] = \sigma^2 \mathbf{B}\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} + \mathbf{B}\mathbf{X}\boldsymbol{\beta}\boldsymbol{\beta}^T\mathbf{X}^T \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} - \boldsymbol{\beta}\boldsymbol{\beta}^T \\ \end{aligned} \]

as $\mathbf{B} \mathbf{X} = \mathbf{I}_n$: \[ \begin{aligned} \mathbb{Cov}[\bar{\boldsymbol{\beta}}; \hat{\boldsymbol{\beta}}] = \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} + \boldsymbol{\beta}\boldsymbol{\beta}^T - \boldsymbol{\beta}\boldsymbol{\beta}^T = \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} \end{aligned} \]

Gauss-Markov - Proof - 6/6

Finally: \[ \begin{aligned} \mathbb{Cov}[\bar{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}; \hat{\boldsymbol{\beta}}] & = \mathbb{Cov}[\bar{\boldsymbol{\beta}}; \hat{\boldsymbol{\beta}}] - \mathbb{Var}[\hat{\boldsymbol{\beta}}]\\ & = \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} - \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1}\\ &= 0. \end{aligned} \]

and: \[ \mathbb{Var}[\bar{\boldsymbol{\beta}}] - \mathbb{Var}[\hat{\boldsymbol{\beta}}] = \mathbb{Var}[\bar{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}] \]

is positive, semi-definite, i.e.: \[ \mathbb{Var}[\bar{\boldsymbol{\beta}}] \geq \mathbb{Var}[\hat{\boldsymbol{\beta}}]. \] Which ends the proof.

Residuals

Geometrical Interpretation

Residuals - Definitions

\[ \begin{aligned} \hat{\boldsymbol{\epsilon}} & = \mathbf{y} - \hat{\mathbf{y}} \\ & = (\mathbf{I}_n - \mathbf{P}^{\mathbf{X}})\mathbf{y}\\ & = \mathbf{P}^{\mathbf{X}^\bot}\mathbf{y}\\ & = \mathbf{P}^{\mathbf{X}^\bot}(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon})\\ & = \mathbf{P}^{\mathbf{X}^\bot}\boldsymbol{\epsilon} \end{aligned} \]

Residuals - Bias and Variance

\[ \begin{aligned} \mathbb{E}[\hat{\boldsymbol{\epsilon}}] & = \mathbf{0} \\ \mathbb{Var}[\hat{\boldsymbol{\epsilon}}] & = \sigma^2 \mathbf{P}^{\mathbf{X}^\bot} \end{aligned} \]

Residuals - Bias and Variance - Proof

Bias: \[ \mathbb{E}[\hat{\boldsymbol{\epsilon}}] = \mathbb{E}[\mathbf{P}^{\mathbf{X}^\bot}\boldsymbol{\epsilon}] = \cdots \]

\[ \mathbb{E}[\hat{\boldsymbol{\epsilon}}] = \mathbf{P}^{\mathbf{X}^\bot}\mathbb{E}[\boldsymbol{\epsilon}] = \mathbf{0}. \]

Variance: \[ \mathbb{Var}[\hat{\boldsymbol{\epsilon}}] = \mathbb{Var}[\mathbf{P}^{\mathbf{X}^\bot}\boldsymbol{\epsilon}] = \cdots \]

\[ \mathbb{Var}[\hat{\boldsymbol{\epsilon}}] = \mathbf{P}^{\mathbf{X}^\bot}\mathbb{Var}[\boldsymbol{\epsilon}][\mathbf{P}^{\mathbf{X}^\bot}]^T = \sigma^2 \mathbf{P}^{\mathbf{X}^\bot}[\mathbf{P}^{\mathbf{X}^\bot}]^T = \cdots \]

\[ \mathbb{Var}[\hat{\boldsymbol{\epsilon}}] = \sigma^2 \mathbf{P}^{\mathbf{X}^\bot}\mathbf{P}^{\mathbf{X}^\bot} = \sigma^2 \mathbf{P}^{\mathbf{X}^\bot}. \]

$\hat{\mathbf{y}}$ - Bias and Variance

\[ \begin{aligned} \mathbb{E}[\hat{\mathbf{y}}] & = \mathbf{X}\boldsymbol{\beta} \\ \mathbb{Var}[\hat{\mathbf{y}}] & = \sigma^2 \mathbf{P}^{\mathbf{X}} \end{aligned} \]

$\hat{\mathbf{y}}$ - Bias and Variance - Proof

Bias: \[ \mathbb{E}[\hat{\mathbf{y}}] = \mathbb{E}[\mathbf{X}\hat{\boldsymbol{\beta}}] = \cdots \]

\[ \mathbb{E}[\hat{\mathbf{y}}] = \mathbf{X}\mathbb{E}[\hat{\boldsymbol{\beta}}] = \mathbf{X}\boldsymbol{\beta} \]

Variance: \[ \mathbb{Var}[\hat{\mathbf{y}}] = \mathbb{Var}[\mathbf{X}\hat{\boldsymbol{\beta}}] = \cdots \]

\[ \mathbb{Var}[\hat{\mathbf{y}}] = \mathbf{X}\mathbb{Var}[\hat{\boldsymbol{\beta}}]\mathbf{X}^T = \mathbf{X}[\sigma^2 (\mathbf{X}^T\mathbf{X})^{-1}]\mathbf{X}^T = \cdots \]

\[ \mathbb{Var}[\hat{\mathbf{y}}] = \sigma^2 \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T = \sigma^2 \mathbf{P}^{\mathbf{X}}. \]

Covariance of $\hat{\boldsymbol{\epsilon}}$ and $\hat{\mathbf{y}}$

\[ \mathbb{Cov}[\hat{\boldsymbol{\epsilon}}; \hat{\mathbf{y}}] = \mathbf{0}. \]

Covariance of $\hat{\boldsymbol{\epsilon}}$ and $\hat{\mathbf{y}}$ - Proof

\[ \mathbb{Cov}[\hat{\boldsymbol{\epsilon}}; \hat{\mathbf{y}}] = \mathbb{Cov}[\hat{\boldsymbol{\epsilon}}; \mathbf{y} - \hat{\boldsymbol{\epsilon}}] = \cdots \]

\[ \mathbb{Cov}[\hat{\boldsymbol{\epsilon}}; \hat{\mathbf{y}}] = \mathbb{Cov}[\hat{\boldsymbol{\epsilon}}; \mathbf{y}] - \mathbb{Var}[\hat{\boldsymbol{\epsilon}}] = \cdots \]

\[ \mathbb{Cov}[\hat{\boldsymbol{\epsilon}}; \hat{\mathbf{y}}] = \mathbb{Cov}[\mathbf{P}^{\mathbf{X}^\bot}\mathbf{y}; \mathbf{y}] - \sigma^2\mathbf{P}^{\mathbf{X}^\bot} = \cdots \]

\[ \mathbb{Cov}[\hat{\boldsymbol{\epsilon}}; \hat{\mathbf{y}}] = \mathbf{P}^{\mathbf{X}^\bot}\mathbb{Var}[\mathbf{y}] - \sigma^2\mathbf{P}^{\mathbf{X}^\bot} = \cdots \]

\[ \mathbb{Cov}[\hat{\boldsymbol{\epsilon}}; \hat{\mathbf{y}}] = \mathbf{P}^{\mathbf{X}^\bot}[\sigma^2\mathbf{I}_n] - \sigma^2\mathbf{P}^{\mathbf{X}^\bot} = \cdots \]

\[ \mathbb{Cov}[\hat{\boldsymbol{\epsilon}}; \hat{\mathbf{y}}] = \sigma^2 \mathbf{P}^{\mathbf{X}^\bot}- \sigma^2\mathbf{P}^{\mathbf{X}^\bot} = \mathbf{0}. \]

Variance Estimation

Unbiased Variance Estimator

\[ \hat{\sigma}^2 = \frac{1}{n - p} \| \hat{\epsilon} \|^2 = \frac{1}{n - p} RSS \]

is an unbiased estimator of $\sigma^2$.

Note: p parameters $\to$ n - p degrees of freedom.

Unbiased Var Estimator - Proof - 1/2

\[ \mathbb{E}[\hat{\sigma}^2] = \frac{1}{n - p} \mathbb{E}[\| \hat{\epsilon} \|^2] = \cdots \]

Magic trick: \[ \| \hat{\epsilon} \|^2 = Tr(\| \hat{\epsilon} \|^2) = Tr(\hat{\epsilon}^T \hat{\epsilon}) = Tr(\hat{\epsilon} \hat{\epsilon}^T) \]

Unbiased Var Estimator - Proof - 2/2

We get: \[ \begin{aligned} \mathbb{E}[\| \hat{\epsilon} \|^2] &= \mathbb{E}[Tr(\hat{\epsilon} \hat{\epsilon}^T)] \\ &= Tr(\mathbb{E}[\hat{\epsilon} \hat{\epsilon}^T]) & [\text{Trace is linear}]\\ &= Tr(\mathbb{Var}[\hat{\epsilon}]) & [\mathbb{E}[\hat{\epsilon}] = 0] \\ &= Tr(\sigma^2 \mathbf{P}^{\mathbf{X}^\bot}) \\ &= \sigma^2 Tr(\mathbf{P}^{\mathbf{X}^\bot}) \\ &= \sigma^2 (n - p) \end{aligned} \]

as $\mathbf{P}^{\mathbf{X}^\bot}$ is the projection matrix on a space of dimension $n-p$.

Prediction

Predict a new point

We fitted $\hat{\beta}_1, \dots, \hat{\beta}_p$ on $((y_1, \mathbf{x}^1), \dotsc, (y_n, \mathbf{x}^n))$

A new line of predictors $\mathbf{x}^{n+1}$ comes along. How can we guess $y_{n+1}$ ?

We use the same model: \[ y_{n+1} = \mathbf{x}^{n+1}\boldsymbol{\beta} + \epsilon_{n+1} \] with $\mathbb{E}[\epsilon_{n+1}] = 0$, $\mathbb{Var}[\epsilon_{n+1}] = \sigma^2$ and $\mathbb{Cov}[\epsilon_{n+1}; \epsilon_i] = 0$.

We predict $y_{n+1}$ with: \[ \hat{y}_{n+1} = \mathbf{x}^{n+1}\hat{\boldsymbol{\beta}} \]

Question: What is the error $\hat{\epsilon}_{n+1} = y_{n+1} - \hat{y}_{n+1}$ ?

Prediction Error

The prediction error $\hat{\epsilon}_{n+1} = y_{n+1} - \hat{y}_{n+1}$ is such that: \[ \begin{aligned} \mathbb{E}[\hat{\epsilon}_{n+1}] &= 0\\ \mathbb{Var}[\hat{\epsilon}_{n+1}] &= \sigma^2 \left(1 + \mathbf{x}^{n+1} (\mathbf{X}^T\mathbf{X})^{-1}(\mathbf{x}^{n+1})^T\right)\\ \end{aligned} \]

Remarks:

$\mathbf{x}^{n+1}$ is a line-vector of dimension $1 \times p$.
$\mathbf{X}$ is a matrix of dimension $n \times p$.
$\mathbf{X}^T\mathbf{X}$ is a matrix of dimension $p \times p$.
$\mathbf{x}^{n+1} (\mathbf{X}^T\mathbf{X})^{-1}(\mathbf{x}^{n+1})^T$ is a scalar.

Prediction Error - Proof 1/2

\[ \begin{aligned} \mathbb{E}[\hat{\epsilon}_{n+1}] &= \mathbb{E}[y_{n+1} - \mathbf{x}^{n+1}\hat{\boldsymbol{\beta}}]\\ &= \cdots \end{aligned} \]

\[ \begin{aligned} \mathbb{E}[\hat{\epsilon}_{n+1}] &= \mathbb{E}[y_{n+1} - \mathbf{x}^{n+1}\hat{\boldsymbol{\beta}}]\\ &= \mathbb{E}[y_{n+1}] - \mathbf{x}^{n+1} \mathbb{E}[\hat{\boldsymbol{\beta}}]\\ &= \mathbf{x}^{n+1}\boldsymbol{\beta} - \mathbf{x}^{n+1}\boldsymbol{\beta}\\ &= 0 \end{aligned} \]

Prediction Error - Proof 2/2

Because $\hat{y}_{n+1}$ does not depend on $\epsilon_{n+1}$:

\[ \begin{aligned} \mathbb{Var}[\hat{\epsilon}_{n+1}] &= \mathbb{Var}[y_{n+1} - \hat{y}_{n+1}]\\ &= \mathbb{Var}[y_{n+1}] + \mathbb{Var}[\hat{y}_{n+1}]\\ \end{aligned} \]

Hence: \[ \begin{aligned} \mathbb{Var}[\hat{\epsilon}_{n+1}] &= \sigma^2 + \mathbb{Var}[\mathbf{x}^{n+1}\hat{\boldsymbol{\beta}}]\\ &= \sigma^2 + \mathbf{x}^{n+1} \mathbb{Var}[\hat{\boldsymbol{\beta}}](\mathbf{x}^{n+1})^T\\ &= \sigma^2 + \mathbf{x}^{n+1} \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{x}^{n+1})^T\\ \end{aligned} \]

Geometrical Interpretation

Variance Decomposition

Assuming $\mathbb{1}$ is in the model:

Using Pythagore:

\[ \begin{aligned} \|\mathbf{y} - \bar{y} \mathbb{1}\|^2 &= \|\hat{\mathbf{y}} - \bar{y} \mathbb{1} + \mathbf{y} - \hat{\mathbf{y}}\|^2 = \|\hat{\mathbf{y}} - \bar{y} \mathbb{1} + \hat{\boldsymbol{\epsilon}}\|^2 \\ &= \|\hat{\mathbf{y}} - \bar{y} \mathbb{1}\|^2 + \|\hat{\boldsymbol{\epsilon}}\|^2 \end{aligned} \]

Variance Decomposition

\[ \begin{aligned} &\|\mathbf{y} - \bar{y} \mathbb{1}\|^2 &=&&& \|\hat{\mathbf{y}} - \bar{y} \mathbb{1}\|^2 &&+& \|\hat{\boldsymbol{\epsilon}}\|^2 \\ &TSS &=&&& ESS &&+& RSS \end{aligned} \]

TSS: Total Sum of Squares
$\to$ Amount of variability in the response $\mathbf{y}$ before the regression is performed.
RSS: Residual Sum of Squares
$\to$ Amount of variability that is left after performing the regression.
ESS: Explained Sum of Squares
$\to$ Amount of variability that is explained by the regression.

$R^2$ Statistic

\[ R^2 = \frac{ESS}{TSS} = \frac{\|\hat{\mathbf{y}} - \bar{y} \mathbb{1}\|^2}{\|\mathbf{y} - \bar{y} \mathbb{1}\|^2} = 1 - \frac{\|\hat{\epsilon}\|^2}{\|\mathbf{y} - \bar{y} \mathbb{1}\|^2} = 1 - \frac{RSS}{TSS} \]

$R^2$ is the proportion of variability in $\mathbf{y}$ that can be explained by the regression.

$0 \leq R^2 \leq 1$

$R^2$ = 1: the regression is perfect.
$\to$ $\mathbf{y} = \hat{\mathbf{y}}$ and $\mathbf{y}$ is in $\mathcal{M}(\mathbf{x})$.

$R^2$ = 0: the regression is useless.
$\to$ $\hat{\mathbf{y}} = \bar{y}\mathbb{1}$, the empirical mean is sufficient.

$R^2$ Statistic

\[ R^2 = \frac{ESS}{TSS} = \frac{\|\hat{\mathbf{y}} - \bar{y} \mathbb{1}\|^2}{\|\mathbf{y} - \bar{y} \mathbb{1}\|^2} = \cos^2(\theta) \]

$R^2$ Statistic

\[ \begin{aligned} TSS &= \|\mathbf{y} - \bar{y} \mathbb{1}\|^2 = \sum_{i = 1}^n (y_i - \bar{y})^2 & \text{"total inertia"}\\ RSS &= \|\mathbf{y} - \hat{\mathbf{y}}\|^2 = \sum_{i = 1}^n (y_i - \hat{y}_i)^2 & \text{"intra-class inertia"}\\ ESS &= \|\hat{\mathbf{y}} - \bar{y} \mathbb{1}\|^2 = \sum_{i = 1}^n (\hat{y}_i - \bar{y})^2 & \text{"inter-class inertia"} \end{aligned} \] \[ ~ \]

\[ R^2 = \frac{ESS}{TSS} = 1 - \frac{RSS}{TSS} \]

$R^2$ Statistic - Issue - 1/3

\[ y_i = -2 + 3 x_{i1} - x_{i2} + \epsilon_i \]

set.seed(12890926)

## Predictors
n <- 100
x_1 <- runif(n, min = -2, max = 2)
x_2 <- runif(n, min = 0, max = 4)

## Noise
eps <- rnorm(n, mean = 0, sd = 5)

## Model sim
beta_0 <- -2; beta_1 <- 3; beta_2 <- -1
y_sim <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + eps

$R^2$ Statistic - Issue - 2/3

\[ y_i = -2 + 3 x_{i1} - x_{i2} + \epsilon_i \]

fit <- lm(y_sim ~ x_1 + x_2)
summary(fit)$r.squared

## [1] 0.3337178

## unrelated noise variable x_3
x_3 <- runif(n, min = -4, max = 0)
## Fit
fit2 <- lm(y_sim ~ x_1 + x_2 + x_3)
summary(fit2)$r.squared

## [1] 0.3404782

The more there are predictors, the better $R^2$ is !

$R^2$ Statistic - Issue - 3/3

\[ R^2 = 1 - \frac{RSS}{TSS} = 1 - \frac{\|\hat{\epsilon}\|^2}{\|\mathbf{y} - \bar{y} \mathbb{1}\|^2} = 1 - \frac{\|\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}\|^2}{\|\mathbf{y} - \bar{y} \mathbb{1}\|^2} \]

If $\mathbf{X}' = (\mathbf{X} ~ \mathbf{x}_{p+1})$ has one more row, then: \[ \begin{aligned} \|\mathbf{y} - \mathbf{X}'\hat{\boldsymbol{\beta}}'\|^2 & = \underset{\boldsymbol{\beta}' \in \mathbb{R}^{p+1}}{\operatorname{min}} \|\mathbf{y} - \mathbf{X}'\boldsymbol{\beta}' \|^2\\ & = \underset{\boldsymbol{\beta} \in \mathbb{R}^p\\ \beta_{p+1} \in \mathbb{R}}{\operatorname{min}} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta} - \mathbf{x}_{p+1}\beta_{p+1} \|^2\\ & \leq \underset{\boldsymbol{\beta} \in \mathbb{R}^p}{\operatorname{min}} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 \end{aligned} \]

\[ \text{Hence:}\quad \|\mathbf{y} - \mathbf{X}'\hat{\boldsymbol{\beta}}'\|^2 \leq \|\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}\|^2 \]

Adjusted $R^2$ Statistic

\[ \begin{aligned} R_a^2 &= 1 - \frac{RSS / (n-p)}{TSS / (n-1)} = 1 - \frac{(n-1) RSS}{(n-p) TSS} \\ &= 1 - \frac{n-1}{n-p} (1 - R^2) \end{aligned} \]

Penalize for the number of predictors $p$.
Attention $p$ includes the intercept (rank of matrix $\mathbf{X}$).

Adjusted $R^2$ Statistic

\[ y_i = -1 + 3 x_{i1} - x_{i2} + \epsilon_i \]

fit <- lm(y_sim ~ x_1 + x_2)
summary(fit)$adj.r.squared

## [1] 0.31998

## unrelated noise variable x_3 and x_4
x_3 <- runif(n, min = -4, max = 0)
## Fit
fit2 <- lm(y_sim ~ x_1 + x_2 + x_3)
summary(fit2)$adj.r.squared

## [1] 0.3133534

Advertising

fit_all <- lm(sales ~ TV + radio + newspaper, data = ad)
summary(fit_all)

## 
## Call:
## lm(formula = sales ~ TV + radio + newspaper, data = ad)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## TV           0.045765   0.001395  32.809   <2e-16 ***
## radio        0.188530   0.008611  21.893   <2e-16 ***
## newspaper   -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

To be continued

Questions

Confidence interval for $(\hat{\boldsymbol{\beta}}, \hat{\sigma}^2)$ ?
Can we test $\hat{\boldsymbol{\beta}} = 0$ (i.e. no linear trend) ?

Assumptions on the moments are not enough.
$\hookrightarrow$ We need assumptions on the specific distribution of the $\epsilon_i$.

Most common assumption: $\epsilon_i$ are Gaussian.

Advertising Data

Advertising - Data

Advertising - Data

Advertising - Questions

From Simple to Multiple Regression

Model - Simple Regression

Model - Multiple Regression

Simple Regression - Vectorial notation

Multiple Regression - Matricial notation

Multiple Regression - Matricial notation

Multiple Regression - Intercept

Multiple Regression - Intercept

Model

Multiple Regression - Model

Model:

Assumptions:

Notations

Degrees of freedom

Degrees of freedom - simple regression

Degrees of freedom - with intercept

Degrees of freedom - general case

Reminders: Orthogonal Projection

Reminder - Euclidean Geometry 1/2

Reminder - Euclidean Geometry 2/2

Reminder: Orthogonal Projection - 1/3

Reminder: Orthogonal Projection - 2/4

Reminder: Orthogonal Projection - 3/4

Reminder: Matrix of Projection - 4/4

Ordinary Least Squares (OLS)

OLS - Definition

OLS - Projection

OLS - Projection

Geometrical Interpretation

OLS - Projection

OLS - Expression

OLS - Expression

OLS - Proof (Geometrical) - 1/2

OLS - Proof (Geometrical) - 2/2

OLS - Proof (Analytical)

Reminder - Expectation and Variance of random vectors

Expectation and Variance

Expectation and Variance - Properties

The OLS Estimator is unbiased

OLS estimator is unbiased

OLS estimator is unbiased - Proof - 1/3

OLS estimator is unbiased - Proof - 2/3

OLS estimator is unbiased - Proof - 3/3

Variance of the OLS Estimator

Variance of the OLS Estimator

Variance of OLS Estimator - Proof - 1/3

Variance of OLS Estimator - Proof - 2/3

Variance of OLS Estimator - Proof - 3/3

Gauss - Markov Theorem

Partial order on Symmetric matrices

Gauss-Markov theorem

Reminder - Variance of the Sum

Gauss-Markov - Proof - 1/6

Gauss-Markov - Proof - 2/6

Gauss-Markov - Proof - 3/6

Gauss-Markov - Proof - 3/6

Gauss-Markov - Proof - 4/6

Gauss-Markov - Proof - 5/6

Gauss-Markov - Proof - 6/6

Residuals

Geometrical Interpretation

Residuals - Definitions

Residuals - Bias and Variance

Residuals - Bias and Variance - Proof

\(\hat{\mathbf{y}}\) - Bias and Variance

\(\hat{\mathbf{y}}\) - Bias and Variance - Proof

Covariance of \(\hat{\boldsymbol{\epsilon}}\) and \(\hat{\mathbf{y}}\)

Covariance of \(\hat{\boldsymbol{\epsilon}}\) and \(\hat{\mathbf{y}}\) - Proof

Variance Estimation

Unbiased Variance Estimator

Unbiased Var Estimator - Proof - 1/2

Unbiased Var Estimator - Proof - 2/2

Prediction

Predict a new point

Prediction Error

Prediction Error - Proof 1/2