23/01/2024
Polycopié de Arnaud Guyader
https://perso.lpsm.paris/~aguyader/files/teaching/Regression.pdf
An Introduction to Statistical Learning,
G. James, D. Witten, T. Hastie and R. Tibshirani
https://statlearning.com/
Moodle
https://moodle.umontpellier.fr/course/view.php?id=25368
Les diapos sont dans slides
Fiches synthétiques dans cheatsheets
Déployé sur https://hax814x.netlify.app
R Markdown
RStudio
: https://rmarkdown.rstudio.com/Markdown
R Markdown
R
chunkR Markdown
The Maunga Whau (Mt Eden) volcano, Auckland, New Zealand
filled.contour(volcano, color.palette = terrain.colors, asp = 1)
library(here) ad <- read.csv(here("data", "Advertising.csv"), row.names = "X") head(ad)
## TV radio newspaper sales ## 1 230.1 37.8 69.2 22.1 ## 2 44.5 39.3 45.1 10.4 ## 3 17.2 45.9 69.3 9.3 ## 4 151.5 41.3 58.5 18.5 ## 5 180.8 10.8 58.4 12.9 ## 6 8.7 48.9 75.0 7.2
TV
, radio
, newspaper
: advertising budgets (thousands of $)sales
: number of sales (thousands of units)attach(ad) par(mfrow = c(1, 3)) plot(TV, sales); plot(radio, sales); plot(newspaper, sales)
plot(TV, sales)
sales
)TV
)Question: Can we write ? \[ y_i \approx \beta_0 + \beta_1 x_i \]
plot(TV, sales) abline(a = 15, b = 0, col = "blue")
plot(TV, sales) abline(a = 5, b = 0.07, col = "blue")
plot(TV, sales) abline(a = 7, b = 0.05, col = "blue")
\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \quad \forall 1 \leq i \leq n\]
sales
)TV
)\[ \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix} = \beta_0 \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} + \beta_1 \begin{pmatrix} x_1 \\ \vdots \\ x_n \end{pmatrix} + \begin{pmatrix} \epsilon_1 \\ \vdots \\ \epsilon_n \end{pmatrix} \]
\[\mathbf{y} = \beta_0 \mathbb{1} + \beta_1 \mathbf{x} + \boldsymbol{\epsilon}\]
The OLS estimators \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are given by:
\[ (\hat{\beta}_0, \hat{\beta}_1) = \underset{(\beta_0, \beta_1) \in \mathbb{R}^2}{\operatorname{argmin}} \left\{ \sum_{i = 1}^n (y_i - \beta_0 - \beta_1 x_i)^2 \right\}\] \(~\)
Goal: Minimize the squared errors between:
\[RSS = \sum_{i = 1}^n (y_i - \beta_0 - \beta_1 x_i)^2 = 5417.14875\]
\[RSS = \sum_{i = 1}^n (y_i - \beta_0 - \beta_1 x_i)^2 = 6232.7659\]
\[RSS = \sum_{i = 1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2 = 2102.5305831\]
Goal: Minimize \(f(\beta_0, \beta_1) = \sum_{i = 1}^n (y_i - \beta_0 - \beta_1 x_i)^2\).
\[ \begin{aligned} \frac{\partial f(\hat{\beta}_0, \hat{\beta}_1)}{\partial \beta_0} &= \cdots &= 0 \\ \frac{\partial f(\hat{\beta}_0, \hat{\beta}_1)}{\partial \beta_1} &= \cdots &= 0 \\ \end{aligned} \]
Goal: Minimize \(f(\beta_0, \beta_1) = \sum_{i = 1}^n (y_i - \beta_0 - \beta_1 x_i)^2\).
\[ \begin{aligned} \frac{\partial f(\hat{\beta}_0, \hat{\beta}_1)}{\partial \beta_0} &= -2 \sum_{i = 1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) &= 0 \\ \frac{\partial f(\hat{\beta}_0, \hat{\beta}_1)}{\partial \beta_1} &= -2 \sum_{i = 1}^n x_i(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) &= 0 \\ \end{aligned} \]
First equation gives:
\[ \begin{aligned} 0 &= -2 \sum_{i = 1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) \\ n \hat{\beta}_0 &= \sum_{i = 1}^n y_i - \hat{\beta}_1 \sum_{i = 1}^n x_i \\ \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} \\ \end{aligned} \]
\[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \rightarrow \bar{y} = \hat{\beta}_0 + \hat{\beta}_1 \bar{x} \]
First equation gives:
\[ \begin{aligned} \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} \\ \end{aligned} \]
Second equation gives:
\[ \begin{aligned} -2 \sum_{i = 1}^n x_i(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) &= 0 \\ \hat{\beta}_0 \sum_{i = 1}^n x_i + \hat{\beta}_1 \sum_{i = 1}^n x_i^2 &= \sum_{i = 1}^n x_iy_i \\ \hat{\beta}_1 \left(\sum_{i = 1}^n x_i^2 - \bar{x} \sum_{i = 1}^n x_i \right) &= - \bar{y}\sum_{i = 1}^n x_i + \sum_{i = 1}^n x_iy_i \\ \end{aligned} \]
First equation gives:
\[ \begin{aligned} \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} \\ \end{aligned} \]
Second equation gives:
\[ \begin{aligned} \hat{\beta}_1 &= \frac{ \sum_{i = 1}^n x_iy_i - \sum_{i = 1}^n x_i\bar{y}}{\sum_{i = 1}^n x_i^2 - \sum_{i = 1}^n x_i\bar{x}} = \frac{ \sum_{i = 1}^n x_i(y_i - \bar{y})}{\sum_{i = 1}^n x_i(x_i - \bar{x})} \\ \hat{\beta}_1 &= \frac{ \sum_{i = 1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i = 1}^n (x_i -\bar{x})(x_i - \bar{x})} \\ \end{aligned} \]
because \[ \sum_{i = 1}^n (y_i - \bar{y}) = 0 = \sum_{i = 1}^n (x_i - \bar{x}) \]
\[ (\hat{\beta}_0, \hat{\beta}_1) = \underset{(\beta_0, \beta_1) \in \mathbb{R}^2}{\operatorname{argmin}} \left\{ \sum_{i = 1}^n (y_i - \beta_0 - \beta_1 x_i)^2 \right\} \]
Closed form expressions: \[ \hat{\beta}_1 = \frac{\sum_{i = 1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i = 1}^n (x_i - \bar{x})^2} = \frac{s_{\mathbf{y},\mathbf{x}}^2}{s_{\mathbf{x}}^2} \qquad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x} \]
With \[ s_{\mathbf{x}}^2 = \frac{1}{n}\sum_{i = 1}^n (x_i - \bar{x})^2 \qquad s_{\mathbf{y},\mathbf{x}}^2 = \frac{1}{n}\sum_{i = 1}^n (x_i - \bar{x})(y_i - \bar{y}) \] the empirical variance and covariance of \(\mathbf{x}\) and \(\mathbf{y}\).
\[ \hat{\beta}_1 = \frac{\sum_{i = 1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i = 1}^n (x_i - \bar{x})^2} \qquad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x} \]
\(\hat{\beta}_0 + \hat{\beta}_1 \bar{x} = \bar{y}\)
The OLS regression line goes through the gravity point \((\bar{x}, \bar{y})\)
\(\hat{\beta}_0\) and \(\hat{\beta}_1\) are linear in the data \(\mathbf{y} = (y_1, \cdots, y_n)^T\).
\(\hat{\beta}_1 = \frac{s_{\mathbf{y},\mathbf{x}}^2}{s_{\mathbf{x}}^2}\).
Simulate according to the model : \[\mathbf{y} = 2 \cdot \mathbb{1} + 3 \cdot \mathbf{x} + \boldsymbol{\epsilon}\]
## Set the seed (Quia Sapientia) set.seed(12890926) ## Number of samples n <- 100 ## vector of x x_test <- runif(n, -2, 2) ## coefficients beta_0 <- 2; beta_1 <- 3 ## epsilon error_test <- rnorm(n, mean = 0, sd = 10) ## y = 2 + 3 * x + epsilon y_test <- beta_0 + beta_1 * x_test + error_test
Find the OLS regression line:
\[ \hat{\beta}_1 = \frac{s_{\mathbf{y},\mathbf{x}}^2}{s_{\mathbf{x}}^2} \qquad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x} \]
beta_hat_1 <- var(y_test, x_test) / var(x_test) beta_hat_0 <- mean(y_test) - beta_hat_1 * mean(x_test)
## Dataset plot(x_test, y_test, pch = 16, cex = 0.7) ## Ideal line abline(a = beta_0, b = beta_1, col = "red", lwd = 2) ## Regression line abline(a = beta_hat_0, b = beta_hat_1, col = "lightblue", lwd = 2)
\(~\)
\[ \mathbb{E}[\hat{\beta}_0] = \beta_0 \qquad \mathbb{E}[\hat{\beta}_1] = \beta_1 \]
Valid as long as residuals:
\[ \mathbb{E}[\hat{\beta}_1] = \mathbb{E}\left[\frac{\sum_{i = 1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i = 1}^n (x_i - \bar{x})^2}\right] = \cdots \\ \]
\[ \mathbb{E}[\hat{\beta}_1] = \mathbb{E}\left[\frac{\sum_{i = 1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i = 1}^n (x_i - \bar{x})^2}\right] = \frac{\sum_{i = 1}^n (x_i - \bar{x})\mathbb{E}[(y_i - \bar{y})]}{\sum_{i = 1}^n (x_i - \bar{x})^2} \\ \]
According to our model: \[y_i = \beta_0 + \beta_1 x_i + \epsilon_i \quad \& \quad \mathbb{E}[\epsilon_i] = 0\]
so: \[ \mathbb{E}[y_i] = \beta_0 + \beta_1 x_i \\ \mathbb{E}[y_i - \bar{y}] = \beta_0 + \beta_1 x_i - (\beta_0 + \beta_1 \bar{x}) = \beta_1(x_i - \bar{x}) \]
and: \[ \mathbb{E}[\hat{\beta}_1] = \frac{\sum_{i = 1}^n (x_i - \bar{x})\beta_1(x_i - \bar{x})}{\sum_{i = 1}^n (x_i - \bar{x})^2} = \beta_1 \]
\[ \begin{aligned} \mathbb{E}[\hat{\beta}_0] &= \mathbb{E}\left[\bar{y} - \hat{\beta}_1\bar{x}\right]\\ &= \cdots \end{aligned} \]
\[ \begin{aligned} \mathbb{E}[\hat{\beta}_0] &= \mathbb{E}\left[\bar{y} - \hat{\beta}_1\bar{x}\right]\\ &= \mathbb{E}[\bar{y}] - \mathbb{E}[\hat{\beta}_1]\bar{x}\\ &= \beta_0 + \beta_1 \bar{x} - \beta_1 \bar{x}\\ &= \beta_0 \end{aligned} \]
\[ \begin{aligned} \hat{\beta}_1 &= \frac{1}{n s_{\mathbf{x}}^2}\sum_{i = 1}^n (x_i - \bar{x})(y_i - \bar{y})\\ &= \frac{1}{n s_{\mathbf{x}}^2}\sum_{i = 1}^n (x_i - \bar{x})(\beta_0 + \beta_1x_i + \epsilon_i - [\beta_0 + \beta_1\bar{x}+\overline{\epsilon}])\\ &= \beta_1 + \frac{1}{n s_{\mathbf{x}}^2} \sum_{i = 1}^n (x_i - \bar{x})\epsilon_i \end{aligned} \]
This expression is theoretical, but it makes it easy to prove \[\mathbb{E}[\hat{\beta}_1] = \beta_1.\]
\[ \mathbb{Var}[\hat{\beta}_0] = \frac{\sigma^2}{n} \left( 1 + \frac{\bar{x}^2}{s_{\mathbf{x}}^2} \right) \]
\[ \mathbb{Var}[\hat{\beta}_1] = \frac{\sigma^2}{n}\frac{1}{s_{\mathbf{x}}^2} \]
Valid as long as residuals:
\[ \begin{aligned} \mathbb{Var}[\hat{\beta}_1] &= \mathbb{Var}\left[\beta_1 + \frac{1}{n s_{\mathbf{x}}^2} \sum_{i = 1}^n (x_i - \bar{x})\epsilon_i\right] \\ &= \cdots \end{aligned} \]
\[ \begin{aligned} \mathbb{Var}[\hat{\beta}_1] &= \mathbb{Var}\left[\beta_1 + \frac{1}{n s_{\mathbf{x}}^2} \sum_{i = 1}^n (x_i - \bar{x})\epsilon_i\right] \\ &= \frac{1}{(n s_{\mathbf{x}}^2)^2} \sum_{i = 1}^n (x_i - \bar{x})^2\mathbb{Var}[\epsilon_i] & [H3]\\ &= \frac{n s_{\mathbf{x}}^2}{(n s_{\mathbf{x}}^2)^2} \sigma^2 & [H2] \\ &= \frac{\sigma^2}{n s_{\mathbf{x}}^2} \end{aligned} \]
\[ \begin{aligned} \mathbb{Var}[\hat{\beta}_0] &= \mathbb{Var}[\bar{y} - \hat{\beta}_1\bar{x}] = \cdots \end{aligned} \]
Attention \(\bar{y}\) and \(\hat{\beta}_1\) might be correlated (they have the same \(\epsilon_i\))
\[ \begin{aligned} \mathbb{Cov}[\bar{y}; \hat{\beta}_1] &= \cdots \end{aligned} \]
\[ \begin{aligned} \mathbb{Cov}[\bar{y}; \hat{\beta}_1] &= \mathbb{Cov}\left[\beta_0 + \beta_1\bar{x} + \bar{\epsilon}; \beta_1 + \frac{1}{n s_{\mathbf{x}}^2} \sum_{i = 1}^n (x_i - \bar{x})\epsilon_i\right] \\ &= \cdots \end{aligned} \]
\[ \begin{aligned} \mathbb{Cov}[\bar{y}; \hat{\beta}_1] &= \mathbb{Cov}\left[\beta_0 + \beta_1\bar{x} + \bar{\epsilon}; \beta_1 + \frac{1}{n s_{\mathbf{x}}^2} \sum_{i = 1}^n (x_i - \bar{x})\epsilon_i\right] \\ &= \mathbb{Cov}\left[\frac{1}{n}\sum_{i = 1}^n\epsilon_i; \frac{1}{n s_{\mathbf{x}}^2} \sum_{i = 1}^n (x_i - \bar{x})\epsilon_i\right] \\ &= \frac{1}{n^2 s_{\mathbf{x}}^2} \sum_{i = 1}^n \sum_{j = 1}^n \mathbb{Cov}\left[\epsilon_i; (x_j - \bar{x})\epsilon_j\right] \\ &= \frac{1}{n^2 s_{\mathbf{x}}^2} \sum_{i = 1}^n \mathbb{Cov}\left[\epsilon_i; (x_i - \bar{x})\epsilon_i\right] \\ &= \frac{1}{n^2 s_{\mathbf{x}}^2} \sum_{i = 1}^n (x_i - \bar{x}) \sigma^2 = 0 \end{aligned} \]
Since \(\mathbb{Cov}[\bar{y}; \hat{\beta}_1] = 0\) :
\[ \begin{aligned} \mathbb{Var}[\hat{\beta}_0] &= \mathbb{Var}[\bar{y} - \hat{\beta}_1\bar{x}] = \mathbb{Var}[\bar{y}] + \mathbb{Var}[\hat{\beta}_1\bar{x}]\\ &= \cdots \end{aligned} \]
Since \(\mathbb{Cov}[\bar{y}; \hat{\beta}_1] = 0\) :
\[ \begin{aligned} \mathbb{Var}[\hat{\beta}_0] &= \mathbb{Var}[\bar{y} - \hat{\beta}_1\bar{x}] = \mathbb{Var}[\bar{y}] + \mathbb{Var}[\hat{\beta}_1\bar{x}]\\ &= \frac{1}{n^2}\mathbb{Var}\left[\sum_{i=1}^n\epsilon_i\right] + \bar{x}^2\mathbb{Var}[\hat{\beta}_1]\\ &= \frac{1}{n}\sigma^2 + \frac{\bar{x}^2 \sigma^2}{n s_{\mathbf{x}}^2} \\ &= \frac{\sigma^2}{n} \left(1 + \frac{\bar{x}^2}{s_{\mathbf{x}}^2}\right) \end{aligned} \]
\[ \mathbb{Cov}[\hat{\beta}_0; \hat{\beta}_1] = - \frac{\sigma^2}{n} \frac{\bar{x}}{s_{\mathbf{x}}^2} \]
Valid as long as residuals:
\[ \begin{aligned} \mathbb{Cov}[\hat{\beta}_0; \hat{\beta}_1] & = \mathbb{Cov}[\bar{y} - \hat{\beta}_1\bar{x}; \hat{\beta}_1] = \cdots \end{aligned} \]
\[ \begin{aligned} \mathbb{Cov}[\hat{\beta}_0; \hat{\beta}_1] &= \mathbb{Cov}[\bar{y} - \hat{\beta}_1\bar{x}; \hat{\beta}_1]\\ &= \mathbb{Cov}[\bar{y}; \hat{\beta}_1] - \bar{x}\mathbb{Cov}[\hat{\beta}_1; \hat{\beta}_1]\\ &= 0 - \bar{x}\frac{\sigma^2}{n s_{\mathbf{x}}^2}\\ &= - \frac{\sigma^2 \bar{x}}{n s_{\mathbf{x}}^2} \end{aligned} \]
\[ \mathbb{Var}[\hat{\beta}_0] = \frac{\sigma^2}{n} \left( 1 + \frac{\bar{x}^2}{s_{\mathbf{x}}^2} \right) \quad \mathbb{Var}[\hat{\beta}_1] = \frac{\sigma^2}{n}\frac{1}{s_{\mathbf{x}}^2} \]
\[ \mathbb{Cov}[\hat{\beta}_0; \hat{\beta}_1] = - \frac{\sigma^2}{n} \frac{\bar{x}}{s_{\mathbf{x}}^2} \]
The OLS estimator is the BLUE
(Best Linear Unbiased Estimator):
Among all unbiased estimators that are linear in \(\mathbf{y}\), the OLS estimators are the ones that have the smallest variance.
\[ \hat{\beta}_1 = \sum_{i = 1}^n p_i y_i \quad \text{ with } \quad p_i = \frac{x_i - \bar{x}}{n s_{\mathbf{x}}^2} \]
\[ \begin{aligned} \mathbb{E}[\tilde{\beta}_1] &= \mathbb{E}\left[\sum_{i = 1}^n q_i y_i\right] = \cdots = \beta_1 & \text{(unbiased)} \end{aligned} \]
\[ \begin{aligned} \mathbb{E}[\tilde{\beta}_1] &= \sum_{i = 1}^n \mathbb{E}[q_i y_i] \\ &= \sum_{i = 1}^n \mathbb{E}[q_i (\beta_0 + \beta_1 x_i + \epsilon_i)] \\ \beta_1 &= \beta_0 \sum_{i = 1}^n q_i + \beta_1 \sum_{i = 1}^n q_ix_i & \text{(unbiased)} \end{aligned} \]
\[ \hat{\beta}_1 = \sum_{i = 1}^n p_i y_i \quad \text{ with } \quad p_i = \frac{x_i - \bar{x}}{n s_{\mathbf{x}}^2} \]
Let \(\tilde{\beta}_1\) be another linear unbiased estimator.
\(\tilde{\beta}_1\) is linear: \(\tilde{\beta}_1 = \sum_{i = 1}^n q_i y_i\)
We have \(\sum_{i=1}^n q_i = 0\) and \(\sum_{i=1}^n q_ix_i = 1\)
\[ \begin{aligned} \mathbb{Cov}[\tilde{\beta}_1 - \hat{\beta}_1; \hat{\beta}_1] &= \mathbb{Cov}[\tilde{\beta}_1; \hat{\beta}_1] - \mathbb{Var}[\hat{\beta}_1] \\ &= \cdots \end{aligned} \]
\[ \begin{aligned} \mathbb{Cov}[\tilde{\beta}_1 - \hat{\beta}_1; \hat{\beta}_1] &= \mathbb{Cov}[\tilde{\beta}_1; \hat{\beta}_1] - \mathbb{Var}[\hat{\beta}_1] \\ &= \sum_{i=1}^n p_i q_i \sigma^2 - \frac{\sigma^2}{n s_{\mathbf{x}}^2} \\ &=\frac{\sigma^2}{n s_{\mathbf{x}}^2}\left(\sum_{i=1}^nq_ix_i - \sum_{i=1}^nq_i\bar{x} - 1\right) \\ &= 0 \end{aligned} \]
\(\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_i\)
\(\hat{\epsilon}_i = y_i - \hat{y}_i\)
\[ \begin{aligned} \sum_{i = 1}^n \hat{\epsilon}_i &= \sum_{i = 1}^n y_i - \hat{y}_i \\ &= \cdots \\ &= 0 \end{aligned} \]
\[ \begin{aligned} \sum_{i = 1}^n \hat{\epsilon}_i &= \sum_{i = 1}^n y_i - \hat{y}_i\\ &= \sum_{i = 1}^n y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i & [\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}]\\ &= \sum_{i = 1}^n (y_i - \bar{y}) - \hat{\beta}_1(x_i - \bar{x}) \\ &= 0 \end{aligned} \]
\[ \hat{\sigma}^2 = \frac{1}{n-2} \sum_{i = 1}^n \hat{\epsilon}_i^2 = \frac{1}{n-2} \cdot RSS \]
is an unbiased estimator of the variance \(\sigma^2\).
2
parameters \((\beta_0, \beta_1) \to n-\) 2
\[ \hat{\sigma}^2 = \frac{1}{n-2} \cdot RSS = \frac{1}{n-2} \sum_{i = 1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2 = 10.62 \]
\[ \begin{aligned} \mathbb{E}\left[\sum_{i=1}^n\hat{\epsilon}_i^2\right] &= \sum_{i=1}^n\mathbb{E}\left[\hat{\epsilon}_i^2\right] = \sum_{i=1}^n\mathbb{Var}\left[\hat{\epsilon}_i\right] \end{aligned} \]
\(~\)
\[ \begin{aligned} \mathbb{Var}[\hat{\epsilon}_i] &= \mathbb{Var}[y_i - \hat{y}_i] \\ &= \mathbb{Var}[\beta_0 + \beta_1 x_i + \epsilon_i - \hat{\beta}_0 - \hat{\beta}_1 x_i]\\ &= \cdots \end{aligned} \]
\[ \begin{aligned} \mathbb{Var}[\hat{\epsilon}_i] &= \mathbb{Var}[y_i - \hat{y}_i] \\ &= \mathbb{Var}[\beta_0 + \beta_1 x_i + \epsilon_i - \hat{\beta}_0 - \hat{\beta}_1 x_i]\\ &= \mathbb{Var}[\epsilon_i - \hat{\beta}_0 - \hat{\beta}_1 x_i]\\ &= \mathbb{Var}[\epsilon_i] + \mathbb{Var}[\hat{\beta}_0 + \hat{\beta}_1 x_i] - 2\mathbb{Cov}[\epsilon_i; \hat{\beta}_0 + \hat{\beta}_1 x_i] \end{aligned} \]
\[ \begin{aligned} \mathbb{Var}[\hat{\epsilon}_i] &= \mathbb{Var}[\epsilon_i] + \mathbb{Var}[\hat{\beta}_0 + \hat{\beta}_1 x_i] - 2\mathbb{Cov}[\epsilon_i; \hat{\beta}_0 + \hat{\beta}_1 x_i] \end{aligned} \] \(~\)
\[ \begin{aligned} \mathbb{Var}[\hat{\beta}_0 + \hat{\beta}_1 x_i] &= \mathbb{Var}[\bar{y} + \hat{\beta}_1 (x_i - \bar{x})] \qquad~~ \{\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\}\\ &= \mathbb{Var}[\bar{y}] + \mathbb{Var}[\hat{\beta}_1 (x_i - \bar{x})] ~~ \{\mathbb{Cov}[\bar{y},\hat{\beta}_1] = 0\}\\ &= \frac{\sigma^2}{n} + (x_i - \bar{x})^2\frac{\sigma^2}{n}\frac{1}{s_{\mathbf{x}}^2} \end{aligned} \] \(~\)
\[ \begin{aligned} \mathbb{Cov}[\hat{\beta}_0 + \hat{\beta}_1 x_i; \epsilon_i] &= \mathbb{Cov}[\bar{y}; \epsilon_i] + \mathbb{Cov}[\hat{\beta}_1 (x_i - \bar{x}); \epsilon_i]\\ &= \frac{\sigma^2}{n} + (x_i - \bar{x})\frac{1}{ns_{\mathbf{x}}^2}(x_i - \bar{x})\sigma^2 \end{aligned} \]
\[ \begin{aligned} \mathbb{Var}[\hat{\epsilon}_i] &= \mathbb{Var}[\epsilon_i] + \mathbb{Var}[\hat{\beta}_0 + \hat{\beta}_1 x_i] - 2\mathbb{Cov}[\epsilon_i; \hat{\beta}_0 + \hat{\beta}_1 x_i]\\ &= \sigma^2 - \frac{\sigma^2}{n} - \frac{\sigma^2 (x_i - \bar{x})^2}{ns_{\mathbf{x}}^2} \end{aligned} \]
\[ \begin{aligned} \mathbb{E}\left[\sum_{i=1}^n\hat{\epsilon}_i^2\right] &= \sum_{i=1}^n\mathbb{Var}\left[\hat{\epsilon}_i\right] = \sum_{i=1}^n \left(\sigma^2 - \frac{\sigma^2}{n} - \frac{\sigma^2 (x_i - \bar{x})^2}{ns_{\mathbf{x}}^2}\right)\\ &= n\sigma^2 - \sigma^2 - \sigma^2 = (n-2)\sigma^2 \end{aligned} \]
\(~\)
\[ \mathbb{E}\left[\hat{\sigma}^2\right] = \mathbb{E}\left[\frac{1}{n-2}\sum_{i=1}^n\hat{\epsilon}_i^2\right] = \sigma^2 \]
The prediction error \(\hat{\epsilon}_{n+1} = y_{n+1} - \hat{y}_{n+1}\) is such that: \[ \begin{aligned} \mathbb{E}[\hat{\epsilon}_{n+1}] &= 0\\ \mathbb{Var}[\hat{\epsilon}_{n+1}] &= \sigma^2 \left(1 + \frac{1}{n} + \frac{1}{ns_{\mathbf{x}}^2} (x_{n+1} - \bar{x})^2\right)\\ \end{aligned} \]
\[ \begin{aligned} \mathbb{E}[\hat{\epsilon}_{n+1}] &= \mathbb{E}[y_{n+1} - \hat{\beta}_0 - \hat{\beta}_1 x_{n+1}]\\ &= \cdots \end{aligned} \]
\[ \begin{aligned} \mathbb{E}[\hat{\epsilon}_{n+1}] &= \mathbb{E}[y_{n+1} - \hat{\beta}_0 - \hat{\beta}_1 x_{n+1}]\\ &= \mathbb{E}[y_{n+1}] - \mathbb{E}[\hat{\beta}_0] - \mathbb{E}[\hat{\beta}_1] x_{n+1}\\ &= \beta_0 + \beta_1 x_{n+1} - \beta_0 - \beta_1 x_{n+1} \\ &= 0 \end{aligned} \]
Because \(\hat{y}_{n+1}\) does not depend on \(\epsilon_{n+1}\):
\[ \begin{aligned} \mathbb{Var}[\hat{\epsilon}_{n+1}] &= \mathbb{Var}[y_{n+1} - \hat{y}_{n+1}]\\ &= \mathbb{Var}[y_{n+1}] + \mathbb{Var}[\hat{y}_{n+1}]\\ \end{aligned} \]
Hence: \[ \begin{aligned} \mathbb{Var}[\hat{\epsilon}_{n+1}] &= \sigma^2 + \mathbb{Var}[\hat{\beta}_0 + \hat{\beta}_1 x_{n+1}]\\ &= \sigma^2 + \frac{\sigma^2}{n} + \frac{\sigma^2}{ns_{\mathbf{x}}^2} (x_{n+1} - \bar{x})^2\\ \end{aligned} \]
\[ \begin{aligned} \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix} &= \beta_0 \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} &&+ \beta_1 \begin{pmatrix} x_1 \\ \vdots \\ x_n \end{pmatrix} &&+ \begin{pmatrix} \epsilon_1 \\ \vdots \\ \epsilon_n \end{pmatrix} \\ ~ \\ \mathbf{y} ~~~\; &= \beta_0 ~~~\mathbb{1} &&+ \beta_1 ~~~\; \mathbf{x} &&+ ~~~ \boldsymbol{\epsilon} \end{aligned} \]
\[ \begin{aligned} \text{Proj}_{\mathcal{M}(\mathbf{x})}\mathbf{y} &= \underset{\tilde{\mathbf{y}} \in \mathcal{M}(\mathbf{x})}{\operatorname{argmin}}\{\|\mathbf{y} -\tilde{\mathbf{y}}\|^2 \}\\ &= \underset{\tilde{\mathbf{y}} = \beta_0 \mathbb{1} + \beta_1 \mathbf{x}\\(\beta_0, \beta_1) \in \mathbb{R}^2}{\operatorname{argmin}}\{\|\mathbf{y} - (\beta_0 \mathbb{1} + \beta_1 \mathbf{x})\|^2 \} \\ &= \underset{\tilde{\mathbf{y}} = \beta_0 \mathbb{1} + \beta_1 \mathbf{x}\\(\beta_0, \beta_1) \in \mathbb{R}^2}{\operatorname{argmin}}\{\sum_{i = 1}^n \left(y_i - (\beta_0 + \beta_1 x_i)\right)^2 \} \\ &= \hat{\mathbf{y}} \\ \end{aligned} \]
\[ \hat{\sigma}^2 = \frac{1}{n-2} \cdot RSS = \frac{1}{n-2} \sum_{i = 1}^n \hat{\epsilon}_i^2 = \frac{1}{n-2} \|\hat{\epsilon}\|^2 \]
Using Pythagoras’s theorem:
\[ \begin{aligned} \|\mathbf{y} - \bar{y} \mathbb{1}\|^2 &= \|\hat{\mathbf{y}} - \bar{y} \mathbb{1} + \mathbf{y} - \hat{\mathbf{y}}\|^2 \\ &= \|\hat{\mathbf{y}} - \bar{y} \mathbb{1} + \hat{\boldsymbol{\epsilon}}\|^2 \\ &= \|\hat{\mathbf{y}} - \bar{y} \mathbb{1}\|^2 + \|\hat{\boldsymbol{\epsilon}}\|^2 \end{aligned} \]
\[ \begin{aligned} &\|\mathbf{y} - \bar{y} \mathbb{1}\|^2 &=&&& \|\hat{\mathbf{y}} - \bar{y} \mathbb{1}\|^2 &&+& \|\hat{\boldsymbol{\epsilon}}\|^2 \\ &TSS &=&&& ESS &&+& RSS \end{aligned} \]
\[ R^2 = \frac{ESS}{TSS} = \frac{\|\hat{\mathbf{y}} - \bar{y} \mathbb{1}\|^2}{\|\mathbf{y} - \bar{y} \mathbb{1}\|^2} = 1 - \frac{\|\hat{\epsilon}\|^2}{\|\mathbf{y} - \bar{y} \mathbb{1}\|^2} = 1 - \frac{RSS}{TSS} \]
\[ \hat{\sigma}^2 = \frac{1}{n-2} \cdot RSS = 10.62 \quad R^2 = 1 - \frac{RSS}{TSS} = 0.61 \]
Confidence interval for \((\hat{\beta}_0, \hat{\beta}_1, \hat{\sigma}^2)\) ?
Can we test \(\hat{\beta}_1 = 0\) (i.e. no linear trend) ?