24/01/2024

What we know

Model

\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \quad \forall 1 \leq i \leq n\]

  • \(y_i\): quantitative response for \(i\) (sales)
  • \(x_i\): quantitative predicting variable for \(i\) (TV)
  • \(\epsilon_i\): “error” for \(i\)
    • random variable
    • (H1) \(\mathbb{E}[\epsilon_i] = 0\) for all \(i\) (centered)
    • (H2) \(\mathbb{Var}[\epsilon_i] = \sigma^2\) for all \(i\) (identical variance)
    • (H3) \(\mathbb{Cov}[\epsilon_i; \epsilon_j] = 0\) for all \(i \neq j\) (independent)
  • Note: Assumptions on first and second moments only.

OLS Estimators

\[ (\hat{\beta}_0, \hat{\beta}_1) = \underset{(\beta_0, \beta_1) \in \mathbb{R}^2}{\operatorname{argmin}} \left\{ \sum_{i = 1}^n (y_i - \beta_0 - \beta_1 x_i)^2 \right\}\]

  • Properties on the first and second moments only.

\[ \begin{aligned} \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1\bar{x} & \hat{\beta}_1 &= \frac{s_{\mathbf{y},\mathbf{x}}^2}{s_{\mathbf{x}}^2} \\ \mathbb{E}[\hat{\beta}_0] &= \beta_0 & \mathbb{E}[\hat{\beta}_1] &= \beta_1 \\ \mathbb{Var}[\hat{\beta}_0] &= \frac{\sigma^2}{n} \left( 1 + \frac{\bar{x}^2}{s_{\mathbf{x}}^2} \right) & \mathbb{Var}[\hat{\beta}_1] &= \frac{\sigma^2}{n}\frac{1}{s_{\mathbf{x}}^2} \\ \mathbb{Cov}[\hat{\beta}_0; \hat{\beta}_1] &= - \frac{\sigma^2}{n} \frac{\bar{x}}{s_{\mathbf{x}}^2} \end{aligned} \]

Questions

  • Confidence interval for \((\hat{\beta}_0, \hat{\beta}_1)\) ?

  • Can we test \(\hat{\beta}_1 = 0\) (i.e. no linear trend) ?

  • Assumptions on the moments are not enough.
    \(\hookrightarrow\) We need assumptions on the specific distribution of the \(\epsilon_i\).
  • Most common assumption: \(\epsilon_i\) are Gaussian.

Gaussian Model

Gaussian Model

\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \quad \forall 1 \leq i \leq n\]

  • \(y_i\): quantitative response for \(i\) (sales)
  • \(x_i\): quantitative predicting variable for \(i\) (TV)
  • \(\epsilon_i\): “error” for \(i\)
    • random variable (r.v.)
    • independent, identically distributed (iid), Gaussian
    • Centered, with variance \(\sigma^2\):

\[ \epsilon_i \underset{\text{iid}}{\operatorname{\sim}} \mathcal{N}(0, \sigma^2) \quad \text{for} \quad 1 \leq i \leq n \]

  • [Note (H1), (H2) and (H3) are still verified.]

Reminder: Gaussian Distribution 1/2

Let \(X\) be a Gaussian r.v. with expectation \(\mu\) and variance \(\sigma^2\): \[ X \sim \mathcal{N}(\mu, \sigma^2). \]

It admits a probability density function (pdf): \[ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) \quad \forall x \in \mathbb{R}, \]

And a characteristic function: \[ \phi_{X}(t) = \mathbb{E}[e^{itX}] = e^{i\mu t - \frac{\sigma^2 t^2}{2}} \quad \forall t \in \mathbb{R}. \]

Reminder: Gaussian Distribution 2/2

Distribution of \(\mathbf{y}\)

  • Model: \[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \quad \text{with} \quad \epsilon_i \underset{\text{iid}}{\operatorname{\sim}} \mathcal{N}(0, \sigma^2) \quad \text{for} \quad 1 \leq i \leq n \]
  • Distribution of \(y_i\) ?

Reminder:
Linear function of a Gaussian

If \(X \sim \mathcal{N}(\mu, \sigma^2)\), then for any \((a, b) \in \mathbb{R}^2\),
\[ Y = aX + b \sim \mathcal{N}(a\mu + b, a^2\sigma^2). \]

  • Proof: Exercise.
    [Hint: use the characteristic function.]
  • \(y_i = \beta_0 + \beta_1 x_i + \epsilon_i\)
    Apply the property with \(a = 1\) and \(b = \beta_0 + \beta_1 x_i\).
    [Reminder: \(x_i\) are assumed to be non random here.]

Distribution of \(\mathbf{y}\)

  • Model: \[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \quad \text{with} \quad \epsilon_i \underset{\text{iid}}{\operatorname{\sim}} \mathcal{N}(0, \sigma^2) \quad \text{for} \quad 1 \leq i \leq n \]

  • Distribution of \(y_i\): \[ y_i \underset{\text{iid}}{\operatorname{\sim}} \mathcal{N}(\beta_0 + \beta_1 x_i, \sigma^2) \quad \text{for} \quad 1 \leq i \leq n \]

  • Question: What is the likelihood of \(\mathbf{y} = (y_1, \dotsc, y_n)^T\) ?

Maximum Likelihood Estimators

Reminder: Likelihood

If \(X\) is a rv that admits a density \(f_{\theta}\) that depends on a parameter \(\theta\), then the likelihood function is the density, seen as a function of \(\theta\) given an outcome \(X = x\): \[ \theta \mapsto L(\theta | x) = f_{\theta}(x). \]

Example:
Assume that we observe one realization \(x\) of a Gaussian random variable \(X\), with unknown expectation \(\mu\), but known variance \(\sigma^2 = 1\).
Then the Likelihood of the observation \(x_{obs}\) is the function: \[ \mu \to \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x_{obs} - \mu)^2}{2\sigma^2}\right). \]

Reminder: Maximum Likelihood 1/2

The Maximum Likelihood (ML) estimator \(\hat{\theta}\) of the parameter \(\theta\) is the one that maximizes the likelihood function in \(\theta\), given an observation \(x\):

\[ \hat{\theta} = \underset{\theta}{\operatorname{argmax}} L(\theta | x). \] Example:
Let \(x_{obs}\) be one realization of a Gaussian r.v. \(X\), with unknown expectation \(\mu\), but known variance \(\sigma^2 = 1\).
Then the ML estimator \(\hat{\mu}\) of \(\mu\) is: \[ \hat{\mu} = \underset{\mu \in \mathbb{R}}{\operatorname{argmax}} \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x_{obs} - \mu)^2}{2\sigma^2}\right). \]

Reminder: Maximum Likelihood 2/2

Reminder: Maximum Likelihood 2/2

Reminder: Maximum Likelihood 2/2

Likelihood of \(\mathbf{y}\)

Distribution of \(y_i\): \[ y_i \underset{\text{iid}}{\operatorname{\sim}} \mathcal{N}(\beta_0 + \beta_1 x_i, \sigma^2) \quad \text{for} \quad 1 \leq i \leq n \]

Likelihood of \(\mathbf{y} = (y_1, \dotsc, y_n)^T\):

\[ \begin{aligned} &L(\beta_0, \beta_1, \sigma^2 | y_1, \dotsc, y_n) = \prod_{i = 1}^n L(\beta_0, \beta_1, \sigma^2 | y_i) & [ind.]\\ \end{aligned} \]

and \[ \begin{aligned} L(\beta_0, \beta_1, \sigma^2 | y_i) = \cdots \end{aligned} \]

Likelihood of \(\mathbf{y}\)

Distribution of \(y_i\): \[ y_i \underset{\text{iid}}{\operatorname{\sim}} \mathcal{N}(\beta_0 + \beta_1 x_i, \sigma^2) \quad \text{for} \quad 1 \leq i \leq n \]

Likelihood of \(\mathbf{y} = (y_1, \dotsc, y_n)^T\):

\[ \begin{aligned} &L(\beta_0, \beta_1, \sigma^2 | y_1, \dotsc, y_n) = \prod_{i = 1}^n L(\beta_0, \beta_1, \sigma^2 | y_i) & [ind.]\\ \end{aligned} \]

and \[ \begin{aligned} L(\beta_0, \beta_1, \sigma^2 | y_i) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(y_i - \beta_0 - \beta_1 x_i)^2}{2\sigma^2}\right) \end{aligned} \]

Likelihood of \(\mathbf{y}\)

  • Likelihood of \(\mathbf{y} = (y_1, \dotsc, y_n)^T\): \[ \begin{aligned} &L(\beta_0, \beta_1, \sigma^2 | y_1, \dotsc, y_n) = \prod_{i = 1}^n L(\beta_0, \beta_1, \sigma^2 | y_i)\\ &\qquad = \prod_{i = 1}^n \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(y_i - \beta_0 - \beta_1 x_i)^2}{2\sigma^2}\right)\\ &\qquad = \frac{1}{\left(\sqrt{2\pi\sigma^2}\right)^n}\exp\left(-\frac{1}{2\sigma^2}\sum_{i = 1}^n (y_i - \beta_0 - \beta_1 x_i)^2\right)\\ &\log L(\beta_0, \beta_1, \sigma^2 | y_1, \dotsc, y_n) = \cdots \end{aligned} \]

Log-Likelihood of \(\mathbf{y}\)

\[ \begin{aligned} &\log L(\beta_0, \beta_1, \sigma^2 | y_1, \dotsc, y_n) = \\ &\qquad - \frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i = 1}^n (y_i - \beta_0 - \beta_1 x_i)^2\\ \end{aligned} \]

  • Question: What are the Maximum Likelihood estimators of \(\beta_0\), \(\beta_1\) and \(\sigma^2\) ?
  • We need to find: \[ (\hat{\beta}_0^{ML}, \hat{\beta}_1^{ML}, \hat{\sigma}^2_{ML}) = \underset{(\beta_0, \beta_1, \sigma^2) \in \mathbb{R}^2\times\mathbb{R}_+^*}{\operatorname{argmax}} \log L(\beta_0, \beta_1, \sigma^2 | \mathbf{y}) \]

ML Estimators - \(\beta_0\) and \(\beta_1\)

\[ \begin{aligned} &\log L(\beta_0, \beta_1, \sigma^2 | \mathbf{y}) = \\ &\qquad - \frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i = 1}^n (y_i - \beta_0 - \beta_1 x_i)^2\\ \end{aligned} \]

ML Estimators - \(\beta_0\) and \(\beta_1\)

\[ \begin{aligned} &\log L(\beta_0, \beta_1, \sigma^2 | \mathbf{y}) = \\ &\qquad - \frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\underbrace{\sum_{i = 1}^n (y_i - \beta_0 - \beta_1 x_i)^2}_{\text{Sum of Squares } SS(\beta_0, \beta_1)}\\ \end{aligned} \]

  • For any \(\sigma^2 > 0\), \[ \begin{aligned} (\hat{\beta}_0^{ML}, \hat{\beta}_1^{ML}) &= \underset{(\beta_0, \beta_1) \in \mathbb{R}^2}{\operatorname{argmax}} \log L(\beta_0, \beta_1, \sigma^2 | \mathbf{y})\\ &= \underset{(\beta_0, \beta_1) \in \mathbb{R}^2}{\operatorname{argmin}} SS(\beta_0, \beta_1)\\ &= (\hat{\beta}_0^{OLS}, \hat{\beta}_1^{OLS}) \end{aligned} \]

ML Estimators - \(\beta_0\) and \(\beta_1\)

\[ \begin{aligned} &\log L(\beta_0, \beta_1, \sigma^2 | \mathbf{y}) = \\ &\qquad - \frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\underbrace{\sum_{i = 1}^n (y_i - \beta_0 - \beta_1 x_i)^2}_{\text{Sum of Squares } SS(\beta_0, \beta_1)}\\ \end{aligned} \]

\[ (\hat{\beta}_0^{ML}, \hat{\beta}_1^{ML}) = (\hat{\beta}_0^{OLS}, \hat{\beta}_1^{OLS}) \]

\(\hookrightarrow\) The ML estimators are the same as the OLS estimators.

ML Estimators - \(\sigma^2\)

\[ \begin{aligned} \hat{\sigma}^2_{ML} &= \underset{\sigma^2 \in \mathbb{R}_+^*}{\operatorname{argmax}} \log L(\hat{\beta}_0, \hat{\beta}_1, \sigma^2 | \mathbf{y})\\ &= \underset{\sigma^2 \in \mathbb{R}_+^*}{\operatorname{argmax}} - \frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i = 1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2\\ &= \underset{\sigma^2 \in \mathbb{R}_+^*}{\operatorname{argmin}} \frac{n}{2} \log(2\pi\sigma^2) + \frac{1}{2\sigma^2}\sum_{i = 1}^n \hat{\epsilon}_i^2\\ \end{aligned} \]

ML Estimators - \(\sigma^2\)

\[ \frac{\partial}{\partial \sigma^2} \left[\frac{n}{2} \log(2\pi\sigma^2) + \frac{1}{2\sigma^2}\sum_{i = 1}^n \hat{\epsilon}_i^2\right] = \cdots = 0 \]

ML Estimators - \(\sigma^2\)

\[ \begin{aligned} \frac{\partial}{\partial \sigma^2} \left[\frac{n}{2} \log(2\pi\sigma^2) + \frac{1}{2\sigma^2}\sum_{i = 1}^n \hat{\epsilon}_i^2\right] &= \frac{n}{2} \frac{1}{\sigma^2} - \frac{1}{2\sigma^4}\sum_{i = 1}^n \hat{\epsilon}_i^2 = 0 \end{aligned} \] And we get:

\[ \hat{\sigma}^2_{ML} = \frac{1}{n}\sum_{i = 1}^n \hat{\epsilon}_i^2 \]

ML Estimators - \(\sigma^2\) - Remarks

  • The ML estimator is different from the unbiased estimator of the variance we saw earlier.

\[ \hat{\sigma}^2_{ML} = \frac{1}{n}\sum_{i = 1}^n \hat{\epsilon}_i^2 \neq \frac{1}{n-2}\sum_{i = 1}^n \hat{\epsilon}_i^2 = \hat{\sigma}^2 \]

  • The ML estimator of the variance is biased: \[ \mathbb{E}[\hat{\sigma}^2_{ML}] = \frac{n - 2}{n}\sigma^2 \neq \sigma^2 \]

Distribution of the Coefficients - \(\sigma^2\) known

Moments of the estimators

  • We already know the moments of the estimators:

\[ \begin{aligned} \mathbb{E}\left[ \begin{pmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \end{pmatrix} \right] &= \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} \\ \mathbb{Var}\left[ \begin{pmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \end{pmatrix} \right] &= \frac{\sigma^2}{n} \begin{pmatrix} 1 + \frac{\bar{x}^2}{s_{\mathbf{x}}^2} & -\frac{\bar{x}}{s_{\mathbf{x}}^2}\\ -\frac{\bar{x}}{s_{\mathbf{x}}^2} & \frac{1}{s_{\mathbf{x}}^2} \end{pmatrix} = \sigma^2 \mathbf{V}_n \end{aligned} \]

  • What can we say about their distribution ?

Distribution of \((\hat{\beta}_0, \hat{\beta}_1)\)

When the variance \(\sigma^2\) is known, we find that the vector of estimators \((\hat{\beta}_0, \hat{\beta}_1)\) is a Gaussian vector, with expectation \((\beta_0, \beta_1)\) and variance \(\sigma^2 \mathbf{V}_n\):

\[ \begin{pmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix}; \sigma^2 \mathbf{V}_n \right) \]

  • Proof delayed to next chapter (multivariate regression).

Simulated dataset

Simulate according to the model : \[\mathbf{y} = -2 \cdot \mathbb{1} + 3 \cdot \mathbf{x} + \boldsymbol{\epsilon}\]

Simulated dataset : replicates

Simulate according to the model : \[\mathbf{y} = -2 \cdot \mathbb{1} + 3 \cdot \mathbf{x} + \boldsymbol{\epsilon}\]

Simulated dataset : replicates

Simulate according to the model : \[\mathbf{y} = -2 \cdot \mathbb{1} + 3 \cdot \mathbf{x} + \boldsymbol{\epsilon}\]

Simulated dataset : replicates

Simulate according to the model : \[\mathbf{y} = -2 \cdot \mathbb{1} + 3 \cdot \mathbf{x} + \boldsymbol{\epsilon}\]

\[ \hat{\beta}_0 \sim \mathcal{N}\left(\beta_0, \frac{\sigma^2}{n}\left(1 + \frac{\bar{x}^2}{s_{\mathbf{x}}^2}\right)\right) \quad \hat{\beta}_1 \sim \mathcal{N}\left(\beta_1, \frac{\sigma^2}{n}\frac{1}{s_{\mathbf{x}}^2}\right) \]

Distribution of the Variance - \(\sigma^2\) known

Unbiased estimator of the Variance

\[ \hat{\sigma}^2 = \frac{1}{n-2}\sum_{i = 1}^n \hat{\epsilon}_i^2 \qquad \mathbb{E}[\hat{\sigma}^2] = \sigma^2 \]

  • Question: Distribution of \(\hat{\sigma}^2\) ?

Distribution of \(\hat{\sigma}^2\)

\[ \frac{n-2}{\sigma^2}\hat{\sigma}^2 \sim \chi^2_{n-2} \]

\(~\)

  • The variance estimator \(\hat{\sigma}^2\) is distributed according to a chi squared distribution with n-2 degrees of freedom.

\(~\)

  • \(\hat{\beta}\) and \(\hat{\sigma}^2\) are independent.

\(~\)

  • Proof delayed to next chapter (multivariate regression).

Reminder: Chi Squared Distribution 1/3

Let \(X_1, \dotsc, X_p\) be \(p\) standard normal iid rv : \(X_i \sim \mathcal{N}(0, 1)\). Then \[ X = \sum_{i = 1}^p X_i^2 \sim \chi^2_p \]

is a Chi squared r.v. with \(p\) degrees of freedom.

We have: \[ \mathbb{E}[X] = p \quad \mathbb{Var}[X] = 2p. \]

It converges in distribution towards the Gaussian with matching moments.

Reminder: Chi Squared Distribution 2/3

Reminder: Chi Squared Distribution 3/3

Distribution of the Coefficients - \(\sigma^2\) unknown

Distribution of \((\hat{\beta}_0, \hat{\beta}_1)\)

When \(\sigma^2\) is known:

\[ \begin{pmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix}; \sigma^2 \mathbf{V}_n \right) \qquad \mathbf{V}_n= \frac{1}{n} \begin{pmatrix} 1 + \frac{\bar{x}^2}{s_{\mathbf{x}}^2} & -\frac{\bar{x}}{s_{\mathbf{x}}^2}\\ -\frac{\bar{x}}{s_{\mathbf{x}}^2} & \frac{1}{s_{\mathbf{x}}^2} \end{pmatrix} \]

i.e.

\[ \frac{\hat{\beta}_0 - \beta_0}{\sqrt{\sigma^2(1 + \bar{x}^2/s_{\mathbf{x}}^2)/n}} \sim \mathcal{N}(0, 1) \qquad \frac{\hat{\beta}_1 - \beta_1}{\sqrt{\sigma^2(1/s_{\mathbf{x}}^2)/n}} \sim \mathcal{N}(0, 1) \]

  • Problem \(\sigma^2\) is generally unknown.
  • Solution Replace \(\sigma^2\) by \(\hat{\sigma}^2\).
  • How is the distribution changed ?

Distribution of \((\hat{\beta}_0, \hat{\beta}_1)\)

When \(\sigma^2\) is known:

\[ \frac{\hat{\beta}_0 - \beta_0}{\sqrt{\sigma^2(1 + \bar{x}^2/s_{\mathbf{x}}^2)/n}} \sim \mathcal{N}(0, 1) \qquad \frac{\hat{\beta}_1 - \beta_1}{\sqrt{\sigma^2(1/s_{\mathbf{x}}^2)/n}} \sim \mathcal{N}(0, 1) \]

When \(\sigma^2\) is unknown: \[ \frac{\hat{\beta}_0 - \beta_0}{\sqrt{\hat{\sigma}^2(1 + \bar{x}^2/s_{\mathbf{x}}^2)/n}} \sim \mathcal{T}_{n-2} \qquad \frac{\hat{\beta}_1 - \beta_1}{\sqrt{\hat{\sigma}^2(1/s_{\mathbf{x}}^2)/n}} \sim \mathcal{T}_{n-2} \]

  • Replace standard Gaussian by Student distribution.

Reminder: Student Distribution 1/2

Let \(Z\) be a standard normal rv : \(Z \sim \mathcal{N}(0, 1)\),
and \(X\) a chi squared rv with \(p\) degrees of freedom: \(X \sim \chi^2_p\), \(Z\) and \(X\) independent. Then \[ T = \frac{Z}{\sqrt{X/p}} \sim \mathcal{T}_p \]

is a Student r.v. with \(p\) degrees of freedom.

We have (for \(p > 2\)): \[ \mathbb{E}[T] = 0 \quad \mathbb{Var}[T] = \frac{p}{p-2}. \]

It converges in distribution towards the standard Gaussian.

Reminder: Student Distribution 2/2

Confidence Intervals

Confidence intervals

\[ \frac{\hat{\beta}_0 - \beta_0}{\sqrt{\hat{\sigma}^2_0}} \sim \mathcal{T}_{n-2} \qquad \frac{\hat{\beta}_1 - \beta_1}{\sqrt{\hat{\sigma}^2_1}} \sim \mathcal{T}_{n-2}\\ \frac{n-2}{\sigma^2}\hat{\sigma}^2 \sim \chi^2_{n-2} \]

  • With probability \(1-\alpha\): \[ \begin{aligned} \beta_0 &\in \left[\hat{\beta}_0 \pm t_{n-2}(1 - \alpha/2) \sqrt{\hat{\sigma}_0^2}\right]\\ \beta_1 &\in \left[\hat{\beta}_1 \pm t_{n-2}(1 - \alpha/2) \sqrt{\hat{\sigma}_1^2}\right]\\ \sigma^2 &\in \left[\frac{(n-2)\hat{\sigma}^2}{c_{n-2}(1 - \alpha/2)}; \frac{(n-2)\hat{\sigma}^2}{c_{n-2}(\alpha/2)}\right] \end{aligned} \]

Simulated dataset

\[\mathbf{y} = -2 \cdot \mathbb{1} + 3 \cdot \mathbf{x} + \boldsymbol{\epsilon}\]

  • With probability \(1-\alpha\): \[ \begin{aligned} t_{n-2}(\alpha/2) \leq \frac{\hat{\beta}_1 - \beta_1}{\sqrt{\hat{\sigma}^2_1}} \leq t_{n-2}(1 - \alpha/2) \end{aligned} \]

Simulated dataset - R output

\[\mathbf{y} = -2 \cdot \mathbb{1} + 3 \cdot \mathbf{x} + \boldsymbol{\epsilon}\]

linreg <- lm(y_test ~ x_test)
summary(linreg)
## 
## Call:
## lm(formula = y_test ~ x_test)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.58688 -0.66510  0.05019  0.69597  2.86079 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.9448     0.1017  -19.12   <2e-16 ***
## x_test        3.0979     0.0866   35.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9812 on 98 degrees of freedom
## Multiple R-squared:  0.9289, Adjusted R-squared:  0.9281 
## F-statistic:  1280 on 1 and 98 DF,  p-value: < 2.2e-16

Prediction Intervals

Prediction error

Recall that the prediction error \(\hat{\epsilon}_{n+1} = y_{n+1} - \hat{y}_{n+1}\) is such that: \[ \begin{aligned} \mathbb{E}[\hat{\epsilon}_{n+1}] &= 0\\ \mathbb{Var}[\hat{\epsilon}_{n+1}] &= \sigma^2 \left(1 + \frac{1}{n} + \frac{1}{ns_{\mathbf{x}}^2} (x_{n+1} - \bar{x})^2\right)\\ \end{aligned} \]

Prediction Distribution

We get (variance known):

\[ y_{n+1} - \hat{y}_{n+1} \sim \mathcal{N}\left(0, \sigma^2 \left(1 + \frac{1}{n} + \frac{1}{ns_{\mathbf{x}}^2} (x_{n+1} - \bar{x})^2\right)\right) \]

and (variance unknown): \[ \frac{y_{n+1} - \hat{y}_{n+1}}{\sqrt{\hat{\sigma}^2 \left(1 + \frac{1}{n} + \frac{1}{ns_{\mathbf{x}}^2} (x_{n+1} - \bar{x})^2\right)}} \sim \mathcal{T}_{n-2} \]

Advertising Dataset

Advertising

library(here)
ad <- read.csv(here("data", "Advertising.csv"), row.names = "X")
head(ad)
##      TV radio newspaper sales
## 1 230.1  37.8      69.2  22.1
## 2  44.5  39.3      45.1  10.4
## 3  17.2  45.9      69.3   9.3
## 4 151.5  41.3      58.5  18.5
## 5 180.8  10.8      58.4  12.9
## 6   8.7  48.9      75.0   7.2
  • TV, radio, newspaper: advertising budgets (thousands of $)
  • sales: number of sales (thousands of units)

Advertising - TV - 1/6

fit_tv <- lm(sales ~ TV, data = ad)
plot(ad$TV, ad$sales)
abline(fit_tv, col = "blue")

Advertising - TV - 2/6

summary(fit_tv)
## 
## Call:
## lm(formula = sales ~ TV, data = ad)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3860 -1.9545 -0.1913  2.0671  7.2124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.032594   0.457843   15.36   <2e-16 ***
## TV          0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

Advertising - TV - 3/6

  • Is there a relationship between TV advertising budget and sale ?
    • Student t test: p values < 2e-16
  • How strong is the relationship ?
    • RSS = 3.259, RSS / \(\bar{y}\) = 23% (rough percentage of error)
    • \(R^2\): 0.6119 : Only around 61% of the variance explained.
  • How large is the effect of TV on sales ?
    • \(\hat{\beta}_1 \approx 0.047537 \pm 1.96 \times 0.002691\)
    • CI away from 0
    • When I spend 1000$ more in TV, I sell about 47.5 more units.

Advertising - TV - 4/6

## lm fit
fit_tv <- lm(sales ~ TV, data = ad)
plot(ad$TV, ad$sales)
abline(fit_tv, col = "black")
## colors
contrast_col <- hcl.colors(2)
## Confidence interval
newx <- seq(0, 300, by = 1)
conf_int <- predict(fit_tv, data.frame(TV = newx), interval = "conf")
# Plot
lines(newx, conf_int[, 2], col = contrast_col[1], lty = 2, lwd = 2)
lines(newx, conf_int[, 3], col = contrast_col[1], lty = 2, lwd = 2)
# Prediction intervals
pred_int <- predict(fit_tv, data.frame(TV = newx), interval = "pred")
# Plot
lines(newx, pred_int[, 2], col = contrast_col[2], lty = 2, lwd = 2)
lines(newx, pred_int[, 3], col = contrast_col[2], lty = 2, lwd = 2)

Advertising - TV - 5/6

  • 95% confidence interval (dark purple) and prediction interval (light yellow)

Advertising - TV - 6/6

  • How accurately can we predict future sales ?
    • See prediction intervals
  • Is the relationship linear ?
    • Does it look like it ? More on that later.
  • Which media contribute to sales ?
    • Start again with newspapers and radio.
  • Is there synergy among the media ?
    • ??
  • To answer these questions better, we need multiple linear regression.
    • Next chapter.

Joint Distribution of the Coefficients - \(\sigma^2\) unknown

Distribution of \((\hat{\beta}_0, \hat{\beta}_1)\)

When \(\sigma^2\) is known:

\[ \hat{\boldsymbol{\beta}} = \begin{pmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \end{pmatrix} \sim \mathcal{N}\left( \boldsymbol{\beta} = \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix}; \sigma^2 \mathbf{V}_n \right) \]

When \(\sigma^2\) is unknown: \[ \frac{1}{2\hat{\sigma}^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T \mathbf{V}_n^{-1}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) \sim \mathcal{F}^2_{n-2} \]

  • with \(\mathcal{F}^2_{n-2}\) the Fisher distribution.

Reminder: Fisher Distribution 1/2

Let \(U_1 \sim \chi^2_{p_1}\) and \(U_2 \sim \chi^2_{p_2}\), \(U_1\) and \(U_2\) independent. Then \[ F = \frac{U_1/p_1}{U_2/p_2} \sim \mathcal{F}^{p_1}_{p_2} \]

is a Fisher r.v. with \((p_1, p_2)\) degrees of freedom.

We have (for \(p_2 > 2\)): \[ \mathbb{E}[F] = \frac{p_2}{p_2 - 2}. \]

When \(p_2\) goes to infinity, \(\mathcal{F}^{p_1}_{p_2}\) converges in distribution to \(\frac{1}{p_1}\chi^2_{p_1}\).

Reminder: Fisher Distribution 2/2

Confidence Region

Confidence Intervals

  • Marginal distributions:

\[ \frac{\hat{\beta}_0 - \beta_0}{\sqrt{\hat{\sigma}^2_0}} \sim \mathcal{T}_{n-2} \qquad \frac{\hat{\beta}_1 - \beta_1}{\sqrt{\hat{\sigma}^2_1}} \sim \mathcal{T}_{n-2} \]

  • Give confidence intervals:

\[ \begin{aligned} \beta_0 &\in \left[\hat{\beta}_0 \pm t_{n-2}(1 - \alpha/2) \sqrt{\hat{\sigma}_0^2}\right]\\ \beta_1 &\in \left[\hat{\beta}_1 \pm t_{n-2}(1 - \alpha/2) \sqrt{\hat{\sigma}_1^2}\right]\\ \end{aligned} \]

Confidence Region

  • Joint distribution:

\[ \frac{1}{2\hat{\sigma}^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T \mathbf{V}_n^{-1}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) \sim \mathcal{F}^2_{n-2} \]

  • Gives confidence region:

\[ \frac{1}{2\hat{\sigma}^2}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^T \mathbf{V}_n^{-1}\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) \leq f_{n-2}^2(1 - \alpha) \]

Since \(\mathbf{V}_n^{-1}\) is positive definite, this is the equation of an elliptic region.

Simulated dataset

\[\mathbf{y} = -2 \cdot \mathbb{1} + 3 \cdot \mathbf{x} + \boldsymbol{\epsilon}\]

Difference between:

  • confidence intervals (green) and the
  • confidence region (purple).