Prediction intervals

Lucy D’Agostino McGowan

confidence intervals

If we use the same sampling method to select different samples and computed an interval estimate for each sample, we would expect the true population parameter ( \(\beta_1\) ) to fall within the interval estimates 95% of the time.

Confidence interval for \(\hat\beta_1\)

How do we calculate the confidence interval for the slope?

\[\hat\beta_1\pm t^*SE_{\hat\beta_1}\]

How do we calculate it in R?

In with the confint function:

mod <- lm(battery_percent ~ screen_time, data)
summary(mod)


Call:
lm(formula = battery_percent ~ screen_time, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-61.818 -17.353   2.546  19.108 115.720 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 68.150787   2.503928  27.218  < 2e-16 ***
screen_time -0.022447   0.008347  -2.689  0.00735 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 24.53 on 630 degrees of freedom
Multiple R-squared:  0.01135,   Adjusted R-squared:  0.009781 
F-statistic: 7.233 on 1 and 630 DF,  p-value: 0.007349

confint(mod)

                 2.5 %       97.5 %
(Intercept) 63.2337308 73.067842966
screen_time -0.0388383 -0.006056541

How do we calculate it in R?

“by hand”

t_star <- qt(0.025, df = nrow(data) - 2, lower.tail = FALSE)
# or
t_star <- qt(0.975, df =  nrow(data) - 2)

-0.022447 - t_star * 0.008347

[1] -0.03883831

-0.022447 + t_star * 0.008347

[1] -0.00605569

Confidence intervals

There are ✌️ other types of confidence intervals we may want to calculate

The confidence interval for the mean response in \(y\) for a given \(x^*\) value
The confidence interval for an individual response \(y\) for a given \(x^*\) value
Why are these different? Which do you think is easier to estimate? It is harder to predict one response than to predict a mean response. What does this mean in terms of the standard error?
The SE of the prediction interval is going to be larger

Confidence intervals

confidence interval for \(\mu_y\) and prediction interval

\[ \hat{y}\pm t^* SE\]

\(\hat{y}\) is the predicted \(y\) for a given \(x^*\)
\(t^*\) is the critical value for the \(t_{n-2}\) density curve
\(SE\) takes ✌️ different values depending on which interval you’re interested in

\(SE_{\hat\mu}\)

\(SE_{\hat{y}}\)

Which will be larger?

Confidence intervals

confidence interval for \(\mu_y\) and prediction interval

\[\hat{y}\pm t^* SE\]

\(\hat{y}\) is the predicted \(y\) for a given \(x^*\)
\(t^*\) is the critical value for the \(t_{n-2}\) density curve
\(SE\) takes ✌️ different values depending on which interval you’re interested in
\(SE_{\hat\mu} = \hat{\sigma}_\epsilon\sqrt{\frac{1}{n}+\frac{(x^*-\bar{x})^2}{\Sigma(x-\bar{x})^2}}\)
\(SE_{\hat{y}}=\hat{\sigma}_\epsilon\sqrt{1 + \frac{1}{n}+\frac{(x^*-\bar{x})^2}{\Sigma(x-\bar{x})^2}}\)

What is the difference between these two equations?

Confidence intervals

confidence interval for \(\mu_y\) and prediction interval

\[\hat{y}\pm t^* SE\]

\(\hat{y}\) is the predicted \(y\) for a given \(x^*\)
\(t^*\) is the critical value for the \(t_{n-2}\) density curve
\(SE\) takes ✌️ different values depending on which interval you’re interested in
\(SE_{\hat\mu} = \hat{\sigma}_\epsilon\sqrt{\frac{1}{n}+\frac{(x^*-\bar{x})^2}{\Sigma(x-\bar{x})^2}}\)
\(SE_{\hat{y}}=\hat{\sigma}_\epsilon\sqrt{\color{red}1 + \frac{1}{n}+\frac{(x^*-\bar{x})^2}{\Sigma(x-\bar{x})^2}}\)

an individual response will vary from the mean response \(\mu_y\) with a standard deviation of \(\sigma_\epsilon\)

Let’s do it in R!

mod <- lm(battery_percent ~ screen_time, data = data)
predict(mod)

       1        2        3 
62.69606 61.25943 59.91258

mod <- lm(battery_percent ~ screen_time, data = data)
predict(mod, interval = "confidence")

       fit      lwr      upr
1 62.69606 60.70358 64.68855
2 61.25943 59.27788 63.24097
3 59.91258 57.48676 62.33840

mod <- lm(battery_percent ~ screen_time, data = data)
predict(mod, interval = "prediction")

## WARNING predictions on current data refer to _future_ responses

       fit      lwr      upr
1 62.69606 14.47649 110.9156
2 61.25943 13.04031 109.4785
3 59.91258 11.67317 108.1520

Let’s do it in R!

What if we have new data?

new_data <- data.frame(
  screen_time = c(300, 250, 523)
)
new_data

  screen_time
1         300
2         250
3         523

predict(
  mod, 
  newdata = new_data, 
  interval = "prediction")

       fit       lwr      upr
1 61.41656 13.198504 109.6346
2 62.53893 14.320523 110.7573
3 56.41079  8.024989 104.7966

`Aplication Exercise`

Open appex-10.qmd
You are interested in the predicted Porsche Price for Porsche cars that have 50,000 miles previously driven on average. Calculate this value with an appropriate confidence interval.
You are interested in the predicted Porsche Price for a particular Porsche with 40,000 miles previously driven. Calculate this value with an appropriate confidence interval.

04:00