Drawing Inference

Lucy D’Agostino McGowan

Data

data 
# A tibble: 604 × 5
      id screen_time battery_percent campus_location pit_meals
   <int>       <dbl>           <dbl> <chr>               <dbl>
 1    44         243              79 Quad                    2
 2    30         307              19 South Campus            7
 3    30         367              54 South Campus           14
 4    44         175              63 North Campus            2
 5    44         366              47 North Campus            1
 6    44         143              74 Quad                    5
 7    44         288              70 Quad                    4
 8    44         118              94 Quad                    0
 9    24         134              90 Quad                    9
10    24         300              82 Quad                    4
# ℹ 594 more rows

When is a simple linear regression model a useful descriptive summary?

  • Linearity holds
  • The residuals have “zero mean” (this is always true!)
  • The datapoints are independent

What if we want to draw inference on another sample?

Inference

  • So far we’ve only been able to describe our sample
  • For example, we’ve just been describing \(\hat{\beta}_1\) the estimated slope of the relationship between \(x\) and \(y\)
  • What if we want to extend these claims to the population?

Survey data

How can I visualize a single continuous variable?

Code
ggplot(data, aes(x = screen_time)) +
  geom_histogram(bins = 50) + 
  labs(x = "Average Daily Screen Time (minutes)")

Survey data

Code
set.seed(1)
ggplot(data, 
       aes(x = screen_time, y = 1)) + 
  geom_boxplot(outlier.color = NA) + 
  geom_jitter() + 
  labs(x = "Average Daily Screen Time (minutes)") +
  theme(axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

Survey data

Code
set.seed(1)

ggplot(data, 
       aes(x = screen_time, y = 1)) + 
  geom_boxplot(outlier.colour = NA) + 
  geom_jitter() + 
  geom_jitter(data = data_sample, color = "cornflower blue", size = 3) + 
  labs(x = "Average Daily Screen Time (minutes)") +
  theme(axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

Survey data

Code
set.seed(1)

ggplot(data,
       aes(x = screen_time, 
           y = id == 9,
           color = id == 9)) + 
  geom_boxplot() + 
  geom_jitter() + 
  scale_color_manual(values = c("black", "cornflower blue")) + 
  theme(legend.position = "none") + 
  labs(x = "Average Daily Screen Time (minutes)",
       y = "in sample")

Application Exercise

  1. Create a new project from this template in RStudio Pro:
https://github.com/sta-112-f23/appex-07.git
  1. Run the load-packages and load-data chunks.
  2. Go to: https://lucy.shinyapps.io/sta-112-f23-id-lookup/
  3. Enter your email address to look up your ID
  4. Update the filter(id == 9) line in the filter-data chunk to your ID and run this chunk.
05:00

Survey data

How can I calculate the average daily screen time in my sample in R?

data_sample |>
  summarize(mean_screen_time = mean(screen_time))
# A tibble: 1 × 1
  mean_screen_time
             <dbl>
1             235.
lm(screen_time ~ 1, data = data_sample)

Call:
lm(formula = screen_time ~ 1, data = data_sample)

Coefficients:
(Intercept)  
      235.1  

Survey data

What if I want to know the average daily screen time for Wake Forest Students?


How can we quantify how much we’d expect the mean to differ from one random sample to another?

  • We need a measure of uncertainty
  • How about the standard error of the mean?
  • The standard error is how much we expect the sample mean to vary from one random sample to another.

Survey data

How can we quantify how much we’d expect the mean to differ from one random sample to another?

mod <- lm(screen_time ~ 1, data = data_sample)
summary(mod)

Call:
lm(formula = screen_time ~ 1, data = data_sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-59.091 -42.091  -2.091  38.909  63.909 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   235.09       7.08   33.21   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40.67 on 32 degrees of freedom

Application Exercise

  1. Fit an intercept only model to calculate the average screen time in your sample
  2. Use the summary function on the linear model you fit
  3. What is the standard error for the mean? Interpret this value.
05:00

confidence intervals

If we use the same sampling method to select different samples and computed an interval estimate for each sample, we would expect the true population parameter (the average screen time for Wake Forest Students) to fall within the interval estimates 95% of the time.

Confidence interval

\[\bar{x} \pm t^∗ \times SE_{\bar{x}}\]

  • \(t^*\) is the critical value for the \(t_{n−1}\) density curve to obtain the desired confidence level
  • Often we want a 95% confidence level.

Demo

Let’s do it in R!

mod <- lm(screen_time ~ 1, data = data_sample)
summary(mod) 

Call:
lm(formula = screen_time ~ 1, data = data_sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-59.091 -42.091  -2.091  38.909  63.909 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   235.09       7.08   33.21   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40.67 on 32 degrees of freedom
qt(0.025, df = nrow(data_sample) - 1, lower.tail = FALSE)
[1] 2.036933

Let’s do it in R!

Why 0.025?

mod <- lm(screen_time ~ 1, data = data_sample)
summary(mod) 

Call:
lm(formula = screen_time ~ 1, data = data_sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-59.091 -42.091  -2.091  38.909  63.909 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   235.09       7.08   33.21   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40.67 on 32 degrees of freedom
qt(0.025, df = nrow(data_sample) - 1, lower.tail = FALSE)
[1] 2.036933

Let’s do it in R!

Why lower.tail = FALSE?

mod <- lm(screen_time ~ 1, data = data_sample)
summary(mod) 

Call:
lm(formula = screen_time ~ 1, data = data_sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-59.091 -42.091  -2.091  38.909  63.909 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   235.09       7.08   33.21   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40.67 on 32 degrees of freedom
qt(0.025, df = nrow(data_sample) - 1, lower.tail = FALSE)
[1] 2.036933

Let’s do it in R!

mod <- lm(screen_time ~ 1, data = data_sample)
summary(mod) 

Call:
lm(formula = screen_time ~ 1, data = data_sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-59.091 -42.091  -2.091  38.909  63.909 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   235.09       7.08   33.21   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40.67 on 32 degrees of freedom
t_star <- qt(0.025, df = nrow(data_sample) - 1, lower.tail = FALSE)
310.14 + t_star * 20.23
[1] 351
310.14 - t_star * 20.23
[1] 269
confint(mod) 
            2.5 % 97.5 %
(Intercept)   221    250

confidence intervals

If we use the same sampling method to select different samples and computed an interval estimate for each sample, we would expect the true population parameter (the mean) to fall within the interval estimates 95% of the time.

Application Exercise

  1. Open appex-07.qmd
  2. Calculate the \(t^*\) value for your confidence interval
  3. Calculate the confidence interval “by hand” using the \(t^*\) value from exercise 2 and the mean and standard error from the previous application exercise
  4. Calculate the confidence interval using the confint function
  5. Interpret this value
05:00

Confidence Intervals

Code
data |>
  group_by(id) |>
  mutate(screen_time = ifelse(screen_time == 1210, 132, screen_time)) |>
  summarise(m = mean(screen_time, na.rm = TRUE),
            sd = sd(screen_time, na.rm = TRUE),
            n = n(),
            t = qt(0.025, df = n - 1, lower.tail = FALSE),
            lb = m - t * sd,
            ub = m + t * sd,
            yes = ifelse(lb < mean(data$screen_time) & ub > mean(data$screen_time), 1, 0)) |>
  filter(n > 2) |>
  ggplot(aes(y = factor(id), xmin = lb, x = m, xmax = ub, color = yes)) +
  geom_pointrange() +
  geom_vline(xintercept = mean(data$screen_time), lty = 2) + 
  theme(legend.position = "none") + 
  ylab("id") + 
  xlab("Average Daily Screen Time (Minutes)")