Drawing Inference

Lucy D’Agostino McGowan

Data

data

# A tibble: 604 × 5
      id screen_time battery_percent campus_location pit_meals
   <int>       <dbl>           <dbl> <chr>               <dbl>
 1    44         243              79 Quad                    2
 2    30         307              19 South Campus            7
 3    30         367              54 South Campus           14
 4    44         175              63 North Campus            2
 5    44         366              47 North Campus            1
 6    44         143              74 Quad                    5
 7    44         288              70 Quad                    4
 8    44         118              94 Quad                    0
 9    24         134              90 Quad                    9
10    24         300              82 Quad                    4
# ℹ 594 more rows

When is a simple linear regression model a useful descriptive summary?

Linearity holds
The residuals have “zero mean” (this is always true!)
The datapoints are independent

What if we want to draw inference on another sample?

Inference

So far we’ve only been able to describe our sample
For example, we’ve just been describing \(\hat{\beta}_1\) the estimated slope of the relationship between \(x\) and \(y\)
What if we want to extend these claims to the population?

Survey data

How can I visualize a single continuous variable?

Code

ggplot(data, aes(x = screen_time)) +
  geom_histogram(bins = 50) + 
  labs(x = "Average Daily Screen Time (minutes)")

Survey data

Code

set.seed(1)
ggplot(data, 
       aes(x = screen_time, y = 1)) + 
  geom_boxplot(outlier.color = NA) + 
  geom_jitter() + 
  labs(x = "Average Daily Screen Time (minutes)") +
  theme(axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

Survey data

Code

set.seed(1)

ggplot(data, 
       aes(x = screen_time, y = 1)) + 
  geom_boxplot(outlier.colour = NA) + 
  geom_jitter() + 
  geom_jitter(data = data_sample, color = "cornflower blue", size = 3) + 
  labs(x = "Average Daily Screen Time (minutes)") +
  theme(axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

Survey data

Code

set.seed(1)

ggplot(data,
       aes(x = screen_time, 
           y = id == 9,
           color = id == 9)) + 
  geom_boxplot() + 
  geom_jitter() + 
  scale_color_manual(values = c("black", "cornflower blue")) + 
  theme(legend.position = "none") + 
  labs(x = "Average Daily Screen Time (minutes)",
       y = "in sample")

`Application Exercise`

Create a new project from this template in RStudio Pro:

https://github.com/sta-112-f23/appex-07.git

Run the load-packages and load-data chunks.
Go to: https://lucy.shinyapps.io/sta-112-f23-id-lookup/
Enter your email address to look up your ID
Update the filter(id == 9) line in the filter-data chunk to your ID and run this chunk.

05:00

Survey data

How can I calculate the average daily screen time in my sample in R?

data_sample |>
  summarize(mean_screen_time = mean(screen_time))

# A tibble: 1 × 1
  mean_screen_time
             <dbl>
1             235.

lm(screen_time ~ 1, data = data_sample)


Call:
lm(formula = screen_time ~ 1, data = data_sample)

Coefficients:
(Intercept)  
      235.1

Survey data

What if I want to know the average daily screen time for Wake Forest Students?

How can we quantify how much we’d expect the mean to differ from one random sample to another?

We need a measure of uncertainty
How about the standard error of the mean?
The standard error is how much we expect the sample mean to vary from one random sample to another.

Survey data

How can we quantify how much we’d expect the mean to differ from one random sample to another?

mod <- lm(screen_time ~ 1, data = data_sample)
summary(mod)


Call:
lm(formula = screen_time ~ 1, data = data_sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-59.091 -42.091  -2.091  38.909  63.909 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   235.09       7.08   33.21   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40.67 on 32 degrees of freedom

`Application Exercise`

Fit an intercept only model to calculate the average screen time in your sample
Use the summary function on the linear model you fit
What is the standard error for the mean? Interpret this value.

05:00

confidence intervals

If we use the same sampling method to select different samples and computed an interval estimate for each sample, we would expect the true population parameter (the average screen time for Wake Forest Students) to fall within the interval estimates 95% of the time.

Confidence interval

\[\bar{x} \pm t^∗ \times SE_{\bar{x}}\]

\(t^*\) is the critical value for the \(t_{n−1}\) density curve to obtain the desired confidence level
Often we want a 95% confidence level.

Demo

Let’s do it in R!

mod <- lm(screen_time ~ 1, data = data_sample)
summary(mod)


Call:
lm(formula = screen_time ~ 1, data = data_sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-59.091 -42.091  -2.091  38.909  63.909 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   235.09       7.08   33.21   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40.67 on 32 degrees of freedom

qt(0.025, df = nrow(data_sample) - 1, lower.tail = FALSE)

[1] 2.036933

Let’s do it in R!

Why 0.025?

mod <- lm(screen_time ~ 1, data = data_sample)
summary(mod)


Call:
lm(formula = screen_time ~ 1, data = data_sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-59.091 -42.091  -2.091  38.909  63.909 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   235.09       7.08   33.21   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40.67 on 32 degrees of freedom

qt(0.025, df = nrow(data_sample) - 1, lower.tail = FALSE)

[1] 2.036933

Let’s do it in R!

Why lower.tail = FALSE?

mod <- lm(screen_time ~ 1, data = data_sample)
summary(mod)


Call:
lm(formula = screen_time ~ 1, data = data_sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-59.091 -42.091  -2.091  38.909  63.909 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   235.09       7.08   33.21   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40.67 on 32 degrees of freedom

qt(0.025, df = nrow(data_sample) - 1, lower.tail = FALSE)

[1] 2.036933

Let’s do it in R!

mod <- lm(screen_time ~ 1, data = data_sample)
summary(mod)


Call:
lm(formula = screen_time ~ 1, data = data_sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-59.091 -42.091  -2.091  38.909  63.909 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   235.09       7.08   33.21   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40.67 on 32 degrees of freedom

t_star <- qt(0.025, df = nrow(data_sample) - 1, lower.tail = FALSE)
310.14 + t_star * 20.23

[1] 351

310.14 - t_star * 20.23

[1] 269

confint(mod)

            2.5 % 97.5 %
(Intercept)   221    250

confidence intervals

If we use the same sampling method to select different samples and computed an interval estimate for each sample, we would expect the true population parameter (the mean) to fall within the interval estimates 95% of the time.

`Application Exercise`

Open appex-07.qmd
Calculate the \(t^*\) value for your confidence interval
Calculate the confidence interval “by hand” using the \(t^*\) value from exercise 2 and the mean and standard error from the previous application exercise
Calculate the confidence interval using the confint function
Interpret this value

05:00

Confidence Intervals

Code

data |>
  group_by(id) |>
  mutate(screen_time = ifelse(screen_time == 1210, 132, screen_time)) |>
  summarise(m = mean(screen_time, na.rm = TRUE),
            sd = sd(screen_time, na.rm = TRUE),
            n = n(),
            t = qt(0.025, df = n - 1, lower.tail = FALSE),
            lb = m - t * sd,
            ub = m + t * sd,
            yes = ifelse(lb < mean(data$screen_time) & ub > mean(data$screen_time), 1, 0)) |>
  filter(n > 2) |>
  ggplot(aes(y = factor(id), xmin = lb, x = m, xmax = ub, color = yes)) +
  geom_pointrange() +
  geom_vline(xintercept = mean(data$screen_time), lty = 2) + 
  theme(legend.position = "none") + 
  ylab("id") + 
  xlab("Average Daily Screen Time (Minutes)")