Partitioning Variability

Lucy D’Agostino McGowan

Partitioning variability

Example

Code
ggplot(data, aes(x = battery_percent)) + 
  geom_histogram(bins = 30)

Example

Code
ggplot(data, aes(x = battery_percent)) + 
  geom_histogram(bins = 30) + 
  geom_vline(xintercept = c(mean(data$battery_percent) + sd(data$battery_percent), mean(data$battery_percent) - sd(data$battery_percent)), lty = 2)

Total variation in response y

\[SSTotal = \sum (y - \bar{y})^2\]

data %>%
  summarise(
    sstotal = 
      sum((______ - ______)^2)
    )

Total variation in response y

\[SSTotal = \sum (y - \bar{y})^2\]

data %>%
  summarise(
    sstotal = 
      sum((battery_percent - mean(battery_percent))^2)
    )
# A tibble: 1 × 1
  sstotal
    <dbl>
1 369840.

Total variation in response y

\[SSTotal = \sum (y - \bar{y})^2\]

data %>%
  summarise(
    sstotal = 
      var(battery_percent) * (n()-1)
    )
# A tibble: 1 × 1
  sstotal
    <dbl>
1 369840.

Unexplained variation from the residuals

\[SSE = \sum (y - \hat{y})^2\]

mod <- lm(battery_percent ~ screen_time, data = data)


data %>%
  summarise(
    sse = 
      sum((______ - _______)^2)
    )

Unexplained variation from the residuals

\[SSE = \sum (y - \hat{y})^2\]

mod <- lm(battery_percent ~ screen_time, data = data)


data %>%
  summarise(
    sse = 
      sum((battery_percent - fitted(mod))^2)
    )
# A tibble: 1 × 1
      sse
    <dbl>
1 366449.

Unexplained variation from the residuals

\[SSE = \sum (y - \hat{y})^2\]

mod <- lm(battery_percent ~ screen_time, data = data)


data %>%
  summarise(
    sse = 
      sum(residuals(mod)^2)
    )
# A tibble: 1 × 1
      sse
    <dbl>
1 366449.

Unexplained variation from the residuals

\[SSE = \sum (y - \hat{y})^2\]

mod <- lm(battery_percent ~ screen_time, data = data)


data %>%
  summarise(
    sse = 
      sigma(mod)^2 * (n() - 2)
    )
# A tibble: 1 × 1
      sse
    <dbl>
1 366449.

Variation explained by the model

\[SSModel = \sum (\hat{y}-\bar{y})^2\]

data %>%
  summarise(
    ssmodel = 
      sum(______ - ______)^2)
    )

Variation explained by the model

\[SSModel = \sum (\hat{y}-\bar{y})^2\]

data %>%
  summarise(
    ssmodel = 
      sum((fitted(mod) - mean(battery_percent))^2)
    )
# A tibble: 1 × 1
  ssmodel
    <dbl>
1   3391.

Partitioning variability

data %>%
  summarise(
    sstotal = sum((battery_percent - mean(battery_percent))^2),
    ssmodel = sum((fitted(mod) - mean(battery_percent))^2),
    sse = sum(residuals(mod)^2)
    )
# A tibble: 1 × 3
  sstotal ssmodel     sse
    <dbl>   <dbl>   <dbl>
1 369840.   3391. 366449.

Partitioning variability

data %>%
  summarise(
    sstotal = sum((battery_percent - mean(battery_percent))^2),
    ssmodel = sum((fitted(mod) - mean(battery_percent))^2),
    sse = sum(residuals(mod)^2),
    ssmodel + sse
    )


What will this be?

Partitioning variability

data %>%
  summarise(
    sstotal = sum((battery_percent - mean(battery_percent))^2),
    ssmodel = sum((fitted(mod) - mean(battery_percent))^2),
    sse = sum(residuals(mod)^2),
    ssmodel + sse
    )
# A tibble: 1 × 4
  sstotal ssmodel     sse `ssmodel + sse`
    <dbl>   <dbl>   <dbl>           <dbl>
1 369840.   3391. 366449.         369840.

Partitioning variability

data %>%
  summarise(
    sstotal = sum((battery_percent - mean(battery_percent))^2),
    ssmodel = sum((fitted(mod) - mean(battery_percent))^2),
    sse = sum(residuals(mod)^2),
    ssmodel + sse,
    sstotal - ssmodel
    )


What will this be?

Partitioning variability

data %>%
  summarise(
    sstotal = sum((battery_percent - mean(battery_percent))^2),
    ssmodel = sum((fitted(mod) - mean(battery_percent))^2),
    sse = sum(residuals(mod)^2),
    ssmodel + sse,
    sstotal - ssmodel
    )
# A tibble: 1 × 5
  sstotal ssmodel     sse `ssmodel + sse` `sstotal - ssmodel`
    <dbl>   <dbl>   <dbl>           <dbl>               <dbl>
1 369840.   3391. 366449.         369840.             366449.

Partitioning variability

data %>%
  summarise(
    sstotal = sum((battery_percent - mean(battery_percent))^2),
    ssmodel = sum((fitted(mod) - mean(battery_percent))^2),
    sse = sum(residuals(mod)^2),
    ssmodel + sse,
    sstotal - ssmodel,
    sstotal - sse
    )


What will this be?

Partitioning variability

data %>%
  summarise(
    sstotal = sum((battery_percent - mean(battery_percent))^2),
    ssmodel = sum((fitted(mod) - mean(battery_percent))^2),
    sse = sum(residuals(mod)^2),
    ssmodel + sse,
    sstotal - ssmodel,
    sstotal - sse
    )
# A tibble: 1 × 6
  sstotal ssmodel     sse `ssmodel + sse` `sstotal - ssmodel` `sstotal - sse`
    <dbl>   <dbl>   <dbl>           <dbl>               <dbl>           <dbl>
1 369840.   3391. 366449.         369840.             366449.           3391.

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = \sum_{i=1}^n (y - \bar{y})^2\]

How many observations?

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = \sum_{i=1}^{\require{color}\colorbox{#86a293}{$n$}} (y - \bar{y})^2\]

How many observations?

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = \sum_{i=1}^{n} (y - \bar{y})^2\]

How many things are “estimated”?

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = \sum_{i=1}^{n} (y - \require{color}\colorbox{#86a293}{$\bar{y}$})^2\]

How many things are “estimated”?

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = \sum_{i=1}^{n} (y - \bar{y})^2\]

How many degrees of freedom?

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = \sum_{i=1}^{n} (y - \bar{y})^2\]

\[\Large df_{SSTOTAL}=n-1\]

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - \hat{y})^2\]

How many observations?

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{\require{color}\colorbox{#86a293}{$n$}} (y - \hat{y})^2\]

How many observations?

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - \hat{y})^2\]

How is \(\hat{y}\) estimated with simple linear regression?

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - (\hat{\beta}_0+\hat{\beta_1}x))^2\]

How is \(\hat{y}\) estimated with simple linear regression?

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - (\hat{\beta}_0+\hat{\beta_1}x))^2\]

How many things are “estimated”?

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - (\require{color}\colorbox{#86a293}{$\hat{\beta}_0$}+\colorbox{#86a293}{$\hat{\beta}_1$}x))^2\]

How many things are “estimated”?

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - (\hat{\beta}_0+\hat{\beta_1}x))^2\]

How many degrees of freedom?

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - (\hat{\beta}_0+\hat{\beta_1}x))^2\]

\[\Large df_{SSE} = n - 2\]

Degrees of freedom

  • The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = SSModel + SSE\]

\[df_{SSTotal} = df_{SSModel} + df_{SSE} \]

\[n - 1 = df_{SSModel} + (n - 2)\]

Application Exercise

How many degrees of freedom does SSModel have?

\[n - 1 = df_{SSModel} + (n - 2)\]

01:00

Mean squares

\[MSE = \frac{SSE}{n - 2}\]

\[MSModel = \frac{SSModel}{1}\]

What is the pattern?

\[\Large F = \frac{MSModel}{MSE}\]

F-distribution

Under the null hypothesis

Code
f <- data.frame(
  stat = rf(n = 10000, df1 = 1, df2 = 629)
)

ggplot(f) + 
  geom_histogram(aes(stat), bins = 40) + 
  labs(x = "F Statistic")

Example

We can see all of these statistics by using the anova function on the output of lm

mod <- lm(battery_percent ~ screen_time, data = data)
anova(mod)
Analysis of Variance Table

Response: battery_percent
             Df Sum Sq Mean Sq F value  Pr(>F)  
screen_time   1   3391  3390.6  5.8199 0.01613 *
Residuals   629 366449   582.6                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What is the SSModel?

Example

We can see all of these statistics by using the anova function on the output of lm

mod <- lm(battery_percent ~ screen_time, data = data)
anova(mod)
Analysis of Variance Table

Response: battery_percent
             Df Sum Sq Mean Sq F value  Pr(>F)  
screen_time   1   3391  3390.6  5.8199 0.01613 *
Residuals   629 366449   582.6                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What is the MSModel?

Example

We can see all of these statistics by using the anova function on the output of lm

mod <- lm(battery_percent ~ screen_time, data = data)
anova(mod)
Analysis of Variance Table

Response: battery_percent
             Df Sum Sq Mean Sq F value  Pr(>F)  
screen_time   1   3391  3390.6  5.8199 0.01613 *
Residuals   629 366449   582.6                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What is the SSE?

Example

We can see all of these statistics by using the anova function on the output of lm

mod <- lm(battery_percent ~ screen_time, data = data)
anova(mod)
Analysis of Variance Table

Response: battery_percent
             Df Sum Sq Mean Sq F value  Pr(>F)  
screen_time   1   3391  3390.6  5.8199 0.01613 *
Residuals   629 366449   582.6                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What is the MSE?

Example

We can see all of these statistics by using the anova function on the output of lm

mod <- lm(battery_percent ~ screen_time, data = data)
anova(mod)
Analysis of Variance Table

Response: battery_percent
             Df Sum Sq Mean Sq F value  Pr(>F)  
screen_time   1   3391  3390.6  5.8199 0.01613 *
Residuals   629 366449   582.6                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What is the SSTotal?

Example

We can see all of these statistics by using the anova function on the output of lm

mod <- lm(battery_percent ~ screen_time, data = data)
anova(mod)
Analysis of Variance Table

Response: battery_percent
             Df Sum Sq Mean Sq F value  Pr(>F)  
screen_time   1   3391  3390.6  5.8199 0.01613 *
Residuals   629 366449   582.6                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What is the F statistic?

Example

We can see all of these statistics by using the anova function on the output of lm

mod <- lm(battery_percent ~ screen_time, data = data)
anova(mod)
Analysis of Variance Table

Response: battery_percent
             Df Sum Sq Mean Sq F value  Pr(>F)  
screen_time   1   3391  3390.6  5.8199 0.01613 *
Residuals   629 366449   582.6                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Is the F-statistic statistically significant?

p-value

The probability of getting a statistic as extreme or more extreme than the observed test statistic given the null hypothesis is true

F-Distribution

Under the null hypothesis

Code
ggplot(f) + 
  geom_histogram(aes(stat), bins = 40) + 
  labs(x = "F Statistic")

Degrees of freedom

  • \(n = 631\)
  • \(df_{SSTotal} = ?\)

Degrees of freedom

  • \(n = 631\)
  • \(df_{SSTotal} = 630\)

Degrees of freedom

  • \(n = 631\)
  • \(df_{SSTotal} = 630\)
  • \(df_{SSE} = ?\)

Degrees of freedom

  • \(n = 631\)
  • \(df_{SSTotal} = 630\)
  • \(df_{SSE} = n - 2 = 629\)

Degrees of freedom

  • \(n = 631\)
  • \(df_{SSTotal} = 630\)
  • \(df_{SSE} = n - 2 = 629\)
  • \(df_{SSModel} = ?\)

Degrees of freedom

  • \(n = 631\)
  • \(df_{SSTotal} = 630\)
  • \(df_{SSE} = n - 2 = 629\)
  • \(df_{SSModel} = 630 - 629 = 1\)

Example

To calculate the p-value under the t-distribution we use pt(). What do you think we use to calculate the p-value under the F-distribution?

anova(mod)
Analysis of Variance Table

Response: battery_percent
             Df Sum Sq Mean Sq F value  Pr(>F)  
screen_time   1   3391  3390.6  5.8199 0.01613 *
Residuals   629 366449   582.6                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • pf()
  • it takes 3 arguments: q, df1, and df2. What do you think we would plug in for q?

Degrees of freedom

  • \(n = 631\)
  • \(df_{SSTotal} = 630\)
  • \(df_{SSE} = n - 2 = 629\) df2
  • \(df_{SSModel} = 630 - 629 = 1\) df1

Example

To calculate the p-value under the t-distribution we use pt(). What do you think we use to calculate the p-value under the F-distribution?

anova(mod)
Analysis of Variance Table

Response: battery_percent
             Df Sum Sq Mean Sq F value  Pr(>F)  
screen_time   1   3391  3390.6  5.8199 0.01613 *
Residuals   629 366449   582.6                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pf(5.8199, 1, 629, lower.tail = FALSE)
[1] 0.01613098

Example

Why don’t we multiply this p-value by 2 when we use pf()?

anova(mod)
Analysis of Variance Table

Response: battery_percent
             Df Sum Sq Mean Sq F value  Pr(>F)  
screen_time   1   3391  3390.6  5.8199 0.01613 *
Residuals   629 366449   582.6                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pf(5.8199, 1, 629, lower.tail = FALSE)
[1] 0.01613098

F-Distribution

Under the null hypothesis

Code
ggplot(f) + 
  geom_histogram(aes(stat), bins = 40) + 
  labs(x = "F Statistic")

F-Distribution

Under the null hypothesis

Code
f$shaded <- ifelse(f$stat > 5.8199, TRUE, FALSE)

ggplot(f) + 
  geom_histogram(aes(stat, fill = shaded), bins = 40) + 
  geom_vline(xintercept = 5.8199, lwd = 1.5) +
  labs(x = "F Statistic") +
  theme(legend.position = "none")
  • We observed an F-statistic of 5.8199
  • Are there any negative values in an F-distribution?

F-Distribution

Under the null hypothesis

Code
f$shaded <- ifelse(f$stat > 5.8199, TRUE, FALSE)

ggplot(f) + 
  geom_histogram(aes(stat, fill = shaded), bins = 40) + 
  geom_vline(xintercept = 5.8199, lwd = 1.5) +
  labs(x = "F Statistic") +
  theme(legend.position = "none")
  • The p-value calculates values “as extreme or more extreme”, in the t-distribution “more extreme values”, defined as farther from 0, can be positive or negative. Not so for the F!

Example

anova(mod)
Analysis of Variance Table

Response: battery_percent
             Df Sum Sq Mean Sq F value  Pr(>F)  
screen_time   1   3391  3390.6  5.8199 0.01613 *
Residuals   629 366449   582.6                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod)

Call:
lm(formula = battery_percent ~ screen_time, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-61.468 -17.443   2.593  19.190  46.905 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 67.496166   2.563073  26.334   <2e-16 ***
screen_time -0.020871   0.008652  -2.412   0.0161 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 24.14 on 629 degrees of freedom
Multiple R-squared:  0.009168,  Adjusted R-squared:  0.007592 
F-statistic:  5.82 on 1 and 629 DF,  p-value: 0.01613
  • Notice the p-value for the F-test is the same as the p-value for the \(\hat\beta_1\) t-test
  • This is always true for simple linear regression (with just one \(x\) variable)

What is the F-test testing?

anova(mod)
Analysis of Variance Table

Response: battery_percent
             Df Sum Sq Mean Sq F value  Pr(>F)  
screen_time   1   3391  3390.6  5.8199 0.01613 *
Residuals   629 366449   582.6                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • null hypothesis: the fit of the intercept only model (with \(\hat\beta_0\) only) and your model (\(\hat\beta_0 + \hat\beta_1x\)) are equivalent
  • alternative hypothesis: The fit of the intercept only model is significantly worse compared to your model
  • When we only have one variable in our model, \(x\), the p-values from the F and t tests are going to be equivalent

Relating the F and the t

anova(mod)
Analysis of Variance Table

Response: battery_percent
             Df Sum Sq Mean Sq F value  Pr(>F)  
screen_time   1   3391  3390.6  5.8199 0.01613 *
Residuals   629 366449   582.6                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod)

Call:
lm(formula = battery_percent ~ screen_time, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-61.468 -17.443   2.593  19.190  46.905 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 67.496166   2.563073  26.334   <2e-16 ***
screen_time -0.020871   0.008652  -2.412   0.0161 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 24.14 on 629 degrees of freedom
Multiple R-squared:  0.009168,  Adjusted R-squared:  0.007592 
F-statistic:  5.82 on 1 and 629 DF,  p-value: 0.01613
(-2.412)^2
[1] 5.817744

Application Exercise

  1. Open appex-08.qmd
  2. Using your data, predict battery percent from screen time
  3. What are the degrees of freedom for the: Sum of Squares Total, Sum of Squares Model, Sum of Squares Errors
  4. Calculate the following quantities: Sum of Squares Total, Sum of Squares Model, Sum of Squares Errors
  5. Calculate the F-statistic for the model and the p-value
  6. What is the null hypothesis? What is the alternative?
06:00