Partitioning Variability

Lucy D’Agostino McGowan

Partitioning variability

Example

Code

ggplot(data, aes(x = battery_percent)) + 
  geom_histogram(bins = 30)

Example

Code

ggplot(data, aes(x = battery_percent)) + 
  geom_histogram(bins = 30) + 
  geom_vline(xintercept = c(mean(data$battery_percent) + sd(data$battery_percent), mean(data$battery_percent) - sd(data$battery_percent)), lty = 2)

Total variation in response y

\[SSTotal = \sum (y - \bar{y})^2\]

data %>%
  summarise(
    sstotal = 
      sum((______ - ______)^2)
    )

Total variation in response y

\[SSTotal = \sum (y - \bar{y})^2\]

data %>%
  summarise(
    sstotal = 
      sum((battery_percent - mean(battery_percent))^2)
    )

# A tibble: 1 × 1
  sstotal
    <dbl>
1 369840.

Total variation in response y

\[SSTotal = \sum (y - \bar{y})^2\]

data %>%
  summarise(
    sstotal = 
      var(battery_percent) * (n()-1)
    )

# A tibble: 1 × 1
  sstotal
    <dbl>
1 369840.

Unexplained variation from the residuals

\[SSE = \sum (y - \hat{y})^2\]

mod <- lm(battery_percent ~ screen_time, data = data)

data %>%
  summarise(
    sse = 
      sum((______ - _______)^2)
    )

Unexplained variation from the residuals

\[SSE = \sum (y - \hat{y})^2\]

mod <- lm(battery_percent ~ screen_time, data = data)

data %>%
  summarise(
    sse = 
      sum((battery_percent - fitted(mod))^2)
    )

# A tibble: 1 × 1
      sse
    <dbl>
1 366449.

Unexplained variation from the residuals

\[SSE = \sum (y - \hat{y})^2\]

mod <- lm(battery_percent ~ screen_time, data = data)

data %>%
  summarise(
    sse = 
      sum(residuals(mod)^2)
    )

# A tibble: 1 × 1
      sse
    <dbl>
1 366449.

Unexplained variation from the residuals

\[SSE = \sum (y - \hat{y})^2\]

mod <- lm(battery_percent ~ screen_time, data = data)

data %>%
  summarise(
    sse = 
      sigma(mod)^2 * (n() - 2)
    )

# A tibble: 1 × 1
      sse
    <dbl>
1 366449.

Variation explained by the model

\[SSModel = \sum (\hat{y}-\bar{y})^2\]

data %>%
  summarise(
    ssmodel = 
      sum(______ - ______)^2)
    )

Variation explained by the model

\[SSModel = \sum (\hat{y}-\bar{y})^2\]

data %>%
  summarise(
    ssmodel = 
      sum((fitted(mod) - mean(battery_percent))^2)
    )

# A tibble: 1 × 1
  ssmodel
    <dbl>
1   3391.

Partitioning variability

data %>%
  summarise(
    sstotal = sum((battery_percent - mean(battery_percent))^2),
    ssmodel = sum((fitted(mod) - mean(battery_percent))^2),
    sse = sum(residuals(mod)^2)
    )

# A tibble: 1 × 3
  sstotal ssmodel     sse
    <dbl>   <dbl>   <dbl>
1 369840.   3391. 366449.

Partitioning variability

data %>%
  summarise(
    sstotal = sum((battery_percent - mean(battery_percent))^2),
    ssmodel = sum((fitted(mod) - mean(battery_percent))^2),
    sse = sum(residuals(mod)^2),
    ssmodel + sse
    )

What will this be?

Partitioning variability

data %>%
  summarise(
    sstotal = sum((battery_percent - mean(battery_percent))^2),
    ssmodel = sum((fitted(mod) - mean(battery_percent))^2),
    sse = sum(residuals(mod)^2),
    ssmodel + sse
    )

# A tibble: 1 × 4
  sstotal ssmodel     sse `ssmodel + sse`
    <dbl>   <dbl>   <dbl>           <dbl>
1 369840.   3391. 366449.         369840.

Partitioning variability

data %>%
  summarise(
    sstotal = sum((battery_percent - mean(battery_percent))^2),
    ssmodel = sum((fitted(mod) - mean(battery_percent))^2),
    sse = sum(residuals(mod)^2),
    ssmodel + sse,
    sstotal - ssmodel
    )

What will this be?

Partitioning variability

data %>%
  summarise(
    sstotal = sum((battery_percent - mean(battery_percent))^2),
    ssmodel = sum((fitted(mod) - mean(battery_percent))^2),
    sse = sum(residuals(mod)^2),
    ssmodel + sse,
    sstotal - ssmodel
    )

# A tibble: 1 × 5
  sstotal ssmodel     sse `ssmodel + sse` `sstotal - ssmodel`
    <dbl>   <dbl>   <dbl>           <dbl>               <dbl>
1 369840.   3391. 366449.         369840.             366449.

Partitioning variability

data %>%
  summarise(
    sstotal = sum((battery_percent - mean(battery_percent))^2),
    ssmodel = sum((fitted(mod) - mean(battery_percent))^2),
    sse = sum(residuals(mod)^2),
    ssmodel + sse,
    sstotal - ssmodel,
    sstotal - sse
    )

What will this be?

Partitioning variability

data %>%
  summarise(
    sstotal = sum((battery_percent - mean(battery_percent))^2),
    ssmodel = sum((fitted(mod) - mean(battery_percent))^2),
    sse = sum(residuals(mod)^2),
    ssmodel + sse,
    sstotal - ssmodel,
    sstotal - sse
    )

# A tibble: 1 × 6
  sstotal ssmodel     sse `ssmodel + sse` `sstotal - ssmodel` `sstotal - sse`
    <dbl>   <dbl>   <dbl>           <dbl>               <dbl>           <dbl>
1 369840.   3391. 366449.         369840.             366449.           3391.

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = \sum_{i=1}^n (y - \bar{y})^2\]

How many observations?

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = \sum_{i=1}^{\require{color}\colorbox{#86a293}{$n$}} (y - \bar{y})^2\]

How many observations?

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = \sum_{i=1}^{n} (y - \bar{y})^2\]

How many things are “estimated”?

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = \sum_{i=1}^{n} (y - \require{color}\colorbox{#86a293}{$\bar{y}$})^2\]

How many things are “estimated”?

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = \sum_{i=1}^{n} (y - \bar{y})^2\]

How many degrees of freedom?

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = \sum_{i=1}^{n} (y - \bar{y})^2\]

\[\Large df_{SSTOTAL}=n-1\]

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - \hat{y})^2\]

How many observations?

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{\require{color}\colorbox{#86a293}{$n$}} (y - \hat{y})^2\]

How many observations?

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - \hat{y})^2\]

How is $\hat{y}$ estimated with simple linear regression?

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - (\hat{\beta}_0+\hat{\beta_1}x))^2\]

How is $\hat{y}$ estimated with simple linear regression?

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - (\hat{\beta}_0+\hat{\beta_1}x))^2\]

How many things are “estimated”?

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - (\require{color}\colorbox{#86a293}{$\hat{\beta}_0$}+\colorbox{#86a293}{$\hat{\beta}_1$}x))^2\]

How many things are “estimated”?

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - (\hat{\beta}_0+\hat{\beta_1}x))^2\]

How many degrees of freedom?

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSE = \sum_{i=1}^{n} (y - (\hat{\beta}_0+\hat{\beta_1}x))^2\]

\[\Large df_{SSE} = n - 2\]

Degrees of freedom

The number of observations used to estimate the statistic minus the number of things you are estimating

\[SSTotal = SSModel + SSE\]

\[df_{SSTotal} = df_{SSModel} + df_{SSE} \]

\[n - 1 = df_{SSModel} + (n - 2)\]

`Application Exercise`

How many degrees of freedom does SSModel have?

\[n - 1 = df_{SSModel} + (n - 2)\]

01:00

Mean squares

\[MSE = \frac{SSE}{n - 2}\]

\[MSModel = \frac{SSModel}{1}\]

What is the pattern?

\[\Large F = \frac{MSModel}{MSE}\]

F-distribution

Under the null hypothesis

Code

f <- data.frame(
  stat = rf(n = 10000, df1 = 1, df2 = 629)
)

ggplot(f) + 
  geom_histogram(aes(stat), bins = 40) + 
  labs(x = "F Statistic")