\[SSTotal = \sum (y - \bar{y})^2\]
\[SSTotal = \sum (y - \bar{y})^2\]
\[SSTotal = \sum (y - \bar{y})^2\]
\[SSE = \sum (y - \hat{y})^2\]
\[SSE = \sum (y - \hat{y})^2\]
\[SSE = \sum (y - \hat{y})^2\]
\[SSE = \sum (y - \hat{y})^2\]
\[SSModel = \sum (\hat{y}-\bar{y})^2\]
\[SSModel = \sum (\hat{y}-\bar{y})^2\]
What will this be?
What will this be?
data %>%
summarise(
sstotal = sum((battery_percent - mean(battery_percent))^2),
ssmodel = sum((fitted(mod) - mean(battery_percent))^2),
sse = sum(residuals(mod)^2),
ssmodel + sse,
sstotal - ssmodel
)
# A tibble: 1 × 5
sstotal ssmodel sse `ssmodel + sse` `sstotal - ssmodel`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 369840. 3391. 366449. 369840. 366449.
What will this be?
data %>%
summarise(
sstotal = sum((battery_percent - mean(battery_percent))^2),
ssmodel = sum((fitted(mod) - mean(battery_percent))^2),
sse = sum(residuals(mod)^2),
ssmodel + sse,
sstotal - ssmodel,
sstotal - sse
)
# A tibble: 1 × 6
sstotal ssmodel sse `ssmodel + sse` `sstotal - ssmodel` `sstotal - sse`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 369840. 3391. 366449. 369840. 366449. 3391.
\[SSTotal = \sum_{i=1}^n (y - \bar{y})^2\]
How many observations?
\[SSTotal = \sum_{i=1}^{\require{color}\colorbox{#86a293}{$n$}} (y - \bar{y})^2\]
How many observations?
\[SSTotal = \sum_{i=1}^{n} (y - \bar{y})^2\]
How many things are “estimated”?
\[SSTotal = \sum_{i=1}^{n} (y - \require{color}\colorbox{#86a293}{$\bar{y}$})^2\]
How many things are “estimated”?
\[SSTotal = \sum_{i=1}^{n} (y - \bar{y})^2\]
How many degrees of freedom?
\[SSTotal = \sum_{i=1}^{n} (y - \bar{y})^2\]
\[\Large df_{SSTOTAL}=n-1\]
\[SSE = \sum_{i=1}^{n} (y - \hat{y})^2\]
How many observations?
\[SSE = \sum_{i=1}^{\require{color}\colorbox{#86a293}{$n$}} (y - \hat{y})^2\]
How many observations?
\[SSE = \sum_{i=1}^{n} (y - \hat{y})^2\]
How is \(\hat{y}\) estimated with simple linear regression?
\[SSE = \sum_{i=1}^{n} (y - (\hat{\beta}_0+\hat{\beta_1}x))^2\]
How is \(\hat{y}\) estimated with simple linear regression?
\[SSE = \sum_{i=1}^{n} (y - (\hat{\beta}_0+\hat{\beta_1}x))^2\]
How many things are “estimated”?
\[SSE = \sum_{i=1}^{n} (y - (\require{color}\colorbox{#86a293}{$\hat{\beta}_0$}+\colorbox{#86a293}{$\hat{\beta}_1$}x))^2\]
How many things are “estimated”?
\[SSE = \sum_{i=1}^{n} (y - (\hat{\beta}_0+\hat{\beta_1}x))^2\]
How many degrees of freedom?
\[SSE = \sum_{i=1}^{n} (y - (\hat{\beta}_0+\hat{\beta_1}x))^2\]
\[\Large df_{SSE} = n - 2\]
\[SSTotal = SSModel + SSE\]
\[df_{SSTotal} = df_{SSModel} + df_{SSE} \]
\[n - 1 = df_{SSModel} + (n - 2)\]
Application Exercise
How many degrees of freedom does SSModel have?
\[n - 1 = df_{SSModel} + (n - 2)\]
01:00
\[MSE = \frac{SSE}{n - 2}\]
\[MSModel = \frac{SSModel}{1}\]
What is the pattern?
\[\Large F = \frac{MSModel}{MSE}\]
Under the null hypothesis
We can see all of these statistics by using the anova
function on the output of lm
Analysis of Variance Table
Response: battery_percent
Df Sum Sq Mean Sq F value Pr(>F)
screen_time 1 3391 3390.6 5.8199 0.01613 *
Residuals 629 366449 582.6
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What is the SSModel?
We can see all of these statistics by using the anova
function on the output of lm
Analysis of Variance Table
Response: battery_percent
Df Sum Sq Mean Sq F value Pr(>F)
screen_time 1 3391 3390.6 5.8199 0.01613 *
Residuals 629 366449 582.6
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What is the MSModel?
We can see all of these statistics by using the anova
function on the output of lm
Analysis of Variance Table
Response: battery_percent
Df Sum Sq Mean Sq F value Pr(>F)
screen_time 1 3391 3390.6 5.8199 0.01613 *
Residuals 629 366449 582.6
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What is the SSE?
We can see all of these statistics by using the anova
function on the output of lm
Analysis of Variance Table
Response: battery_percent
Df Sum Sq Mean Sq F value Pr(>F)
screen_time 1 3391 3390.6 5.8199 0.01613 *
Residuals 629 366449 582.6
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What is the MSE?
We can see all of these statistics by using the anova
function on the output of lm
Analysis of Variance Table
Response: battery_percent
Df Sum Sq Mean Sq F value Pr(>F)
screen_time 1 3391 3390.6 5.8199 0.01613 *
Residuals 629 366449 582.6
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What is the SSTotal?
We can see all of these statistics by using the anova
function on the output of lm
Analysis of Variance Table
Response: battery_percent
Df Sum Sq Mean Sq F value Pr(>F)
screen_time 1 3391 3390.6 5.8199 0.01613 *
Residuals 629 366449 582.6
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What is the F statistic?
We can see all of these statistics by using the anova
function on the output of lm
Analysis of Variance Table
Response: battery_percent
Df Sum Sq Mean Sq F value Pr(>F)
screen_time 1 3391 3390.6 5.8199 0.01613 *
Residuals 629 366449 582.6
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Is the F-statistic statistically significant?
The probability of getting a statistic as extreme or more extreme than the observed test statistic given the null hypothesis is true
Under the null hypothesis
To calculate the p-value under the t-distribution we use pt()
. What do you think we use to calculate the p-value under the F-distribution?
Analysis of Variance Table
Response: battery_percent
Df Sum Sq Mean Sq F value Pr(>F)
screen_time 1 3391 3390.6 5.8199 0.01613 *
Residuals 629 366449 582.6
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pf()
q
, df1
, and df2
. What do you think we would plug in for q
?df2
df1
To calculate the p-value under the t-distribution we use pt()
. What do you think we use to calculate the p-value under the F-distribution?
Why don’t we multiply this p-value by 2 when we use pf()
?
Under the null hypothesis
Under the null hypothesis
Under the null hypothesis
Analysis of Variance Table
Response: battery_percent
Df Sum Sq Mean Sq F value Pr(>F)
screen_time 1 3391 3390.6 5.8199 0.01613 *
Residuals 629 366449 582.6
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm(formula = battery_percent ~ screen_time, data = data)
Residuals:
Min 1Q Median 3Q Max
-61.468 -17.443 2.593 19.190 46.905
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.496166 2.563073 26.334 <2e-16 ***
screen_time -0.020871 0.008652 -2.412 0.0161 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 24.14 on 629 degrees of freedom
Multiple R-squared: 0.009168, Adjusted R-squared: 0.007592
F-statistic: 5.82 on 1 and 629 DF, p-value: 0.01613
Analysis of Variance Table
Response: battery_percent
Df Sum Sq Mean Sq F value Pr(>F)
screen_time 1 3391 3390.6 5.8199 0.01613 *
Residuals 629 366449 582.6
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Variance Table
Response: battery_percent
Df Sum Sq Mean Sq F value Pr(>F)
screen_time 1 3391 3390.6 5.8199 0.01613 *
Residuals 629 366449 582.6
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm(formula = battery_percent ~ screen_time, data = data)
Residuals:
Min 1Q Median 3Q Max
-61.468 -17.443 2.593 19.190 46.905
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.496166 2.563073 26.334 <2e-16 ***
screen_time -0.020871 0.008652 -2.412 0.0161 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 24.14 on 629 degrees of freedom
Multiple R-squared: 0.009168, Adjusted R-squared: 0.007592
F-statistic: 5.82 on 1 and 629 DF, p-value: 0.01613
Application Exercise
appex-08.qmd
06:00