Summarizing data, a Review

Lucy D’Agostino McGowan

Learning objectives

Recall how to summarize one continuous variable
Identify variables where a mean is a good summary measure (or not)
Explain why we summarize data (what is the big picture?)

One continuous variable

How can we visualize a single continuous variable?

Histogram

Code

starwars %>%
  drop_na(height) %>%
  ggplot(aes(x = height)) +
  geom_histogram(bins = 30, fill = "#86a293")

Density

Code

starwars %>%
  drop_na(height) %>%
  ggplot(aes(x = height)) +
  geom_density(color = "#86a293")

Boxplot

Code

starwars %>%
  drop_na(height) %>%
  ggplot(aes(x = height, y = 1)) +
  geom_boxplot(outlier.shape = NA, color = "#86a293") + 
  geom_jitter(color = "#86a293") + 
  theme(axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

One continuous variable

How can we numerically summarize a single continuous variable?

starwars %>%
  summarise(mean = mean(height, na.rm = TRUE))

# A tibble: 1 × 1
   mean
  <dbl>
1  174.

One continuous variable

Code

library(geomtextpath)
starwars %>%
  drop_na(height) %>%
  ggplot(aes(x = height)) +
  geom_histogram(bins = 30, fill = "#86a293") +
  geom_textvline(xintercept = 174, 
                 lwd = 6, 
                 linewidth = 2, 
                 label = "mean = 174",
                 hjust = 0.25)

One continuous variable

Why do we calculate a mean?

Reduces the dimensionality of the data (from n to 1)
To get a sense of a “typical” observation
- When is this an accurate representation?

Meaningful means

Symmetric

Code

set.seed(1)

d1 <- tibble(x = rnorm(1000, mean = 10))
ggplot(d1, aes(x = x)) + 
  geom_histogram(bins = 30, fill = "#86a293")

Bimodal

Code

d2 <- tibble(x = c(rnorm(500, mean = 10),
                   rnorm(500, mean = 20)))
ggplot(d2, aes(x = x)) + 
  geom_histogram(bins = 30, fill = "#86a293")

Skewed

Code

d3 <- tibble(x = rbeta(1000, 2, 5))
ggplot(d3, aes(x = x)) + 
  geom_histogram(bins = 30, fill = "#86a293")

Guess the mean for each of these variables.

Meaningful means

Symmetric

Code

ggplot(d1, aes(x = x)) + 
  geom_histogram(bins = 30, fill = "#86a293") + 
  geom_vline(xintercept = mean(d1$x), lwd = 2)

Bimodal

Code

ggplot(d2, aes(x = x)) + 
  geom_histogram(bins = 30, fill = "#86a293") + 
  geom_vline(xintercept = mean(d2$x), lwd = 2)

Skewed

Code

ggplot(d3, aes(x = x)) + 
  geom_histogram(bins = 30, fill = "#86a293") + 
  geom_vline(xintercept = mean(d3$x), lwd = 2)

Does this value represent a “typical” observation?

Math speak

\[\Large\bar{x} =\sum_{i=1}^n \frac{x_i}{n}\]

Math speak

\[\Large{\require{color}\colorbox{#86a293}{$\bar{x}$}} =\sum_{i=1}^n \frac{x_i}{n}\]

the mean of the variable $x$

Math speak

\[\Large\bar{x} ={\require{color}\colorbox{#86a293}{$\sum$}}_{i=1}^n \frac{x_i}{n}\]

add up the observations

Math speak

\[\Large\bar{x} =\sum_{{\require{color}\colorbox{#86a293}{$i=1$}}}^n \frac{x_i}{n}\]

from the first

Math speak

\[\Large\bar{x} =\sum_{i=1}^{\require{color}\colorbox{#86a293}{$n$}} \frac{x_i}{{\require{color}\colorbox{#86a293}{$n$}}}\]

total number of observations

Math speak

\[\Large\bar{x} =\sum_{i=1}^n \frac{{\require{color}\colorbox{#86a293}{$x_i$}}}{n}\]

continuous variable for observation i

Math speak

\[\Large\bar{x} =\sum_{i=1}^n \frac{x_i}{\require{color}\colorbox{#86a293}{${n}$}}\]

divide by the total number of observations

`Application Exercise`

data
$x_1$	3
$x_2$	5
$x_3$	1
$x_4$	7
$x_5$	8

Using the data to the left, what is $n$?
What is $\bar{x}$?

03:00

Data = model + error

Data

Code

d <- tibble(
  i = 1:5,
  x = c(3, 5, 1, 7, 8),
  model = mean(x),
  error = x - model
) 

knitr::kable(d)

i	x	model	error
1	3	4.8	-1.8
2	5	4.8	0.2
3	1	4.8	-3.8
4	7	4.8	2.2
5	8	4.8	3.2

Data

Code

ggplot(d, aes(x = 1, y = x)) + 
  geom_point() + 
  geom_texthline(yintercept = mean(d$x), label = "mean = 4.8") + 
  theme(axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor.x = element_blank())

Data

Code

ggplot(d, aes(x = i, y = x)) + 
  geom_point() + 
  geom_texthline(yintercept = mean(d$x), label = "mean = 4.8") + 
  geom_segment(aes(y = x, yend = mean(x), x = i, xend = i), color = "blue") + 
  theme(axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor.x = element_blank())

Math Speak

\[\Large x = \beta_0 + \varepsilon\]

Math Speak

\[\Large {\require{color}\colorbox{#86a293}{$x$}} = \beta_0 + \varepsilon\]

This is the vector $x=\{x_1,\dots,x_n\}$

Math Speak

\[\Large x = {\require{color}\colorbox{#86a293}{$\beta_0$}} + \varepsilon\]

we call this the “intercept”, when there are no other variables, it is just the mean, $\bar{x}$

Math Speak

\[\Large x = \beta_0 + {\require{color}\colorbox{#86a293}{$\varepsilon$}}\]

the error

Data

Code

ggplot(d, aes(x = i, y = x)) + 
  geom_point() + 
  geom_texthline(yintercept = mean(d$x), lwd = 5, hjust = 0.1,
                 label = as.character(expression(beta[0])), parse = TRUE) + 
  geom_segment(aes(y = x, yend = mean(x), x = i, xend = i), color = "blue") +
  theme(axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor.x = element_blank())

Data

Code

ggplot(d, aes(x = i, y = x)) + 
  geom_point() + 
  geom_texthline(yintercept = mean(d$x), lwd = 5, hjust = 0.1,
                 label = as.character(expression(beta[0])), parse = TRUE) + 
  geom_textsegment(aes(y = x, yend = mean(x), x = i, xend = i), color = "blue",
                   label = as.character(expression(epsilon)), parse = TRUE,
                   lwd = 5) +
  theme(axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor.x = element_blank())

Data

Code

ggplot(d, aes(x = 1, y = x)) + 
  geom_point() + 
  geom_texthline(yintercept = mean(d$x), lwd = 5, hjust = 0.1,
                 label = as.character(expression(beta[0])), parse = TRUE) + 
  geom_segment(aes(y = x, yend = mean(x), x = 1, xend = 1), color = "blue") + 
  theme(axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor.x = element_blank())

Data

Code

ggplot(d, aes(x = 1, y = x)) + 
  geom_point() + 
  geom_texthline(yintercept = mean(d$x), lwd = 5, hjust = 0.1,
                 label = as.character(expression(beta[0])), parse = TRUE) + 
  geom_segment(aes(y = x, yend = mean(x), x = 1, xend = 1), color = "blue") + 
  theme(axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank())

Code

d2 <- d[2:4]
names(d2) <- c("$\\mathbf{x}$", "$\\beta_0$", "$\\varepsilon$")

knitr::kable(d2)

$\mathbf{x}$	$\beta_0$	$\varepsilon$
3	4.8	-1.8
5	4.8	0.2
1	4.8	-3.8
7	4.8	2.2
8	4.8	3.2

Calculating the mean in R

summarise(d, mean_x = mean(x))

# A tibble: 1 × 1
  mean_x
   <dbl>
1    4.8

lm(x ~ 1, data = d)


Call:
lm(formula = x ~ 1, data = d)

Coefficients:
(Intercept)  
        4.8

“intercept only model”
lm: linear model

`Application Exercise`

Open your 04-appex.qmd file. Load the packages by running the top R chunk of code.

Copy the code below into an R chunk at the bottom of the file:

d <- tibble(
  x = c(3, 5, 1, 7, 8)
)

What do you think this code does? Try typing ?tibble in the Console - what does this function do?

Calculate the mean of x. Do this two ways, using the summary function and using the lm function.
Add a new variable called error to the data set d that is equal to x minus the mean of x.

05:00

Recap

When is the mean an appropriate summary measure to calculate?

What assumptions need to be true in order to use a mean to represent your single continuous variable?

data
\(x_1\)	3
\(x_2\)	5
\(x_3\)	1
\(x_4\)	7
\(x_5\)	8