Simulation and the bootstrap

packages

library(tidyverse)

Simulation

Sometimes you know the exact mathematical answer to a problem, eg:
- \(X_1, \ldots X_n \sim N(\mu, \sigma^2)\), what is distribution of \(\bar{X}\)? (Ans: \(N(\mu, \sigma^2/n)\).)
More often, though, you don’t:
- if \(X \sim Bin(2, 0.5), Y \sim Bin(3, 0.2)\), what is dist of \(Z = X+Y\)?
Simulation: generate random \(X\) and \(Y\), calculate sum, repeat many times. Gives you (approx) dist of \(X+Y\) without any mathematics!

Random numbers in R

R knows about a lot of distributions, eg: norm binom pois exp gamma t chisq (type ?distributions in Console to see more)
to generate random numbers from a distribution, put r on front of these; inputs are number of random values to generate, and parameters of distribution to simulate from.
Examples:

rnorm(5, 100, 15)

[1] 111.96072  89.80331 113.74231  90.74167  89.86461

and

rpois(10, 3.5)

 [1] 5 2 5 5 3 5 1 3 4 0

Our problem: simulating once

if \(X \sim Bin(2, 0.5), Y \sim Bin(3, 0.2)\), what is dist of \(Z = X+Y\)?

x <- rbinom(1, 2, 0.5)
y <- rbinom(1, 3, 0.2)
x

[1] 1

[1] 1

x + y

[1] 2

To simulate many times: - set up dataframe with space for each simulated value - work rowwise - do one simulation per row

Simulating many times

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(x = rbinom(1, 2, 0.5),
         y = rbinom(1, 3, 0.2),
         z = x + y) -> d

Results

Distribution of sum

Make a bar chart rather than a histogram because distribution of \(Z\) is discrete:

ggplot(d, aes(x = z)) + geom_bar()

(Simulated) probability that the sum is at least 4:

d %>% count(z >= 4)

Only this much:

285/10000

[1] 0.0285

A sum of 5 is possible though very unlikely.

The bootstrap

Source: Hesterberg et al

Sampling distribution of a statistic is distribution of that statistic over “all possible samples” from population of interest.
“Plug-in principle”: sample mean estimates population mean, sample variance estimates population variance, etc.
Also, sample is estimate of population (precisely, proportion of sample values \(\le x\) estimates probability of drawing value \(\le x\) from population, for any \(x\)).
As long as your sample is representative, sampling from the sample (!) is an estimate of sampling from the population. Called a bootstrap sample.
Sample from sample with replacement, or else you get original sample back.

Blue Jays attendances:

ggplot(jays, aes(sample = attendance)) + 
  stat_qq() + stat_qq_line()

\(t\) procedure for the mean may not be a good idea because the distribution is skewed.
Previously: hand-waving with sample size.

What actually matters

It’s not the distribution of the data that has to be approx normal (for a \(t\) procedure).
What matters is the sampling distribution of the sample mean.
If the sample size is large enough, the sampling distribution will be normal enough even if the data distribution is not.
- This is why we had to consider the sample size as well as the shape.
But how do we know whether this is the case or not? We only have one sample.
Use the bootstrap to simulate sampling distribution.

Simulating the sampling distribution of sample statistic

Sample from our sample with replacement.
Calculate statistic
Repeat many times (simulation).
This gives an idea of how our statistic might vary in repeated samples: that is, its sampling distribution.
Called the bootstrap distribution of the statistic.
If the bootstrap distribution is approx normal, infer that the true sampling distribution also approx normal, therefore inference about the mean such as \(t\) is good enough.
If not, we should be more careful.

Bootstrapping the Blue Jays attendances

Sampling with replacement is done like this (the default sample size is as long as the original data):

boot <- sample(jays$attendance, replace=TRUE)
mean(boot)

[1] 23764.76

That’s one bootstrapped mean. We need a whole bunch.

Comparing the actual sample with the bootstrapped one

sort(jays$attendance)

 [1] 14184 14433 15062 15086 15168 15606 16402 17264 17276 18581
[11] 19014 19217 21195 21312 21397 21519 29306 30430 33086 34743
[21] 37929 42419 42917 44794 48414

sort(boot)

 [1] 14184 14433 14433 15086 15086 15168 15168 16402 17264 17264
[11] 17276 17276 17276 17276 18581 19014 21312 21519 33086 37929
[21] 42419 42419 42917 42917 48414

Bootstrap sample has repeats plus missing values from original sample.

A whole bunch

We are now doing a simulation. I like 10,000 samples when testing for normality:

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(boot_sample = list(sample(jays$attendance, replace = TRUE))) %>% 
  mutate(my_mean = mean(boot_sample)) -> samples

for each row:
- obtain a bootstrap sample (list because we are saving the whole sample in one cell of the dataframe)
- work out the mean of that bootstrap sample.

Bootstrap sample means

samples

Sampling distribution of sample mean

ggplot(samples, aes(x=my_mean)) + geom_histogram(bins=10)

Is that a slightly long right tail?

Normal quantile plot

might be better than a histogram:

ggplot(samples, aes(sample = my_mean)) + 
  stat_qq()+stat_qq_line()

a very very slight right-skewness, but very close to normal.
hence the \(t\)-test is fine for the Blue Jays attendances.

Kids learning to read

Normal quantile plots, one for each sample:

ggplot(kids, aes(sample = score)) + 
  stat_qq() + stat_qq_line() +
  facet_wrap(~ group)

These both look close to normal.

Control group

Pull out control group children
Obtain bootstrap sampling distribution of scores

kids %>% filter(group == "c") -> controls
tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(controls$score, 
                                 replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) -> samples1

Bootstrap sample means

samples1

The bootstrap sampling distribution

ggplot(samples1, aes(sample = my_mean)) +
  stat_qq() + stat_qq_line()

Very close to normal.

Same, for treatment group

kids %>% filter(group == "t") -> treated
tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(treated$score, 
                                 replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) -> samples2

The bootstrap sampling distribution

ggplot(samples2, aes(sample = my_mean)) +
  stat_qq() + stat_qq_line()

Very slightly left-skewed, but close to normal. Not a problem.

Pain relief

With matched pairs, assumption is of normality of differences:

pain %>% mutate(diff = druga - drugb) -> pain
pain

Bootstrap sampling distribution of differences

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(pain$diff, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) -> samples

Result

samples

Assess it for normality

ggplot(samples, aes(sample = my_mean)) +
  stat_qq() + stat_qq_line()

Very skewed to the left (because of low outlier)
Matched pairs \(t\) not to be trusted at all.

Histogram with many bins

ggplot(samples, aes(x = my_mean)) + geom_histogram(bins = 60)

actually a very multimodal distribution: one mode for each time the low outlier appears in the bootstrap sampling distribution (can be none at all up to several times).