Topics in Scientific and Statistical Computing
Likelihood: (relative) measure of how well our observations fit a specific hypothesis / model
known parameter, variable data
We know that the oceans cover 70% of the earth’s surface. If we were to take 10 random GPS coordinates of the earth, what’s the probability of getting 5 ocean points?
What probability distribution should we use? Which function in R?
If we repeat this calculation for every possible outcome of ocean points (out of 10), we can visualize the probability distribution, for a known parameter (\(p\)).
If we sum the ocean_p
vector, what value should we get?
Reminder: this is a probability mass function.
unknown parameter, fixed data
(the usual case)
Imagine you are martian observing Earth for the first time and you want to estimate the proportion of ocean on the planet using a bunch of random landings with your space probe.
Out of 20 trials, you made 12 ocean landings.
What is \(\text{Pr}(p = 0.7 | \text{ocean} = 12, \text{trials} = 20)\)?
Will these probabilities sum to 1?
Which value of \(p\) corresponds to the maximum likelihood?
p_grid p_lik
Min. :0.00 Min. :0.000e+00
1st Qu.:0.25 1st Qu.:2.979e-05
Median :0.50 Median :8.532e-03
Mean :0.50 Mean :4.715e-02
3rd Qu.:0.75 3rd Qu.:9.113e-02
Max. :1.00 Max. :1.797e-01
# A tibble: 1 × 2
p_grid p_lik
<dbl> <dbl>
1 0.6 0.180
Following slides are based on a lab by Maurice Goodman.
\[ \begin{aligned} Y &= \alpha + \beta X + \epsilon \\ \epsilon &\sim \text{Normal}(0, \sigma^2) \end{aligned} \]
\[ \begin{aligned} Y &\sim \text{Normal}(\mu, \sigma^2) \\ \mu &= \alpha + \beta X \end{aligned} \]
Three parameters:
a <- 1 ## intercept
b <- 2 ## slope
sigma <- 1 ## error standard deviation
n <- 50 ## sample size
## simulate data
set.seed(13)
sim_data <- data.frame(x = runif(n, min = -3, max = 3))
sim_data$y <- rnorm(n, mean = a + b*sim_data$x, sd = sigma)
## Look at the data structure
head(sim_data)
x y
1 1.2619347 3.673224
2 -1.5231762 -3.505669
3 -0.6621933 -2.351430
4 -2.4516980 -4.960354
5 2.7723873 5.816631
6 -2.9344000 -4.877011
Likelihood function:
\[ \mathcal{L}(\theta) = f_X(x|\theta) = \prod_{i = 1}^n f_X(x_i ) \]
Log-likelihood function:
\[ l(\theta) = \ln (\mathcal{L}(\theta)) = \sum_{i = 1}^n \ln [f_X(x_i)] \]
We want to maximize the log-likelihood of this function, which is equivalent to minimizing the negative log-likelihood.
\[ \begin{aligned} Y &= \alpha + \beta X + \epsilon \\ \epsilon &\sim \text{Normal}(0, \sigma^2) \end{aligned} \]
becomes:
\[ Y \sim \text{Normal}(\underbrace{\alpha + \beta X}_{\text{mean}}, \underbrace{\sigma^2}_{\text{variance}}) \]
We can express the negative log-likelihood function in R as:
-sum(dnorm(y, mean = ..., sd = ..., log = TRUE))
optim
ize the function$par
a b log_sigma
0.8269008 2.0182844 -0.1061237
$value
[1] 65.64576
$counts
function gradient
140 NA
$convergence
[1] 0
$message
NULL
$hessian
a b log_sigma
a 61.82268539 4.243509988 -0.037625973
b 4.24350999 176.478986962 -0.002715066
log_sigma -0.03762597 -0.002715066 100.020193344
(Intercept) x
0.8265966 2.0182841
a_seq <- seq(-2, 4, length.out = 100)
b_seq <- seq(-1, 5, length.out = 100)
### data frame containing all combinations of the intercept and slope
par_grid <- expand.grid("a" = a_seq, "b" = b_seq)
### loop over data frame, store negative log likelihood
for (i in 1:nrow(par_grid)) {
par_grid$loglik[i] <- lm_nll(
c(a = par_grid$a[i], b = par_grid$b[i], fit$par[3]),
x = sim_data$x,
y = sim_data$y
)
}
Modify our lm_log_lik
function so that the slope is a separate argument.
We’ll pass this to optim
to just find the values of the intercept and sd
## loop over values of the slope, store negative log-likelihood
for (i in 1:nrow(b_profile)) {
b_fit <- optim(
par = c(a = 0, log_sigma = log(1)), # initial values for intercept and sd
fn = slope_proflik, # objective function
b = b_profile$b[i], # fixed slope value
x = sim_data$x, # x values
y = sim_data$y # y values
)
b_profile$nll[i] <- b_fit$value
}
\[ \begin{aligned} \mathcal{J}(\hat{\theta}) &= - \frac{\partial^2}{\partial \theta^2} \ln [\mathcal{L}(\hat{\theta})] \\ \text{Var}(\hat{\theta}) &= 1 / \mathcal{J}(\hat{\theta}) \end{aligned} \]
a b log_sigma
a 1.620204e-02 -3.895846e-04 6.084368e-06
b -3.895846e-04 5.675765e-03 7.514228e-09
log_sigma 6.084368e-06 7.514228e-09 9.997983e-03
The variances are on the diagonal, and we can get the standard errors by taking their square root:
Let’s compare these to the standard errors reported by lm
:
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.8265966 | 0.1299250 | 6.362106 | 1e-07 |
x | 2.0182841 | 0.0768989 | 26.245943 | 0e+00 |
\[ \text{CI} = (\hat{\theta} - \phi^{-1}(\gamma / 2) \times \text{SE}(\hat{\theta}), \; \hat{\theta} + \phi^{-1}(1 - \gamma / 2) \times \text{SE}(\hat{\theta})) \]
Where \(\phi^{-1}\) is the inverse cumulative distribution function of the standard Normal distribution.
Equivalently:
\[ \text{CI} = (\phi^{-1}_{\hat{\theta}, \text{SE}(\theta)}(\gamma / 2), \; \phi^{-1}_{\hat{\theta}, \text{SE}(\theta)} (1-\gamma/2)) \]
Where \(\phi^{-1}_{\hat{\theta}, \text{SE}(\theta)}\) is the inverse cumulative distribution function for a Normal distribution with mean \(\hat{\theta}\) and standard deviation \(\text{SE}(\theta)\).
The Normal inverse CDF is returned by the qnorm
function. We can obtain a confidence interval for our intercept \(\alpha\) and slope \(\beta\):
lm_fit <- function(x, y, SE = "fisher", init = c("a" = 0, "b" = 0, "log_sigma" = 0), n_boot = 1000) {
if (!(SE %in% c("bootstrap", "fisher"))) stop("SE options are bootstrap or fisher")
if (length(x) != length(y)) stop("x and y lengths differ")
## Maximum-likelihood estimate
MLE <- optim(
par = init, # initial values
fn = lm_nll, # our negative log-likelihood function
x = x, # x values
y = y, # y values
hessian = (SE == "fisher") # only return Hessian if Fisher information used
)
## Standard error using either fisher information or bootstrapping
if (SE == "fisher") {
var_cov <- solve(MLE$hessian)
fit_se <- sqrt(diag(var_cov))
} else {
n_obs <- length(x) ## number of observations
## initialize an empty matrix to store parameter estimates for each sample
par_samples <- matrix(nrow = n_boot, ncol = 3)
## recalculate MLE for each sample
for (i in 1:n_boot) {
samp <- sample(1:n_obs, size = n_obs, replace = TRUE) # sampled observations
new_x <- x[samp] # subset x to sampled observations
new_y <- y[samp] # subset y to sampled observations
sample_fit <- optim(
par = init, # initial values
fn = lm_nll, # log likelihood function
x = new_x, # sampled x-values
y = new_y # sampled y-values
)
par_samples[i,] <- sample_fit$par # store sample parameters
}
fit_se <- apply(par_samples, MARGIN = 2, FUN = sd) # calculate column SDs
}
## nicely format output
data.frame(
coef = c("intercept", "slope"),
estimate = MLE$par[1:2],
SE = fit_se[1:2]
)
}
a <- 1 ## intercept
b <- 2 ## slope
sigma <- 1 ## error standard deviation
n <- 50 ## sample size
## simulate data
sim_data <- tibble(
x = runif(n, min = -3, max = 3), ## x values
y = rnorm(n, mean = a + b*x, sd = sigma) ## regression equation
)
## plot data
sim_data %>% ggplot(aes(x, y)) + geom_point() + geom_smooth(method = "lm")
coef estimate SE
a intercept 0.9358099 0.13386046
b slope 2.0485483 0.07308945