Maximum likelihood

Topics in Scientific and Statistical Computing

Robin Elahi

What is likelihood?



Likelihood: (relative) measure of how well our observations fit a specific hypothesis / model

Probability distribution functions vs likelihood functions

  • parameters
  • data
  • summation / integration of probabilities

Probability distribution function

known parameter, variable data

We know that the oceans cover 70% of the earth’s surface. If we were to take 10 random GPS coordinates of the earth, what’s the probability of getting 5 ocean points?

p <- 0.7

What probability distribution should we use? Which function in R?

dbinom(x = 5, size = 10, p = p)
[1] 0.1029193

If we repeat this calculation for every possible outcome of ocean points (out of 10), we can visualize the probability distribution, for a known parameter (\(p\)).

ocean_grid <- seq(0, 10, by = 1)
ocean_p <- dbinom(x = ocean_grid, size = 10, p = p)

If we sum the ocean_p vector, what value should we get?
Reminder: this is a probability mass function.

sum(ocean_p)
[1] 1

Likelihood function

unknown parameter, fixed data

(the usual case)

Imagine you are martian observing Earth for the first time and you want to estimate the proportion of ocean on the planet using a bunch of random landings with your space probe.

Out of 20 trials, you made 12 ocean landings.

What is \(\text{Pr}(p = 0.7 | \text{ocean} = 12, \text{trials} = 20)\)?

dbinom(x = 12, size = 20, prob = 0.7)
[1] 0.1143967

   

Repeat for many values of \(p\):

p_grid <- seq(0, 1, by = 0.01)
p_lik <- dbinom(x = 12, size = 20, prob = p_grid)
plot(p_grid, p_lik, type = "b", 
     xlab = "Proportion of ocean (hypothesis)", 
     ylab = "Pr(data | hypothesis)", 
     main = "Likelihood function")

Will these probabilities sum to 1?

sum(p_lik)
[1] 4.761905

Which value of \(p\) corresponds to the maximum likelihood?

df <- tibble(p_grid, p_lik)
summary(df)
     p_grid         p_lik          
 Min.   :0.00   Min.   :0.000e+00  
 1st Qu.:0.25   1st Qu.:2.979e-05  
 Median :0.50   Median :8.532e-03  
 Mean   :0.50   Mean   :4.715e-02  
 3rd Qu.:0.75   3rd Qu.:9.113e-02  
 Max.   :1.00   Max.   :1.797e-01  
df[which.max(df$p_lik), ]
# A tibble: 1 × 2
  p_grid p_lik
   <dbl> <dbl>
1    0.6 0.180

Maximum likelihood

Following slides are based on a lab by Maurice Goodman.

Outline

  • simulate data from a linear model using a known intercept and slope
  • write a function to fit a linear regression using maximum-likelihood estimation,
  • compare estimated intercept and slope compare to the known values

Linear regression

\[ \begin{aligned} Y &= \alpha + \beta X + \epsilon \\ \epsilon &\sim \text{Normal}(0, \sigma^2) \end{aligned} \]

\[ \begin{aligned} Y &\sim \text{Normal}(\mu, \sigma^2) \\ \mu &= \alpha + \beta X \end{aligned} \]

Three parameters:

  • \(\alpha\), the intercept
  • \(\beta\), the slope
  • \(\sigma\), the standard deviation of the residuals

Simulate data

a <- 1 ## intercept
b <- 2 ## slope
sigma <- 1 ## error standard deviation
n <- 50 ## sample size
## simulate data
set.seed(13)
sim_data <- data.frame(x = runif(n, min = -3, max = 3))
sim_data$y <- rnorm(n, mean = a + b*sim_data$x, sd = sigma)
## Look at the data structure
head(sim_data)
           x         y
1  1.2619347  3.673224
2 -1.5231762 -3.505669
3 -0.6621933 -2.351430
4 -2.4516980 -4.960354
5  2.7723873  5.816631
6 -2.9344000 -4.877011

Likelihood function:

\[ \mathcal{L}(\theta) = f_X(x|\theta) = \prod_{i = 1}^n f_X(x_i ) \]

Log-likelihood function:

\[ l(\theta) = \ln (\mathcal{L}(\theta)) = \sum_{i = 1}^n \ln [f_X(x_i)] \]

We want to maximize the log-likelihood of this function, which is equivalent to minimizing the negative log-likelihood.

\[ \begin{aligned} Y &= \alpha + \beta X + \epsilon \\ \epsilon &\sim \text{Normal}(0, \sigma^2) \end{aligned} \]

becomes:

\[ Y \sim \text{Normal}(\underbrace{\alpha + \beta X}_{\text{mean}}, \underbrace{\sigma^2}_{\text{variance}}) \]

We can express the negative log-likelihood function in R as:

-sum(dnorm(y, mean = ..., sd = ..., log = TRUE))

R function

## negative log-likelihood of y-values given 
## intercept, slope, and error sd
## we pass the parameters as a vector in the order a, b, sigma
lm_nll <- function(par, x, y) {
  a <- par["a"]; b <- par["b"]; sigma <- exp(par["log_sigma"])
  -sum(dnorm(y, mean = a + b*x, sd = sigma, log = TRUE))
}

optimize the function

fit <- optim(
  par = c(a = 0, b = 0, log_sigma = 1), 
  fn = lm_nll, 
  x = sim_data$x, 
  y = sim_data$y, 
  hessian = TRUE
)

fit
$par
         a          b  log_sigma 
 0.8269008  2.0182844 -0.1061237 

$value
[1] 65.64576

$counts
function gradient 
     140       NA 

$convergence
[1] 0

$message
NULL

$hessian
                    a             b     log_sigma
a         61.82268539   4.243509988  -0.037625973
b          4.24350999 176.478986962  -0.002715066
log_sigma -0.03762597  -0.002715066 100.020193344
# Compare with lm
coef(lm(y ~ x, data = sim_data))
(Intercept)           x 
  0.8265966   2.0182841 

sim_data %>% ggplot(aes(x, y)) + geom_point() + 
  geom_abline(intercept = a, slope = b, 
              color = "red") + # truth
  geom_abline(intercept = fit$par[1], slope = fit$par[2], 
              color = "blue") # fit

Likelihood slice

a_seq <- seq(-2, 4, length.out = 100)
b_seq <- seq(-1, 5, length.out = 100)

### data frame containing all combinations of the intercept and slope
par_grid <- expand.grid("a" = a_seq, "b" = b_seq)

### loop over data frame, store negative log likelihood
for (i in 1:nrow(par_grid)) {
  par_grid$loglik[i] <- lm_nll(
    c(a = par_grid$a[i], b = par_grid$b[i], fit$par[3]), 
    x = sim_data$x, 
    y = sim_data$y
  )
}

Likelihood profile for the slope

Modify our lm_log_lik function so that the slope is a separate argument.

slope_proflik <- function(pars, b, x, y) {
  a <- pars["a"]; sigma <- exp(pars["log_sigma"]) ## only two pars in this function
  -sum(dnorm(y, mean = a + b*x, sd = sigma, log = TRUE))
}

b_profile <- data.frame(
  b = seq(-1, 5, length.out = 100)  ## values of slope to loop over
)

We’ll pass this to optim to just find the values of the intercept and sd

## loop over values of the slope, store negative log-likelihood
for (i in 1:nrow(b_profile)) {
  b_fit <- optim(
    par = c(a = 0, log_sigma = log(1)), # initial values for intercept and sd
    fn = slope_proflik,  # objective function
    b = b_profile$b[i], # fixed slope value
    x = sim_data$x,   # x values
    y = sim_data$y   # y values
  )
  b_profile$nll[i] <- b_fit$value
}

## plot slope profile
b_profile %>% 
  ggplot(aes(b, nll)) + 
  geom_line() + 
  labs(x = expression(beta), y = "negative log-likelihood")

Standard errors: Fisher information

\[ \begin{aligned} \mathcal{J}(\hat{\theta}) &= - \frac{\partial^2}{\partial \theta^2} \ln [\mathcal{L}(\hat{\theta})] \\ \text{Var}(\hat{\theta}) &= 1 / \mathcal{J}(\hat{\theta}) \end{aligned} \]

var_cov <- solve(fit$hessian)
var_cov
                      a             b    log_sigma
a          1.620204e-02 -3.895846e-04 6.084368e-06
b         -3.895846e-04  5.675765e-03 7.514228e-09
log_sigma  6.084368e-06  7.514228e-09 9.997983e-03

The variances are on the diagonal, and we can get the standard errors by taking their square root:

fisher_se <- sqrt(diag(var_cov))
fisher_se
         a          b  log_sigma 
0.12728722 0.07533767 0.09998992 


Let’s compare these to the standard errors reported by lm:

term estimate std.error statistic p.value
(Intercept) 0.8265966 0.1299250 6.362106 1e-07
x 2.0182841 0.0768989 26.245943 0e+00

Confidence intervals: normal approximation

\[ \text{CI} = (\hat{\theta} - \phi^{-1}(\gamma / 2) \times \text{SE}(\hat{\theta}), \; \hat{\theta} + \phi^{-1}(1 - \gamma / 2) \times \text{SE}(\hat{\theta})) \]

Where \(\phi^{-1}\) is the inverse cumulative distribution function of the standard Normal distribution.

Equivalently:

\[ \text{CI} = (\phi^{-1}_{\hat{\theta}, \text{SE}(\theta)}(\gamma / 2), \; \phi^{-1}_{\hat{\theta}, \text{SE}(\theta)} (1-\gamma/2)) \]

Where \(\phi^{-1}_{\hat{\theta}, \text{SE}(\theta)}\) is the inverse cumulative distribution function for a Normal distribution with mean \(\hat{\theta}\) and standard deviation \(\text{SE}(\theta)\).

The Normal inverse CDF is returned by the qnorm function. We can obtain a confidence interval for our intercept \(\alpha\) and slope \(\beta\):

gamma <- 0.05 ## 95% confidence interval
alpha_CI <- qnorm(c(gamma/2, 1 - gamma/2), mean = fit$par[1], 
                  sd = fisher_se[1])
beta_CI <- qnorm(c(gamma/2, 1 - gamma/2), mean = fit$par[2], 
                 sd = fisher_se[2])
cbind(alpha_CI, beta_CI)
      alpha_CI  beta_CI
[1,] 0.5774225 1.870625
[2,] 1.0763792 2.165944

Making it a function

lm_fit <- function(x, y, SE = "fisher", init = c("a" = 0, "b" = 0, "log_sigma" = 0), n_boot = 1000) {

  if (!(SE %in% c("bootstrap", "fisher"))) stop("SE options are bootstrap or fisher")
  if (length(x) != length(y)) stop("x and y lengths differ")

  ## Maximum-likelihood estimate
  MLE <- optim(
    par = init, # initial values
    fn = lm_nll,  # our negative log-likelihood function
    x = x,   # x values
    y = y,   # y values
    hessian = (SE == "fisher") # only return Hessian if Fisher information used
  )

  ## Standard error using either fisher information or bootstrapping
  if (SE == "fisher") {
    var_cov <- solve(MLE$hessian)
    fit_se <- sqrt(diag(var_cov))
  } else {
    n_obs <- length(x) ## number of observations

    ## initialize an empty matrix to store parameter estimates for each sample
    par_samples <- matrix(nrow = n_boot, ncol = 3)

    ## recalculate MLE for each sample
    for (i in 1:n_boot) {
      samp <- sample(1:n_obs, size = n_obs, replace = TRUE) # sampled observations
      new_x <- x[samp] # subset x to sampled observations
      new_y <- y[samp] # subset y to sampled observations

      sample_fit <- optim(
        par = init, # initial values
        fn = lm_nll, # log likelihood function
        x = new_x, # sampled x-values
        y = new_y # sampled y-values
      )
  
    par_samples[i,] <- sample_fit$par # store sample parameters
  }
  fit_se <- apply(par_samples, MARGIN = 2, FUN = sd) # calculate column SDs
 }

 ## nicely format output
 data.frame(
   coef = c("intercept", "slope"), 
   estimate = MLE$par[1:2], 
   SE = fit_se[1:2]
 )

}

Lab exercise 1

a <- 1 ## intercept
b <- 2 ## slope
sigma <- 1 ## error standard deviation
n <- 50 ## sample size

## simulate data
sim_data <- tibble(
  x = runif(n, min = -3, max = 3), ## x values
  y = rnorm(n, mean = a + b*x, sd = sigma) ## regression equation
)

## plot data
sim_data %>% ggplot(aes(x, y)) + geom_point() + geom_smooth(method = "lm")
## run regression
lm_fit(sim_data$x, sim_data$y, SE = "fisher")
       coef  estimate         SE
a intercept 0.9358099 0.13386046
b     slope 2.0485483 0.07308945