The EM Algorithm The EM Algorithm The EM Algorithm The EM algorithm is a general method for finding maximum likelihood estimates of the parameters of an underlying distribution from the observed data when the data is ”incomplete” or has ”missing values” The ”E” stands for ”Expectation” The ”M” stands for ”Maximization” To set up the EM algorithm successfully, one has to come up with a way of relating the unobserved complete data with the observed incomplete data so that the complete data has a ”nice” function of the unknown parameters such the maximum likelihood estimation is easy The EM Algorithm The EM Algorithm Setup Let Y be the observed data and let X be the unobserved complete data For the EM algorithm requires that: observed data Y can be written as a function of the complete data X . That is, there is some function t(X ) = Y that collapses or projects X onto Y . complete data X has a probability density or distribution f (X |θ) for some parameter vector θ We want to find θ̂ such that f (X |θ) is maximized, i.e., we want to find the MLE (Maximum Likelihood Estimator). The EM Algorithm The EM Algorithm If X were known, to find θ̂ we would generally take the log of the likelihood function first l(X |θ) = ln(f (X |θ)) We then maximize l(X |θ) with respect to θ But since X is not observed, it is not possible to maximize l(X |θ) The EM algorithm can be used to maximize a conditional likelihood for the unobserved X . The EM Algorithm The EM Algorithm In the E step (expectation step) of the algorithm, we calculate the following conditional expectation Q(θ|θ0 ) = E [l(X |θ)|Y , θ0 ] = E [ln(f (X |θ))|Y , θ0 ] where θ0 is some initial value of θ Q is the expected complete data log-likelihood The M step of the algorithm finds the θ̂ that maximizes Q(θ|θ0 ). Then set θ1 = θ̂, where θ1 is now your current estimate Return to the E step and start the process over again by calculating Q(θ|θ1 ) and maximizing it with respect to θ Repeat this process until convergence, i.e., |θn − θn−1 | 6 , where is some small number (e.g. .0001) The essence of the EM algorithm is that for each iteration maximizing Q(θ|θi ) leads to an increase of the log likelihood of the observed data Y for each iteration i. The EM Algorithm ABO Blood Type Example The locus corresponding to the ABO blood group has three alleles, A, B and O and is located on chromosome 9q34. Alleles A and B are co-dominant, and the alleles A and B are dominant to O. This leads to the following genotypes and phenotypes: Genotype AA, AO BB, BO AB OO Blood Type A B AB O The EM Algorithm EM: ABO Blood Type Example From a sample of 521, the following blood types were observed: Blood Type A B AB O Total Number 186 38 13 284 We want to estimate pA , pB , and pO , the frequency of alleles A, B, and O, respectively. How can we do this? Note that θ = (pA , pB , pO ). What is the observed data? What is the complete data? The EM Algorithm EM: ABO Blood Type Example Let N be the number of people in the study. The complete data is X = (nA/A , nA/O , nB/B , nB/O , nA/B , nO/O ), where nA/A is the number of people with A/A genotype, nA/O is the number of people with the A/O genotype, etc... The observed data is Y = (nA , nB , nAB , nO ), where nA is the number of people with blood type A, nB is the number of people with blood type B, etc... What is N, the total number of people in the sample, in terms of the unobserved complete data? What is nA in terms of the unobserved complete data? What is nB in terms of the unobserved complete data? What is nAB in terms of the unobserved complete data? What is nO in terms of the unobserved complete data? What is the complete data likelihood? Assume HWE. What is the complete data log-likelihood? Assume HWE. The EM Algorithm EM: ABO Blood Type Example nA = nA/A + nA/O nB = nB/B + nB/O nAB = nA/B nO = nO/O What is the complete data log likelihood? Assume HWE. If the genotype data at the ABO gene were observed, then the likelihood function would have the following multinomial distribution N f (X |θ) = × nA/A , nA/O , nB/B , nB/O , nA/B , nO/O 2 nO/O (pA2 )nA/A (2pA pO )nA/O (pB2 )nB/B (2pB pO )nB/O (2pA pB )nA/B (pO ) The EM Algorithm Expectation Step of EM Algorithm The complete data log-likelihood function is N ln(f (X |θ)) = ln + nA/A , nA/O , nB/B , nB/O , nA/B , nO/O nA/A ln(pA2 )+nA/O ln(2pA pO )+nB/B ln(pB2 )+nB/O ln(2pB pO )+ 2 nA/B ln(2pA pB ) + nO/O ln(pO ) Remember that Y = (nA , nB , nAB , nO ). For the initial iteration of the EM algorithm, the E step calculates Q(θ|θ0 ) = E [ln(f (X |θ))|Y , θ0 ] 0 )), and we want to calculate So, θ0 = (pA0 , pB0 , pO 0 0 0) Q(pA , pB , pO |pA , pB , pO 0 0 ]? = E [nA/A |Y , pA0 , pB0 , pO What is nA/A The EM Algorithm Expectation Step of EM Algorithm 0 nA/A = nA P(AA genotype|A blood type) = nA 0 0 ]? What is nA/O = E [nA/O |Y , pA0 , pB0 , pO 0 0 ]? What is nA/B = E [nA/B |Y , pA0 , pB0 , pO 0 0 ]? What is nB/O = E [nB/O |Y , pA0 , pB0 , pO 0 0 ]? What is nO/O = E [nO/O |Y , pA0 , pB0 , pO The EM Algorithm (pA0 )2 0 (pA0 )2 + 2pA0 pO Expectation Step of EM Algorithm Q(θ|θ0 ) = 0 0 0 0 nA/A ln(pA2 )+nA/O ln(2pA pO )+nB/B ln(pB2 )+nB/O ln(2pB pO )+ 2 nA Bln(2pA pB ) + nO ln(pO ) + g (X ) where g (X ) = n ,n ,n N,n ,n ,n and is not a A/A A/O B/B B/O A/B O/O function of the parameters pA , pB , and pO For the M step, we want θ̂ = (p̂A , p̂B , p̂O ) that maximizes Q How would we do this? The EM Algorithm Maximization Step of EM Algorithm The M step involves maximizing Q, the expected value of the log-likelihood (obtained in the E step) with respect to θ = (pA , pB , pO ). The MLE is: p̂A = p̂B = p̂O = 0 0 2nA/A +nA/O +nAB 2N 0 0 2nB/B +nB/O +nAB 2N 0 0 +nB/O 2nO +nA/O 2N 1 = p̂ The next step is to set pA1 = p̂A , pB1 = p̂B , pO O Then return to the E step of the algorithm and compute 1) Q(θ|θ1 ), where θ1 = (pA1 , pB1 , pO Continue iterating between the E and the M step until the θi values converge. The EM Algorithm