ON TIME SERIES OF OBSERVATIONS FROM EXPONENTIAL FAMILY DISTRIBUTIONS Juan Carlos Abril Facultad de Ciencias Económicas, Universidad Nacional de Tucumán y CONICET. Casilla de correo 209, 4000 Tucumán, Argentina SUMMARY In this work a general method is developed for handling non-Gaussian observations in linear state space time series models and this is applied to the special case in which the observations come from exponential family distributions. The method is based on the idea of estimating the state vector by its posterior mode. Let ∼ be the vector (∼ 1〈 ,..., ∼ 〈n ) 〈 and let p (∼ | Yn ) be the conditional density of ∼ given the whole sample Yn . The posterior mode estimate (PME) of ∼ is defined to be the value ∼ˆ of ∼ that maximises p (∼ | Yn ) . ˆ Ζ E (∼ | Yn ) . When the observations When the model is linear and Gaussian, ∼ are non-Gaussian, however, E (∼ | Yn ) is generally difficult or impossible to compute and to use the mode instead is a natural alternative. More than that, it can be argued that in the non-Gaussian case the PME is preferable to E (∼ | Yn ) since it is the value of ∼ which is the most probable given the data. In this respect it can be thought of analogous to the maximum likelihood estimate of a fixed parameter vector. The question of calculating suitable starting values for the state iteration is considered and the estimation of the hyperparameters is investigated in the work. KEYWORDS: Approximate Kalman filtering and smoothing; Exponential family distributions; Kalman filtering and smoothing; Non-Gaussian time series models; Posterior mode estimate. JEL classification: C3, C5. 1. Introduction We begin with the linear Gaussian state space model. Although our main concern is with non-Gaussian models, the linear Gaussian model provides the basis from which all our methods will be developed. The model can be formulated in a variety of ways; we shall take the form yt Ζ Z t ∼ t Η ⁄ t , ⁄ t ~ N (0, H t ), ∼ t Ζ Tt ∼ t ϑ1 Η Rt ♣ t , ♣ t ~ N (0, Qt ), 1 (1) for t Ζ 1,..., n, where y t is a p ⌠ 1 vector of observations, ∼ t is an unobserved m ⌠ 1 state vector, Rt is a selection matrix composed of r columns of the identity matrix I m , which need not be adjacent, and the variance matrices H t and Qt are nonsingular. The disturbance vectors ⁄ t and ♣ t are serially independent and independent of each other. Matrices H t , Qt , Z t and Tt are assumed known apart for possible dependence on a parameter vector ÷ which in classical inference is assumed fixed and unknown and in Bayesian inference is assumed to be random. The first line of equation (1) is called the observation equation and the second line, the state equation of the state space model. Matrices Z t and Tt are permitted to depend on y1 ,..., y t ϑ1 . The initial state ∼ 0 is assumed to be N ( a 0 , P0 ) independently of ⁄1 ,..., ⁄ n and ♣1 ,..., ♣ n , where a 0 and P0 are first assumed known; later, we consider how to proceed in the absence of knowledge of a 0 and P0 and particularly in the diffuse case when P0ϑ1 Ζ 0 . Let Yt ϑ1 denote the set y1 ,..., y t ϑ1 together with any information prior to time t Ζ 1 . Starting at t Ζ 1 and building up the distributions of ∼ t and y t recursively, it can be shown that the conditional densities p ( y t | ∼ 1 ,..., ∼ t , Yt ϑ1 ) Ζ p ( y t | ∼ t ) and p (∼ t | ∼ 1 ,..., ∼ t ϑ1 , Yt ϑ1 ) Ζ p (∼ t | ∼ t ϑ1 ) , thus establishing the truly Markovian nature of the model. Since in model (1) all distributions are Gaussian, conditional distributions are also Gaussian. Assume that ∼ t given Yt ϑ1 is N ( a t , Pt ) and that ∼ t given Yt is N ( at |t , Pt |t ) . The object of the Kalman filter and smoother (KFS) is to calculate a t |t , Pt |t , at Η1 and Pt Η1 given a t , Pt recursively. In this work a general method is developed for handling non-Gaussian observations in linear state space time series models and this is applied to the special case in which the observations come from exponential family distributions. The method is based on the idea of estimating the state vector by its posterior mode, a device which has already been considered by some authors for spline smoothing (see, for example, Abril (1999) and the references given therein). Let ∼ be the stacked vector (∼ 1〈 ,..., ∼ 〈n ) 〈 and let p (∼ | Yn ) be the conditional density of ∼ given Yn . The posterior mode estimate (PME) of ∼ is defined to be the value ∼ˆ of ∼ that maximises p (∼ | Yn ) . When the model is linear and Gaussian, ∼ˆ Ζ E (∼ | Yn ) . When the observations are non-Gaussian, however, E (∼ | Yn ) is generally difficult or impossible to compute and to use the mode instead is a natural alternative. More than that, it can be argued that in the non-Gaussian case the PME is preferable to E (∼ | Yn ) 2 since it is the value of ∼ which is the most probable given the data. In this respect it can be thought of analogous to the maximum likelihood (ML) estimate of a fixed parameter vector; ˆ is see, for example, Whittle (1991) and the subsequent discussion. The tth subvector of ∼ called the smoothed value of ∼ t and is denoted by ∼ˆ t . The idea of using the PME for departure from the linear Gaussian case can be found in earlier literature, for example in Sage and Melsa (1971). The approach has been developed for exponential families by Fahrmeir (1992) and in earlier papers referenced therein. The treatment below is based on ideas of Durbin and Koopman (1994) who developed a technique for iterative computing of the PME using fast KFS algorithms, as well as methods for calculating approximate ML estimate of hyperparameters. Let At be the stacked vector (∼ 1〈 ,..., ∼ 〈t ) 〈 , t Ζ 1,..., n . Then ∼ˆ is the PME of An and for filtering, the PME of At given Yt ϑ1 is the value of At that maximises the conditional density of p ( At | Yt ϑ1 ) , t Ζ 1,..., n . The tth subvector of this is denoted by a t . In all cases we shall consider, the PME ∼ˆ is the solution of the equations ⌡ log p(∼ | Yn ) Ζ 0, ⌡ ∼t t Ζ 1,..., n. Since, however, log p (∼ | Yn ) Ζ log p (∼ , Yn ) ϑ log p (Yn ) , it is more easily obtained from the joint density as the solution of the equations ⌡ log p(∼ , Yn ) Ζ 0, ⌡ ∼t t Ζ 1,..., n. (2) Similarly, at the filtering stage the PME of At given Yt ϑ1 is the solution of the equations ⌡ log p( At | Yt ϑ1 ) Ζ 0, s Ζ 1,η , t. ⌡∼ s (3) We leave aside for the moment the initialisation question. In the non-Gaussian case these equations are non-linear and must be solved by iteration. The idea we shall use for obtaining suitable iterative steps is the following. Write out the equations (2) for the linear Gaussian analogue of the model under consideration. Since for the Gaussian densities modes are equal to means, the solution of the equations is E (∼ | Yn ) . But we know that this is given by the KFS. It follows that the equations in the ~ be a trial value of ∼ and linear Gaussian case can be solved by the KFS. Now let ∼ ~ . This can be done by expanding locally about ∼ ~ or construct a linearised form of (2) using ∼ by other methods to be illustrated later. Manipulate the resulting equations into a form that mimics the analogous Gaussian equations and solve by the KFS to obtain an improved value 3 of ∼ . Repeat the process and continue until suitable convergence has been achieved. Of course, this iteration for estimation of the state vector has to be combined with parallel iteration for estimation of the hyperparameter vector. In the next section the technique is illustrated by applying it to observations from a general exponential family. Section 3 investigates the problem of hyperparameter estimation, section 4 considers the question of calculating suitable starting values for the state iteration and in section 5 we give some conclusions. 2. State space models when the observations come from a general exponential family distribution We consider the case where the observations come from a general exponential family distribution with density p( y t | ∼ t ) Ζ expξ °〈t y t ϑ bt (° t ) Η ct ( y t )ζ (4) where ° t Ζ Z t ∼ t , t Ζ 1,..., n , and we refer to it as the signal. We retain the same state transition equation as in the Gaussian case (1), namely ∼ t Ζ Tt ∼ t ϑ1 Η Rt ♣ t , ♣ t ~ N (0, Qt ), t Ζ 1,..., n. (5) This model covers many important time series applications occurring in practice, including binomial, Poisson, multinomial and exponential distributed time series data. In (4), bt and ct are asummed to be known functions and y t is a p ⌠ 1 vector of observations which may be continuous or discrete. Assumptions regarding (5) are the same as in the Gaussian case. If the observations y t in (4) had been independent with ∼ t , bt and c t constant then (4) would have been a generalised linear model, for treatment of which see McCulloch and Nelder (1989). Model (4) and (5) was proposed for appropriate type of non-Gaussian time series data by West, Harrison and Migon (1985) who called it the dynamic generalised linear model. They gave an approximate treatment based on conjugate priors. The treatment given here differs completely from theirs and is based on Durbin and Koopman (1994). In fact, it is an extension of this last work. It is evident that Poisson data with mean exp( Z t〈∼ t ) satisfy (4). For binomial data, suppose y t is the number of successes in N trials with probability ↓ t . Then log p( y t | ∼ t ) Ζ exp≡ logξ ↓ t Ε1 ϑ ↓ t Φ ζ y t Η N logΕ1 ϑ ↓ t Φ Η log N! ϑ log yt ! ϑ logΕN ϑ y t Φ ! … which satisfies (4) by taking ↓ t Ζ expΕZ t〈∼ t Φ ξ1 Η expΕZ t〈∼ t Φζ . 4 This is the logistic transform which is in general used for logistic regression of binomial data in the non-time series case. The main reason for the inclusion of Rt in (5) is that some of the equations of (5) have zero error terms. This happens, for example, when some elements of the state vector are regression coefficients which are constant over time. The function of Rt is then to select those error terms which are non-zero. This is done by taking the columns of Rt to be a subset of the columns of I m . For simplicity let us assume that this is so. Then Rt〈Rt Ζ I g , where g is the number of non-zero error terms. Also, Rt Qt Rt〈 and Rt Qtϑ1 Rt〈 are MoorePenrose generalised inverses of each other (see, for example, section 16.5 of Rao (1973)). Since none of the elements of ♣ t need be degenerated we assume that Qt is positive definite. For simplicity suppose that ∼ 0 is fixed and known, leaving until later consideration of the initialisation question. The log density of ∼ , Yn is then log p(∼, Yn ) Ζ ϑ 1 n Ε∼ t ϑ Tt ∼ t ϑ1 Φ〈 Rt Qtϑ1 Rt〈Ε∼ t ϑ Tt ∼ t ϑ1 Φ å 2 t Ζ1 n Η å ξ∼ 〈t Z t〈 y t ϑ bt ΕZ t ∼ t Φ Η c t Ε y t Φ ζ . t Ζ1 Differentiating with respect to ∼ t and equating to zero, we obtain the PME´s ∼ˆ 1 ,..., ∼ˆ n as the solution of the equations ϑ Rt Qtϑ1 Rt〈 Ε∼ t ϑ Tt ∼ t ϑ1 Φ Η Tt Η1 Rt Η1QtϑΗ11 Rt〈Η1 Ε∼ t Η1 ϑ Tt Η1∼ t Φ ξ ζ Η Z t〈 y t ϑ bt* ΕZ t ∼ t Φ Ζ 0 (6) for t Ζ 1, η , n ϑ 1 where for t Ζ n the second term is absent. Here bt* ( x ) denotes dbt ( x ) / dx for any p ⌠ 1 vector x . ~ gives ~ is a trial value of ∼ . Expanding b * ΕZ ∼ Φ about ∼ Suppose that ∼ t t t t t t ~ Φ Η b ** ΕZ ∼ ~ ~ bt* ΕZ t ∼ t Φ Ζ bt* ΕZ t ∼ t t t t Φ Z t Ε∼ t ϑ ∼ t Φ to the first order where bt** ( x ) Ζ dbt* ( x ) / dx Ζ d 2 bt ( x ) / dx dx 〈 . Putting ξ ζξ ζ ~ Φ ϑ1 y ϑ b * ΕZ ∼ ~ ~ ~ y t Ζ bt** ΕZ t ∼ t t t t t Φ Η Zt ∼ t in the last term of (6) gives ~ Φξ ~ Z t〈bt** ΕZ t ∼ yt ϑ Z t ∼ t ζ . t Now the equations corresponding to (6) for the linear Gaussian model (1) are 5 (7) ϑ Rt Qtϑ1 Rt〈 Ε∼ t ϑ Tt ∼ t ϑ1 Φ Η Tt Η1 Rt Η1QtϑΗ11 Rt〈Η1 Ε∼ t Η1 ϑ Tt Η1∼ t Φ Η Z t〈H tϑ1 Ε y t ϑ Z t ∼ t Φ Ζ 0. (8) Comparing (7) with the last term of (8) we see that the linearised form of (6) has the same ξ ζ ~ Φ ϑ1 . Since the solution of (8) is y t and bt** ΕZ t ∼ form as (8) if we replace y t and H t in (8) by ~ t given by the KFS it follows that with these substitutions the solution of the linearised form of (6) can also be obtained by the KFS. By this means we obtain an improved value of ∼ which is used as a new trial value to obtain a new improved value, and so on until suitable convergence has been reached. Initialisation of the Kalman filter can be carried out by any of the techniques considered by de Jong (1989, 1991) and Koopman (1993) (see, for example, Abril (1999)). Since in the course of the iterations variances and covariances of errors of estimation of the ∼ t ´s are not required, Koopman´s smoother should be used in preference to de Jong´s since it is faster. After iterations for state and parameter estimation are complete, if approximate variances and covariances are required a further smoothing pass using de Jong´s smoother can be made. However, the accuracy of these is open to doubt since in the non-Gaussian case the matrix bt** ΕZ t ∼ˆ t Φ is a random matrix whereas in the Gaussian case the analogous matrix H t is fixed. 3. Hyperparameter estimation Since construction of the exact likelihood is difficult or impossible, the ML estimation of the unknown parameter vector ÷ that we consider in this section is necessarily approximate. The simplest approach is to construct an approximate likelihood by the prediction error decomposition method. For the Gaussian model (1) in the diffuse case the likelihood is given by n L Ζ p(Yn | ÷) Ζ ( 2↓) ϑ( n ϑ ≥ ) p / 2 Ft t Ζ ≥ Η1 ϑ 1 2 1 n exp ϑ v t〈 Ft ϑ1v t 2 t Ζ≥ Η1 (9) where vt Ζ y t ϑ Z t at , a t is the filtered estimate of ∼ t and Ft Ζ Var (vt ) . By analogy with this our first approximate form for the likelihood is given by the same expression (9), where vt and Ft are now the values calculated in a final pass of the Kalman filter using the methods ~ equal to the final smoothed estimate of ∼ . of the previous section with ∼ t t Assume now that the initial mean and variance a 0 and P0 are known. For the second approximate form, denote Eξ(∼ ϑ ∼ˆ )(∼ ϑ ∼ˆ ) 〈ζ for the exponential family model (4) and (5) by 6 Ve and denote the same matrix for the Gaussian model (1) with Rt Ζ I m by V g . Let pe (∼ , Yn | ÷ ) , p g (∼ , Yn | ÷ ) , pe (Yn | ÷) , p g (Yn | ÷ ) be the joint densities of ∼ , Yn and the marginal densities of Yn under the two models. Then ϑ1 / 2 p g (∼, Yn | ÷) Ζ p g (Yn | ÷) (2↓) ϑ nm / 2 V g 1 exp ϑ (∼ ϑ ∼ˆ )〈V gϑ1 (∼ ϑ ∼ˆ ) , 2 so on putting ∼ Ζ ∼ˆ we have p g (Yn | ÷ ) Ζ ( 2↓) nm / 2 V g 1/ 2 p g (∼ˆ , Yn | ÷ ). (10) Now n p g (∼, Yn | ÷) Ζ (2↓) ϑ n ( m Η p ) / 2 Õ Qt ϑ1 / 2 ϑ1 / 2 Ht f g (∼, Yn ) t Ζ1 where 1 n 〈 f g (∼ , Yn ) Ζ expϑ å Ε∼ t ϑ Tt ∼ t ϑ1 Φ Qtϑ1 Ε∼ t ϑ Tt ∼ t ϑ1 Φ 2 t Ζ1 〈 Η Ε y t ϑ Z t ∼ t Φ H tϑ1 Ε y t ϑ Z t ∼ t Φ . (11) We deduce from (10) and (11) that 1/ 2 p g (Yn | ÷) Ζ (2↓) ϑ np / 2 V g n ÕQ ϑ1 / 2 t Ht ϑ1 / 2 f g (∼, Yn ). (12) t Ζ1 But by the prediction error decomposition, n p g (Yn | ÷ ) Ζ ( 2↓) ϑnp / 2 Ft ϑ1 / 2 t Ζ1 1 n expϑ vt〈 Ft ϑ1vt . 2 t Ζ1 (13) Since the quadratic forms in Yn in the exponents of (12) and (13) must be the same, the following identity holds: Vg 1/ 2 n Ζ Õ Qt 1/ 2 Ht 1/ 2 Ft ϑ1 / 2 . (14) t Ζ1 We now apply these results to the exponential family model. The joint density of ∼ and Yn is n p e (∼ , Yn | ÷ ) Ζ ( 2↓) ϑ nm / 2 Qt ϑ1 / 2 t Ζ1 1 n 〈 ⌠ expϑ Ε∼ t ϑ Tt ∼ t ϑ1 Φ Qtϑ1 Ε∼ t ϑ Tt ∼ t ϑ1 Φ 2 t Ζ1 n Η ξ∼ 〈t Z ' t y t ϑ bt ΕZ t ∼ t Φ Η c t Ε y t Φ ζ . t Ζ1 7 (15) By analogy with (10) we take for the approximate marginal density of Yn , and hence our second approximation to the likelihood, L Ζ p e (Yn | ÷ ) Ζ (2↓) nm / 2 Ve 1/ 2 p e (∼ˆ , Yn | ÷ ) (16) where Ve 1/ 2 n Ζ Õ Qt t Ζ1 1/ 2 bt** ( Z t ∼ˆ t ) ϑ1 / 2 Fˆt ϑ1 / 2 (17) and where ∼ˆ t is the final smoothed value of ∼ t and F̂t is the value of Ft that is computed in ~ Ζ ∼ˆ . Similar results hold in the diffuse case when the final pass of the Kalman filter with ∼ t t the Kalman filter are initialised by any of the techniques considered by de Jong (1989, 1991) and Koopman (1993) (see, for example, Abril (1999)). Either of these forms of the likelihood can be maximised by numerical maximisation routines as for the prediction error decomposition in the Gaussian case, with concentration where appropriate (see, for example, Abril (1999)). However, a new point arises that was not present in the Gaussian case. Values of the likelihood are calculated from the results of iterated estimation of ∼ˆ and are therefore subject to small errors since the iteration cannot be continued until the errors are infinitesimal. Consequently they are not suitable for maximising routines that calculate the gradient from adjacent values of the hyperparameters and then proceed from the last approximation along a direction solely determined by this gradient. Instead, routines should be used such as the downhill simplex method given in section 10.4 of Press et al. (1986), Powell´s method given in section 10.5 of the same book or the Gill-Murray Pitfield algorithm. Instead of estimating the hyperparameters by direct numerical maximisation of the loglikelihood, the EM algorithm may be used as an alternative, or the two techniques may be combined using the EM algorithm in the early stages of maximisation, where it is relatively fast, and then switching to numerical maximisation in the later stages, where it is relatively slow. Let us now consider the use of the EM algorithm. For the model we are considering, the observational density (4) does not depend on unknown hyperparameters. Consequently, for construction of the EM algorithm we can neglect the observational component, so the E step gives, as an approximation, and ignoring negligible terms, 1 n ~ E c ξ log p(∼, Yn | ÷ )ζ Ζ ϑ å ξ log Qt Η tr Qtϑ1 (∼ˆ t ϑ Tt ∼ˆ t ϑ1 )(∼ˆ t ϑ Tt ∼ˆ t ϑ1 )〈 2 t Ζ1 ~ ~ ~ ~ Η Vt Η TtVt ϑ1Tt〈 ϑ Tt C t ϑ C t〈Tt . ζ 8 (18) ~ , where ÷ ~ is a trial value of ÷ , Here, ∼ˆ t is the final smoothed value of ∼ t given that ÷ Ζ ÷ ~ Ec denotes expectation with respect to density ~ ∼ˆ Ζ E c (∼ ) , ~) , p (∼ | Yn , ÷ ~ ~ ~ ~ Vt Ζ E c ξ(∼ t ϑ ∼ˆ t ) (∼ t ϑ ∼ˆ t )〈ζ and C t Ζ E c ξ(∼ t ϑ1 ϑ ∼ˆ t ϑ1 ) (∼ t ϑ ∼ˆ t )〈ζ . Similarly, and preferably, if hyperparameters occur only in Qt and not in Tt we can use the faster Koopman (1993) form ξ ζ 1 n ~ ~♣ ~〈 ~ ~ ~ (19) E c ξ log p (∼, Yn | ÷ )ζ Ζ ϑ å log Qt Η tr Qtϑ1 (♣ t t Η Qt ϑ Qt Dt Qt ) 2 t Ζ1 ~ ~ ,Q where ♣ t t and Dt are calculated for the linearised model as in the same way as for the ~ linear model (see, for example, Abril (1999)). In either case, in the M step, E c [.] is maximised with respect to ÷ to obtain an improved estimate. For cases in which the EM algorithm is not used it is sometimes convenient to compute the score function in order to specify the direction along which numerical search ~ is should be made. The approximate score function at ÷ Ζ ÷ ⌡ log p (Yn | ÷ ) ⌡÷ ~ ÷ Ζ÷ Ζ ~ ⌡ E c ξlog p (∼ , Yn | ÷ )ζ ⌡÷ ~ ÷ Ζ÷ ~ where E c [log p (∼ , Yn | ÷)] is given by (18) or (19). As we said, this is used to specify the direction of numerical search for the maximum. 4. Calculation of starting values for the state iteration Two methods are suggested for obtaining starting values for the state iteration, the ~ , the extended Kalman filter and the approximate two-filter smoother. With trial values ∼ t Kalman filter step during the state iteration that we have just been discussing can be written as a t Η1 ~ Pt Η1 ~ Ζ Tt Η1 a t Η K t v~t , ~ ~ ) K~ Η R Q R 〈 , Ζ Tt Η1 Pt Tt〈Η1 ϑ Z t〈bt** ( Z t ∼ t t t Η1 t Η1 t Η1 ξ ζ (20) where v~t ~ Kt ~ Ft ~ Φ Η b ** ΕZ ∼ ~ ~ yt ϑ bt* ΕZ t ∼ t t t t ΦZ t Ε∼ t ϑ ∼ t Φ, ~ ~ ) Z F~ ϑ1 , Ζ Tt Η1 Pt bt** ( Z t ∼ t t t ~ ** ~ ~ ) Η b ** ( Z ∼ ~ Ζ bt ( Z t ∼ t ) Z t Pt Z t〈bt** ( Z t ∼ t t t t ). Ζ ~ Ζ a in these formulae. Thus The extended Kalman filter is obtained merely by taking ∼ t Η1 t Η1 ~ the estimate of ∼ given by the filter at time t – 1. all we are doing is taking for ∼ t t 9 The extended Kalman filter normally gives adequate starting values for the state iteration. Indeed Fahrmeir (1992) claims that it gives an adequate approximation to the PME though Durbin and Koopman (1994) and Abril (1999) dispute this. However, it is obvious that the estimates of ∼ t that it gives for t small are less accurate than those for t large. This fault is rectified by the following device which uses the entire sample to estimate ∼ t at each time point. Let us revert momentarily to the Gaussian model (1). Let Yt Ηn1 Ζ ( y t〈Η1 , η , y n〈 ) 〈 . Then, because of the Markovian nature of the model, p (∼ t | Yn ) Ζ Ζ Ζ Ζ p(∼ t , Yn ) p (Yn ) p(Yn | ∼ t ) p (∼ t ) p(Yn ) p(Yt | ∼ t ) p (Yt nΗ1 | ∼ t ) p (∼ t ) p (Yn ) p(∼ t | Yt ) p (∼ t | Yt nΗ1 )c(Yn ) p (∼ t ) (21) where c (Yn ) depends only on Yn and where p (∼ t ) is the marginal density of ∼ t . This formula has been derived in a wider context by Solo (1989). Now suppose, as will normally be the case in practice, that we have initialised our Kalman Filter by a diffuse prior. The p (∼ t ) will also be diffuse so the estimate of ∼ t going forwards in time and the estimate going backwards in time will be independent. Suppose further that a valid and manageable state space model is available going backwards in time; often, of course, it will be effectively the same model as the one going forwards. Let atn Ζ E (∼ t | Yt nΗ1 ) and Pt n Ζ E[(∼ t ϑ atn )(∼ t ϑ a tn )〈 | Yt nΗ1 ] , these are, of course, given routinely by the reverse Kalman filter. Then ξ ζξ ∼ˆ t Ζ Pt|ϑt 1 at|t Η ( Pt n ) ϑ1 atn Pt|ϑt 1 Η ( Pt n ) ϑ1 ζ ϑ1 which is called the two-filter smoother. While the two-filter smoother gives the same values as the classical and de Jong smoothers for Gaussian models, it is not as efficient as them computationally. Its real value in the present context is that after approximation and simplification it is capable of giving a better approximation to the PME of the stacked vector than the extended Kalman filter in non-Gaussian cases (see, for example, Abril (1999)). We can use the idea to obtain state starting values after implementing extended Kalman filters in both forwards and backwards directions. Let us assume, as is often the case, that the dimensionality of ∼ t is so large that 10 inversion of the matrices Pt and Pt n at each time point is impractical. Let f and b be the forwards and backwards estimates of a particular element of ∼ t and let v f and vb be the corresponding diagonal elements of the variance matrices Pt and Pt n produced by the two filters. Then take as the estimate of this element that is to be used as a starting value in the state iteration the weighted average ( f / v f Η b / vb ) (1 / v f Η 1 / v g ) . The starting values obtained in this way should be more accurate than those given by the extended Kalman filter and should therefore lead to fewer iterations, but whether the resulting computational gain is worth while depends on the particular model. We call this the approximate two-filter smoother. 5. Conclusions This paper presents a methodology for the treatment of non-Gaussian time series observations, particularly when they come from exponential family distributions. The methodology is developed that can be used by applied researchers for dealing with real nonGaussian time series data without them having to be time series specialists. The idea underlying the techniques is to put everything in state space form and then, linearised it to obtain an approximation to the Gaussian case in order to apply the KFS, estimating the state vector by its posterior mode. The PME is clearly a reasonable estimate because it maximises the corresponding density and is analogous to the maximum likelihood estimate. The way that the state and hyperparameter iterations are tied in together is as follows. Let ÷ (1) be the value of ÷ that maximises the likelihood using only the extended Kalman filter or the approximate two-filter smoother. Let ∼ (t1) be the value of ∼ t produced by this filter or this smoother in the final iteration of this hyperparameter iteration. Taking ÷ Ζ ÷ (1) and ∼ (t1) as starting values, let ∼ t( 2 ) be the smoothed value of ∼ t produced by the state iteration of section 2. With ∼ t Ζ ∼ (t 2 ) and ÷ (1) as a trial value of ÷ let ÷ ( 2 ) be the value of ÷ that maximises the likelihood. Taking ∼ t( 2 ) as starting value let ∼ t( 3) be the value produced by the state iteration. The process continues in this way until the relative change in log L is less than a preassigned number, say 10-5 or 10-6. A final state iteration is then made to give the final smoothed value of ∼ˆ t . REFERENCES ABRIL, JUAN CARLOS. (1999). Análisis de Series de Tiempo Basado en Modelos de Espacio de Estado. EUDEBA: Buenos Aires. 11 DE JONG, P. (1989). Smoothing and interpolation with the state-space models. J. Amer. Statist. Assoc., 84, 1085-88. DE JONG, P. (1991). The diffuse Kalman filter. Ann. Statist., 19, 1073-83. DURBIN, J. AND S. J. KOOPMAN. (1994). Filtering, smoothing and estimation for time series when the observations come from exponential family distributions. London School of Economics Statistics Research Report. FAHRMEIR, L. (1992). Posterior mode estimation by extended Kalman filtering for multivariate dynamic generalised linear models. J. Amer. Statist. Ass., 87, 501-9. KOOPMAN, S. J. (1993). Disturbance smoother for state space models. Biometrika. 80, 11726. MCCULLAGH, P. AND J. A. NELDER. (1989). Generalised Linear Models (Second Ed.). Chapman and Hall: London. PRESS, W., B. FLANNERY, S. TEUKOLSKY AND W. VETTERLING. (1986). Numerical Recipes: The Art of Scientific Computing. Cambridge University Press: Cambridge. RAO, C. RADHAKRISHNA. (1973). Linear Statistical Inference and Its Applications (Second Ed.). Wiley: New York. SAGE, A. AND J. MELSA. (1971). Estimation Theory, With Applications to Communications and Control. McGraw-Hill: New York. SOLO, VICTOR. (1989). A simple derivation of the Granger representation theorem. Unpublished manuscript. WEST, M., J. HARRISON AND H. S. MIGON. (1985). Dynamic generalised linear models and Bayesian Forecasting. J. Amer. Statist. Ass., 80, 73-97. WHITTLE, PETER. (1991). Likelihood and cost as path integrals (with discussion). J. R. Statist. Soc., B, 53, 505-29. 12