Dynamic and Semiparametric Models

L. Fahrmeir

Dynamic and Semiparametric Models

2012, Wiley Series in Probability and Statistics

Fahrmeir, Knorr-Held: Dynamic and semiparametric models Sonderforschungsbereich 386, Paper 76 (1997) Online unter: http://epub.ub.uni-muenchen.de/ Projektpartner Dynamic and semiparametric models Ludwig Fahrmeir and Leonhard Knorr{Held Universitat Munchen Institut fur Statistik Ludwigstr. 33, 80539 Munchen 1 Introduction This chapter surveys dynamic or state space models and their relationship to non{ and semiparametric models that are based on the roughness penalty approach. We focus on recent advances in dynamic modelling of non{Gaussian, in particular discrete{valued, time series and longitudinal data, make the close correspondence to semiparametric smoothing methods evident, and show how ideas from dynamic models can be adopted for Bayesian semiparametric inference in generalized additive and varying coe cient models. Basic tools for corresponding inference techniques are penalized likelihood estimation, Kalman ltering and smoothing and Markov chain Monte Carlo (MCMC) simulation. Similarities, relative merits, advantages and disadvantages of these methods are illustrated through several applications. Section 2 gives a short introductory review of results for the classical situation of Gaussian time series observations. We start with Whittaker's (1923) \method of graduation" for estimating trends and show that it is equivalent to the posterior mean estimate from a linear Kalman lter model with known smoothing or variance parameters. We sketch extensions to general Gaussian linear dynamic or state space models and to continuous time analogues like the Bayesian version of cubic spline smoothing (Wahba, 1978). For more detailed expositions of the equivalence between Bayesian smoothness priors and penalized least squares we refer the reader to Kohn and Ansley (1988) and previous work cited there and to van der Linde (1995, 1996) for a thorough discussion of splines from a Bayesian point of view. This equivalence also suggests alternative ways of estimating unknown smoothing or variance pa1 rameters: Within a semiparametric approach, estimation by optimizing some cross{validated criterion is a common choice. Empirical Bayes models, also treating hyperparameters as xed or unknown, lead to marginal likelihood estimation. Maximization can be done by EM{type algorithms. Fully Bayesian models put a weakly informative prior on the hyperparameters and make a complete posterior analysis with MCMC techniques feasible. We then turn briey to so{called conditionally Gaussian dynamic models that are still linear but with errors distributed as scale mixtures of normals. Already with this seemingly moderate generalization, penalized least squares and posterior mean estimates are no longer equivalent. Beyond various approximate Kalman lters and smoothers, fully Bayesian approaches based on MCMC are available that make e cient use of the conditionally Gaussian structure. Fundamentally non{Gaussian time series and longitudinal data, in particular for categorical and count data, are considered in Section 3. Dynamic binomial and Poisson models are important members of the family of dynamic generalized linear models. Semiparametric counterparts based on penalized likelihood estimation can be derived as posterior mode estimators, with extended or iterative Kalman{type smoothing algorithms as e cient computational tools (Fahrmeir 1992 Fahrmeir and Tutz, 1994, ch. 8, Fahrmeir and Wagenpfeil, 1996). However, the equivalence between posterior mean and penalized likelihood estimation is lost. Fully Bayesian inference is possible with recently developed MCMC techniques for non{Gaussian dynamic models, an area of intensive current research. In Section 3.2, we outline the ideas for Metropolis{Hastings algorithms suggested by Knorr{Held (1996). These algorithms are used for the applications and are generalized in Section 3.3 to non{normal longitudinal data with additional unobserved population heterogeneity across units. Ideas from non{Gaussian dynamic modelling, in particular for non{equally spaced or continuous{time parameter models, can be transferred to semiparametric regression models for cross{sectional data (Section 4). This leads to Bayesian spline{type smoothing for generalized additive and varying coe cient models using MCMC techniques as a supplement and alternative to penalized likelihood or, equivalently, posterior mode estimation (Hastie and Tibshirani, 1990, 1993). 2 Finally Section 5 summarizes conclusions and indicates extensions to other data situations and statistical models. 2 Linear dynamic models and optimal smoothing for time series data This section gives a brief survey on the correspondence between linear dynamic or state space models and semiparametric optimal smoothing methods based on the roughness penalty approach. We illustrate this correspondence by some simple and commonly{used examples and review more general and recent work. 2.1 Gaussian models In the classical smoothing problem treated by Whittaker (1923), time series observations y = (y1 : : : yT ) are assumed to be the sum yt = t + t t = 1 : : : T (1) of a smooth trend function and an irregular noise component . Whittaker suggested to estimate by minimizing the penalized least squares criterion X X PLS( ) = (y ; ) + ( ; 2 ; + ; ) T t=1 t t T 2 t=3 t t 1 t 2 2 (2) with respect to = (1 : : : T ). Minimization of PLS( ) tries to hold the balance between t of the data, expressed by the sum of squares on the left side and smoothness of the trend, expressed by the roughness penalty term in form of the sum of squared dierences. The smoothness parameter , assumed to be given or xed, weights the two competing goals data t and smoothness. The trend model (1) and the PLS criterion (2) can be generalized in a exible way. Inclusion of a seasonal component = (1 : : : T ) with period m leads to the additive trend{seasonal 3 model yt = t + t + t t = 1 : : : T (3) and the PLS criterion XT (yt ;t ;t) + XT (t ;2t; +t; ) + XT (t +t; +: : :+t;m )2 ! min t=m t=1 t=3 (4) for estimating the trend function and the seasonal component . More generally, the inuence of covariates can be taken into account by extending the additive predictor t + t to t = t + t + x0t t + wt0 (5) PLS( ) = 2 with time{varying eects t 1 1 2 2 2 1 +1 for xt and constant eects for wt. This penalized least squares approach is reasonable if time series observations are { at least approximately { Gaussian. This is made explicit by assuming that the errors t in (1) and (3) are i.i.d. N (0 2) random variables. Then the t term in (4) corresponds to the log{ likelihood of the additive Gaussian observation model (3), and the PLS approach appears as a semiparametric method for estimating the xed, unknown sequences = (1 : : : T ), = (1 : : : T ). A dynamic model corresponding to (3) and (4) considers and as sequences of random variables. It is hierarchical and consists of two stages: The rst stage is the Gaussian observation model (3) for y given and . In the second stage, a transition model corresponding to the roughness penalty term in (4) is given by the dierence equations t ; 2t;1 + t;2 = ut t = 3 : : : T t + t;1 + : : : + t;m+1 = wt t = m : : : T (6) (7) The errors ut and wt are i.i.d. N (0 u2 ){ and N (0 w2 ){ distributed. Initial values are speci ed for example by (1 2)0 N (0 k I ) (1 : : : m;1 )0 N (0 k I ): (8) All errors and initial values are assumed as mutually independent. The dierence equation (6), also called a random walk of second order, penalizes deviations from the linear trend t = 4 2 t;1 ; t;2. The seasonal model (7) prefers \small" values of the sum t + t;1 + + t;m+1 , so that the seasonal pattern does not change too much over periods. From a Bayesian point of view (6) and (7), together with (8), de ne a multivariate normal prior p( ) = p( )p( ) for ( ) = ( 1 T 1 T ), the so{called smoothness prior. Also, the observation model (3) de nes a multivariate normal distribution p( j ) for the data given and . Here the hyperparameters, i.e. the variances 2, u2 , w2 , are regarded as known or given constants. Thus, the posterior (9) p( j ) = p( j p()p() )p( ) / p( j )p( )p( ) is also normal and characterized by the posterior expectation ( j ) and covariance ( j ). Due to normality, the posterior expectation and the posterior mode, i.e. the maximizer of (9), coincide. Taking logarithms, using the (conditional) independence assumptions and ignoring constant factors leads to the criterion 1 T ( ; ; )2 + 1 ( 2 + 2 ) + 1 T ( ; 2 + )2 ::: ::: ::: y y y y y y E y V ar y 2 X t=1 y t + 1 ( 12 + k ::: t + t 1 k 1 m;1 ) + 2 2 2 XT ( t + w t=m 2 X t u t=3 ::: + t;m+1 ) 2 t;1 t;2 ! min (10) for estimating , . Setting 1 = 2 u2 2 = 2 w2 and choosing diuse priors (8) with ! 1, ! 1, the criteria (4) and (10) are identical so that the semiparametric PLS estimate , is identical to the posterior mode estimate and, due to posterior normality, the posterior mean: ( )= ( j ) (11) k = = bb k bb E y : This equivalence remains valid for more general linear Gaussian observation and transition models, see Kohn and Ansley (1988) for a thorough treatment. Collecting trend, season and other parameters as in (5) in a so{called state vector t, e.g. t = (t t;1 t : : : t;m+1 t ) most linear dynamic models can be put in the form of Gaussian linear state space models 0 t = zt t + t y t+1 = F t t + vt 5 t N (0 v ) t (0 t ) N 2 Q (12) (13) by appropriate de nition of design vectors t and transition matrices t, see for example Harvey (1989), West and Harrison (1989) or Fahrmeir and Tutz (1994, ch. 8.1). The well{ known classical Kalman lter and smoother or recent variants like the diuse lter (de Jong, 1991) can be used for e ciently computing posterior expectations b t = ( tj ) and variances ( tj ). Because of the equivalence with a corresponding PLS criterion, the Kalman lter and smoother can also be regarded as an algorithmic tool for computing semiparametric PLS estimates, without any need for a Bayesian interpretation, see Fahrmeir and Tutz (1994, ch. 8.1). Using Kalman smoothers for semiparametric additive models like (3) avoids back tting and provides diagonal bands of smoother matrices as a by{product. However, forcing dynamic models into state space form can result in high{dimensional state vectors with singular multivariate priors, causing unnecessary algorithmic complications. z F E V ar y y Up to now it was tacitly assumed that the time series is equally spaced in time. Extensions to non{equally spaced data are possible either by modi ed dierence priors or by continuous time models. For example, a rst order random walk t ; t;1 = t, t (0 u2 ) is generalized to (14) (0 t u2) t ; t;1 = t t u where t is the time between observation non{equally spaced data can be de ned by t where kt ; 1+ t ; t 1 ! t ;1 + u ;1 yt t ; t 1 N u u N and t. A second order random walk for y ;2 = ut t is a weight function depending on t and ; t 1 ut (0 N kt 2 ) u (15) . A simple and straightforward choice is t = t as for rst order random walk priors. There are other reasonable, but more complex forms of t that are consistent with the equally{ spaced case, see Knorr{Held (1996). Corresponding PLS criteria are easily derived from these priors. k k For continuous{time models, trend, season and other time{varying parameters are considered as smooth functions of time. With a slight change in notation, the simple trend model (1) becomes =1 s = ( s) + s y t 6 s : : : T with observation times 1 s , a smooth trend function ( ) and i.i.d. errors (0 2). A continuous time version of the PLS criterion (2) is: Find as a twice{ s dierentiable function that minimizes t N < ::: < t < ::: < t t X( T s=1 b ; ( s)) + ys 2 t Z ( 00( )) 2 t (16) dt: The minimizing function is a cubic smoothing spline, see Green and Silverman (1994) for a recent treatment and Eubank (1996, this volume). Wahba (1978) showed that (16) has a Bayesian justi cation by placing the solution of the stochastic dierential equation 2 ( ) () s = ;1=2 (17) d t dW s ds2 ds t1 s as a smoothness prior over . Such a dierential equation of order two is the continuous time version of a second order random walk (6). Here ( ) is a standard Wiener process with ( 1) = 0, independent of the errors s. For diuse initial conditions W s W t ( ( 1) 0( 1)) (0 t b t N ) kI k !1 the cubic smoothing spline ( ) at coincides with the posterior expectation of ( ) given the data, i.e. ( ) = ( ( )j ) s s s b s E s y : This equivalence can also be established for more general types of splines where second derivatives are replaced by linear dierential operators, see e.g. Kohn and Ansley (1987, 1988). They also derive a discrete{time stochastic dierence equation from (17) and use state space techniques for computation of the smoothing spline. Again, pointwise Bayesian con dence bands can be computed as a by{product. For a recent discussion of splines from a Bayesian point of view we refer to van der Linde (1995). In practice, smoothing parameters or hyperparameters like 2 u2 w2 are usually unknown. Within the semiparametric roughness penalty approach, data{driven choice of smoothing parameters is often done by cross-validated optimization of some selection criterion. Already for a small number of smoothing parameters problems may occur because the selection 7 criterion can be a rather at function of = ( 1 2 ). Whithin an empirical Bayes approach, hyperparameters in dynamic models are treated as unknown constants. Then the method of maximum likelihood is a natural choice. Maximization can be carried out directly by numerical optimization routines or indirectly via the EM algorithm, see Harvey (1989, ch. 4). If the likelihood is rather at, then ML estimation also performs poorly. Fully Bayesian approaches can avoid these problems by providing additional information about hyperparameters in form of \hyperpriors". A traditional approach are discrete priors leading to multiprocess Kalman lters (Harrison and Stevens, 1976). More recently Markov chain Monte Carlo (MCMC) techniques have been developed to estimate hyperparameters by simulation from their posteriors (Carlin, Polson and Stoer, 1992, Carter and Kohn, 1994, Fruhwirth{Schnatter, 1994). An advantage of these simulation methods is that their basic concepts are also useful in conditionally non{Gaussian situations as below and in the following sections. ::: 2.2 Conditionally Gaussian models Gaussian models are not robust against outliers in the observation errors and change points in the trend function or other unobserved components. One way to robustify linear dynamic models is to assume that error distributions are scale mixtures of normals. For given values of the mixture variables the linear dynamic model is then conditionally Gaussian. Mixture variables may be discrete or continuous. A popular choice are 2{mixture variables, leading to {distributions for the errors. A conditionally Gaussian version of the simple trend model (1) with a second order random walk model for the trend is t = + ; 2 ;1 + ;2 = yt t t t t j (0 j 2 (0 t t !1t N ut ut ! t N 2 ) 2 ) =!1t 2 u =! t : Assuming 1 1 and 2 2 to be independently 2{distributed with 1 and 2 degrees of freedom, then and are independently ( 1) and ( 2) distributed. Although Kalman lters and smoothers are still best linear estimators, they perform poorly for small degrees of freedom 1 and 2. Various approximate ltering and smoothing algorithms have therefore been ! t t ! t ut t t 8 given already in early work on robusti ed state space modelling (Masreliez, 1975, Masreliez and Martin, 1977, Martin and Raftery, 1987). More recently, fully Bayesian MCMC methods have been developed to tackle this problem. Carlin, Polson and Stoer (1992) suggest a Gibbs sampling algorithm adding the mixture variables 1t and 2t to the set of unknown parameters. Their approach applies to rather general nonnormal dynamic models, but can be ine cient with respect to mixing and convergence properties. Carter and Kohn (1994, 1996) and Shephard (1994) propose a modi ed Gibbs sampling algorithm, that updates the whole \state vector" = ( 1 T ) all at once. This modi cation makes the algorithm much more e cient. The parameters 1 T are often highly correlated, so updating t , =1 one at a time, which is done in Carlin, Polson and Stoer, often results in poor mixing, i.e. the corresponding Markov chain is not moving rapidly throughout the support of the posterior distribution. Consequently, Monte{Carlo standard errors of sample averages will be large. ! ! ::: ::: t :::T As an alternative to these fully Bayesian methods one may also consider posterior mode estimation. Let 1( t) and 2( t) denote the negative log{densities of the i.i.d. errors t and t. Taking logarithms and using (conditional) independence assumptions, a robusti ed version of the PLS criterion (2) u u X T t=1 ( ; t )2 + 1 yt X T t=3 ( ;2 2 t ;1 + t;2 ) t 2 ! min (18) b can be derived. Computation of the minimizer can be carried out by iterative Kalman{ type algorithms, see Kunstler (1996). An advantage of posterior mode estimation is that it can also be extended to other {functions, for example Huber functions or ( ) = j j. Also, one may start directly from criterion (18), without Bayesian interpretation, to obtain robust semiparametric estimators, and transfer this approach to robust continuous{time spline{type estimation. It should be noted, however, that already for conditionally Gaussian dynamic linear models posterior mean estimates, obtained from a fully Bayesian approach, and posterior mode or spline{type estimators are no longer equivalent. This property holds only for linear Gaussian models with known hyperparameters as in Section (2.1). u 9 u 3 Non{Gaussian observation models This section deals with fundamentally non{Gaussian time series and longitudinal data. We progress from simple examples for discrete{valued time series to general non{Gaussian situations. 3.1 Non{Gaussian time series Figure 1 displays the number t of occurrences of rainfall over 1 mm in the Tokyo area for each calendar day during the years 1983{1984. The data, presented in Kitagawa (1987) and reanalyzed later on by several authors, is an example of a discrete{valued time series. Responses t, = 1 366, are assumed as binomial: y y t ::: yt 8 >< t ) with >: ( B nt = 2 for = 6 60 ( February 29) t = 1 for = 60 nt t n t and t the probability of rainfall on calendar day . To compare it to similar data from other areas or other years, and to see some seasonal pattern, the probabilities = ( 1 T ), = 366, will be estimated as a smooth curve. For the following we reparametrize t by a logit link to t: exp( t) t t( t) = t = log 1; 1 + exp( ) t ::: T t t : A semiparametric discrete{time roughness penalty approach will start from a penalized log{ likelihood criterion like ( )= PL T X t=1 f t log t( t) + ( t ; t) log(1 ; t( t))g; y n y T X t=3 ( t;2 t ;1 + t;2) 2 ! max (19) to obtain smooth estimates b and b of the xed, unknown sequences and . Comparison with the penalized least squares criterion (2) shows that essentially the Gaussian log{ likelihood of the observation model (1) is replaced by the sum of binomial log{likelihood contributions. Instead of second order dierences, one might also use a sum P( t ; t;1)2 of squared rst order dierences. 10 2.0 1.5 1.0 0.5 0.0 0 100 200 300 days Figure 1: Tokyo rainfall data. Using the same notation as in Section 2 the continuous{time version of (19) is: Find = f ( )g as a twice{dierentiable function that solves t ( )= PL Xf T s=1 ys log ( ( s)) + ( s ; s) log(1 ; t( ( s)))g ; t n y t Z( 00 ( ))2 ! max (20) t dt : For a given smoothing parameter , the solution is again a cubic smoothing spline, see Hastie and Tibshirani (1990) and Green and Silverman (1994). For equally{spaced data as in the example, the discrete{ and continuous{time spline solutions to (19) and (20) are usually in quite close agreement. Algorithmic e cient solutions of the high{dimensional nonlinear optimization problems (19) and (20) are usually carried out by iteratively applying smoothers for penalized least squares estimation to working observations. For a Bayesian version of the semiparametric approach (19) in form of a non{Gaussian dynamic model, we take j ( yt t t) ( ) = 1 +exp( exp( t) ( )) B nt t t t t (21) as the observation model. We supplement it as in (6) by a random walk model of rst or 11 second order t = t ;1 + ut or t = 2 t;1 + ;2 + ut (22) t as a smoothness prior for . The errors t are i.i.d. (0 u2 ) distributed, and initial values 1 resp. 1 , 2 are speci ed as in (8). Variances are assumed to be known. In addition, conditional independence is assumed among all tj . u N y In contrast to Gaussian models, the posterior j )p( ) / p( j )p( ) p( j ) = p( p( ) y y y y (23) is now non{normal. Thus, posterior expectations and posterior modes are no longer equivalent. With a diuse prior for initial values, the posterior mode b is the maximizer of (19) with smoothing parameter equal to 1 2 2 . Algorithmic solutions can be e ciently obtained by extended or iterative Kalman ltering and smoothing, see Fahrmeir (1992), Fahrmeir and Tutz (1994) and Fahrmeir and Wagenpfeil (1996). As in the Gaussian case, these techniques may also be viewed as convenient computational tools for computing penalized likelihood estimators, without Bayesian interpretation. For a fully Bayesian analysis, including computation of posterior moments and quantiles, simulation based estimation, in particular MCMC methods, are generally most appropriate. Details are given in Section 3.2. = A continuous{time dynamic model corresponding to (20) is obtained by placing the stochastic dierential equation (17) as a smoothness prior over . Again, posterior modes are still equivalent to cubic smoothing splines, but dierent from posterior expectations. Fully Bayesian spline{type smoothing will also be based on MCMC for dynamic models. For this purpose, it is useful to rewrite the continuous{time prior (17) as a stochastic dierence equation for the state vector () ) t = ( ( ) t d t =dt of the trend and its derivative. Following Kohn and Ansley (1987), the sequence of evaluations at s, = 1 obeys the stochastic dierence equation t s s := ( s) t :::T s+1 = Fs s + us 12 s =1 : : : T (24) with transition matrices F 0 B1 s = @ 0 s+1 1 1 CA and independent errors us (0 N 2 s+1 0 B s = @ ) Us = 3 s+1 = 2 s+1 = U ts+1 ; ts 1 2C A 3 2 2 s+1 = : s+1 Together with the observation model j ( ) ( ys ts ( )) B ns ts we obtain a binomial dynamic, or state space, model. Higher order splines can also be written in state space form, see Kohn & Ansley (1987). As a second example, we consider a time series of counts t of the weekly incidence of acute hemorragic conjunctivitis (AHC) in Chiba{prefecture in Japan during 1987. Kashiwagi and Yanagimoto (1992) analyze this data, assuming a loglinear Poisson model y j ( t) yt t Po t = exp( t) and a rst order random walk prior for . They obtain a posterior mean estimate based on numerical integrations similar as in Kitagawa (1987). Of course, other smoothness priors as second order random walks or the continuous{time analogue (17) might be used as well. A penalized likelihood approach would start directly from T T X X ( t ; t;1)2 ! max ( ) = ( t log t ; t) ; y PL t=2 t=1 or with other forms of the roughness penalty term. Again, penalized likelihood estimators are equivalent to posterior mode estimators, but dierent from corresponding posterior means. Both examples belong to the class of dynamic generalized linear models. The general observation model is as following: The conditional density of t, given the unknown state vector family type with conditional expectation y ( j t) = E yt t 13 = ( t) h t is of the linear exponential related to the linear predictor t = t0 t by a suitable link . As in the Gaussian case the components of t may consist of trend t, season t and possibly time{varying eects t of covariates t and t is a suitable design vector. For example an additive predictor z h x z t = t+ t+ 0 t t x can be written in this form. Although time{constant eects can be incorporated formally by setting t = t;1, it is often advantageous to split up the predictor in 0 = t+ t+ t 0 t t + wt x : For the second stage, smoothness priors p( ) are put on the sequence = ( 1 T ) in form of a transition model. Linear Gaussian transition models like dierence equation (6), (7) or the state space form t+1 = t + t are often retained as a common choice, but we will also use priors for non{equally spaced observations or continuous times priors. ::: F u As for the examples, we can always write down a corresponding semiparametric model and an associated penalized likelihood criterion ( )= PL X T t=1 t( tj t ) ; l y X p j =1 j I ( j ) ! max : (25) Here j = ( j1 jT ) is the {th component of , ( j ) a penalty function and j a smoothing parameter. For given smoothing parameters j , estimates j are obtained by iterative smoothing, with back tting in an inner loop, see Hastie and Tibshirani (1990, 1993), Green and Silverman (1994), Fahrmeir and Tutz (1994, ch. 5). As in the examples, (25) can always be derived from the corresponding dynamic model relying on the principle of posterior mode estimation. ::: j I Estimation of unknown smoothing parameters j or corresponding hyperparameters j2 can be based on the same principles as for Gaussian models. Relying on the roughness penalty approach, smoothing parameters are selected as minimizer of a generalized cross{validation criterion, see O'Sullivan, Yandell and Raynor (1986), Wahba, Wang, Gu, Klein and Klein (1995). Empirical Bayes approaches consider hyperparameters j2 as unknown but xed and use (approximate) maximum likelihood estimation, for example an EM{type algorithm as 14 suggested in Fahrmeir (1992). Wagenpfeil (1996) compares some of these approaches. In a fully Bayesian setting, hyperparameters j2 are considered as random and independent inverse gamma priors 2 ( j j) =1 (26) j are a common choice for hyperpriors. By appropriate choice of j , j , these priors can be made more or less informative. IG a b j :::p a b 3.2 MCMC in non{Gaussian dynamic models The design of e cient MCMC algorithms in dynamic models with non{Gaussian observation model is currently an intense research area. For easier presentation, we rst discuss several MCMC algorithms for simple non{Gaussian dynamic trend models, like the dynamic binomial or Poisson models in the examples above. Extensions to the general case are outlined at the end of this subsection. Supplementing model (21), (22) with a hyperprior p( u2) for u2, for example an inverse gamma prior as in (26), the posterior distribution of the parameters and u2 is given by: p( u2j ) / p( j )p( j u2)p( u2) (27) MCMC methods construct Markov chains that converge to a given distribution, here the posterior. Once the chain has reached equilibrium, it provides (dependent) samples from that posterior distribution. Quantities of interest, such as the posterior mean or median, can now be estimated by the appropriate empirical versions. y y : The well{known Gibbs sampling algorithm (e.g. Gelfand and Smith, 1990) is based on samples from the full conditional distributions of all parameters. In general, a full conditional distribution is proportional to the posterior (27) but often considerable simpli cations can be done. To implement the Gibbs sampler in dynamic trend models, we have to sample from p( tj 2 ) / p( tj t)p( tj s=t and p( 2j ) / p( j t2)p( u2) y y y : 6 2) u (28) (29) If inverse gamma priors are assigned to u2, (29) is still inverse gamma and samples can be generated easily using standard algorithms. 15 Suppose we could also easily generate samples from (28), = 1 . The Gibbs sampling algorithm iteratively updates 1 and 2 by samples from their full conditionals. Markov chain theory shows, that the so generated sequence of random numbers converges to the posterior (27) for any starting value of the Markov chain. Such an algorithm is proposed in Fahrmeir, Hennevogl and Klemme (1992), following suggestions of Carlin, Polson and Stoer (1992). However, there are some drawbacks of pure Gibbs sampling in non{Gaussian dynamic models. Firstly, samples from (28), which is non{standard for non{Gaussian observation models, can only be obtained by carefully designed rejection algorithms which may require already a considerable amount of computation time in itself. Fortunately, instead of sampling from the full conditional distribution, a member of the more general class of Hastings algorithms (Hastings, 1970) can be used to update , = 1 . Here so{called proposals are generated from an arbitrary distribution and a speci c accept{reject step is added. Such a Hastings step is typically easier to implement and more e cient in terms of CPU time. A thorough discussion of the Hastings algorithm is given in Tierney (1994) and Besag, Green, Higdon and Mengersen (1995). t : : : T :::T u t t :::T For example, to update (28), it is su cient to generate a proposal from the conditional prior distribution p( 6= 2) and to accept the proposal as the new state of the Markov chain with probability ( ) ) p( = min 1 p( ) here denotes the current state of the chain. The resulting algorithm requires less computation time than pure Gibbs sampling since the conditional prior distribution is Gaussian with known moments so proposals are easy to generate. t t js t u yt jt yt jt t However, the generated Markov chain might show signs of slow convergence and does not mix rapidly. That is, the Markov chain is not moving rapidly throughout the support of the posterior distribution so that subsequent samples are highly dependent and Monte Carlo estimates become imprecise. This is a consequence of the underlying single move strategy, i.e. parameters = 1 are updated one by one. Various attempts have been made to design algorithms that converge fast and mix rapidly. A fruitful idea is the use of blocking here blocks of parameters, say ( ) = ( +1 ;1 ), are updated simultaneously t t :::T ab a a 16 : : : b b rather than step by step. Such a blocking strategy is a compromise between updating all at once, which is infeasible for fundamentally non{Gaussian time series, and updating one at a time. The algorithms of Shephard and Pitt (1995) and Knorr{Held (1996) are based on blocking Knorr{Held generalizes of the conditional prior proposal above to block move algorithms: Generate a proposal ( ) form the conditional prior distribution p( ( )j (1 ;1) ( +1 ) 2) and accept the proposal as the new state of the Markov chain with probability 9 8 Q > >> < = p( j ) >= = min >1 Q > >: p( j )> = ab a b T ab u b t yt t a : b yt t t a One of the advantages of MCMC is the possibility to calculate exact posterior distributions of functionals of parameters. For the Tokyo rainfall data, the posterior estimates of the probabilities ) = 1 +exp( (30) exp( ) t t t are of main interest. Instead of plugging an estimate for f g in (30), we calculate posterior samples from f g, using the original samples from p( j ). The posterior distributions p( j ) can now be explored in detail without any approximation. In contrast, posterior mode or splines estimation do not have this feature. Here plug{in estimates, especially con dence bands, are typically biased due to the non{linearity in (30). Similar considerations apply to the AHC example, where = exp( ) is to be estimated. t t y t y t Figure 2 shows the posterior estimates of the probabilities f g for the Tokyo rainfall data, calculated by a conditional prior block{MCMC algorithm. A highly dispersed but proper inverse gamma hyperprior (26) with = 1, = 0 00005 was assigned to 2. This prior has a mode at 0.000025. The estimated posterior median was 0.0001. The pattern in Figure 2, with peaks for wet seasons, nicely reects the climate in Tokyo. It would be di cult to see this by looking only at the raw data (Figure 1). In Fahrmeir and Tutz (1994, ch. 5.3), the probabilities f g are tted by a cubic smoothing spline, with the smoothing parameter estimated by generalized cross{validation criterion. This criterion had two local minima at = 32 and = 4064. The smoothing spline for = 4064 is quite close to the posterior median t in 2, while the smoothing spline for = 32 is much rougher. Such rougher t a b : u t 17 posterior median estimates are also obtained if the parameter for the inverse gamma prior is set to higher values. For example, with = 1, = 0 005, the prior mode equals 0.0025. This prior is in favor of larger values for 2, so that posterior median estimates for f g become rougher. As a third approach, posterior mode estimation, with an EM{type algorithm for estimating 2 by maximization of the marginal likelihood, also gives estimates that are in good agreement. These results correspond to empirical evidence experienced in other applications: If smoothing and variance parameters are properly adjusted, posterior mean and medians are often rather close to posterior modes or penalized likelihood estimates. Also, estimation of hyperparameters by cross-validation or marginal likelihood can be helpful for the choice of parameters of the hyperprior in a fully Bayesian model. Similar evidence is provided by the next example. b a b : t u 1.0 u • • • • • •• • •••• •• • • • • ••• •• 0.6 0.8 • • •• • • 0.0 0.2 0.4 •• ••• •• • • •• • •••••••••••••••• •• • •• ••• •• ••• ••• ••••• •• • • ••••• ••••• •••••• ••••••• • •• •• • ••••••••••••••••••••••••••••••••• •••• ••••••••••• ••••••••••• •• ••• •• •••••••••• ••••••••• ••• • •••••••• ••••••••••••••••••••••• ••••• 0 100 200 300 days Figure 2: Tokyo rainfall data. Data and tted probabilities (posterior median within 50, 80 and 95 % credible regions). The data is reproduced as relative frequencies with values 0, 0.5 and 1. Estimates for the AHC data are shown in Figure 3a) and b) for both rst and second order random walk priors. The posterior distribution of the intensities f g shows a peak around weak 33 similar to the results of Kashiwagi and Yanagimoto (1992). Compared to the model with second order random walk priors, estimates in Figure 3a) are somewhat rougher and the t 18 peak around week 33 is lower and more at. This reects the fact that rst order random walk priors are in favor of horizontal, locally straight lines. Figure 3c) shows Bayesian cubic spline{type estimates with thec continuous{time prior (17). As was to be expected with equally spaced observations, these estimates are in very close agreement with those in Figure 3b). Figure 3d) shows displays the cubic smoothing spline, which is the posterior mode estimator from the Bayesian point of view. As with the rainfall data example, it is again quite close to the posterior median in 3c). In more general dynamic models, response t is related to some unknown parameter vector t , see for example the state space representation (24) of the spline{type prior (17). MCMC simulation in dynamic models can be performed similarly as for the simple dynamic trend model, where t = t is a scalar, by single{ or block{move algorithms. Shephard and Pitt (1995) make speci c Fisher scoring type steps to construct a proposal that approximates the full conditional distribution taking the observation into account. In contrast, conditional prior proposals are built independently of the observation and are therefore easier to construct. Their performance is good for situations, where the posterior is not very dierent from the conditional prior. This is typically the case for discrete valued observations such as bi{ or multinomial logistic models as in our examples. Sometimes components j of =( 1 T ) (compare the notation in (25)) are a priori independent and a componentwise updating strategy with conditional prior proposals can have advantages. Componentwise updating becomes inevitable in problems with multiple time scales or, more general, generalized additive models, see Section 4. y ::: Finally we note, that it is possible to combine robust transition models with non{Gaussian observation models similarly as in Section 2.2. For example, on may use random walk priors with t{distributed errors for trend components, allowing for abrupt large jumps. MCMC simulation in such models is often straightforward, since error terms are still Gaussian, given unknown mixture values. An example is given in Knorr{Held (1996) with ( ){distributed errors and an additional hyperprior on the degrees of freedom . t 19 5 second order RW prior 5 first order RW prior 0 10 20 30 40 3 • • •••••• •• • • 2 ••••• •••• • • • •• • • •• • ••• • •• • 1 1 • • • •• • • ••• • • • 0 4 3 2 • 0 • • ••••• •••• • • • •• • 50 • • 4 • • • • •• • • ••• 0 10 • •• • ••• • •• 20 • • •••••• •• • • 30 40 50 stoch. diff. eq. prior spline type estimate 5 weeks (b) 5 weeks (a) 0 10 20 • • •••••• •• • • 30 40 50 3 • •• • ••• • 2 ••••• •••• • • • •• • •• • 1 1 • • • •• • • ••• • ••••• •••• • • • •• • 0 weeks (c) • • 0 4 3 2 • 0 • • • • 4 • • • • •• • • ••• •• 10 • •• • ••• • 20 • • •••••• •• • • 30 40 50 weeks (d) AHC data. Data and tted probabilities (posterior median within 50, 80 and 95% credible regions). Figure 3: 20 3.3 Non{Gaussian longitudinal data In this section, we consider longitudinal data where observations ( ) yti xti t =1 : : : T i =1 : : : n on a response variable and a vector of covariates are made for a cross{section of units at the same time points = 1 . Models for Gaussian outcomes have been treated already extensively, but much less has been done in the non{Gaussian case. As an example, we will consider monthly business test data collected by the IFO institute in Munich for a large cross{section of rms. Answers given in a monthly questionnaire are categorical, most of them trichotomous with categories like \increase" (+), \no change" (=) or \decrease" (;), compare Fahrmeir and Tutz (1994, Examples 6.3, 8.5). Selecting a speci c response variable , say answers on production plans, we obtain categorical longitudinal data. y x t n :::T yti y Observation models for longitudinal data can be de ned by appropriate extensions of models for time series data. A straightforward generalization within the exponential family framework is as follows: For given covariates and a possibly time{varying parameter vector , the {dimensional response comes from a linear exponential density with conditional mean xti q t yti ( j E yti xti t )= ( ) (31) h ti and linear predictor ti = Zti (32) t: Here : ! is a {dimensional link and the matrix is a function of the covariates and possibly past responses. Individual responses are assumed to be conditionally independent. A dynamic model for longitudinal data is obtained by supplementing the observation model (31) and (32) with transition models as smoothness priors for as in Section 3.1. Just as for time series, some subvector e of may indeed be time{constant. Such partially dynamic models are formally covered by (32) with the additional restriction e = e ;1 or by making this explicit and rewriting the predictor in additive form = + . h IR q IR q q Zti t xti t t ti 21 Zti t t Vti The posterior mode or penalized likelihood approach leads to XX ( )= T PL n t=1 i=1 ti( t l X ); p j =1 j I ( j ) ! max (33) : Here, ti( t) = log ( tij ti t) is the conditional likelihood contribution of observation ti. Computationally e cient solutions can be obtained, for example, by extended or iterative Kalman{type smoothers, see Fahrmeir and Tutz (1994, ch. 8.4) and Wagenpfeil (1996). l f y x y Observation models of the form (31), (32) may be appropriate if heterogeneity among units is su ciently described by observed covariates. This will not always be the case, in particular for larger cross{sections. A natural way to deal with this problem is an additive extension of the linear predictor to ti = ti t + ti i Z W b where i are unit{speci c parameters and ti an appropriate design matrix. A dynamic mixed model is obtained with usual transition models for and a \random eects" model for the unit{speci c parameter. A common assumption is to assume the i are i.i.d. Gaussian, b W 0 b s b i (0 ) N (34) D with covariance matrix . For posterior mode or penalized likelihood estimation of ( 1 T ) and = ( 1 n ), a further penalty term D ::: b b :::b ( )= I b X n i=1 0 i = i b Db corresponding to the Gaussian prior (34) is added to (33). An algorithmic solution for the resulting joint posterior mode or penalized likelihood estimates ( ) is worked out in Biller (1993), also in combination with an EM{type algorithm for estimation of smoothing parameters. However, computation times become large for multicategorical responses. Moreover, serious bias may occur, see Breslow and Clayton (1993), Breslow and Lin (1995). bb b MCMC techniques are more attractive for dynamic mixed models through their model exibility. The additional parameters 1 n are added to the set of unknown parameters and are updated with some well designed proposals, for example with Metropolis random walk b :::b 22 proposals, in every MCMC cycle. Besides, a hyperprior for D has to be introduced. The usual choice is the inverted Wishart distribution p(D) / jDj;;(m+1)=2 exp(;tr(BD;1)) with parameters > (m ; 1)=2 and jB j > 0 here m is the dimension of bi. A Gibbs step can then be used to update D. Turning to the IFO business test example, we investigate the dependency of current production plans on demand and orders in hand in the speci c branch \Vorprodukte Steine und Erden". We have complete longitudinal observations of 51 rms for the period from 1980 to 1994. Our model allows for time{changing eects of covariates and for trend and seasonal variation of threshold parameters, which represent corresponding probabilities of the response categories. Additional unit{speci c parameters bi are introduced to allow for rm{speci c dierences of these probabilities. The response variable "production plans" is given in three ordered categories: "increase" (+), "no change" (=) and "decrease" (;). Its conditional distribution is assumed to depend on the covariates "orders in hand", "expected business conditions" as well as on the production plans of the previous month. All these covariates are trichotomous. We used a dummy coding approach for comparison with previous analyses with the category (;) as reference category. The corresponding dummies are denoted by A+, A= (orders in hand), G+ , G= (expected business conditions) and P +, P = (production plans of the previous month) and de ne the covariate vector xti. The inclusion of P as a covariate reduces the panel length by 1 to T = 179 (February 1980 to December 1994). A cumulative logistic model (e.g. Fahrmeir & Tutz, 1994a, ch. 3) was used due to the ordinal nature of the response variable: Let yeti = 1 and yeti = 2 denote the response categories \increase" and \no change" respectively. Then P (yeti j ) = F (tij + x0ti t) j = 1 2 is assumed with xti = (G+ G= P + P = A+ A=)0 and F (x) = 1=(1 + exp(;x)). We decompose both threshold parameters ti1 and ti2 into trend parameters t, seasonal 23 parameters and unit speci c parameters b , one for each threshold: t i = + + b j = 1 2: tij tj tj ij Note that the threshold parameters have to follow the restriction 1 < 2 for all combinations of t and i. A seasonal model (7) with period m = 12 was chosen for the seasonal parameters of both thresholds. First order random walk priors are assigned to all covariate eect parameters and to both trend parameters 1 2. All time{changing parameters are assumed to be mutually independent with proper but highly dispersed inverse gamma hyperpriors (a=1, b=0.005). The rm{speci c parameters b = (b 1 b 2) are assumed to follow a Gaussian distribution with mean zero and dispersion D. We used the parameter values = 1 and B = diag(0:005 0:005) for the inverted Wishart hyperprior speci cation for D. ti t t ti t i i i 0 This model can be written as a dynamic mixed model with = h( ) = h(Z ti where 0 t ti = ( 1 1 2 2 ), t t t 0 1 W =B @ 0 0 t t ti and ti t + W b ) ti i 1 0C A 1 1 0 1 1 0 0 x C Z =B A: @ 0 0 1 1 x The responses variable y is multinomially distributed 0 ti ti 0 ti ti y M2(1 ) ti ti where y = (1 0) , (0 1) or (0 0) , if the rst (+), second (=) or third (;) category is observed. The link function h is given by 1 0 F ( 1 ) CA : h( ) = B @ F ( 2) ; F ( 1) ti 0 0 0 ti ti ti ti Figure 4 displays the temporal pattern of the trend parameters , j = 1 2, and of both threshold parameters = + , j = 1 2. The rst trend parameter is slightly decreasing tj tj tj tj 24 -2 0 Estimates of trend components -10 -8 -6 -4 post. median 50 % cred. region 80 % cred. region 95 % cred. region 1980 1982 1984 1986 1988 1990 1992 1994 -10 -8 -6 -4 -2 0 Estimates of threshold parameters 1980 1982 1984 1986 1988 1990 1992 1994 Estimates of trends and thresholds. Dashed vertical lines represent the month January of each year. Figure 4: 25 while the second remains constant over the whole period. A distinct seasonal pattern can be seen with higher probabilities of positive response in spring and negative response in fall. However, rm{speci c deviations from this pattern are substantial as can be seen from Figure 5. Here, posterior median estimates of the rst and second rm{speci c parameter 1 and 2 are plotted against each other for all 51 rms. Interestingly, these two parameters bi 1 0 -1 -2 second threshold 2 bi -2 -1 0 1 2 first threshold Figure 5: Plot of the estimates of b 1 against b 2 for each unit. i i are often highly negatively correlated. The estimated dispersion matrix of the random eect distribution is 1 0 0 78 ; 0 28 CA c = B@ ;0 28 0 23 and the estimated correlation, based on posterior samples of the corresponding functional of : : : : D 26 , is ;0 67. Both estimates are posterior median estimates. It seems that some rms are more conservative in their answers and often choose "no change" for the response variable, relative to the overall frequencies. Such rms have negative values for 1 and positive values for 2. Other rms avoid the category \no change" and answer often more extremely with "decrease" or "increase". For these rms 1 is positive and 2 negative. D : bi bi bi bi The estimated patterns of time{dependent covariate eects (Figure 6) show an interesting temporal pattern, in particular the eect of the dummy + (Figure 7), which stands for expected improved business conditions, relative to ;: A distinct low can be seen end at the of 1991, when the German economy was shaken by a recession. In 1982 a new government under the leadership of chancellor Helmut Kohl was established. From that time onwards the eect increases until 1989/1990 with some additional variation and can be interpreted as a growing trust in the government. G G The peak in 1989/1990 coincidences with the German reuni cation, which was expected to have a catalytic eect on the economy due to the sudden opening of the market in former east Germany. In the years 1986, 90 and 94, parliament elections were held in fall. In these years the eect is always decreasing towards the end of the year, which may be due to the uncertainty regarding the election results. 4 Generalized additive and varying coecient models Let us now turn to a cross{sectional regression situation where observations ( 1 ), =1 on a response and a vector ( 1 ) of covariates are given. Generalized additive models (Hastie and Tibshirani, 1990) assume that, given = ( 1 ), the distribution of belongs to an exponential family with mean = ( j ) linked to an additive predictor by yi xi : : : xip i :::n y x : : : xp xi yi i xi : : : xip E yi xi i i = ( ) h i i = 1 ( 1) + f xi ::: + ( ) (35) fp xip : Here 1 are unknown, smooth functions of continuous covariates 1 . If some covariates are assumed to have a linear eect on the predictor, then semiparametric or f : : : fp x : : : xp 27 Point estimates 1 2 3 4 5 P+ P= G+ G= A+ A= 1980 1982 1984 1986 1988 1990 1992 1994 Figure 6: Estimated time{changing covariate eects. Dashed vertical lines represent the month January of each year. generalized partial linear models like i = f1(xi1) + : : : + pxip (36) are appropriate modi cations of (35). In (36), xip could also be a binary or categorical covariate. In the following, we will focus on model (35). Nonparametric estimation of the functions f1 : : : fp can be based on the penalized likelihood criterion p n (37) PL(f1 : : : fp) = li(yiji) ; j (fj (u))2du ! fmax :::fp X Z X j =1 i=1 00 1 with individual likelihood contributions li from yijxi. The maximizing functions are cubic smoothing splines f1 : : : fp. Other types of penalty terms may also be used, replacing for example second derivatives by m{th order derivatives fj(m)(u) or using discretized penalty terms. Computation is usually carried out by back tting algorithms, see Hastie and Tibshirani (1990) or Fahrmeir and Tutz (1994, ch. 5). As a drawback, construction of con dence bands relies on conjectures of approximate normality of penalized likelihood estimators, b b 28 2.5 3.0 3.5 4.0 4.5 5.0 post. median 50 % cred. region 1980 1982 1984 1986 1988 1990 1992 1994 Figure 7: Estimated time{changing covariate eect of G+. Dashed vertical lines represent the month January of each year. with a rigorous proof still missing. Also, data{driven choice of smoothing parameters can be problematic. Bayesian inference in generalized additive models, as outlined in the sequel, uses ideas from dynamic models for time series data. Basically, time is replaced by metrical covariates, with dierent covariates j , = 1 , corresponding to dierent time scales j . x j :::p t For each covariate, let tj 1 < : : : < tjs < : : : < tjTj Tj denote the strictly ordered, dierent values of observations smoothness priors for the unknown values ( ) f tj 1 ( ) < : : : < f tjs ( < : : : < f tjTj n xij i = 1 :::n . Bayesian ) can now be de ned by adapting random walk models (5), (12) for non{equally spaced time{ points or continuous{time priors like (17) to the present situation. Setting js = j ( js ), and f 29 t js = tj s ; tjs ;1 , rst and second order random walk priors are given by js and ; 1+ ;1 js ! js ; ;1 + js ;1 = ujs js js js ujs js ;1 js (0 N ;2 = js ujs 2 j uj s ) (0 N kjs 2 j ) (38) with mutually independent errors . Priors for Bayesian cubic spline{type smoothing, corresponding to the penalized log{likelihood (37) are given by the stochastic dierential equations 2 ( ) () = 1 (39) 2 ujs d fj s dWj s j ds ds s with mutually independent standard Wiener processes, ditions for = ( ( ) 0 ( )) fj s fj s js tj ( ) = 0, and diuse initial con- Wj tj 1 : In complete analogy to Section 3.1, the priors (39) can be written in state space form (24) and some hyperpriors are assigned to . j The likelihood p( j ), the priors p( ), p( ) and as a consequence, the posterior p( j ) have the same structure as in Section (3.2). Therefore single{ or block{move schemes as outlined there can be used to simulate from the posterior. Details and some generalizations are given in Lang (1996) for random walk priors and Biller and Fahrmeir (1997) for stochastic dierential equation priors. y y As an application, we consider the credit{scoring problem described in Fahrmeir & Tutz (1996, ch. 2.1). In credit business banks are interested in estimating the risk that consumers will pay back their credits as agreed upon by contract or not. The aim of credit{scoring is to model or predict the probability that a client with certain covariates (\risk factors") is to be considered as a potential risk. The data set consists of 1000 consumers's credits from a South German bank. The response variable of interest is \creditability", which is given in dichotomous form ( = 0 for creditworthy, = 1 for not creditworthy). In addition, 20 covariates that are assumed to inuence creditability were collected. As in Fahrmeir and Tutz, we will use a subset of these data, containing only the following covariates, which are y y 30 partly metrical and partly categorical: x1 x3 x4 x5 x6 x8 running account, trichotomous with categories \no running account" (= 1), \good running account" (= 2), \medium running account" (\less than 200 DM" = 3 = reference category) duration of credit in months, metrical amount of credit in DM, metrical payment of previous credits, dichotomous with categories \good", \bad" (=reference category) intended use, dichotomous with categories \private" or \professional" (=reference category) marital status, with reference category \living alone". A parametric logit model for the probability P( = 1j ) of being not creditworthy leads to the somewhat surprising conclusion that the covariate \amount of credit" has no signi cant inuence on the risk. Here, we reanalyze the data with a partial linear logit model y log 1 ;P(P(= =1j 1j) ) = y x y x 0 + 1 1x1 + 2 2x1 x + 3( 3 ) + 4( 4 ) + f x f x 5x5 + 6x6 + 8 x8: Here 11 and 21 are dummies for the categories \good" and \medium" running account, respectively. The predictor has semiparametric or partial linear form: The smooth functions 3 ( 3), 4 ( 4) of the metrical covariates \duration of credit" and \amount of credit", are estimated by usual cubic splines and by Bayesian spline{type smoothing using second order random walk models (38) for non{equally spaced observations. The constant 0 and the eects 1 8 of the remaining categorical covariates are considered as xed for penalized likelihood estimation and estimated jointly with the curves 3 and 4. For Bayesian estimation, diuse priors are chosen for 0 1 2 5 6 8, and additional M{H{steps with random walk proposals are included for MCMC simulation. x f x x f x ::: f f Figure 8 shows the estimates for the curves 3 and 4. Again, the posterior mean of the spline{type smoother and the posterior mode or penalized likelihood estimator (full line) are not far away from each other. While the eect of the variable \duration of credit" is not f 31 f too far away from linearity, the eect of \amount of credit" is clearly nonlinear. The curve has bathtub shape and indicates that not only high credits but also low credits increase the risk, compared to \medium" credits between 3000{6000 DM. Apparently, if the inuence is misspeci ed by assuming a linear function 4 4 instead of 4( 4), the estimated eect b4 will be near zero, corresponding to an almost horizontal line b4 4 near zero, and falsely considered as nonsigni cant. Table 1 gives the posterior means together with 80% credible intervals. and maximum likelihood estimates of the remaining eects. Both estimates are in good agreement. x f x x posterior mean 0.662 {1.468 {1.085 {0.442 {0.578 1 x1 2 x1 x5 x6 x8 80% CI ML estimator {0.004 1.335 0.633 {2.243 {0.733 {1.324 {2.051 {0.135 {0.998 {1.035 0.209 {0.440 {1.180 0.016 {0.516 Table 1: Estimates of constant parameters in the credit{scoring data. Finally we note, that the whole approach can be extended to varying coe cient models (Hastie and Tibshirani, 1993), where the predictor has the form i = 1( 1 ) 1 + f xi zi ::: + ( ) (40) fp xip zip with 1 as further \factors". For the special case 1 = = = 1, (40) reduces to a generalized additive model (35). If 1 are further covariates, possibly including some components of , then a term ( ) can be interpreted as an interaction term between and , or ( ) can be considered as an eect of , varying over the \eect{ modi er" ( ). For , i.e. if covariate is time , ( ) is a time{varying eect, and for 1 = = = the linear predictor has the same form as in dynamic models for time series or longitudinal data. zi : : : zip zi ::: z : : : zp x x z fj xij fj xj x ::: xp fj xij zij xj z t xj t 32 t fj t zip 1 0 -1 -3 -2 Bayes GAM 80% credible region 20 40 60 duration of credit in months -1 0 1 2 Bayes GAM 80% credible region 0 5000 10000 15000 amount of credit in DM Figure 8: Estimated functions of the covariate \duration of credit" and \amount of credit". 33 5 Conclusions In this chapter we showed that dynamic models with Bayesian smoothness priors and semiparametric models based on the roughness penalty approach provide supplementary ways for nonparametric function estimation. Semiparametric Bayesian smoothing has some attractive features: It provides a natural framework for Bayesian analysis beyond posterior mode or MAP estimation and recent advances in MCMC techniques allow to estimate posterior means, medians, quantiles and other functionals of regression functions or other parameters. No approximation, based on conjectures of asymptotic normality, have to be made. Bayesian data{driven choice of smoothing parameters is automatically incorporated in the model. Due to the hierarchical model formulation and modular estimation techniques, the Bayesian approach oers much exibility in modifying or extending methods to other situations, for example to dynamic mixed models for longitudinal data (Section 3.3), to generalized additive and varying coe cient models (Section 4) or to data with missing values, an issue not treated here. To some extent, of course, one has to pay for these advantages: MCMC techniques produce a rich output, but computation times can also be quite high. Metropolis{Hastings algorithms provide a wide variety of possibilities for updating steps, but convergence and mixing of the so constructed Markov chain has also to be checked empirically. Careful convergence diagnostics deserve much attention in particular applications. Above all, the choice of reasonable priors on the unknown functions remains subjective and may not be easily accepted. Semiparametric models based on the roughness penalty approach are useful supplementary tools for data analysis: Roughness penalties corresponding to smoothness priors can be interpreted without any underlying Bayesian framework. Thus, if the roughness penalty looks reasonable it supports the choice of the smoothness prior. As we have shown, the penalized likelihood estimator can always be interpreted as a corresponding posterior mode estimator from a Bayesian point of view. Computation is done by numerically e cient solutions of a nonlinear maximization problem, relying on commonly accepted and well{ understood optimization routines. As we demonstrated by examples, the posterior mode is 34 often quite near to posterior means or medians and, therefore, can be quite useful to check convergence of MCMC simulations. We focused on non{Gaussian models for times series, longitudinal and regression data within the set up of generalized linear models with a prespeci ed link functions of known parametric form, as for example the logistic or the exponential functions. Tis restriction could be relaxed by de ning a generalized parametric family of link functions (as for example in Stukel, 1988, Czado, 1992) and estimating unknown parameters in the link function jointly with unknowns in the predictor. A non{parametric Bayesian approach, avoiding any parametric speci cation of a link function, has been proposed by Arjas and Gasbarra (1994) and Arjas and Liu (1996) in the related context of hazard regression. Generally, we believe that in situations with many covariates exible non{ or semiparametric modelling and exploration of the predictor is more important compared to nonparametric choice of the link function while retaining linear parametric predictors. For Gaussian models, Bayesian analysis of regression splines with adaptive knot selection has been recently proposed by Smith and Kohn (1994), Smith, Wong and Kohn (1996) and Denison, Mallick and Smith (1996). It would be interesting to adapt these methods for non{Gaussian regression models. Extensions to other data structures are possible by choosing other observation models and smoothness assumptions. In particular, event history analysis and spatial statistics are a wide and promising eld of research, e.g. Fahrmeir & Knorr{Held (1997), Arjas and Liu (1996), Arjas and Heikkinen (1996) and Besag, York and Mollie (1991). Also, problems of model diagnostics and model choice have to dealt with convincingly. Here again, Bayesian and non{Bayesian data analyses could complement one another in a productive way. References !1] Arjas, E. & Gasbarra, D. (1994). Nonparametric Bayesian inference from right censored survival data, using the Gibbs sampler. Statistica Sinica, 4, 505{524. !2] Arjas, E. & Heikkinen, J. (1996). An algorithm for nonparametric Bayesian estimation 35 of a Poisson intensity. Computional Statistics, to appear. !3] Arjas, E. & Liu, L. (1996). A Non{parametric Bayesian Approach to Hazard Regression: A Case Study with a Large Number of Covariate Values. Statistics in Medicine, 15, 1757{1770. !4] Besag, J. E., Green, P. J., Higdon, D. & Mengersen, K. (1995). Bayesian computation and stochastic systems (with discussion). Statistical Science, 10, 3{66. !5] Besag, J. E., York, J. & Mollie, A. (1991). Bayesian image restoration with two applications in spatial statistics (with discussion). Annals of the Institute of Statistical Mathematics, 43, 1{59. !6] Biller, C. (1993). Modelle mit zufalligen Eekten fur kategoriale Langschnittdaten. Diplomarbeit, Universitat Munchen. !7] Biller, C. & Fahrmeir, L. (1997). Bayesian Spline{type Smoothing in Generalized Regression Models, to appear in Computational Statistics. !8] Breslow, N. E. & Clayton, D. G. (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association, 88, 9{25. !9] Breslow, N. E. & Lin, X. (1995). Bias correction in generalised linear mixed models with a single component of dispersion. Biometrika, 82, 81{91. !10] Carlin, B. P., Polson, N. G. & Stoer, D. S. (1992). A Monte Carlo approach to nonnormal and nonlinear state-space-modeling. Journal of the American Statistical Association, 87, 493{500. !11] Carter, C. K. & Kohn, R. (1994). On Gibbs sampling for state space models. Biometrika, 81, 541{553. !12] Carter, C. K. & Kohn, R. (1996). Robust Bayesian nonparametric regression. In: Statistical Theory and Computational Aspects of Smoothing (W. Hardle & M. G. Schimek, eds.). Heidelberg: Physica{Verlag, 128{148. 36 !13] Czado, C. (1992). On Link Selection in Generalized Linear Models. In: Advances in GLIM and Statistical Modelling (L. Fahrmeir, B. Francis, R. Gilchrist & G. Tutz, eds.). Heidelberg: Springer, 60{65. !14] de Jong, P. (1991). The diuse Kalman lter. The Annals of Statistics, 19, 1073{1083. !15] Denison, D. G. T., Mallick, K. B. & Smith, A. F. M. (1996). Automatic Bayesian curve tting. Unpublished Manuscript. !16] Fahrmeir, L. (1992). Posterior mode estimation by extended Kalman ltering for multivariate dynamic generalized linear models. Journal of the American Statistical Association, 87, 501{509. !17] Fahrmeir, L., Hennevogl, W. & Klemme, K. (1992). Smoothing in Dynamic Generalized Linear Models by Gibbs Sampling. In: Advances in GLIM and Statistical Modelling (L. Fahrmeir, B. Francis, R. Gilchrist & G. Tutz, eds.). Heidelberg: Springer, 85{90. !18] Fahrmeir, L. & Knorr{Held, L. (1997). Dynamic discrete time duration models, Sociological Methodology 1997, to appear. !19] Fahrmeir, L. & Tutz, G. (1994). Multivariate Statistical Modelling Based on Generalized Linear Models. New York: Springer{Verlag. !20] Fahrmeir, L. & Wagenpfeil, S. (1996). Penalized Likelihood Estimation and Iterative Kalman Smoothing for Non{Gaussian Dynamic Regression Models. Computational Statistics and Data Analysis, to appear. !21] Fruhwirth{Schnatter, S. (1994). Data augmentation and dynamic linear models. Journal of Time Series Analysis, 15, 183{202. !22] Gelfand, A. E. & Smith, A. F. M. (1990). Sampling{Based Approaches to Calculating Marginal Densities. Journal of the American Statistical Association, 85, 398-409. !23] Green, P. J. & Silverman, B. W. (1994). Nonparametric Regression and Generalized Linear Models. London: Chapman & Hall. 37 !24] Harrison, P. J. & Stevens, C. F. (1976). Bayesian Forecasting (with Discussion). Journal of the Royal Statistical Society B, 38, 205{247. !25] Harvey, A. C. (1989). Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge: University Press. !26] Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57, 97{109. !27] Hastie, T. & Tibshirani, R. (1990). Generalized Additive Models. London: Chapman & Hall. !28] Hastie, T. & Tibshirani, R. (1993). Varying coe cient models. Journal of the Royal Statistical Society B, 55, 757{796. !29] Kashiwagi, & Yanagimoto (1992). Smoothing Serial Count Data Through a State{ Space Model. Biometrics, 48, 1187{1194. !30] Kitagawa, G. (1987). Non{Gaussian State{Space Modelling of Non{stationary Time Series (with discussion). Journal of the American Statistical Association, 82, 1032{ 1063. !31] Knorr{Held, L. (1996). Hierarchical Modelling of Discrete Longitudinal Data: Applications of Markov Chain Monte Carlo. Ph.D. dissertation, Universitat Munchen. !32] Kunstler, R. (1996). Robuste Zustandsraummodelle. Ph.D. dissertation, Universitat Munchen. !33] Kohn, R. & Ansley, C. (1987). A New Algorithm for Spline Smoothing Based on Smoothing a Stochastic Process. SIAM Journal Scientic Statistical Computing, 8, 33{48. !34] Kohn, R. & Ansley, C. (1988). Equivalence between Bayesian smoothing priors and optimal smoothing for function estimation. In Bayesian Analysis of Time Series and Dynamic Models (J. C. Spall, eds.). New York: Dekker, 393{430. 38 !35] Lang, S. (1996). Bayesianische Inferenz in Modellen mit variierenden Koe zienten. Diplomarbeit, Universitat Munchen. !36] Martin, R. D. & Raftery (1987). Robustness, Computation, and Non{Euclidean Models (Comment). Journal of the American Statistical Association, 82, 1044{1050. !37] Masreliez, C. J. (1975). Approximate Non{Gaussian Filtering with Linear State and Observation Relations. IEEE Transactions on Automatic Control, AC{20, 107{110. !38] Masreliez, C. J. & Martin, R. D. (1977). Robust Bayesian Estimation for the Linear Model and Robustifying the Kalman Filter. IEEE Transactions on Automatic Control, AC{22, 361{371. !39] Shephard, N. (1994). Partial non{Gaussian state space. Biometrika, 81, 115{131. !40] Shephard, N. & Pitt, M. K. (1995). Parameter{driven exponential family models. Unpublished manuscript, Nu eld College, Oxford, UK. !41] Smith, M. & Kohn, R. (1994). Nonparametric Regression using Bayesian Variable Selection. Journal of Econometrics, to appear. !42] Smith, M., Wong, C.{M. & Kohn, R. (1996). Additive Nonparametric Regression for Time Series. Preprint, Australian Graduate School of Management. !43] Stukel, T. A. (1988). Generalized Logistic Models. Journal of the American Statistical Association, 83, 426{431. !44] Tierney, L. (1994). Markov chains for exploring posterior distributions (with discussion). The Annals of Statistics, 22, 1701{1762. !45] O'Sullivan, F., Yandell, B. & Raynor, W. (1986). Automatic Smoothing of Regression Functions in Generalized Linear Models. Journal of the American Statistical Association, 81, 96{103. !46] van der Linde, A. (1995). Splines from a Bayesian point of view. Test, 4, 63{81. !47] van der Linde, A. (1996). 39 !48] Wahba, G. (1978). Improper priors, spline smoothing and the problem of guarding against model errors in regressions. Journal of the Royal Statistical Society B, 40, 364{372. !49] Wahba, G. Wang, Y., Gu, C., Klein, R. & Klein, B. (1995). Smoothing spline ANOVA for exponential families, with application to the Wisconsin epidemiological study of diabetic retinopathy. The Annals of Statistics, 23, 1865{1896. !50] Wagenpfeil, S. (1996). Dynamische Modelle zur Ereignisanalyse. Ph.D. dissertation, Universitat Munchen. !51] West, M. & Harrison, P. J. (1989). Bayesian Forecasting and Dynamic Models. New York: Springer{Verlag. !52] Whittaker, E. T. (1923). On a New Method of Graduation. Proc. Edinborough Math. Assoc., 78, 81{89. 40

Log In

Dynamic and Semiparametric Models

Related papers

Related papers

Related topics