Longitudinal Data Analysis, Panel Data Analysis CHRISTIANE GRILL University of Vienna, Austria Longitudinal or panel data is a special type of pooled data which consists of a crosssection of units (e.g., countries, firms, households, individuals) for which there exist repeated observations over time. Consequently, observations in panel data involve at least two dimensions: a cross-sectional dimension and a time-series dimension. Panel data may be generated by pooling time-series observations across units. Longitudinal or panel data analysis refers to the statistical analysis of such datasets. In general, methods associated with the terms panel or longitudinal analysis focus on short panels, for which the number of observed units (N) is large and the number of repeated observations over time (T) is small. In contrast, methods under the umbrella of time-series cross-section focus on long panel, for which N is rather small compared to a relatively large T. Examples Longitudinal or panel data has become widely available to empirical researchers. Wellknown examples of U.S. panel data are the Panel Study of Income Dynamics (PSID) or the National Longitudinal Surveys of Labor Market Experience (NLS). The PSID conducted by the University of Michigan collects annual economic information from a representative national sample of about 6,000 U.S. families and 15,000 individuals. Its datasets contain over 5,000 variables. The NLS contains five separate longitudinal databases covering distinct segments of labor force. Its measured variables focus on the supply side of the labor market. The most well-known socioeconomic panels in Europe are, among others, the German Socio-Economic Panel (GSOEP), the British Household Panel Survey (BHPS), and the Dutch Socio-Economic Panel. On a European level, the EU statistics on income and living conditions (EU-SILC) collect data on income distribution and social inclusion in the European Union (EU). The EU-SILC nowadays contains longitudinal data on topics such as poverty, housing, education, or health from all EU member states, including Iceland, Norway, Switzerland, and Turkey. Panel designs are also prominent in the field of electoral studies. For instance, the American National Election Studies (ANES)—established in 1948—conducts national surveys of voters in the United States before and after every presidential election. Its international counterparts are, for example, the British Election Study (BES), the Dutch Parliamentary Election Study (DPES), and the Brazilian Electoral Panel Study (BEPS). Besides the The International Encyclopedia of Communication Research Methods. Jörg Matthes (General Editor), Christine S. Davis and Robert F. Potter (Associate Editors). © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. DOI: 10.1002/9781118901731.iecrm0134 2 L O N G I T U D I N A L D ATA A N A L Y S I S, P A N E L D ATA A N A L Y S I S field of political communication, panel designs are on the rise in various research areas of communication science (e.g., children, adolescents and media use, environmental communication, health communication, and public relations). Benefits and limitations of panel data Some of the benefits and limitations of panel data for statistical analysis include the inference of causal propositions, the ability to control for heterogeneity or the existence of heteroscedasticity and serial correlations (Frees, 2004). One of the main key advantages of panel data is that such data provides the opportunity to thoroughly analyze causal propositions. While cross-sectional data allows observations of covariances, and therefore, does not—strictly speaking—allow drawing conclusions about causalities, panel data allows analyzing whether a change in an input precedes a change in the outcome. In other words, panel data allows observations on shifts of responses as reactions to an input. For instance, the analysis of cross-sectional data might reveal a significantly positive relation between media exposure and being political informed. However, the analysis does not provide evidence on the cause-andeffect relationship. In contrast, the analysis of longitudinal or panel data might reveal that increased media exposure causes heightened levels of being politically informed. Another benefit of using panel data relates to the fact that its datasets are by nature much larger since the data consists of multiple observations on the same units over time. The large number of data points increases the degrees of freedom, and results in more variability and less collinearity among the measured variables than in cross-sectional designs. Hence, these characteristics overall improve the efficiency of estimates, and thus, allow more accurate inferences of model parameters. For example, turnout in national elections and public support for the government to be elected may be highly correlated for annual time-series observations for a given country. By stacking or pooling these observations across different countries, the variation in the data is increased and collinearity reduced. As a result, researchers obtain more reliable model estimates and are able to test more sophisticated behavioral models using less restrictive assumptions. Another advantage of panel data is the possibility to control for individual heterogeneity. In many datasets, subjects (i.e., units) are unlike one another, that is, they are heterogeneous. In cross-sectional regression analysis, models ascribe the uniqueness of subjects to a disturbance term. In contrast, longitudinal data allows modeling this uniqueness. Due to the large number of observations in panel data, researchers are able to incorporate subject-specific parameters, and hence, are able to control for heterogeneity of individuals. Not controlling for these unobserved individual specific effects would result in biased estimates. For example, children’s media use is regressed on various individual attributes, such as peer interactions, family structures, gender, race, and so on. But the error term may still include unobserved individual characteristics, such as family lifestyle, which are correlated with some of the regressors, such as family structure. By using panel data, one can difference the data over time and eliminate the unobserved individual specific effect of family lifestyle. L O N G I T U D I N A L D ATA A N A L Y S I S, P A N E L D ATA A N A L Y S I S 3 Moreover, panel datasets are better suited to study complex issues of dynamic behavior. They can, therefore, be utilized to study dynamics of change with the help of more complicated behavioral models and hypotheses. In doing so, panel data analysis uncovers dynamic relationships between the dependent and independent variables. For example, with cross-section data one can estimate the turnout rate in elections at a particular point in time. Repeated cross-sections show how this proportion changes over time. But, only panel data allows estimating what proportion of those who voted in one election also voted in another election. Limitations of panel data encompass problems in the design, data collection, and data management. Specifically, these problems include problems of coverage (incomplete account of the units of interest), measurement errors (caused by unclear questions, memory errors, deliberate distortion, or interviewer effects), nonresponse (due to the lack of cooperation among units), recall, or frequency of interviewing. In particular, with panel data, a key concern is that observations for each unit at every wave may not be possible. In this situation, the nonobservance of these units in future waves would be missing completely at random (MCAR). In this case, data could be analyzed by complete-case analysis (analyzing only cases for which all waves are observed), or available-data analysis (i.e., methods which do not require response vectors of equal length). A more serious cause of missing data in panels is attrition (i.e., dropout, panel mortality). Whereas data missing due to censoring are nearly always MCAR, data missing due to dropout may not fit this criterion. The failure to re-interview the interested units may result in a selection bias if the attrition is correlated with substantively relevant characteristics. For instance, the dropout in a panel survey on environmental behavior might be related to an individual’s disinterest in environmental protection. If the data are missing at random, imputation methods may yield unbiased estimates of model quantities. An alternative is multiple random imputation, which models the probability of missingness and matches missing observations with observed observations, which have similar probabilities of being missing. Another strategy, which indirectly remedies the problem of panel attrition, is to refresh the sample by adding new observations toward the end of the study. These new observations are called a refreshment sample. This sample allows adjusting for panel effects. Another alternative is to provide rotating panels. In rotating panel designs a part of the sample is replaced at each subsequent point in time. In doing so, rotating designs reduce respondent burden and also provide an opportunity to refresh the sample with units that better reflect the targeted units of interest. Another pitfall of panel data represents heteroscedasticity. One of the most important assumptions of classical linear regression models is that the variance of each disturbance term is some constant number. This is the assumption of homoscedasticity or equal variance. But if the unmodeled variance differs from one individual to the next, heteroscedasticity is present in the panel data. Heteroscedasticity does not result in unbiased estimators, but these estimators no longer have minimum variance. These problems can be solved by utilizing a generalized least square (GLS) estimator that allows for unique variances among individuals. Moreover, panel data might suffer from an endogeneity bias. An endogeneity bias occurs if the mean responses vary cross-sectionally via unobserved unique means and 4 L O N G I T U D I N A L D ATA A N A L Y S I S, P A N E L D ATA A N A L Y S I S these differences are not modeled, and therewith, left in the error term. As a result, any cross-sectionally varying covariate will correlate with the error term. In other words, independent variables correlate with the error term. Also serial correlation represents another drawback of panel data. Serial correlation relates to the fact that repeated observations on the same units are highly correlated, and therewith, violate the assumption of uncorrelated errors. These correlations might result from panel conditioning, as a unit’s response is influenced by prior interviews. Serial correlations are by tendency large and positive, but diminish as the time between the observations increases. Linear longitudinal data models Many of the longitudinal data applications that appear in the literature are based on linear model theory. Hence, this contribution is devoted to these linear longitudinal data models. However, nonlinear models represent an area of recent development. Nonlinear models refer to instances where the distribution of the response cannot be reasonably approximated using a normal curve. While there exists consensus that longitudinal data is best suited for making causal inferences, there has not been as much consensus on the best methods for analyzing such data. There exist many traditions for analyzing panel data: While economics and political science traditionally analyze trends, and thus, aim at modeling the level of the dependent variable (Y), social, behavioral, or educational scientists are often concerned with assessing individual changes, and hence, aim at modeling changes of the dependent variable (ΔY). Overall, there exist several estimation techniques to address one or more of the previously outlined pitfalls of panel data. The most prominent linear panel data models are (i) the fixed-effects model, and (ii) the random-effects model, both of which are applied to model the level of the dependent variable. As to the most prominent approaches in order to model the change of the dependent variable, (iii) the lagged dependent variable approach, and (iv) the change score method are most frequently used (Andress, Golsch, & Schmidt, 2013). The fixed-effects model Whenever scholars aim at modeling the level of the dependent variable (Y), the fixedor the random-effects model are preferably applied. The terminology of a fixed- and random-effects model has caused quite some misunderstanding and confusion in the past since the terms fixed and random effects have multiple meanings. While a fixed effect relates to any model quantity estimated, a random effect relates to any parameter that is unique to the individual but can be predicted separately. In line with the classic view, a fixed-effects model treats unobserved differences between units as fixed parameters whereas a random-effects model associates these differences with random variables. From a more modern perspective, these two approaches are distinguished by the assumptions these models make about the association between observed and L O N G I T U D I N A L D ATA A N A L Y S I S, P A N E L D ATA A N A L Y S I S 5 unobserved variables. While in a fixed-effects model, the unobserved variables can have any association with the observed variables, in a random-effects model observed and unobserved variables are uncorrelated. As a side note, the random-effects model is considered a special case of the mixed-effects model (Allison, 2009). Both approaches—the fixed- and the random-effects model—are able to overcome unit effects and the endogeneity bias of panel data. In general, most panel applications underlie a simple regression with an error disturbance term, such as the following model: yit = xit′ 𝛽 + 𝜇i + vit i = 1, … , N; t = 1, … , T (1) where yit is the dependent variable for the ith individual at time t, xit is a vector of observations on k explanatory variables, 𝛽 is a k vector of unknown coefficients, 𝜇 i is an observed individual specific effect, and vit is a zero mean random error disturbance term with variance 𝜎 2 v . If 𝜇 i (i.e., the unobserved individual specific effect) in Equation 1 refers to fixed parameters to be estimated, this model is called a fixed-effects (FE) model. A FE-model assumes that the individual-specific effect is a random variable, which can correlate with the explanatory variable. Moreover, the model assumes that time-varying explanatory variables are not perfectly linear and that they have non-zero within-variance. A fixedeffects model is typically estimated with least squares dummy variables (LSDV). This approach estimates the model by utilizing ordinary least squares (OLS) and includes dummy variables for each unit (N – 1) in order to be able to estimate the individual invariant effects. This in turn leads to a large loss in degrees of freedom, but reduces multicollinearity among regressors. Although this approach might be computationally simple and accounts for a known source of variance in the model specification, the unit’s dummies are perfectly collinear with any variable that varies cross-sectionally. Consequently, the LSDV approach excludes any time-invariant covariates in the model. Moreover, if the number of observed units is large (especially relative to the number of waves), estimating a LSDV model is inefficient. In this case, an alternative fixed-effects estimator to LSDV is the within estimator. By applying this approach, the dependent variable and all covariates are introduced as deviations from the unit’s mean of the variable into the model. In doing so, the within estimator avoids estimating unique intercepts for each unit. Most importantly, the within estimator produces the same coefficient estimates as the LSDV approach does. However, the within estimator is not able to estimate the effects of any time-invariant variables such as gender, race, or religion. These variables are eliminated in both the LSDV and the within estimator approach. Consequently, the main disadvantage of the fixed-effects model is that it cannot accommodate time-invariant covariates (Hsiao, 2003). The random-effects model If 𝜇 i (i.e., the unobserved individual specific effect) in Equation 1 relates to independent random variables with zero mean and a constant variance 𝜎𝜇2 , this model is called a random-effects (RE) model. A random-effects model assumes that the individual 6 L O N G I T U D I N A L D ATA A N A L Y S I S, P A N E L D ATA A N A L Y S I S specific effect is a random disturbance at the individual level that is uncorrelated with the explanatory variables. Furthermore, this model assumes that the regressors have a non-zero variance. More importantly, the random-effects model allows the inclusion of time-invariant covariates in the model specification, which is the most apparent difference between the fixed- and random-effects model. The random-effect model can be estimated by generalized least squares (GLS) by utilizing a least squares regression. This model is characterized by a compound symmetry covariance structure and specifically recognizes that repeated observations covary. Therefore, the model includes a term that forces all repeated measures to correlate at a constant level with each other. By allowing time-invariant covariates, the random-effects model avoids the inefficiency problem of LSDV. Consequently, the random-effects model is the most practical option for short panels as well as for any models, for which time-invariant covariates are to be estimated. On the downside, this model assumes that unit effects are independent of covariates. But if the unit effects yet correlate with any covariate, the estimates of the random-effects model are biased (Hsiao, 2003). The lagged dependent variable approach Since panel data provides the ability to accommodate temporal trends, researchers are frequently interested in studying changes of the explanatory variable (ΔY). To that end, longitudinal datasets are characterized by a relatively large number of observed units compared to a relatively small number of observations over time. In order to model the changes of the dependent variable and to draw causal inferences about dynamics, the lagged dependent variable approach and the change score method are widely prominent. The underlying idea of the lagged dependent variable approach or regressor variable method is that while controlling for the dependent variable Y at a prior point time (yit−1 ), one or more dependent variables (xit ) cause a change in the dependent variable Y at a subsequent point in time (yit ). As already outlined, serial correlation represents a major pitfall of panel data. One solution to overcome this problem is the inclusion of such a lagged term of the dependent variable as a covariate. This term accounts for serial correlation and makes the remaining errors independent. Specifically, in this approach yit is regressed on both xit and yit−1 . Such a regression model, which might include one or more lagged values of the dependent variable among its explanatory variables, is also called an autoregressive model, also known as dynamic model, in the form of: yit = 𝛼 + xit′ 𝛽 + 𝛾yit −1 + vit (2) Specifically, this regression (Equation 2) models the time path of the dependent variable in relation to its past value(s). The lagged dependent variable approach provides appropriate measures for studying causality in longitudinal designs. The obtained parameters are interpreted in terms of predicting change. It is an appealing approach for modeling dynamics from a practical viewpoint. Whereas some scholars argue that a lagged response most effectively accounts for unit effects and serial correlations, other scholars argue that this approach might yield an endogeneity L O N G I T U D I N A L D ATA A N A L Y S I S, P A N E L D ATA A N A L Y S I S 7 bias if the lagged term does not eliminate all serial correlations (Finkel, 1995). Although the application of this approach is justified with only two waves of data, the usage of a lagged dependent variable costs one wave of data, which means that the first wave of observations cannot be modeled by this approach. Consequently, the lagged dependent variable approach is usually used for studying dynamics in long panels. The change score method Another option to model changes in the dependent variable is the change score method or difference score method. This approach utilizes the change or difference score between two observations over time as the dependent variable. Specifically, in this method yit – yit−1 is regressed on the dependent variables, in the form of: yit –yit −1 = 𝛼 + xit′ 𝛽 + vit (3) There exist two major objections against the use of change scores, which therefore promote the lagged dependent variable approach. Firstly, change scores tend to be much less reliable measures than the individual variables. If the measures of the dependent variables at the various points in time have a reliability of less than 1.00, and thus, do not perfectly assess the measurement concept, the reliability and the validity of the difference score are less than the separate scores. Secondly, change scores are frequently negatively correlated with the score of the dependent variable at yit−1 . This negative correlation is often substantial. Consequently, if there exists a relation between an independent variable and the difference score, it remains unclear whether the change of the independent variable has caused this relation or whether this relation reflects the relationship of the independent variable at t −1. One solution might be to incorporate yit−1 into the regression (Equation 3) as a control variable so that the relation between the difference score and the independent variable is adjusted for confounded effects. Interestingly, this approach is rarely applied or discussed in the literature (Dalecki & Willits, 1991). Recommendations for the analysis of panel data Although panel data offer many advantages to study causal propositions, the power of panel or longitudinal data analysis largely depends on the compatibility of the assumptions of the respective statistical models with the generated data. Otherwise, choosing the wrong analytical method might result in misleading inferences. Whether the fixed- or random-effects model is better suited for modeling the level of the dependent variable in longitudinal data depends on the assumption researchers make about the correlation between the individual specific error and the regressors. If one assumes that there exists no correlation between the error term and the regressors, the random-effects model is the appropriate choice. In contrast, if a correlation between the error term and the regressors is assumed, then the fixed-effects model is 8 L O N G I T U D I N A L D ATA A N A L Y S I S, P A N E L D ATA A N A L Y S I S better suited for the analysis. Bearing this fundamental difference in mind, the following observations might provide additional guidelines: If T is large and N is small, fixedeffects models might be preferable. If N is large and T is small and the units are considered to be random drawings, random-effects models might be preferable (Andress et al., 2013). The best-known test in order to decide whether to use a fixed-effects or a randomeffects model is the Hausman test. This test aims at detecting whether the unit effects are indeed uncorrelated with any input variables. The respective null hypothesis postulates that the unobserved individual specific effects do not correlate with the independent variables. The basic idea is that since the fixed-effects transformation eliminates the effects of the unobserved individual specific effects from the model specification, the fixed-effects estimator is consistent regardless of whether there exist correlations between the specific effects and the input variables. If the null hypothesis is true, the random-effects estimator is efficient. On the other hand, the fixed-effects estimator is efficient. Fixed-effects models are frequently applied in randomized experiments since these models increase efficiency and reduce bias (Allison, 2009). As a rule of thumb, these models are preferably used to make inferences about the sample whereas randomeffects models are generally applied to draw conclusions about the larger population (Allison, 1994). In order to model the change of the dependent variable and to assess dynamics, the lagged dependent variable approach and the change score method are strongly recommended. The lagged dependent variable approach is frequently used in experimental research. This approach is able to remedy possible imbalances of randomization procedures when the assignment of subjects to treatment categories has resulted in groups that are significantly different regarding the dependent variable of interest. Even though there exist major objections to the use of change scores, this method might turn out to be superior to the lagged dependent variable approach if the independent variable is temporally subsequent to the dependent variable and uncorrelated with the transient component of the dependent variable (Allison, 1990). SEE ALSO: Panel Research Methods; Regression Analysis, Linear; Time-Series Analysis References Allison, P. D. (1990). Change scores as dependent variables in regression analysis. Sociological Methodology, 20(1), 93–114. doi:10.2307/271083 Allison, P. D. (1994). Using panel data to estimate the effects of events. Sociological Methods & Research, 23(2), 174–199. doi:10.1177/0049124194023002002 Allison, P. D. (2009). Fixed effects regression models. Thousand Oaks, CA: SAGE. Andress, H. J., Golsch, K., & Schmidt, A. W. (2013). Applied panel data analysis for economic and social surveys. Berlin/Heidelberg: Springer. Dalecki, M., & Willits, F. K. (1991). Examining change using regression analysis: Three approaches compared. Sociological Spectrum, 11(2), 127–145. doi:10.1080/02732173. 1991.9981960 L O N G I T U D I N A L D ATA A N A L Y S I S, P A N E L D ATA A N A L Y S I S 9 Finkel, S. E. (1995). Causal analysis with panel data. Thousand Oaks, CA: SAGE. Frees, E. W. (2004). Longitudinal and panel data: Analysis and applications in the social sciences. New York: Cambridge University Press. Hsiao, C. (2003). Analysis of panel data (2nd ed.). Cambridge, UK/New York: Cambridge University Press. Further reading Beck, N., & Katz, J. N. (1995). What to do (and not to do) with time-series cross-section data. American Political Science Review, 89(3), 634–647. doi:10.2139/ssrn.1658640 Beck, N., & Katz, J. N. (2011). Modeling dynamics in time-series-cross-section political economy data. Annual Review of Political Science, 14, 331–352. doi:10.1146/annurev-polisci-071510103222 Gillespie, D. F., & Streeter, C. L. (1994). Fitting regression models to research questions for analyzing change in nonexperimental research. Social Work Research, 18(4), 239–245. doi:10.1093/swr/18.4.239 Gujarati, D. N. (2003). Basic econometrics (4th ed.). Boston: McGraw-Hill. Hamaker, E. L., Kuiper, R. M., & Grasman, R. P. (2015). A critique of the cross–lagged panel model. Psychological Methods, 20(1), 102–116. doi:10.1037/a0038889 Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data. Cambridge, MA/London: MIT Press. Christiane Grill is a researcher at the Department of Communication at the University of Vienna. Her focus of research is on political offline and online communication with a particular emphasis on EU politics and EU elections. Moreover, her research is dedicated to media reception and its effects on public opinion. Dr. Grill is also interested in the development of empirical methods in social sciences and within this realm published the paper “Clarifying and Expanding the Use of Confirmatory Factor Analysis Journalism and Mass Communication Research” together with Lance Holbert in Journalism & Mass Communication Quarterly in 2015.