A note on panel data methods

data science
econometrics
Statisticians and econometricians often don’t speak the same language.
Author

Tim Cosemans

Published

6 April 2025

Defining Panel Data

Panel, or longitudinal data has become widely available to empirical researchers across economics. Many countries conduct national periodic surveys, such as the British Household Panel Survey or the National Longitudinal Survey in the United States. In addition, commercial organizations offer a wide array of longitudinal datasets for research, such as Nielsen’s Consumer Panel. The units of analysis, individuals, firms etc., are not only observed on different variables (like in cross-sectional data), but also over time. Key to this type of data is the fact that the observations, i.e., the individual data points, are no longer independent since they belong to the same unit of analysis. The most outspoken benefit of this type of data lies in this within-unit variance, that brings extra information in addition to the traditional between-unit variance observed in cross-sectional data (Capitaine, Genuer, and Thiébaut 2021).

Capitaine, Louis, Robin Genuer, and Rodolphe Thiébaut. 2021. ‘Random Forests for High-Dimensional Longitudinal Data’. Statistical Methods in Medical Research 30 (1): 166–84.

Panel or longitudinal data is therefore a special case of clustered data, where data points belong to a cluster that causes a correlation between them. A different type of clustered data is hierarchical data where data points are connected not because they belong to the same unit of analysis, but because the unit of analysis belongs to some overarching group. For example, test scores of students have a hierarchical dimension since the unit of analysis, the student, belongs to a group, a class. All grades from students of the same class can be expected to show some correlation due to characteristics of the class, such as the teacher, the class size etc.

Analyzing Panel Data

Both econometrics and statistics have a large array of methods to analyze panel data, but the terminology used to denote them can often be conflicting and confusing (Gelman 2005). In general, we can define a longitudinal model with individual specific intercepts as follows:

Gelman, Andrew. 2005. ‘Why i Don’t Use the Term “Fixed and Random Effects” | Statistical Modeling, Causal Inference, and Social Science’. https://statmodeling.stat.columbia.edu/2005/01/25/why_i_dont_use/.

\[ Y_{it} = f(X_{it}) + c_i + e_{it} \]

where:

  • \(f(X_{it})\) is an unknown function
  • \(Y_{it}\) represents the outcome
  • \(e_{it}\) is a standard normally distributed error
  • \(c_i\) is an unobserved, time-constant factor specific to individual i

In econometrics, this model is called the unobserved effects model (Wooldridge 2016). There, the discussion usually centers around whether \(c_i\) must be treated as a random variable (i.e., a random effect) or a parameter to be estimated (i.e., a fixed effect).

Over time:

  • “Random effects” has become synonymous with zero correlation between \(c_i\) and \(X_i\)
  • “Fixed effects” implies correlation between \(c_i\) and \(X_i\)

In statistics, on the other hand, a fixed effect is a common term used to refer to any parameter in a regression model that is a population average to be estimated, usually denoted by \(\beta\).

Estimation under the assumption of non-zero correlation in econometrics (“fixed effects estimation”) is more realistic, but requires the removal of the influence of \(c_i\). This can be done using the “within” transformation of the equation to be estimated, which is equivalent to estimating with OLS

\[ Y_{it} - Ȳ_i = f(X_{it} - X̄_i) + e_{it} - ē_i \]

While this approach successfully removes the influence of \(c_i\), it also removes all time-constant variables. The main downside of this approach is that other time-constant variables cannot be included. Estimation under the assumption of no correlation (“random effects estimation”), does allow for the inclusion of time-constant variables, but is often less realistic. This involves estimating

\[ Y_{it} = f(X_{it}) + v_{it} \]

where \(v_{it} = c_i + e_{it}\). Econometricians account for the serial correlation introduced in the composite error term by using generalised least squares. Similar approaches exist in statistics to model the influence of individual-specific factors. They all fall under the “generalised linear mixed model” (GLMM). The general linear mixed model can be written as

\[ Y_{it} = f(X_{it}) + W_{it} * c_i + e_{it} \]

Where:

  • \(X_{it}\) are covariates with fixed effects
  • \(W_{it}\) are covariates with random effects

These last two equations are equivalent if we only include a random intercept (\(W_{it} = 1\)). What the econometrician calls unobserved effects, the statistician calls random intercepts.

Caution is needed, however. While the econometrician’s random effects model is equivalent to the statistician’s random intercept model, estimating both (e.g., using the plm and lmer packages in R) can yield different results as mixed models are usually estimated using (restricted or unrestricted) maximum likelihood and not generalised least squares.

“The econometric GLS approach has closed-form analytical solutions computable by standard linear algebra and, although the latter can sometimes get computationally heavy on the machine, the expressions for the estimators are usually rather simple. ML estimation of longitudinal models, on the contrary, is based on numerical optimization of nonlinear functions without closed-form solutions and is thus dependent on approximations and convergence criteria.” (Croissant and Millo 2008)

Croissant, Yves, and Giovanni Millo. 2008. ‘Panel Data Econometrics in r: The Plm Package’. Journal of Statistical Software 27: 1–43.

The GLMM does add the opportunity to also add random, individual specific, slopes for covariates in addition to their general, fixed effect. This can be done by including a variable in both \(X_{it}\) and \(W_{it}\). These random effects represent random variation in the model that can be attributed to a group to which observations belong. It is with these types of models that opportunities arise to model not only longitudinal but also hierarchical effects simultaneously.

In statistics, the econometrician’s worry about violation of the “random effects” assumption still holds when working with mixed models and observational data. There might still be correlation between the error term and the explanatory variables due to other time-constant unobserved effects. Experimental data does not suffer from this type of endogeneity.

Specifically for the case of the random intercept, which is the most prevalent in econometrics, we can employ a “correlated random effects approach”, which explicitly models the correlation between \(X_{it}\) and \(c_i\). This approach assumes

\[ c_i = \psi + X̄_i * \phi + a_i \]

where:

  • \(X̄_i\) is the vector of time-averages of all variables
  • \(E(a_i|X_i) = 0\)

So that we can estimate:

\[ Y_{it} = f(X_{it}) + \psi + Z_i * \delta + X̄_i * \phi + v_{it} \]

Econometricians again use GLS to deal with serial correlation in the composite error \(v_{it}\),all while allowing other time-constant factors and accounting for unobserved time-constant effects that are correlated with \(X_{it}\) (Wooldridge 2016).

Wooldridge, Jeffrey. 2016. Introductory Econometrics. Cengage AU.

Final Note

In addition to the methods described above, econometrics has a different set of methods for analyzing ‘dynamic’ panel data methods, that include lags of variables. I consider these to be outside of the scope of the current discussion.