Comparing `panelr` and `plm`

May 21, 2019 8 min read R

It has been just a little more than a day since I announced my new R package, panelr, to the wider world.

My new #rstats package, panelr, is now on CRAN. Learn more about this package that has become essential for my workflow with panel data here: https://t.co/1tlwlcA5qc
— Jacob Long (@jacobandrewlong) May 20, 2019

Quite frankly, I’ve been surprised by the level of attention! Some responses have been calls for economists to check the package out (they are certainly welcome to do so) as well as things along the lines of there finally being a package for panel data in R. So let me make something clear: There’s at least one comparable package for R, called plm, which is very good and should be particularly appealing for economists. This leads to the understandable question as to how panelr differs from plm.

Super cool! Does your package offer some advantages/differences from plm that I should think about?
— Nicholas Potter (@econpotter) May 21, 2019

The plm package has been around since 2006 and is quite good. I didn’t make panelr out of any kind of deep dissatisfaction with plm nor the idea that it needed to be superseded.

Yves Croissant and Giovanni Millo discuss in plm’s main vignette the fact that there is a great deal of overlap between econometric treatment of panel data and other statistical approaches to panel data, like multilevel modeling. They note, among, other things,

while a very comprehensive software framework for (among many other features) maximum likelihood estimation of linear regression models for longitudinal data, packages nlme (Pinheiro et al. 2007) and lme4 (Bates 2007), is available in the R environment and can be used, e.g., for estimation of random effects panel models, its use is not intuitive for a practicing econometrician, and maximum likelihood estimation is only one of the possible approaches to panel data econometrics.

and

Furthermore, we felt there was a need for automation of some basic data management tasks such as lagging, summing and, more in general, applying (in the R sense) functions to the data, which, although conceptually simple, become cumbersome and error-prone on two-dimensional data, especially in the case of unbalanced panels.

I totally agree with these things. I, however, am not an econometrician, so I came to this area with the opposite problem as Croissant and Millo. I like and respect plm a great deal, but

a package doing panel data “from the econometrician’s viewpoint”

is the opposite of what I was looking for. The way I am accustomed to thinking and talking about panel data analysis is different from the standard econometric approach, for better or worse. My training in this area comes mostly from sociologists, who are of course not ignorant of econometrics but have different preferences and norms. There is a lot of overlap, but ultimately I was motivated to fit a type of model that I would not expect to see in plm even though it is closely related. Those efforts led to panelr.

So here’s the TL;DR:

I wanted to simplify the fitting of panel models that use multilevel models for estimation, especially the kind that produces within-entity effects equivalent to econometric fixed effects models.
I subsequently wanted to streamline GEE estimation of these models.
I wanted to create a function that estimates asymmetric effects models (Allison, 2019).
The panel_data object inherits from grouped tibbles and should fit well into workflows that rely on the “tidyverse.”
I have included tools that make reshaping data from wide to long and vice versa more user-friendly.
The documentation and general approach talk about panel data in ways that are more familiar to me and people similarly trained.

Longer version:

Bell and Jones (2015) describe a model specification for panel data that uses the estimation technique econometricians would use for what they call “random effects” models but generates estimates equivalent to what econometricians call “fixed effects” models. This is achieved using multilevel (also known as mixed) models and including individual-level means of time-varying predictors in the model¹. What you get are within-entity estimates (exactly equivalent to fixed effects) along with between-entity effect estimates, which are not robust to confounding from stable predictors but nonetheless may be of substantive interest. This further enables the analyst to include other time-stable covariates, incorporate random slopes or additional grouping factors, and even move to generalized models (e.g., logit).

The equivalence of these models was first noted by Mundlak (1978). Multilevel modeling researchers have been estimating models like this for some time (e.g., Kreft, de Leeuw, & Aiken, 1995; Hofmann & Gavin, 1998; Raudenbush & Bryk, 2002). Only recently has there been wider recognition of the near-equivalence of fixed effects models and this multilevel model that I and some others refer to as the “within-between” model (Allison, 2009; Bell & Jones, 2015).

Wanting to fit these models is what got me started on the road to panelr. The transformations needed to properly fit the models were quite tedious and, to quote Croissant and Millo, I “felt there was a need for automation of some basic data management tasks such as lagging, summing and, more in general, applying (in the R sense) functions to the data, which, although conceptually simple, become cumbersome and error-prone.” With the popularity of dplyr for data manipulation, and the fact that it can make the necessary transformations much easier, I thought others would find these tools to be useful and accessible.

Later on, more models I was interested in became fairly straightforward to add to the package. GEE estimation of the within-between model may be desirable in some cases, for instance (McNeish, 2019). I wanted to fit asymmetric effects models that allow positive and negative changes in variables to differ (Allison, 2019). I was also able to implement a better method for calculating interactions in within-between models (Giesselmann & Schmidt-Catran, 2018). Get the details on these things in the introductory vignette.

In general, plm has a lot more stuff. For instance, for fixed effects models there are many different methods for calculating standard errors included rather painlessly with plm. In the mutlilevel modeling framework for panelr’s wbm() function, the multilevel model inherently deals with the within-entity correlation of errors, but you are limited if you would like different kinds of adjustments (GEE estimation offers some more flexibility). plm also includes many tests of various model assumptions, like the Hausman test (which can be replicated on a per-coefficient basis in the within-between model). plm has many tools for including instrumental variables, but panelr has none and I don’t foresee any being added in the near future.

Overall, I am not motivated to duplicate features of plm just for the sake of feature parity. There are some models which are nearly or exactly equivalent across the two packages, but this is just a happy coincidence. I will expand panelr as is prudent, which may sometimes involve duplication of plm functionality but only for reasons relating to substantive differences in implementation. As an example, panelr includes the function are_varying() that allows the user to assess whether variables vary over time. It is substantively equivalent to plm’s pvar(), but I was motivated to give users a means for asking for specific variables using a “tidy” selection interface. Although I would not generally promise that my packages are highly performant, I later realized that are_varying() is much faster than pvar(), which can be quite noticeable for larger datasets.

This is just to say that there may be cases in which panelr does something very similar, and it may sometimes be better or different in a way that you would prefer. But I generally see panelr as a complementary package, filling in some gaps and giving an alternative way to do panel data analysis in R.

References

Allison, P. D. (2009). Fixed effects regression models. Thousand Oaks, CA: SAGE Publications. https://doi.org/10.4135/9781412993869.d33

Allison, P. D. (2019). Asymmetric fixed-effects models for panel data. Socius, 5, 1–12. https://doi.org/10.1177/2378023119826441

Bell, A., & Jones, K. (2015). Explaining fixed effects: Random effects modeling of time-series cross-sectional and panel data. Political Science Research and Methods, 3, 133–153. https://doi.org/10.1017/psrm.2014.7

Giesselmann, M., & Schmidt-Catran, A. (2018). Interactions in fixed effects regression models (Discussion Papers of DIW Berlin No. 1748). DIW Berlin, German Institute for Economic Research. Retrieved from https://ideas.repec.org/p/diw/diwwpp/dp1748.html

Hofmann, D. A., & Gavin, M. B. (1998). Centering decisions in hierarchical linear models: Implications for research in organizations. Journal of Management, 24, 623–641. https://doi.org/10.1177/014920639802400504

Kreft, I. G. G., de Leeuw, J., & Aiken, L. S. (1995). The effect of different forms of centering in hierarchical linear models. Multivariate Behavioral Research, 30, 1–21. https://doi.org/10.1207/s15327906mbr3001_1

McNeish, D. (2019). Effect partitioning in cross-sectionally clustered data without multilevel models. Multivariate Behavioral Research, 1–20. https://doi.org/10.1080/00273171.2019.1602504

Mundlak, Y. (1978). On the pooling of time series and cross section data. Econometrica, 46, 69–85. https://doi.org/10.2307/1913646

Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed). Thousand Oaks, CA: Sage.

Conventionally, one also subtracts the individual means from the occasion-level predictor values, as one often does to estimate fixed effects models via OLS. This is not strictly necessary, though. ↩︎

panel data panelr R statistics plm

Comparing `panelr` and `plm`

References

Jacob A. Long

Assistant Professor of Mass Communications