Abstract
Linear regressions that use aggregated values from a group variable such as a school or a neighborhood are commonplace in the social sciences. This paper uses Monte Carlo methods to demonstrate that aggregated variables produce spurious relationships with other dependent and independent variables in a model even when there are no underlying relationships among those variables. The size of the spurious relationships (or postulated effects) increases as the number of observations per group decreases. Although this problem is remedied by including the individual-level variable in the regression, the problem has not been discussed in the methodological literature. Accordingly, studies using aggregate variables must be interpreted with caution if the individual-level measurements are not available.
Similar content being viewed by others
Notes
The shape of the actual distributions are unknown, but assuming they are normal, the 100,000 samples would generate an extremely small standard error, so that a correlation as small as .01 would be significant at the .05 level.
Simulations were also run for 60 and 80 schools, but results differed only slightly from the 50 school case.
In the dataset, the variable was named "metasum".
It is understood that the coefficients for S and P in model (4) can be different than those in model (3), even though the same β symbols are used.
The full set of simulation correlations that go into model (4) are available from the authors.
References
Bryk, A.S., Raudenbush, S.W.: Hierarchical Linear Models. Sage Publications, Newbury Park (1992)
Gottfried, M.A.: Absent peers in elementary years: the negative classroom effects of unexcused absences on standardized testing outcomes. Teach. Coll. Rec. 113, 1597–1632 (2011)
Hanushek, E.A., Kain, J.F., Rivkin, S.G.: New evidence about Brown v. Board of Education: the complex effects of school racial composition on achievement. J. Labor Econ. 27, 349–383 (2009)
Hill, C.J., Bloom, H.S., Black, A.R., Lipsey, M.W.: Empirical benchmarks for interpreting effect sizes in research. Child. Dev. Perspect. 2, 172–177 (2008)
Hoxby, C., Wiengrath, G.: Taking Race Out of the Equation: School Reassigment and the Structure of Peer Effects (2005). www.economics.harvard.edu/faculty/hoxby/papers/hoxbyweingarth_taking_race.pdf
Kahlenberg, R.D.: The Future of School Integration: Socioeconomic Diversity as an Education Reform Strategy. The Century Foundation, Washington DC (2012)
King, G.: A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior from Aggregate Data. Princeton University Press, Princeton (1997)
Lipsey, M.W., Puzio, K., Yun, C., Hebert, M.A., Steinka-Fry, K., Cole, W., Roberts, M., Anthony, K.S., Busick, M.D.: Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms. U.S. Department of Education, Institute for Education Science, Washington DC (2012)
Loveless, T.: How Well are American Students Learning?. Brookings Institution, Washington DC (2012)
Marks GN. (2012). Are school-SES effects theoretical and methodological artifacts?. Teach. Coll. Rec. (ID Number 16872)
Moulton, B.R.: An illustration of a pitfall in estimating the effects of aggregate variables on micro units. Rev. Econ. Stat. 72, 334–338 (1990)
Sampson, R.J., Raudenbush, S.W., Earls, F.: Neighbourhoods and violent crime: a multilevel study of collective efficacy. Science 277, 918–924 (1997)
Vigdor, J., Nechyba, T.: Peer Effects in Elementary School: Learning from ‘Apparent’ Random Assignment. Duke University and NBER, Durham (2004)
Willms, J.D.: School composition and contextual effects on student outcomes. Teach. Coll. Rec. 112(4), 1137–1162 (2010)
Wooldridge, J.M.: Cluster-sample methods in applied econometrics. Am. Econ. Rev. 93, 133–138 (2003)
Author information
Authors and Affiliations
Corresponding author
Technical appendix: on multilevel modeling
Technical appendix: on multilevel modeling
Multilevel modeling does not affect the analyses presented here because of a lack of within-school (or within-group) variation. This is demonstrated in Table 8 using Model 1(b) for the case of 10 schools and 30 students per school (N = 300) from the USA PISA data. The within-school expected values are shown for the standard deviations of school SES (S), achievement (A), and individual SES (I) as well as the covariance between A and I. The correlation was computed from the appropriate values for covariance and standard deviations. It is clear that there is almost no variation among the within-school estimates of these values, where the sample size for each school is 30 students. These within-cell values were estimated at a later time than the original simulations for Table 2, so the correlation for the full sample of 300 cases (.409) is slightly smaller than the original correlation of .456.
Rights and permissions
About this article
Cite this article
Armor, D.J., Cotla, C.R. & Stratmann, T. Spurious relationships arising from aggregate variables in linear regression. Qual Quant 51, 1359–1379 (2017). https://doi.org/10.1007/s11135-016-0335-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11135-016-0335-0