Skip to main content
Log in

Spurious relationships arising from aggregate variables in linear regression

  • Published:
Quality & Quantity Aims and scope Submit manuscript

Abstract

Linear regressions that use aggregated values from a group variable such as a school or a neighborhood are commonplace in the social sciences. This paper uses Monte Carlo methods to demonstrate that aggregated variables produce spurious relationships with other dependent and independent variables in a model even when there are no underlying relationships among those variables. The size of the spurious relationships (or postulated effects) increases as the number of observations per group decreases. Although this problem is remedied by including the individual-level variable in the regression, the problem has not been discussed in the methodological literature. Accordingly, studies using aggregate variables must be interpreted with caution if the individual-level measurements are not available.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. The shape of the actual distributions are unknown, but assuming they are normal, the 100,000 samples would generate an extremely small standard error, so that a correlation as small as .01 would be significant at the .05 level.

  2. Simulations were also run for 60 and 80 schools, but results differed only slightly from the 50 school case.

  3. In the dataset, the variable was named "metasum".

  4. It is understood that the coefficients for S and P in model (4) can be different than those in model (3), even though the same β symbols are used.

  5. The full set of simulation correlations that go into model (4) are available from the authors.

References

  • Bryk, A.S., Raudenbush, S.W.: Hierarchical Linear Models. Sage Publications, Newbury Park (1992)

    Google Scholar 

  • Gottfried, M.A.: Absent peers in elementary years: the negative classroom effects of unexcused absences on standardized testing outcomes. Teach. Coll. Rec. 113, 1597–1632 (2011)

    Google Scholar 

  • Hanushek, E.A., Kain, J.F., Rivkin, S.G.: New evidence about Brown v. Board of Education: the complex effects of school racial composition on achievement. J. Labor Econ. 27, 349–383 (2009)

    Article  Google Scholar 

  • Hill, C.J., Bloom, H.S., Black, A.R., Lipsey, M.W.: Empirical benchmarks for interpreting effect sizes in research. Child. Dev. Perspect. 2, 172–177 (2008)

    Article  Google Scholar 

  • Hoxby, C., Wiengrath, G.: Taking Race Out of the Equation: School Reassigment and the Structure of Peer Effects (2005). www.economics.harvard.edu/faculty/hoxby/papers/hoxbyweingarth_taking_race.pdf

  • Kahlenberg, R.D.: The Future of School Integration: Socioeconomic Diversity as an Education Reform Strategy. The Century Foundation, Washington DC (2012)

    Google Scholar 

  • King, G.: A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior from Aggregate Data. Princeton University Press, Princeton (1997)

    Google Scholar 

  • Lipsey, M.W., Puzio, K., Yun, C., Hebert, M.A., Steinka-Fry, K., Cole, W., Roberts, M., Anthony, K.S., Busick, M.D.: Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms. U.S. Department of Education, Institute for Education Science, Washington DC (2012)

    Google Scholar 

  • Loveless, T.: How Well are American Students Learning?. Brookings Institution, Washington DC (2012)

    Google Scholar 

  • Marks GN. (2012). Are school-SES effects theoretical and methodological artifacts?. Teach. Coll. Rec. (ID Number 16872)

  • Moulton, B.R.: An illustration of a pitfall in estimating the effects of aggregate variables on micro units. Rev. Econ. Stat. 72, 334–338 (1990)

    Article  Google Scholar 

  • Sampson, R.J., Raudenbush, S.W., Earls, F.: Neighbourhoods and violent crime: a multilevel study of collective efficacy. Science 277, 918–924 (1997)

    Article  Google Scholar 

  • Vigdor, J., Nechyba, T.: Peer Effects in Elementary School: Learning from ‘Apparent’ Random Assignment. Duke University and NBER, Durham (2004)

    Google Scholar 

  • Willms, J.D.: School composition and contextual effects on student outcomes. Teach. Coll. Rec. 112(4), 1137–1162 (2010)

    Google Scholar 

  • Wooldridge, J.M.: Cluster-sample methods in applied econometrics. Am. Econ. Rev. 93, 133–138 (2003)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David J. Armor.

Technical appendix: on multilevel modeling

Technical appendix: on multilevel modeling

Multilevel modeling does not affect the analyses presented here because of a lack of within-school (or within-group) variation. This is demonstrated in Table 8 using Model 1(b) for the case of 10 schools and 30 students per school (N = 300) from the USA PISA data. The within-school expected values are shown for the standard deviations of school SES (S), achievement (A), and individual SES (I) as well as the covariance between A and I. The correlation was computed from the appropriate values for covariance and standard deviations. It is clear that there is almost no variation among the within-school estimates of these values, where the sample size for each school is 30 students. These within-cell values were estimated at a later time than the original simulations for Table 2, so the correlation for the full sample of 300 cases (.409) is slightly smaller than the original correlation of .456.

Table 8 Within-school estimates of parameter variation for Model 1b (case of 10 schools & 30 students per school)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Armor, D.J., Cotla, C.R. & Stratmann, T. Spurious relationships arising from aggregate variables in linear regression. Qual Quant 51, 1359–1379 (2017). https://doi.org/10.1007/s11135-016-0335-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11135-016-0335-0

Keywords

Navigation