Skip to main content
Log in

Comparing Standard Regression Modeling to Ensemble Modeling: How Data Mining Software Can Improve Economists’ Predictions

  • Article
  • Published:
Eastern Economic Journal Aims and scope Submit manuscript

Abstract

Economists’ wariness of data mining may be misplaced, even in cases where economic theory provides a well-specified model for estimation. We discuss how new data mining/ensemble modeling software, for example the program TreeNet, can be used to create predictive models. We then show how for a standard labor economics problem, the estimation of wage equations, TreeNet outperforms standard OLS regression in terms of lower prediction error. Ensemble modeling resists the tendency to overfit data. We conclude by considering additional types of economic problems that are well-suited to use of data mining techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4

Similar content being viewed by others

Notes

  1. We do not consider whether the dependent variable should be transformed but rather stay with this standard method; similar to regression modeling, TreeNet does not also automatically test different transformations of the dependent variable.

  2. There are literally so many such studies that no single survey of the wage regression literature exists. See Willis [1986] for a classic survey of human capital earnings functions; Jacobsen [2007], Chapter 7, for discussion and examples of many of these articles in the gender wage differential context.

  3. A fourth model was run where experience and education were included as categorical values (i.e., as many dummies as the number of values minus one); this model did not outperform either model 2 or model 3 by either the adjusted R2 or the prediction error metrics, so we do not include results from it.

References

  • Belloni, Alexandre, Victor Chernozhukov, and Christian Hansen . 2014. High-dimensional Methods and Inference on Structural and Treatment Effects. Journal of Economic Perspectives, 28 (2): 29–50.

    Article  Google Scholar 

  • Blinder, Alan . 1973. Wage Discrimination: Reduced Form and Structural Estimates. Journal of Human Resources, 8 (4): 436–455.

    Article  Google Scholar 

  • Bowles, Mike . 2014. Ensemble Packages in R. Revolutions Blog (April 8), Revolution Analytics, http://blog.revolutionanalytics.com/2014/04/ensemble-packages-in-r.html.

  • Cook, Thomas D. 2014. “Big Data” in Research on Social Policy. Journal of Policy Analysis and Management, 33 (2): 544–547.

    Article  Google Scholar 

  • Friedman, Jerome H. 1999a. Stochastic Gradient Boosting. Technical Report, Dept. of Statistics, Stanford University.

  • Friedman, Jerome H. 1999b. Greedy Function Approximation: A Gradient Boosting Machine. Technical Report, Dept. of Statistics, Stanford University.

  • Jacobsen, Joyce P. 2007. The Economics of Gender, 3rd ed., Malden, Mass.: Blackwell.

    Google Scholar 

  • Munnell, Alicia H., Geoffrey M. B. Tootell, Lynne E. Browne, and James McEneaney . 1996. Mortgage Lending in Boston: Interpreting HMDA Data. American Economic Review, 86 (1): 25–53.

    Google Scholar 

  • Oaxaca, Ronald . 1973. Male-female Wage Differentials in Urban Labor Markets. International Economic Review, 14 (3): 693–709.

    Article  Google Scholar 

  • Salford Systems. 2001–05. TreeNet: An Exclusive Implementation of Jerome Friedman’s MART Methodology: Robust Multi-tree Technology for Data Mining, Predictive Modeling and Data Processing, http://www.salford-systems.com/.

  • Schonlau, Matthias . 2005. Boosted Regression (Boosting): An Introductory Tutorial and a Stata Plugin. The Stata Journal, 5 (3): 330–354.

    Google Scholar 

  • Stock, James H. 2010. The Other Transformation in Econometric Practice: Robust Tools for Inference. Journal of Economic Perspectives, 24 (2): 83–94.

    Article  Google Scholar 

  • Varian, Hal R. 2014. Big Data: New Tricks for Econometrics. Journal of Economic Perspectives, 28 (2): 3–28.

    Article  Google Scholar 

  • Wichard, Jorg D., and Maciej Ogorzalek . 2007. Time Series Prediction with Ensemble Models Applied to the CATS Benchmark. Neurocomputing, 70 (13–15): 2371–78.

    Article  Google Scholar 

  • Willis, Robert J. 1986. Wage Determinants: A Survey and Reinterpretation of Human Capital Earnings Functions. Handbook of Labor Economics, 1: 525–602.

    Article  Google Scholar 

Download references

Acknowledgements

The authors thank session participants at the May 2013 Eastern Economics Association meetings and an anonymous referee for helpful suggestions, and Wesleyan University for research support.

Author information

Authors and Affiliations

Authors

Appendix

Appendix

Variables used in regressions and TreeNet

Dependent variable

Lnearn — The log real wage, calculated by taking the log of the person’s earnings from the previous year, divided by hours usually worked times weeks worked last year.

Variables in “sparse” model

Education — A recoding from the level of schooling completed to the approximate years of schooling that would normally have been spent to achieve that level of attainment.

Experience — Potential experience, calculated as age minus years of education minus six.

Gender — A dummy variable coded 1 for woman.

Race — A dummy variable coded 1 for anyone whose ethnicity is more than 66 percent from a white non-hispanic background.

Marriage — A dummy variable coded 1 for a person that is legally married.

Serve — A dummy variable coded 1 for people who have ever served on active duty in the United States military.

Rural — A dummy variable based on the metropolitan status provided by the CPS, coded 1 for people whose primary living space is non-metropolitan.

South — A dummy variable coded 1 for a person living in the southern region of the United States.

Additional variables in “all variable” model

Age — An integer variable representing the age of the person.

Non-English — A dummy variable coded 1 for people who were born in countries where English is not the primary language spoken.

NotborninUS — A dummy variable coded 1 if the person was born outside the United States.

h_numper — An integer variable for the number of people living in the person’s household.

Losewks — An integer variable that represents the number of self-reported weeks that the person failed to show up for scheduled work.

ss_val — The total payment in dollars that the person receives in yearly social security payments.

csp_val — The total payment in dollars that the person receives in yearly child support payments.

vet_yn — A dummy variable for whether or not the person receives veterans payments.

fin_yn — A dummy variable for whether or not the person receives financial assistance.

oi_yn — A dummy variable for whether or not the observation receives any other income.

Additional interaction and polynomial terms in “complex” model

Agesq — The square of the age variable.

Edsq — The square of the education variable.

Exsq The square of the experience variable.

Wed — An interaction term between being a female and education.

Wedsq — The Wed variable squared.

Wex — An interaction term between being a female and experience.

Wexsq — The Wex variable squared.

(Note: Additional higher-order terms and interactions were tested and rejected.)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jacobsen, J., Levin, L. & Tausanovitch, Z. Comparing Standard Regression Modeling to Ensemble Modeling: How Data Mining Software Can Improve Economists’ Predictions. Eastern Econ J 42, 387–398 (2016). https://doi.org/10.1057/eej.2015.8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1057/eej.2015.8

Keywords

JEL Classifications

Navigation