Comparing Standard Regression Modeling to Ensemble Modeling: How Data Mining Software Can Improve Economists’ Predictions

Jacobsen, Joyce P; Levin, Laurence M; Tausanovitch, Zachary

doi:10.1057/eej.2015.8

Comparing Standard Regression Modeling to Ensemble Modeling: How Data Mining Software Can Improve Economists’ Predictions

Article
Published: 30 March 2015

Volume 42, pages 387–398, (2016)
Cite this article

Eastern Economic Journal Aims and scope Submit manuscript

Joyce P Jacobsen¹,
Laurence M Levin² &
Zachary Tausanovitch³

121 Accesses
4 Citations
3 Altmetric
Explore all metrics

Abstract

Economists’ wariness of data mining may be misplaced, even in cases where economic theory provides a well-specified model for estimation. We discuss how new data mining/ensemble modeling software, for example the program TreeNet, can be used to create predictive models. We then show how for a standard labor economics problem, the estimation of wage equations, TreeNet outperforms standard OLS regression in terms of lower prediction error. Ensemble modeling resists the tendency to overfit data. We conclude by considering additional types of economic problems that are well-suited to use of data mining techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ensemble Forecasting

Regularization Methods in Economic Forecasting

An Introduction to Stacking Regression for Economists

Notes

We do not consider whether the dependent variable should be transformed but rather stay with this standard method; similar to regression modeling, TreeNet does not also automatically test different transformations of the dependent variable.
There are literally so many such studies that no single survey of the wage regression literature exists. See Willis [1986] for a classic survey of human capital earnings functions; Jacobsen [2007], Chapter 7, for discussion and examples of many of these articles in the gender wage differential context.
A fourth model was run where experience and education were included as categorical values (i.e., as many dummies as the number of values minus one); this model did not outperform either model 2 or model 3 by either the adjusted R² or the prediction error metrics, so we do not include results from it.

References

Belloni, Alexandre, Victor Chernozhukov, and Christian Hansen . 2014. High-dimensional Methods and Inference on Structural and Treatment Effects. Journal of Economic Perspectives, 28 (2): 29–50.
Article Google Scholar
Blinder, Alan . 1973. Wage Discrimination: Reduced Form and Structural Estimates. Journal of Human Resources, 8 (4): 436–455.
Article Google Scholar
Bowles, Mike . 2014. Ensemble Packages in R. Revolutions Blog (April 8), Revolution Analytics, http://blog.revolutionanalytics.com/2014/04/ensemble-packages-in-r.html.
Cook, Thomas D. 2014. “Big Data” in Research on Social Policy. Journal of Policy Analysis and Management, 33 (2): 544–547.
Article Google Scholar
Friedman, Jerome H. 1999a. Stochastic Gradient Boosting. Technical Report, Dept. of Statistics, Stanford University.
Friedman, Jerome H. 1999b. Greedy Function Approximation: A Gradient Boosting Machine. Technical Report, Dept. of Statistics, Stanford University.
Jacobsen, Joyce P. 2007. The Economics of Gender, 3rd ed., Malden, Mass.: Blackwell.
Google Scholar
Munnell, Alicia H., Geoffrey M. B. Tootell, Lynne E. Browne, and James McEneaney . 1996. Mortgage Lending in Boston: Interpreting HMDA Data. American Economic Review, 86 (1): 25–53.
Google Scholar
Oaxaca, Ronald . 1973. Male-female Wage Differentials in Urban Labor Markets. International Economic Review, 14 (3): 693–709.
Article Google Scholar
Salford Systems. 2001–05. TreeNet: An Exclusive Implementation of Jerome Friedman’s MART Methodology: Robust Multi-tree Technology for Data Mining, Predictive Modeling and Data Processing, http://www.salford-systems.com/.
Schonlau, Matthias . 2005. Boosted Regression (Boosting): An Introductory Tutorial and a Stata Plugin. The Stata Journal, 5 (3): 330–354.
Google Scholar
Stock, James H. 2010. The Other Transformation in Econometric Practice: Robust Tools for Inference. Journal of Economic Perspectives, 24 (2): 83–94.
Article Google Scholar
Varian, Hal R. 2014. Big Data: New Tricks for Econometrics. Journal of Economic Perspectives, 28 (2): 3–28.
Article Google Scholar
Wichard, Jorg D., and Maciej Ogorzalek . 2007. Time Series Prediction with Ensemble Models Applied to the CATS Benchmark. Neurocomputing, 70 (13–15): 2371–78.
Article Google Scholar
Willis, Robert J. 1986. Wage Determinants: A Survey and Reinterpretation of Human Capital Earnings Functions. Handbook of Labor Economics, 1: 525–602.
Article Google Scholar

Download references

Acknowledgements

The authors thank session participants at the May 2013 Eastern Economics Association meetings and an anonymous referee for helpful suggestions, and Wesleyan University for research support.

Author information

Authors and Affiliations

Economics Department, Public Affairs Center, Wesleyan University, 238 Church Street, Middletown, 06459, CT, USA
Joyce P Jacobsen
VISA Inc., 800 Metro Center Boulevard, Foster City, 94404, CA
Laurence M Levin
Network for Teaching Entrepreneurship, 120 Wall Street, 18th floor, NY 10005, New York
Zachary Tausanovitch

Authors

Joyce P Jacobsen
View author publications
You can also search for this author in PubMed Google Scholar
Laurence M Levin
View author publications
You can also search for this author in PubMed Google Scholar
Zachary Tausanovitch
View author publications
You can also search for this author in PubMed Google Scholar

Appendix

Variables used in regressions and TreeNet

Dependent variable

Lnearn — The log real wage, calculated by taking the log of the person’s earnings from the previous year, divided by hours usually worked times weeks worked last year.

Variables in “sparse” model

Education — A recoding from the level of schooling completed to the approximate years of schooling that would normally have been spent to achieve that level of attainment.

Experience — Potential experience, calculated as age minus years of education minus six.

Gender — A dummy variable coded 1 for woman.

Race — A dummy variable coded 1 for anyone whose ethnicity is more than 66 percent from a white non-hispanic background.

Marriage — A dummy variable coded 1 for a person that is legally married.

Serve — A dummy variable coded 1 for people who have ever served on active duty in the United States military.

Rural — A dummy variable based on the metropolitan status provided by the CPS, coded 1 for people whose primary living space is non-metropolitan.

South — A dummy variable coded 1 for a person living in the southern region of the United States.

Additional variables in “all variable” model

Age — An integer variable representing the age of the person.

Non-English — A dummy variable coded 1 for people who were born in countries where English is not the primary language spoken.

NotborninUS — A dummy variable coded 1 if the person was born outside the United States.

h_numper — An integer variable for the number of people living in the person’s household.

Losewks — An integer variable that represents the number of self-reported weeks that the person failed to show up for scheduled work.

ss_val — The total payment in dollars that the person receives in yearly social security payments.

csp_val — The total payment in dollars that the person receives in yearly child support payments.

vet_yn — A dummy variable for whether or not the person receives veterans payments.

fin_yn — A dummy variable for whether or not the person receives financial assistance.

oi_yn — A dummy variable for whether or not the observation receives any other income.

Additional interaction and polynomial terms in “complex” model

Agesq — The square of the age variable.

Edsq — The square of the education variable.

Exsq The square of the experience variable.

Wed — An interaction term between being a female and education.

Wedsq — The Wed variable squared.

Wex — An interaction term between being a female and experience.

Wexsq — The Wex variable squared.

(Note: Additional higher-order terms and interactions were tested and rejected.)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jacobsen, J., Levin, L. & Tausanovitch, Z. Comparing Standard Regression Modeling to Ensemble Modeling: How Data Mining Software Can Improve Economists’ Predictions. Eastern Econ J 42, 387–398 (2016). https://doi.org/10.1057/eej.2015.8

Download citation

Published: 30 March 2015
Issue Date: 01 June 2016
DOI: https://doi.org/10.1057/eej.2015.8

Keywords

JEL Classifications

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparing Standard Regression Modeling to Ensemble Modeling: How Data Mining Software Can Improve Economists’ Predictions

Abstract

Access this article

Similar content being viewed by others

Ensemble Forecasting

Regularization Methods in Economic Forecasting

An Introduction to Stacking Regression for Economists

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Appendix

Variables used in regressions and TreeNet

Dependent variable

Variables in “sparse” model

Additional variables in “all variable” model

Additional interaction and polynomial terms in “complex” model

Rights and permissions

About this article

Cite this article

Keywords

JEL Classifications

Navigation

Comparing Standard Regression Modeling to Ensemble Modeling: How Data Mining Software Can Improve Economists’ Predictions

Abstract

Access this article

Similar content being viewed by others

Ensemble Forecasting

Regularization Methods in Economic Forecasting

An Introduction to Stacking Regression for Economists

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Appendix

Appendix

Variables used in regressions and TreeNet

Dependent variable

Variables in “sparse” model

Additional variables in “all variable” model

Additional interaction and polynomial terms in “complex” model

Rights and permissions

About this article

Cite this article

Share this article

Keywords

JEL Classifications

Search

Navigation