Abstract
Economists’ wariness of data mining may be misplaced, even in cases where economic theory provides a well-specified model for estimation. We discuss how new data mining/ensemble modeling software, for example the program TreeNet, can be used to create predictive models. We then show how for a standard labor economics problem, the estimation of wage equations, TreeNet outperforms standard OLS regression in terms of lower prediction error. Ensemble modeling resists the tendency to overfit data. We conclude by considering additional types of economic problems that are well-suited to use of data mining techniques.
Similar content being viewed by others
Notes
We do not consider whether the dependent variable should be transformed but rather stay with this standard method; similar to regression modeling, TreeNet does not also automatically test different transformations of the dependent variable.
There are literally so many such studies that no single survey of the wage regression literature exists. See Willis [1986] for a classic survey of human capital earnings functions; Jacobsen [2007], Chapter 7, for discussion and examples of many of these articles in the gender wage differential context.
A fourth model was run where experience and education were included as categorical values (i.e., as many dummies as the number of values minus one); this model did not outperform either model 2 or model 3 by either the adjusted R2 or the prediction error metrics, so we do not include results from it.
References
Belloni, Alexandre, Victor Chernozhukov, and Christian Hansen . 2014. High-dimensional Methods and Inference on Structural and Treatment Effects. Journal of Economic Perspectives, 28 (2): 29–50.
Blinder, Alan . 1973. Wage Discrimination: Reduced Form and Structural Estimates. Journal of Human Resources, 8 (4): 436–455.
Bowles, Mike . 2014. Ensemble Packages in R. Revolutions Blog (April 8), Revolution Analytics, http://blog.revolutionanalytics.com/2014/04/ensemble-packages-in-r.html.
Cook, Thomas D. 2014. “Big Data” in Research on Social Policy. Journal of Policy Analysis and Management, 33 (2): 544–547.
Friedman, Jerome H. 1999a. Stochastic Gradient Boosting. Technical Report, Dept. of Statistics, Stanford University.
Friedman, Jerome H. 1999b. Greedy Function Approximation: A Gradient Boosting Machine. Technical Report, Dept. of Statistics, Stanford University.
Jacobsen, Joyce P. 2007. The Economics of Gender, 3rd ed., Malden, Mass.: Blackwell.
Munnell, Alicia H., Geoffrey M. B. Tootell, Lynne E. Browne, and James McEneaney . 1996. Mortgage Lending in Boston: Interpreting HMDA Data. American Economic Review, 86 (1): 25–53.
Oaxaca, Ronald . 1973. Male-female Wage Differentials in Urban Labor Markets. International Economic Review, 14 (3): 693–709.
Salford Systems. 2001–05. TreeNet: An Exclusive Implementation of Jerome Friedman’s MART Methodology: Robust Multi-tree Technology for Data Mining, Predictive Modeling and Data Processing, http://www.salford-systems.com/.
Schonlau, Matthias . 2005. Boosted Regression (Boosting): An Introductory Tutorial and a Stata Plugin. The Stata Journal, 5 (3): 330–354.
Stock, James H. 2010. The Other Transformation in Econometric Practice: Robust Tools for Inference. Journal of Economic Perspectives, 24 (2): 83–94.
Varian, Hal R. 2014. Big Data: New Tricks for Econometrics. Journal of Economic Perspectives, 28 (2): 3–28.
Wichard, Jorg D., and Maciej Ogorzalek . 2007. Time Series Prediction with Ensemble Models Applied to the CATS Benchmark. Neurocomputing, 70 (13–15): 2371–78.
Willis, Robert J. 1986. Wage Determinants: A Survey and Reinterpretation of Human Capital Earnings Functions. Handbook of Labor Economics, 1: 525–602.
Acknowledgements
The authors thank session participants at the May 2013 Eastern Economics Association meetings and an anonymous referee for helpful suggestions, and Wesleyan University for research support.
Author information
Authors and Affiliations
Appendix
Appendix
Variables used in regressions and TreeNet
Dependent variable
Lnearn — The log real wage, calculated by taking the log of the person’s earnings from the previous year, divided by hours usually worked times weeks worked last year.
Variables in “sparse” model
Education — A recoding from the level of schooling completed to the approximate years of schooling that would normally have been spent to achieve that level of attainment.
Experience — Potential experience, calculated as age minus years of education minus six.
Gender — A dummy variable coded 1 for woman.
Race — A dummy variable coded 1 for anyone whose ethnicity is more than 66 percent from a white non-hispanic background.
Marriage — A dummy variable coded 1 for a person that is legally married.
Serve — A dummy variable coded 1 for people who have ever served on active duty in the United States military.
Rural — A dummy variable based on the metropolitan status provided by the CPS, coded 1 for people whose primary living space is non-metropolitan.
South — A dummy variable coded 1 for a person living in the southern region of the United States.
Additional variables in “all variable” model
Age — An integer variable representing the age of the person.
Non-English — A dummy variable coded 1 for people who were born in countries where English is not the primary language spoken.
NotborninUS — A dummy variable coded 1 if the person was born outside the United States.
h_numper — An integer variable for the number of people living in the person’s household.
Losewks — An integer variable that represents the number of self-reported weeks that the person failed to show up for scheduled work.
ss_val — The total payment in dollars that the person receives in yearly social security payments.
csp_val — The total payment in dollars that the person receives in yearly child support payments.
vet_yn — A dummy variable for whether or not the person receives veterans payments.
fin_yn — A dummy variable for whether or not the person receives financial assistance.
oi_yn — A dummy variable for whether or not the observation receives any other income.
Additional interaction and polynomial terms in “complex” model
Agesq — The square of the age variable.
Edsq — The square of the education variable.
Exsq The square of the experience variable.
Wed — An interaction term between being a female and education.
Wedsq — The Wed variable squared.
Wex — An interaction term between being a female and experience.
Wexsq — The Wex variable squared.
(Note: Additional higher-order terms and interactions were tested and rejected.)
Rights and permissions
About this article
Cite this article
Jacobsen, J., Levin, L. & Tausanovitch, Z. Comparing Standard Regression Modeling to Ensemble Modeling: How Data Mining Software Can Improve Economists’ Predictions. Eastern Econ J 42, 387–398 (2016). https://doi.org/10.1057/eej.2015.8
Published:
Issue Date:
DOI: https://doi.org/10.1057/eej.2015.8